Basis Function Adaptation in Temporal Difference Reinforcement Learning Ishai Menache Department of Electrical Engineering Technion, Israel Institute of Technology Haifa 32000, Israel
[email protected] Shie Mannor∗ Laboratory for Information and Decision Systems Massachusetts Institute of Technology Cambridge, MA 02139
[email protected], 617-4522351 Nahum Shimkin Department of Electrical Engineering Technion, Israel Institute of Technology Haifa 32000, Israel
[email protected] April 1, 2003
∗ All
correspondence should be sent to this author.
1
Abstract We examine methods for on-line optimization of the basis function for temporal difference Reinforcement Learning algorithms. We concentrate on architectures with a linear parameterization of the value function. Our methods optimize the weights of the network while simultaneously adapting the parameters of the basis functions in order to decrease the Bellman approximation error. A gradient-based method and the Cross Entropy method are applied to the basis function adaptation problem. The performance of the proposed algorithms is evaluated and compared using simulation experiments.
keywords: Reinforcement Learning, Temporal difference algorithm, Cross Entropy method, Radial Basis functions.
2
Introduction Reinforcement Learning (RL) has evolved in the last decade into a major approach for solving hard Markov Decision Problems (MDPs). This framework addresses in a unified manner the problems posed by an unknown environment and a large state space (Bertsekas and Tsitsiklis, 1996; Sutton and Barto, 1998). The underlying methods are based on Dynamic Programming, and include adaptive schemes that mimic either value iteration (such as Q-learning) or policy iteration (actor-critic methods). While the former attempt to learn directly the optimal value function, the latter are based on learning the value of the currently-used policy, followed by a slower policy-improvement step. A large state space presents two major issues. The most obvious one is the storage problem, as it becomes impractical to store the value function (or optimal action) explicitly for each state. The other is the generalization problem, assuming that limited experience does not provide sufficient data for each and every state. Both these issues are addressed by the Function Approximation approach (Sutton, 1988), which involves approximating the value functions by functional approximators with given architectures and a manageable number of adjustable parameters. Obviously, the success of this approach rests on some regularity properties of the state space, possibly induced by appropriate feature selection, and on the proper choice of an approximation architecture and size. In a linear approximation architecture, the value of a state is computed by first mapping it to a low dimensional feature vector, and then linearly weighting these features using adjustable weights. The mapping from states to features is often referred to as a basis function. A notable class of linear function approximators is that of Radial Basis Function networks (Haykin, 1998). In the RL context, linear architectures uniquely enjoy convergence results and performance guarantees, particularly for the problem of approximating the value of a fixed stationary policy (see Tsitsiklis and Van Roy, 1997, Nedic and Bertsekas, 2001). Yet, the approximation quality obviously hinges on the proper choice of the basis functions. In this paper we consider the possibility of on-line tuning of the basis functions. As is common in RBF network training, we separate the problem of estimating the linear weights of the network, and the (harder) problem of adjusting the parameters of the basis functions themselves. The former is handled on a faster time scale, using standard T D(λ) algorithms that optimize the linear weights for fixed basis functions. Basis function parameters are tuned in a batch manner, using an explicit score function. This approach allows the use of global optimization methods. Furthermore, it provides safeguards against possible divergence and performance degradation, which are well known to oc3
cur in on-line algorithms for non-linear architectures. We consider here two algorithms for adjusting the basis function parameters: The first is a local, gradient-based method, while the second is based on the Cross Entropy (CE) method for global optimization (Rubinstein, 1999). We present experiments with both adaptation methods for a grid world. The experiments demonstrate that both methods manage to improve the initial estimation quality. Our experiments show that the gradient-based algorithm quickly converges, but tends to get trapped in local minima that may be quite far from the optimum. This could in fact be expected, since the approximating function is highly non-linear in the basis-function parameters, leading to a non-convex optimization problem. The CE method offers a significant improvement in this respect, at the expense of greater computational effort. Our approach is related to the work of Singh et al. (1995) on soft state aggregation, where a gradient-based algorithm was proposed in order to improve the weights relating each state to the possible clusters. We note that the number of adjustable parameters required in Singh et al. (1995) is a multiple of the state cardinality, hence this scheme is not suitable as an approximation architecture for large problems. Previous applications of the CE method to MDP optimization include Dubin (2002), where a direct search in policy space is considered. This approach was later extended by Mannor et al. (2003). We also note that several works on optimization of systems that are essentially MDPs were conducted by the CE community. Specifically, the buffer allocation problem (Allon et al., 2001) and robot path planning (Helvik and Wittner, 2001) were recently studied. The paper is organized as follows. We start in Section 1 with a short summary of necessary ideas concerning RL in large MDPs. We then present the basis function optimization problem in Section 2, and propose appropriate score functions. The CE and gradient based adaptation methods are presented in Section 3. Our experiments are presented in Section 4. We conclude and mention directions for future research in Section 5
1
MDPs and Reinforcement Learning
We consider a learning agent who faces a dynamic environment, modelled as an MDP. The model evolves in discrete time, and is specified by the finite state space S, finite action sets (A(s), s ∈ S), transition kernel p = (p(s0 |s, a)), and reward function R = (R(s, a)). At each time step t = 0, 1, 2, . . ., the agent observes the state st ∈ S, and generates an action a ∈ A(st ). As a result, it obtains a reward Rt = R(st , at ), and the process moves to a new state st+1 = s0 with probability p(s0 |st , at ). The agent’s goal is to find a policy π, mapping 4
states to actions, that maximizes some reward functional. In this work we consider the discounted reward criterion, specified by ∞ X V π (s) = Eπ ( γ t Rt |s0 = s) .
(1)
t=0
Here 0 < γ < 1 is the discount factor, π is the agent’s policy (to be optimized), Eπ is the expectation operator induced by π, and s denotes the initial state. As is well known (Puterman, 1994; Bertsekas, 1995), an optimal policy exists within the class of (deterministic) stationary policies, that map states into actions. A randomized stationary policy can be identified with conditional probabilities π(a|s), which specify the probability of choosing action a at state s. The value function V π for such a policy is the unique solution of Bellman’s equation, namely the following set of linear equations X X V π (s) = π(a|s) p(s0 |s, a) [R(s, a, s0 ) + γV π (s0 )] s ∈ S . (2) a∈A(s)
s0 ∈S
As noted, computation of V π is a basic step in the policy improvement algorithm, which is one of the central DP algorithms for computing the optimal policy. A direct solution of the above equation (or the corresponding optimality equation) may not be feasible, either due to unknown environment parameters (p and R) or due to the cardinality of the state space. In either case, temporal difference methods may be employed to evaluate V π . Suppose we wish to approximate V π using a functional approximator of the form V˜r : S → IR, where r ∈ IRK is a tunable parameter vector. Suppose further that the stationary policy π is in effect, and the agent observes the sequence of states and rewards (st , Rt ). The T D(λ) algorithm, for λ ∈ [0, 1] updates the parameters r according to the iteration r := r + αt dt
t X
(γλ)t−k ∇V˜r (sk ) .
k=0
Here ∇ denotes the gradient of V˜ with respect to r, αt is the algorithm gain, and dt = Rt + γVr (st+1 ) − Vr (st ) is the temporal difference corresponding to the stage-t transition, evaluated at r = rt . In a linear approximation architecture, the approximator V˜r is given by V˜r (s) =
K X
r(k)ϕk (s) ≡ ϕ(s) · r ,
(3)
k=1
where r = (r(1) . . . r(K)) is the vector of tunable weights, ϕk : S → IR are the basis functions, also referred to as the state features, and ϕ(s) = 5
(ϕ1 (s) . . . ϕK (s)). With this linear approximator, the T D(λ) algorithm takes the form (Sutton, 1988; Tsitsiklis and Van Roy, 1997) ¡ ¢ rt+1 = rt + zt Rt + (γϕ(st+1 ) − ϕ(st )) · rt ,
(4)
Pt where zt = i=1 (γλ)t−i ϕ(si ) ∈ IRK . Roughly, zt (k) keeps track of the relevance of the present transition to parameter k. Obviously, it may be computed iteratively as zt+1 := γλzt + ϕ(st+1 ) (5) with z−1 = 0. Note that both rt and zt evolve in K-dimensional parameter space. It has been shown in Tsitsiklis and Van Roy (1997) that, for appropriate gains αt and assuming that the Markov chain induced by π is irreducible, that the above algorithm converges to a unique parameter vector r∗ , and further that the approximation error is bounded by a fixed multiple of the optimal one. Finally, the LST D(λ) algorithm (Boyan, 2002) is a batch variant of T D(λ), which converges to the same weight vector as the iterative algorithm above. This algorithm computes the following K dimensional vector and a K × K matrix: bt =
t X
zi Ri ,
At =
i=0
t X
T
zi (ϕ(st ) − γϕ(st+1 )) .
(6)
i=0
The matrix At and the vector bt may be incrementally updated after each transition. The eligibility vector zi is updated as in (5). The approximating weight vector is calculated, when required, via r = A−1 b. This algorithm has been shown to give favorable convergence rates.
2
Evaluation Criteria
Assume now that each of the basis functions ϕk has some pre-determined parametric form ϕθk . For example, in the radial basis function which are described below in Eq. (12), θk corresponds to (µk , Σk ). The value function may be approximated via K X V π (s) ≈ V˜θ,r (s) = rk ϕθk (s) . k=1
We refer to θ = (θ1 , θ2 , . . . , θK ) as the basis function parameters, while r = (r(1) . . . r(K)) are the weights. Prevalent approaches to supervised learning in radial basis and similar networks separate the weight adjustment from the (harder) problem of parameter selection (Haykin, 1998). In this spirit, we use the LST D(λ) algorithm, as described above, to determine the weights while the basis functions are held fixed. Assuming that r is determined as a function of θ 6
by an appropriate algorithm, we shall henceforth omit the explicit dependence of V˜ on r in our notation. We are thus left with the problem of learning the “non-linear” parameters θ, namely tuning the basis functions, in order to allow a better approximation of the value function. We shall require explicit evaluation criteria for the quality of a given approximation V˜ . Since the true value function V π is not available for comparison, we resort to the Bellman error, which is defined at each state as " # X 4 π π π 0 0 Jθ (s) = V˜θ (s) − R (s) + γ P (s, s )V˜θ (s ) , (7) s0 ∈S π
0
where P (s, s ) is the probability that the next state is s0 given that the current state is s and policy π is used. A justification to minimizing the norm of Jθπ is the fact that the norm of the approximation error (V π − V˜ ) is bounded by the norm of the Bellman error norm times a constant (the norm used is a weighted 2-norm, see Bertsekas and Tsitsiklis, 1996.) For the score function, it will be convenient to consider the following (weighted) 2-norm: X M (θ) = β(s)Jθπ (s)2 . (8) s∈S
The weights β(s) ≥ 0 may emphasize certain states at the expense of others. A reasonable choice for these weights is β(s) = pπ (s), the stationary steady state frequencies under the policy π. This means that we are willing to put up with relatively high Bellman errors for states, which are not visited often (and for which it is harder to have good value approximations), as long as we have good approximations for frequently visited states. We note that a similar score function has been used in Singh et al. (1995); see also the discussion in Chapter 6.10 of Bertsekas and Tsitsiklis (1996). Direct calculation of the score (8) requires the model of the environment (transition probabilities and rewards). In addition, depending on the choice of weights, the steady state probabilities under policy π may be required. When these items are not available, we might try to estimate or approximate them en route. When the state space is large, summing over all states is not feasible, and we may be forced to sum over a representative subset of states (see Bertsekas and Tsitsiklis, 1996). The representative states (RS) should typically be separated relative to some state space metric, and possibly represent significant states, such as bottlenecks (McGovern and Barto, 2001; Menache et al., 2002) or uncommon states (see Ratitch and Precup, 2002 for novel criteria for “complex” states). The actual number of RS is determined according to available memory and computation resources. The score function now becomes à !2 X X ˜ π (s) + γ M (θ) = β(s) V˜ π (s) − [R P˜ π (s, s0 )V˜ π (s0 )] , (9) θ
θ
s0 ∈S
s∈RS
7
˜ π (·) are estimates for the transition probabilities and rewhere P˜ π (·, ·) and R wards, respectively. When the MDP is deterministic (or almost so), a reasonable score function can be expressed directly in terms of the temporal differences, for example n ´2 1 X³˜ M (θ) = Vθ (st ) − [Rt + γ V˜θ (st+1 )] . (10) n t=1 ˜ π (s) and P˜ π (s, s0 ) taken This score is comparable to (9), with β(s) = p˜π (s), R as their empirical means. For batch processing the latter score does not seem to enjoy a particular advantage, while Eq. (9) allows greater flexibility, for example in evaluating the approximation error at unvisited states. We note that if the reward or transition probabilities are stochastic then Eq. (10) considers the inherent variance in the value as well, and may therefore be biased towards favoring actions with small variance. An alternative approach for defining a score for an unknown model is based on Monte-Carlo methods (Bertsekas and Tsitsiklis, 1996). It might be possible to obtain a high-precision estimate Vˆ (s) for the value function V π (s) over a restricted number of states s ∈ RS, using direct simulation. The score of the approximation V˜θ may then be defined as ³ ´2 X M (θ) = β(s) Vˆ π (s) − V˜θπ (s) . (11) s∈RS
The apparent advantage of (11) over (9) is that there is no need to estimate the transition probabilities and rewards for the representative states. Yet, using Eq. (11) requires exhaustive sampling of the representative states and their successor states. The advantage of the score (11) over (10) is that it only considers the value and the randomness in the reward or transitions is ignored.
3
Basis Function Optimization
We present next two methods for optimizing the basis functions with respect to the selected score function. The first is gradient-based, and the second employs the CE method. In either case, the state-reward history needs to be observed only once, and then stored and re-used at subsequent stages. It is assumed that this history is sampled from a fixed policy π.
3.1
Gradient Based Adaptation
In its simplest form, this algorithm proceeds by interleaving optimization steps for either r or θ, while keeping the other fixed. Using the score function provided 8
in Eq. (9) we are able to analytically calculate the gradient. At the initial ˜ π (·) are obtained (at least for the phase, estimates of the model p˜π (·), P˜ π (·, ·), R representative states), and the basis functions are initialized with some choice of θ. The following steps are then repeated: (i) The LST D(λ) algorithm is applied to estimate the optimal weights r (with θ fixed). (ii) The basis-function parameters θ are updated using gradient descent steps in θ, with respect to the score M (θ), while r is kept fixed. The gradient of M (θ) is easily obtained from (9), provided that the derivatives of the basis functions ϕθ with respect to their parameters θ are known. The gradient descent step may be improved by letting r change optimally with θ, and computing the gradient in θ accordingly. Let r(θ) denote the optimal value of r given θ. Observe, from (9) and (3), that X ∂J(s, θ) ∂ M (θ) = 2 β(s)J(s, θ) , ∂θ ∂θ s∈RS
with ˜ J(s, θ) = r(θ) · ϕθ (s) − [R(s) +γ
X
P˜ (s, s0 )r(θ) · ϕθ (s0 )] .
s0 ∈S
Letting θq denote the q-th entry of θ, the corresponding derivative is given by ∂ ∂θq J(s, θ)
=
³ ∂r(θ)
∂ϕθ (s) · ϕθ (s) + r(θ) · ∂θq · ¸ X ∂r(θ) ∂ϕθ (s0 ) ´ 0 0 ˜ −γ P (s, s ) · ϕθ (s ) + r(θ) · . ∂θq ∂θq 0 ∂θq
s ∈S
We note that ∂ϕ∂θθ (s) is a vector of zeros except one entry, corresponding to the q derivative of the basis function to which θq belongs. The main issue remains to calculate the derivatives of the linear parameters with respect to the non-linear ∂r(θ) ones, namely ∂r(θ) ∂θ . We briefly outline how to estimate ∂θq by extending the LST D(λ) algorithm. Recall that the LST D(λ) algorithm estimates the optimal weights via r(θ) = A−1 b, with A and b defined in (6). We thus have the estimate ∂A −1 ∂b ∂r(θ) = −A−1 A b + A−1 . ∂θq ∂θq ∂θq Using the expressions for A and b, one can form an estimate for their partial derivatives which is computed incrementally within the LST D(λ) algorithm, in parallel with the usual estimates of A and b. ∂r(θ) ∂θq is then evaluated from the last equation when required. Further details are omitted here for lack of space, and may be found in Menache (2002). 9
3.2
Cross Entropy Based Adaptation
We next provide a brief outline of the CE algorithm as it applies to our problem. Further details will be given in the next section. Recall that the score function of (9) may be estimated for a fixed vector of parameters θ. This score can naturally serve as the score function for the CE method for optimization (Rubinstein, 1999; de-Boer et al., 2002). We follow the conventions used in deBoer et al. (2002) and refer the reader to that tutorial. We assume that the set of parameters θ is drawn from probability density functions (pdfs), which have some parametric form (e.g., Gaussian). A CE iteration will generate a sample of candidate parameterizations θ1 . . . θN according to the cross entropy sampling mechanism. The parameters of the random mechanism will be updated using the outcome of the T D(λ) algorithm, in order to produce a better sample. The algorithm involves two steps, the simulation step, in which experience is used in order to estimate the value function and learn the model, and the optimization step, in which better parameterization is sought for. The evaluations of different parameter vectors is computationally independent of each other. Thus, if parallel computing resources are at hand, the same experience may be used for many θ’s at the same time. The score under a fixed policy π may be chosen as either (9) or (11). The best scores (low errors) are used to determine the new meta parameter. Further details are omitted here for lack of space, and may be found in Menache (2002).
4
Experiments
We describe below several experiments that were conducted on a maze-world. We start with a description of the maze world setup. We continue with results concerning the gradient based adaptation in Section 4.2. We finally describe experiments with CE based adaptation in Section 4.3.
4.1
General Setup
The domain which has been chosen for the experiments is a discrete two dimensional maze-world domain (e.g. Kaelbling et al., 1996). In this domain an agent roams around the maze, trying to reach goal states as fast as possible. The agent receives a small negative reward for each step (−0.5), and a positive reward (+8) for reaching goal states. A goal state is also an absorbing state, after which the agent starts a new episode at a randomly chosen state. The agent may choose to move to one of four neighboring tiles (unless there is an obstacle), and in our experiments its movement is perturbed with probability 10
0.1, meaning that it may fail to move in the chosen direction with this probability, in which case it moves in another (random) direction. The policy π that was selected for evaluation is an optimal policy, which always follows a direction of the shortest Manhattan distance to the closest goal (if there is more than one such direction, the agent assigns equal probabilities to all optimal directions). The maze and the policy π are presented in Figure 1. There are two goals in the maze (marked with “G”), one in the lower right corner and one in the middle, surrounded by a grey obstacle. The Maze world and the Optimal Policy
5
10
Y
G 15
20
G 25 5
10
15
20
25
X
Figure 1: The maze-world. The obstacle is grey and the goals are marked with “G”. The optimal policy is shown by arrows. When two actions were optimal we plotted a diagonal arrow in the direction of the sum of the two optimal directions. The basis functions, which have been chosen in order to approximate the value function are the standard Gaussian basis functions: µ ¶ 1 T −1 ϕk (x) = exp − (x − µk ) Σk (x − µk ) . (12) 2 The adaptation in this case is aimed at determining the two dimensional center position µk , and the 2 × 2 covariance matrix Σk of each Gaussian. In order to reduce the complexity of simulations, we assume that the two dimensions of each Gaussian are independent, reducing the covariance matrix to a diagonal matrix, and the number of tunable parameters to 4 (instead of 5) per basis function. The choice of the Gaussian family fits the differentiability requirement of the gradient-based algorithm, and is a natural choice for our domain, where the 11
value function has local characteristics (e.g., nearby states have similar values). We note that there might be better choices of basis functions, even in our domain. The number of Gaussian functions was determined to be 11. While somewhat arbitrary, it seems that this number of basis functions suffices for a meaningful approximation.
4.2
Gradient based Adaptation
We implemented the gradient based adaptation described in Section 3.1. In all the experiments performed, the convergence of the gradient based adaptation was very fast and usually monotone (except for the vicinity of the minima, where it fluctuated.) Some sample runs are presented in Figure 2, where a monotone decrease of the mean square error may be observed. However, convergence is typically to a non-optimal value which can significantly differ between different trials. This clearly indicates the existence of multiple local minima, as could be expected by the non-linear dependence of the basis functions on their parameters (see, e.g., Auer et al., 1996). To illustrate the extent of this problem, we plot in Figure 3 the mean square error of the estimated value with respect to the true value for the value estimation to which the algorithm converged. Note that each point in Figure 3 represents a run of the gradient-based adaptation algorithm. Observe that some values are repeated, indicating convergence to identical parameter vectors; however, the many different values of the eventual error indicate a plethora of local minima, with widely varying error performance. The parameters of the runs were: length of each episode 10,000 steps, the learning rate of each of the 44 parameters (11 Gaussian, four parameters each) was set to 5×10−4 . Each run terminated when score was not improved anymore. The initial Gaussians placement was uniform. The initial width of each Gaussian was determined in a way that all the state space is “covered”, with some overlap between neighboring Gaussians.
4.3
Cross Entropy Adaptation
Applying the CE method to our problem requires choosing the distribution from which the parameters of each basis function are drawn in every iteration of the CE method. For simplicity we assumed that each parameter is drawn from a Gaussian distribution. This was performed for each of the four parameters that define each basis function. For example, the x coordinate of the center of the ith basis function (Gaussian), µxi , is assumed to be distributed according to a probability distribution function (pdf) N (µµxi , σµ2 x ). A similar assumption i holds for the other three parameters of each Gaussian: the y coordinate of the center, and the two standard deviations for both axis. We comment that even 12
A few sample runs of the GD 14
13.5
Mean square error value
13
12.5
12
11.5
11
10.5
10
1
2
3
4
5 6 Number of Iteration
7
8
9
10
Figure 2: The mean square error for a few samples runs of the gradient-based algorithm. Error of GD runs 15
14
Eventual error
13
12
11
10
9
8
0
20
40
60
80
100
120
140
Run
Figure 3: Final (after convergence) values of the error of the Gradient-based algorithm sorted from lowest error to highest error. though the standard deviations of the basis functions are non negative we still used Gaussian pdfs for sampling them. We have two parameters for the pdf of 13
each the 44 parameters so we have altogether a total of 88 parameters for the pdf from which the CE method generates each sample. Using a Gaussian pdf leads to easy update formulas, as the Gaussian distribution belongs to the natural exponential family (see de-Boer et al., 2002 for m+1 notations and details). Let m + 1 be the current CE iteration; µm+1 µxi and σµxi the parameters for µxi ; and νm be the current threshold (ρ percentile of score of the elite samples). The corresponding pdf is updated according to the following pair of rules (see Rubinstein, 1999): N P
µm+1 µxi
:=
j=1
I{Mi (θj )≤νm } µxi (θj ) N P j=1
N P
σµm+1 x i
:=
j=1
, I{Mi (θj )≤νm }
³ ´T ³ ´ I{Mi (θj )≤νm } µxi (θj ) − µm+1 µxi (θj ) − µm+1 µx µx i
N P j=1
i
. I{Mi (θj )≤νm }
We tested the CE adaptation algorithm on the same maze the gradientdescent was run. The LST D algorithm was used for the evaluation step, with λ = 0.9. The parameters of the CE method were set to N = 400 (samples per iteration) and ρ = 1% (percentage of elite samples). An experiment of 10000 time steps has been recorded and serves as the data for adaptation. A suitable choice of the initial values for the parameter pdfs (e.g., µ0µx and σµ0 x ) is essential i i for the success of the CE-based adaptation. A natural choice is to start with a uniform placement of the basis function centers (see the left part of Figure 6), meaning that there is a distinctive initial distribution (µ0µx , µ0µy ) for the means i i of each of the centers. The initial width of each Gaussian is determined in a way that all the state space is “covered”, with some overlap between neighboring Gaussians, or by any other reasonable heuristic. We use the score criterion (9), where due to the small environment, all visited states serve as representative states. In Figure 4 we present three value functions: the true value function, the approximated value function one obtained by the initial Gaussian placement, and the approximated value function using the final Gaussian placement. The final value function is closer to the true value function. The performance curves of the CE method are presented in Figure 5. The need for the adaptation process is evident. The score (9) is being monotonously improved until convergence of the CE method (i.e., when the variances of the pdf parameters used by the CE method are small.) In addition, using the true value function, the right graph shows the improvement of the (real) mean square error 14
DP values
initial CE values
final CE values 8 6
5
5
5
10
10
10
4
15
0 −2
Y
Y
Y
2
15
−4
15
−6
20
20
−8
20
−10 −12 25
25 5
10
15 X
20
25
25 5
10
15 X
20
25
5
10
15 X
20
25
Figure 4: Maps of the value functions. Bright areas represent states with high value function. The barrier is denoted by black. The left map describes the true value function, as calculated using dynamic programming; the middle map is the estimated value which was calculate using the initial placement; the right map is the estimated value function obtained under the final placement of basis functions.
as a function of the number of tested parameterizations. Comparing Figure 5 to Figure 3 we observe that the CE method reached a better minima (certainly better than most gradient runs). This behavior was typical to many experiments we conducted. To better understand the inner working of the adaptation process we present in Figure 6 the initial and final placements and widths of the radial basis functions. One may notice that the Gaussians wander to “interesting areas”, those which suffer from a discontinuity in the value function (the barrier location in our case). In order to assess the computational effectiveness of the CE method, as compared to simple random search, we compared the CE adaptation method with a random search over the basis function parameters. In Figure 7 we plot the mean square error of the best parameterization produced by the CE method as a function of the number of tested parameterizations. On the same graph we show the best parameterization produced by a random search (starting with the same initial conditions as the first CE iteration.) Results are averaged over ten runs with an error bar representing one standard deviation. It can be observed that the CE method is much more robust and produces improved results at a reasonable cost. The superiority of the CE method can be clearly observed. 15
real error
CE scores
14
0.7
0.6
13
0.5
Score value
error value
12
11
0.4
0.3
10 0.2
9
8
0.1
0
5
10 15 CE iteration
20
0
25
0
5
10 15 CE iteration
20
25
Figure 5: Performance graphs of the CE. On the left is the square error function between the estimated value and the real value. On the right we plot the score of the CE given by Eq. (9). Results are averaged over 10 runs.
final Gaussian positions
initial Gaussian positions 0
0
5
5
10
10
15
Y
Y
15
20
20
25
25
30
30
35
0
5
10
15
20
35
25
X
0
10 X
20
Figure 6: The Gaussian basis functions, before and after the adaptation process. We note that more Gaussians are placed at areas of discontinuity, in our case at the zone of the barrier.
16
16 Random Search Cross Entropy 15
14
13
MSE
12
11
10
9
8
7
0
500
1000 1500 Tested Parametrizations
2000
2500
Figure 7: Average performance as a function of the number of iteration for random search and CE-based adaptation. This indicates that the selection mechanism inherent in the CE method achieves the goal of guiding the search to preferred areas in the parameter space.
5
Conclusion and Future Directions
In this paper we addressed the problem of on-line adaptation of the basis function parameters in temporal difference learning. Two methods were suggested — a gradient based method and a cross entropy based method. In experiments we observed the fast convergence of the gradient method to local minima, which could be significantly sub-optimal. The CE method was seen to provide an effective method to avoid local minima, and approach an optimal tuning of the basis functions The algorithms and simulations that were presented here should be considered as a first step that demonstrates the feasibility of basis function tuning in an unsupervised, reward-driven environment. Additional development and experimentation is required in order to assess the efficacy of these methods in large real-world problems. On the algorithmic side, a promising idea is to combine the CE and gradient methods. Iterating between both methods may lead to faster convergence while still avoiding local minima. Alternative approaches to obtaining effective low-order approximations to the value function may be considered. One interesting approach is to use in the first step a high-order radial basis function, with its linear weights tuned in the standard way. On the second step, a low-order network is fully tuned to fit the high order approximation. The measured data need not be used in the second step, although state trace 17
should probably be useful. It would be interesting to compare these approaches both theoretically and experimentally, in terms of approximation accuracy and computational load. Within a reinforcement learning scheme, low order approximations are important both to reduce the complexity of subsequent operations (such as performing a policy improvement step based on a value function approximation), and to increase the learning rate. One way to realize the latter is to tune the basis functions after some initial experimentation phase, and then proceed with the learning process with those fixed basis functions. Again, guidelines for optimizing this procedure are yet to be developed. Finally, methods for tuning the basis functions should be of interest in relation with other RL schemes, such as Q-learning and direct learning in the policy space (Sutton and Barto, 1998).
Acknowledgements We are grateful to Reuven Rubinstein for introducing the Cross Entropy to us and for helpful discussions. We would also like to thank Vadim Ratner and Oran Reichman for their help with the simulations. S.M. wishes to thank John Tsitsiklis for helpful discussions. This research was partially supported by the Fund for Promotion of Research at the Technion.
18
References Allon, G., Raviv, T., and Rubinstein, R. Y. (2001). Application of the cross entropy method for buffer allocation problem. In Proceeding of the third aegean international conference on design and analysis of manufacturing systems (p. 269-278). Morgan Kaufmann. Auer, P., Herbster, M., and Warmuth, M. (1996). Exponentially many local minima for single neurons. In D. Touretzky, M. Mozer, and M. Hasselmo (Eds.), Advances in neural information processing systems 8 (p. 316-322). MIT Press. Bertsekas, D., and Tsitsiklis, J. (1996). Neuro-dynamic programming. Athena Scientific. Bertsekas, D. P. (1995). Dynamic programming and optimal control. Athena Scientific. Boyan, J. A. (2002). Technical update: Least-squares temporal difference learning. Machine Learning, 49, 233-246. de-Boer, P., Kroese, D., Mannor, S., and Rubinstein, R. (2002). A tutorial on the cross-entropy method. (Avaialble from http://www.cs.utwente.nl/˜ptdeboer/ce/, submitted to the Annals of Operation Research) Dubin, U. (2002). Application of the cross-entropy method to neural computation. Unpublished master’s thesis, Technion. Haykin, S. S. (1998). Neural networks : A comprehensive foundation. Prentice Hall. Helvik, B. E., and Wittner, O. (2001). Using the cross-entropy method to guide/govern mobile agent’s path finding in networks. In Proceedings of the 3rd international workshop on mobile agents for telecommunication applications - MATA01. Morgan Kaufmann. Kaelbling, L. P., Littman, M., and Moore., A. W. (1996). Reinforcement learning - a survey. Journal of Artificial Intelligence Research, 4, 237-285. Mannor, S., Rubinstein, R., and Gat, Y. (2003). The cross entropy method for fast policy search. (Submitted for publication.) McGovern, A., and Barto, A. G. (2001). Automatic discovery of subgoals in reinforcement learning using diverse density. In Proceedings of the 18th international conference on machine learning (p. 361-368). Morgan Kaufmann. 19
Menache, I. (2002). Hierarchical structures in reinforcement learning. Unpublished master’s thesis, Department of Electrical Engineering, Technion. Menache, I., Mannor, S., and Shimkin, N. (2002). Q-cut- dynamic discovery of sub-goals in reinforcement learning. In Proceedings of the 13th european conference on machine learning (p. 295-306). Springer. Nedic, A., and Bertsekas, D. P. (2001). Least-squares policy evaluation algorithms with linear function approximation. (LIDS Report LIDS-P-2537, To appear in J. of Discrete Event Systems) Puterman, M. (1994). Markov decision processes. Wiley-Interscience. Ratitch, B., and Precup, D. (2002). Characterizing Markov decision processes. In Proceedings of the 13th european conference on machine learning (p. 391-404). Springer. Rubinstein, R. Y. (1999). The cross-entropy metod for combinatorial and continuous optimization. Methodology and Computing in Applied Probability, 1, 127-190. Singh, S. P., Jaakkola, T., and Jordan, M. I. (1995). Reinforcement learning with soft state aggregation. In Advances in neural information processing systems (Vol. 7, pp. 361–368). MIT Press. Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3, 9-44. Sutton, R. S., and Barto, A. G. (1998). Reinforcement learning: An introduction. MIT Press. Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42, 674-690.
20