588
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 27, NO. 5, SEPTEMBER 1997
A Reinforcement Learning Neural Network for Adaptive Control of Markov Chains G. Santharam and P. S. Sastry, Member, IEEE
Abstract—In this paper we consider the problem of reinforcement learning in a dynamically changing environment. In this context, we study the problem of adaptive control of finite-state Markov chains with a finite number of controls. The transition and payoff structures are unknown. The objective is to find an optimal policy which maximizes the expected total discounted payoff over the infinite horizon. A stochastic neural network model is suggested for the controller. The parameters of the neural net, which determine a random control strategy, are updated at each instant using a simple learning scheme. This learning scheme involves estimation of some relevant parameters using an adaptive critic. It is proved that the controller asymptotically chooses an optimal action in each state of the Markov chain with a high probability. Index Terms—Adaptive critic, automata, generalized learning, reinforcement learning. Fig. 1. Learning systems.
I. INTRODUCTION
T
HE THEORY of reinforcement learning has been successfully applied in the past for solving problems involving decision making under uncertainty [1], [2]. In this paper we consider the problem of associative reinforcement learning in a dynamically changing environment. The dynamic environment will be a Markov chain, and the learning problem is to identify an optimal control policy. The criterion of optimality is expected discounted payoff over an infinite horizon [3]. The transition and payoff structure of the Markov chain is unknown. A. Reinforcement Learning Systems Reinforcement learning was originally discussed in the context of learning behavior in biological systems [1]. A simple model is the stochastic Learning Automaton (LA) [1] which interacts with a random environment. There are two assumptions in all models employing a simple LA. The model is not sensitive to any “context” in the environment depending on which, different actions may be optimal at different times. The second assumption is about immediate reinforcement, that is, a complete (random) evaluation of an action becomes available before the automaton needs to perform the next action. Manuscript received April 16, 1993; revised May 28, 1995 and September 19, 1996. G. Santharam is with Motorola India Electronics, Bangalore 560042, India. P. S. Sastry is with the Department of Electrical Engineering, Indian Institute of Science, Bangalore 560012, India (e-mail: sastry@expertix. ee.iisc.ernet.in). Publisher Item Identifier S 1083-4427(97)04998-9.
A simple example of a context-sensitive task is the pattern classification problem where the desired output action could be a class label depending on the context vector (the pattern input)1 [5]. For such problems (referred to as associative reinforcement learning problems [6], [7]) the Automaton has to be modified so that the choice of action depends also on the context vector input from the environment [5], [7]. Such a context-sensitive Automaton is called the Generalized Learning Automaton (GLA) [1], [8]. In a simple GLA model, the environment generates immediate reinforcement and the environment is stationary. In some situations, the context may change according to the action taken by the GLA. Further, the reinforcement obtained may be delayed. A different learning approach is required in such dynamic environments [2], [6] as shown in Fig. 1. The figure shows a GLA along with an adaptive critic interacting with the environment. The function of the adaptive critic is to internally estimate a meaningful reinforcement signal for the GLA, given the context vector , the action taken by the GLA and the payoff from the environment. Learning scheme for the model given in Fig. 1 will have to specify the procedure to update the parameters of the GLA and also to update the parameters of the adaptive critic. Such learning systems employing an adaptive critic have been explored in the past [2], [9], [10]. However, there are no general convergence results available for learning systems employing an adaptive critic and performing online learning. In this paper, we provide 1 It is possible to formulate the pattern classification problem as a nonassociative task and solve it using a team of LA [4].
1083–4427/97$10.00 1997 IEEE
Authorized licensed use limited to: INDIAN INSTITUTE OF SCIENCE. Downloaded on March 10, 2009 at 02:41 from IEEE Xplore. Restrictions apply.
SANTHARAM AND SASTRY: REINFORCEMENT LEARNING NEURAL NETWORK
a complete analysis of the reinforcement learning system employing an adaptive critic. B. Learning Control of a Markov Chain The problem of control of a Markov chain (also known as Markov decision problem) arises when state transitions are associated with some cost depending on the action taken in various states. Markov decision problems with various performance criteria have been studied in the past. When the transition probabilities and the reward structure are known, the optimal policy could be obtained through Dynamic Programming techniques [3], [11]. In many situations, the system model (i.e., the transition and reward structure) is not known a priori and this leads to the problem of adaptive control of Markov chains. There are two schemes of adaptive control: direct and indirect. Conventional adaptive control schemes (for a survey, see [12]) are indirect in that an explicit model of the unknown system is constructed and an optimal policy, determined by solving a dynamic programming problem with the current estimated model in lieu of the true model, is used in generating the next action [13], [14]. The complexity of these procedures increase rapidly with the dimension of the problem. Wheeler and Narendra [15] proposed a direct decentralized learning scheme using a team of Learning Automata for the adaptive control of a finite state Markov chain with the criterion being the expected average cost over the infinite horizon. Recently Barto et al. [16] suggested a direct method for this problem with discounted payoff. At each step, the best possible action is identified by the estimation of so called Qvalues or state-action values, originally introduced by Watkins [17], [18]. The action generation is based on some strategy which ensures that each action is generated infinitely often so that the estimates can converge to the true values [18]. The knowledge of the true state action values amounts to the knowledge of the optimal strategy. Through simulation studies the utility of the direct method over the complex indirect method was demonstrated by Barto and Singh [16]. The proposed adaptive control scheme in this paper is a direct method for solving the learning control problem. A stochastic neural net model (a GLA) is used to decide the control strategy. A deterministic neural network is used for the adaptive critic. The function of the adaptive critic is to estimate the state-action values (as used in [16]) and generate a suitable reinforcement signal for the GLA (the controller). This method can be considered as a policy iteration method as opposed to the value iteration suggested in [16], [17]. The main result of the paper is a convergence proof for the GLA learning algorithm which ensures that asymptotically an optimal action is chosen in each state. The rest of the paper is organized as follows. The problem of adaptive control of finite state Markov chain is described in Section II. The controller and the adaptive critic along with the learning algorithms are described in detail in Section III. In Section IV, we present the analysis for the learning algorithm. Section V provides discussion on the algorithm and simulation results are presented in Section VI.
589
II. PROBLEM OF CONTROL OF MARKOV CHAINS Consider the process
where is the state at instant and is the control action taken at instant after observing the state . Let and . Then the following events occur. 1) Next state is chosen with transition probability . 2) For the above transition, a random payoff is obtained, which depends on the states and action . denote the All payoff values are bounded. Let expected payoff for the transition from to when the action is . The controls are to be chosen based on some policy, denoted by . A policy is said to be “stationary” if the action taken at instant depends only on and not directly on . A stationary policy in which actions are chosen based on a fixed probability distribution over the action space, for each state, is called a stationary random policy (SRP). A stationary pure policy (SP) assigns a fixed action for each state. Expected total discounted payoff for each state , under a policy , is defined as
where is the discount factor. denotes expectation with respect to the policy . The optimal payoff value in any state is defined as, for each This function is called optimal value function. Any policy such that is an optimal policy. Theorem 1: (a) The optimal value function satisfies the following set of equations. For each (1) is the average payoff per where transition from state with action . (b) For a given transition and reward structure and a fixed , the optimal value function is unique. (c) There exists a stationary pure policy which is optimal. Proof: See [3]. Definition 1: For each state and action we define the state-action value as the expected infinite horizon total discounted payoff, if action is performed in the initial state and an optimal policy is followed thereafter. We denote the state-action value vector by Using (1) we get, for each , (2) Lemma 2: For the specified Markov chain control problem, for each state , we have (3) Proof: Follows immediately from (2) and (1).
Authorized licensed use limited to: INDIAN INSTITUTE OF SCIENCE. Downloaded on March 10, 2009 at 02:41 from IEEE Xplore. Restrictions apply.
590
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 27, NO. 5, SEPTEMBER 1997
III. NEURAL NET MODEL AND LEARNING ALGORITHM The block diagram of the learning system is shown in Fig. 1. The learning system has two major parts: the controller and the adaptive critic. The controller consists of Generalized Learning Automata (GLA) and it makes a stochastic choice of action in each state (supplied as its context input) based on its internal parameter vector. This parameter vector is updated at each instant using a scalar reinforcement signal supplied by the adaptive critic network. The adaptive critic network maintains estimates of state-action values and supplies a reinforcement signal for the controller. A. Controller Network The structure of the controller (a GLA) is as follows. It has one unit corresponding to each action in the action set . The th unit of the controller has a weight vector (a)
where
is the finite action set and
denotes the complete set of weights in the controller network. The context vector input to the GLA will be a representation of the state of the chain at that instant. We represent the th state with the th unit vector (i.e., a vector in the th component of which is unity and all others are zero) which will be denoted by . The controller generates an action in state stochastically according to the law with probability
Fig. 2. (a) Critic network and (b) comparator network.
(4)
where for
(b)
(5)
is the activation function given by . From (4) it follows that each value of the parameter vector, , determines a specific stationary random policy.
where
B. Adaptive Critic Network The function of the adaptive critic network is to update the estimates of the state-action values after observing the state and payoff. Network structure is shown in Fig. 2(a) for the simple case when there are only two states (i.e., and two admissible actions in each state (i.e., . The network has two layers and all the units have deterministic input-output function. First layer of the network has one unit, , corresponding to each state-action value , and its output is the current estimate, . The inputs into the unit consist of a) the immediate payoff , b) its own output as a feedback, and c) a feedback from the second layer [which, to avoid complexity, is not shown in Fig. 2(a)].
Each unit of the first layer is of sample-and-hold type activated by an input from a decoder that activates only one unit at any instant depending on the state and action . Only the unit that is activated will update the estimate. Let denote a cluster in the first layer consisting of all units corresponding to the same state . The second layer has one comparator network corresponding to each cluster in the first layer. The function of the comparator network is to choose the maximum out of all the inputs. Fig. 2(b) shows the comparator network form for the two input case. It can be suitably extended for larger number of inputs [19]. Output from the second layer of the adaptive critic net is given as a feedback to the first layer units. If the next state after transition is (i.e., then is the feedback from the second layer to the first layer unit that is activated. C. Learning Algorithm At instant , let and represent the weight vector, state-action value estimates and the current state respectively. At and can be arbitrary. (i) Controller generates an action with the law described in (4).
Authorized licensed use limited to: INDIAN INSTITUTE OF SCIENCE. Downloaded on March 10, 2009 at 02:41 from IEEE Xplore. Restrictions apply.
SANTHARAM AND SASTRY: REINFORCEMENT LEARNING NEURAL NETWORK
(ii) Let be the next state of the Markov chain. This transition occurs with probability and generates a random payoff . (iii) Reinforcement to the controller from the adaptive critic will be if
(6)
otherwise.
(iv) Controller weights are updated as follows. At instant , only the weight vector (which corresponds to the action is modified. We use a penalizing term in the weight update to ensure that the weights are kept inside a compact set (see Remarks 2 and 3). For this purpose we define, if if if for some constant of . Then
and let
(7) denote the derivative
591
standard stochastic approximation assumptions on the step size parameter, if the controller samples each action infinitely often. In the next section we show the convergence of the coupled dynamics of and evolving according to the learning algorithm presented above. Remark 2: From the algorithm it follows that the weights remain bounded inside the compact set . In fact, if then, in (9) the term dominates the others the absolute value of which is bounded by unity and the algorithm brings the weight inside the box . Also, , if . Define a set (11) . It is now easy to see that, if for any arbitrary constant then . The boundedness of the q-estimates also could be established as below. Assume that . We will show that the algorithm ensures that
(8) where
if The proof is by induction on
with
given by IV. ANALYSIS
(9) and is the learning parameter. (v) The updating procedure for the state-action value estimates is (10) where
is the learning parameter and with
A. Qualitative Overview of the Analysis Here we summarize at a qualitative level the analysis of the learning algorithm. The next two subsections present detailed proofs of the various theorems. The Learning Algorithm presented in the previous section can be represented as
where and Thus, only the component is updated. Here is the discount factor. The above steps (i)–(v) are repeated at each time instant. Note that the constant in the definition of the function is a parameter of our learning algorithm. Remark 1: Updating (8) is a reinforcement learning algorithm of the reward penalty type (similar to the algorithm [5]), if the term in (9) is not considered. Motivation for (10) for the estimates is obvious from (2). This equation is same as that used by Watkins for estimating the state-action values [18] except for the change that the step size parameter used is constant instead of decreasing as . It is shown in [18] that the algorithm converges, under the
We are interested in the asymptotic behavior of the sequence . is the set of observations at time of estimates step , which is used to update . In general, is a random variable with some distribution which is not known to the learning system and it could be a function of the previous estimates. The function represents the updating specified in the learning algorithm and it is a deterministic nonlinear function satisfying certain regularity conditions (some of which are evident from the learning algorithm and others are due to assumptions on the payoff values, etc.). is a sequence of positive scalars, called the “gain sequence,” which specifies the step size for each .
Authorized licensed use limited to: INDIAN INSTITUTE OF SCIENCE. Downloaded on March 10, 2009 at 02:41 from IEEE Xplore. Restrictions apply.
592
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 27, NO. 5, SEPTEMBER 1997
In stochastic approximation type algorithms [20], the gain sequence as . In tracking problems, usually . In our algorithm, we maintain at a small positive value. We first derive an associated ordinary differential equation (ODE) for a gain sequence decreasing to zero and then extend the results to derive conclusions on the long term behavior of the algorithm in which is fixed at a small positive value. For the gain sequence decreasing to zero, becomes small as increases and hence for large we have , , for some finite integer . Due to the small step size assumption, for large , the process could now be studied assuming that remains constant at the value , for . For our algorithm, this sequence is Markov chain for (fixed for all ) and also it is irreducible with a stationary distribution dependent on (see Lemma 5). We can now further extend our arguments and write (for
where , averaged with respect to the stationary distribution of and is some zero mean random variable which could be dependent on . (See Lemma 6 and Lemma 7 for the averaged function in our algorithm.) Now, using the recursive updating algorithm we can approximate as
Thus, the algorithm tracks the ODE for as large a time as required, provided the learning parameter is small. Further, whenever the ODE is globally stable with a unique stable equilibrium point , the following result from [20] characterizes the asymptotic behavior of the algorithm. Theorem 4: Given sufficiently small, there exists a constant , with and Pr Thus for a small enough value of the learning parameter, the algorithm tracks its associated ODE asymptotically with the required accuracy. Having obtained the associated ODE, we proceed to analyze the asymptotic behavior of the trajectory of the ODE, by constructing a Lyapunov function [22]. We first make the assumption that the Markov decision problem has a unique optimal action in each state of the Markov chain [cf. Assumption (A3)]. We can then show that the associated ODE has a unique asymptotically stable equilibrium point , where is the true state action value vector (see Lemma 9). We will show that, if the weight vector of the controller network is and state action value vector is then it will correspond to choosing an optimal action in each state of the Markov chain with a high probability. Next, a similar result is derived by relaxing the Assumption (A3). We define a surface as in (31) (see Remark 6) such that every point on this surface corresponds to choosing an optimal action with a high probability. We construct a Lyapunov function and show that the trajectories of the ODE will asymptotically approach the surface . B. Derivation of the ODE Consider the process
The above approximation suggests [20] that by a suitable continuous time interpolation of the iterates of the algorithm, the evolution of the algorithm can be approximated by the solution of the ODE
where
Under our algorithm Markov process. Let
is a
where with the same initial conditions as the algorithm. This associated ODE is easier to handle since it is deterministic. The heuristic argument presented above can be made precise [21]. Theorem 8 formally proves this result and derives the ODE for our algorithm [given by (23)–(24)]. The following two results from [20] formally summarize the relationship between the iterates of the learning algorithm and the solutions of the associated ODE. Let denote the iterate at step and let denote the solution of the associated ODE. Theorem 3: Given sufficiently small and , there exists a constant , with and
and We have (12) . [ and are as given by (8) and where of (10)]. Consider a piecewise constant interpolation defined by, for
(13)
is the step size. where Assumptions: (A1) All the payoffs resulting from state transitions are bounded.
Authorized licensed use limited to: INDIAN INSTITUTE OF SCIENCE. Downloaded on March 10, 2009 at 02:41 from IEEE Xplore. Restrictions apply.
SANTHARAM AND SASTRY: REINFORCEMENT LEARNING NEURAL NETWORK
(A2) For any stationary pure policy, the Markov chain is irreducible. It may be noted that: (C1) is an orthonormal set and spans the space , where each is the context vector corresponding to the th state of Markov chain. (C2) The activation function for each unit is continuous, strictly increasing and . be fixed for all . Under Lemma 5: Let the assumptions (A1) and (A2), (i) The Markov chain is homogeneous and irreducible. (ii) The unique invariant distribution of the chain is such that
593
(ii) Let denote the true state-action value vector for a given reward and transition structure. Then, for any weight vector ,
is satisfied iff Proof: Refer to [23] for the proof. Theorem 8: The sequence , defined by (13), converges weakly as to , which satisfies the following ordinary differential equation (ODE) (21) (22)
(14) (iii) is continuous in . Proof: See Appendix. Lemma 6: Let and for all . Then for the given updating procedure in (8) and (9), we have, , for
where is as defined in (12) and the expectation is with respect to the unique invariant measure of the process , when where is given by (15) and is given by (18). Proof: See Appendix. C. Analysis of the ODE Using Lemma 6 and Lemma 7 we explicitly write the equivalent ODE as
(15) where, the expectation is with respect to the unique invariant distribution of the Markov chain for the fixed weight vector is as defined in (5) and
(23)
(16) if
(17) otherwise. Proof: The proof is through simple algebraic manipulations and is provided in [23]. Note that in (17) is independent of because, is independent given the state , estimate and action of the weight vector . Lemma 7: Let and be fixed. Then (18) where the expectation is with respect to the unique invariant distribution of the Markov chain with weight vector fixed at diag as defined in (5) and with indexed) vector given by
(19)
(24) From Lemma 7 (ii) it follows that (24) has a unique zero (the true state action value vector). In the analysis to follow we first make an extra assumption (A3) given below. Later on we see how assumption (A3) can be relaxed. Assumption (A3): For each state of the Markov chain the optimal action is unique, i.e., for each , there exists an unique such that, . Now define for each and (25)
is a (double and for any By assumption (A3), , we have [using (17)], (20)
for any
such that
(26)
with
as defined in (1).
(Here , the norm). Lemma 9 Under the assumptions (A1)–(A3), there exists a unique such that is the unique zero (stationary point) of the ODE (23–24).
Authorized licensed use limited to: INDIAN INSTITUTE OF SCIENCE. Downloaded on March 10, 2009 at 02:41 from IEEE Xplore. Restrictions apply.
594
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 27, NO. 5, SEPTEMBER 1997
Proof: It follows from Lemma 7 (ii) that is the unique stationary point of (24), for any . By Lemma 5, . By (C1), we require the following equations to be satisfied if is to be a zero of (23)–(24)
where is the unique zero of (23). Then we have the following result. Lemma 11: Under (A1)–(A3), for all such that [where is given by (25)], we have that along the trajectory of the ODE (23-24), for any
(27)
Proof: See Appendix. Now consider the following Lyapunov function candidate for the coupled ODE (23)-(24)
First observe that there can not be any solution with , because for and the value of is either 0 or 1 for any and [see (17)]. Assume that for some Now, is a strictly decreasing function of [by (C2)]. For and is also a strictly increasing function of . (Note that has the same sign as only for ). Hence there is a unique satisfying (27). Since , we have that (when ). A similar argument holds when , but the unique solution for (27) is such that . Since the above arguments hold for each , we conclude that there exists an unique such that is a zero for the ODE (23)–(24). Remark 3: The existence of the equilibrium point for the ODE, proved in the above Lemma 9, is crucially dependent on the penalizing term in the learning algorithm (which ensures that weights are bounded inside a compact set) and the fact that the function , is strictly increasing . Remark 4: In the remaining part of the proof we need the notion of Dini derivatives (see [22]), which are briefly explained in this remark. Any continuous real valued function defined over the real line has four Dini derivatives at each point of the real line. , the right upper Dini derivative of at , is defined by
Similarly we could define the right lower (as lim inf), left upper (with increasing to zero) and left lower derivatives. If a function is differentiable at a point then all the four Dini derivatives have the same value as the derivative of the function at that point. A continuous function is strictly decreasing on if for every [22, ch. II, Theorem 6. 2]. Lemma 10: Let be a functional defined by
is the true state-action value vector. Then for any weight vector , along the trajectory of the ODE (23) and (24) for any initial condition. Proof: See Appendix. Now define
(29) for and for any From Lemma 10, . From Lemma 11, when it follows that for any . The following Lemma proves that for the remaining case where . given by (29) with Lemma 12: Consider
where is the discount factor, is defined in (14), is the total number of actions, is defined in (25), is a constant used in the penalty function in (7) and . Then under assumptions (A1)–(A3) , when and . Proof: See Appendix. Theorem 13: Under the assumptions (A1)–(A3) the ODE (23-24) has an unique stationary point and if the initial condition is such that then the trajectories of the ODE asymptotically converge to . Proof: See Appendix. Remark 5: Since the stationary point is globally asymptotically stable, in view of Theorem 4, we have , for any and for large , with a high probability depending on the gain parameter . By the continuity of it follows that , for the function sufficiently large with a high probability. It follows from the proof of Lemma 9 that is greater than or less than based on the value of . Thus, by choosing large, the probability of choosing the optimal action could be made as large as desired with a high probability. D. Relaxing Assumption (A3) When (A3) is not satisfied, more than one action could be optimal in some states. Let
for
where
(28)
If for some Let
then define
.
and (30)
Authorized licensed use limited to: INDIAN INSTITUTE OF SCIENCE. Downloaded on March 10, 2009 at 02:41 from IEEE Xplore. Restrictions apply.
SANTHARAM AND SASTRY: REINFORCEMENT LEARNING NEURAL NETWORK
Remark 6: For any state , we have if and only if action is not optimal in state . If the true stateaction value vector is such that then and any weight vector of the controller is optimal. Suppose (A3) is satisfied by . Then will give the optimal policy function. These two are the extreme cases. Now consider a Markov control problem with two states and three actions , where the transition and reward and structure are such that . In this case, , since actions 1 and 2 are both optimal in state 1 and only action 1 is optimal in state 2. In such problems, we desire that the probabilities of non-optimal actions in any state be close to zero. The relative distribution of probability among the optimal actions in a state is of no interest. Since the algorithm ensures that the all the weights remain bounded, we need to analyze the convergence of weight components alone. This motivates the following definition of the surface and ensuing analysis. Since, by (C1), , the probability of choosing an action in state is determined only by the components . For each , let be the unique solution of the equation (cf. Lemma 9). Let (31) We consider the solutions of the ODE (23)-(24) approaching the surface [instead of a unique point as is the case with assumption (A3)]. Define the functions and as (32) (33) for some constant . Then we have the following result. Theorem 14: Under the assumptions (A1)–(A2), the trajectories of the ODE (23)-(24) approach the surface (as defined in (31)) asymptotically for all initial conditions such . that Proof: See Appendix. V. DISCUSSION In this paper we considered the problem of reinforcement learning in a dynamic environment. Viewed as a learning problem, there are two major difficulties here. The reinforcement available from the environment at any instant is not directly evaluative of the action at that instant because our objective is to optimize total expected discounted payoff. Further, the state changes of the environment are affected by the actions of the learning system complicating the credit assignment problem. The approach we used is based on the notion of an adaptive critic. Adaptive critics are discussed in the context of temporal credit assignment by Barto and his colleagues [2], [9]. A recent review of adaptive critic method for control is [17]. The main idea in this approach is that the critic should be able to supply the learning element with a reinforcement signal that is more meaningful than the immediate payoff from the
595
environment. In the Markov control problem, estimating the state-action values is the natural choice for designing the critic. Using such an estimation scheme for the critic and a simple associative learning algorithm for the learning element, we were able to prove global convergence of the overall system. Given that the estimates converge to true values in the case when there is no learning and all actions are chosen infinitely often, we can get such convergence for the overall system with the learning algorithm, by keeping enough exploratory behavior in the learning element. This basic idea of the proof is the same as that of estimator algorithms for LA [4]. The learning algorithm used to update the parameters of the controller is very similar to the algorithm used in [5] for a pattern classification problem with linearly independent patterns. An associative reward-inaction type algorithm was suggested in [7] and it was proved that the algorithm has a stochastic hill climbing property. The environment was assumed to be stationary and the reinforcement is immediate in all these works. Sutton has recently proposed a class of algorithms called Temporal Difference algorithms [24] which can handle delayed reinforcement. The TD algorithms have much in common with ideas in dynamic programming [17] and can be used for the Markov decision problems considered here (e.g., for estimating the Q-values under a fixed policy). Barto et al. [16] suggested an adaptive control scheme using the estimates of state-action values. The use of state-action values was first suggested by Watkins [18]. The action generation was based on the method proposed by Sato et al. [14] which ensures that all actions are generated infinitely often. This method requires the estimation of state-transition probabilities. Later, in [17], they suggested a different algorithm for action generation, which is similar to our rule in (4)–(5). It is proved by Watkins and Dayan that the estimates for the state-action values will converge to their true values, under the standard stochastic approximation conditions, if all actions are chosen infinitely often [18]. The analysis presented here makes use of the special features of the Markov control problem only in getting the right estimates for the adaptive critic. Hence we hope to extend this approach into tackling a general class of reinforcement learning systems using adaptive critics in our future work. A shortcoming of the algorithm presented here is that it can handle only finite Markov chains because it has to maintain explicit estimates of the state-action values. A challenging open problem is to design such reinforcement learning systems to tackle more general Markov decision problems. VI. SIMULATION RESULTS We give below simulation results for a Markov chain control problem (taken from [15]) with two states and two actions in each state. The following are the parameters used in the simulation . The transition and reward matrices are
Authorized licensed use limited to: INDIAN INSTITUTE OF SCIENCE. Downloaded on March 10, 2009 at 02:41 from IEEE Xplore. Restrictions apply.
596
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 27, NO. 5, SEPTEMBER 1997
Fig. 3. Results of Simulation-1.
It can be easily shown [3], [11] that it is optimal to take action 1 in both the states. Simulation-1: In the first simulation the initial weights as well as the initial values of the estimates of the state-action values are set to zero. Reward for the state transitions were generated with uniform probability over an interval of width , which is the th entry of the reward 0.5 and mean matrix . Fig. 3 shows how the probability of action 1 in each state, , varies with the number of control actions performed in a typical run. The typical values of the action probabilities and the state-action values, after 32 500 controls were performed, are
Simulation-2: In this simulation the initial weight values of the controller are zero, but the initial values of the estimates are (given in the same order as above): 1, 4; 0, 1. (These values of make action 2 optimal in both states). Fig. 4 shows the plot of the action probabilities against the number of control actions. It could be observed that value decreases initially and then increases to 0.9. After 30 000 iterations the action and estimates are probabilities are (in the same order as above) 6.013, 5.729; 6.508, 6.257. This corroborates well with the analytical results that the learning algorithm converges globally. Simulation-3: This example also has two states and two actions. The probabilities are
. The reward matrix is such that excepting . All the other parameters have the same value as in Simulation-1. The true state-action values are (analytically) found to be, and Both actions are optimal in state-1 and only action 2 is optimal in state-2. The plot converges to zero but, value in Fig. 5 shows that oscillates between 0 and 1 (this is because of the finite nonzero step size value and the fact that . This agrees well with the conclusions of Theorem 14. We performed simulations of a Markov chain control problem with five states and three actions (this example is in [16]). After 20 000 iterations the probabilities of the optimal actions were . The simulation was continued up to 80 000 iterations and the probabilities of the optimal actions were .
VII. CONCLUSIONS In this paper the problem of learning in dynamic environments has been considered. The problem of control of Markov chains is analyzed in this perspective. A Generalized Learning Automaton model is used as the controller and a deterministic neural net is used as the critic to estimate some relevant parameters for finding the optimal policy. An associative reward-penalty algorithm is used to update the controller parameters and it is shown that the controller asymptotically
Authorized licensed use limited to: INDIAN INSTITUTE OF SCIENCE. Downloaded on March 10, 2009 at 02:41 from IEEE Xplore. Restrictions apply.
SANTHARAM AND SASTRY: REINFORCEMENT LEARNING NEURAL NETWORK
597
Fig. 4. Results of Simulation-2.
approximates the optimal policy with as high a probability as desired. APPENDIX Proof of Lemma 5: Part (i) follows from Assumption (A2). To Prove Part (ii) is any stationary pure policy and the corresponding invariant distribution, then, , by (A2). Since we have only a finite . number of pure policies, we get Now, using the result in [25, ch. 5, Lemma 1.2], we get (14) as a special case. If
To Prove Part (iii): Let be the transition probability matrix for a fixed . Then weight vector
Since is continuous in [cf. (5)], is continuous. Now consider any sequence converging to . Invariant distribution corresponding to the transition matrix belongs to a compact set (i.e., , for each . be any cluster probability simplex in point of this sequence such that . Since is continuous, . Also,
for each . Hence
Since , by uniqueness of the invariant distribution, we get . As the cluster point in the above is arbitrary, the continuity of follows. Proof of Theorem 8: This is the particular case of a general result in Kushner [21, Theorem 3.2]. We observe the following with regard to the learning algorithm. (i) is a Markov process. Values taken by the process are inside a compact subset of a metric space. [in (12)] is independent of the (ii) The function learning parameter . is measurable and bounded over every compact set. (Note that is measurable and bounded, while is continuous). (iii) If (i.e., and fixed) then has a unique invariant distribution (say, ). This follows from Lemma 5 and the learning algorithm. (iv) The ODE in (21) [also given explicitly in (23) and (24)] has a unique solution for every initial condition. Let and be any two points. We will show that
Authorized licensed use limited to: INDIAN INSTITUTE OF SCIENCE. Downloaded on March 10, 2009 at 02:41 from IEEE Xplore. Restrictions apply.
598
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 27, NO. 5, SEPTEMBER 1997
Fig. 5. Results of simulation-3.
for each and Then is a contraction operator under the maximum norm and , where is the true state-action value vector. Proof: Let be any two vectors. Then
where if if if and
is a constant identified as below. in (21) [which is also given in (15)] has all the components bounded. Denote this bound by a constant . From (18) in Lemma 7 we have . is a diagonal matrix with nonzero and bounded entries (note that ). is a vector, given by (20), which is a function of alone. Also for any where is some constant. Now choose in defining the function above. is measurable and satisfies the conditions of the result in [26, Theorem 10] and hence the uniqueness of the solution follows. By the observations (i)–(iv) above and the result in [21, ch. 5, Theorem 3.2] it follows that the sequence converges weakly to the solution of the ODE (21) as
where
arg
Since it follows is a contraction operator. follows from Lemma 7 (ii). To Complete the Proof of Lemma 10:
(where Proof of Lemma 10: We require the following result in the proof of Lemma 10. Lemma 15: Let be an operator defined by
By substituting from (24)and (34))
(34)
Authorized licensed use limited to: INDIAN INSTITUTE OF SCIENCE. Downloaded on March 10, 2009 at 02:41 from IEEE Xplore. Restrictions apply.
SANTHARAM AND SASTRY: REINFORCEMENT LEARNING NEURAL NETWORK
599
From the proof of Lemma 10 [see (35)], we have (38)
(Since Since (Since
for small
we have
from Lemma 15
and hence (35)
For any weight vector and always (because Hence if
(by Lemma 5) is sigmoid function).
(39) In the worst case, we require
Proof of Lemma 11:
From (37)–(39), we need
Using (23) we get
(36) Since assumption (A3) holds and we get
Thus as above.
is satisfied if the constant
is chosen
from (26) Proof of Theorem 13:
By Lemma 5, for each . From the proof of Lemma 9 it follows that is a strictly decreasing function of and is a strictly increasing function, while both have a zero at . Hence each term in the above sum is strictly less than zero for any . This proves the Lemma.
Existence of a unique stationary point follows from Lemma 9. From Lemmas 10–12 we infer that along the trajectories of the ODE (23)–(24), (excepting the stationary point ). Using the result in Rouche et al. [22, ch. 2, p. 89, Theorem 6. 2], is a Lyapunov function for the ODE [(23) and (24)] and hence the trajectories should converge asymptotically to the unique stationary point if
Proof of Theorem 14: Proof of Lemma 12: By properties of Dini derivatives [22] we get, since differentiable
is
From (7) we have
[cf. (11)]. Noting that
[defined in (33)] We need to show that function satisfies the conditions of the result in [22, p. 43, Th. 6. 33]. We make the following observations. (a) Lemma 10 holds without any modification, i.e., for any and . (b) A result similar to Lemma 11 holds with the function [defined in (32)] in place of the function . (c) Result similar to Lemma 12 holds here with the function being substituted for . Following the same arguments as in Lemma 12, we can find a constant such that
and (note that weights remain bounded in (36)
, we have, from (37)
is satisfied. From the above it follows that for and hence using the result in [22, p. 43, Th. 6.33] the theorem follows.
Authorized licensed use limited to: INDIAN INSTITUTE OF SCIENCE. Downloaded on March 10, 2009 at 02:41 from IEEE Xplore. Restrictions apply.
600
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 27, NO. 5, SEPTEMBER 1997
REFERENCES [1] K. S. Narendra and M. A. L. Thathachar, Learning Automata: An Introduction. Englewood Cliffs, NJ: Prentice-Hall, 1989. [2] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike elements that can solve difficult learning control problems,” IEEE Trans. Syst., Man, Cybern., vol. SMC-13, pp. 835–846, 1983. [3] S. Ross, Applied Probability Models with Optimization Applications. San Francisco, CA: Holden-Day, 1970. [4] M. A. L. Thathachar and P. S. Sastry, “Learning optimal discriminant functions through a cooperative game of automata,” IEEE Trans. Syst., Man, Cybern., vol. SMC-17, pp. 73–85, 1987. [5] A. G. Barto and P. Anandan, “Pattern-recognizing stochastic learning automata,” IEEE Trans. Syst., Man, Cybern., vol. SMC-15, pp. 360–374, 1985. [6] A. G. Barto, “Learning by statistical cooperation of selfinterested neuron-like computing elements,” COINS Tech. Rep. 81-11, Univ. Massachusetts, Amherst, Apr. 1985. [7] R. J. Williams, “Reinforcement learning connectionist systems,” CCS Tech. Rep. NU-CCS-87-3, Northeastern Univ., Boston, MA, 1987. [8] V. V. Phansalkar, “Learning automata algorithms for connectionist systems—local and global convergence,” Ph.D. dissertation, Dept. Elect. Eng., Indian Inst. Science, Bangalore, 1991. [9] R. S. Sutton, “Temporal credit assignment in reiforcement learning,” Ph.D. dissertation, Dept. Comput. Inf. Sci., Univ. Massachusetts, Amherst, 1984. [10] R. J. Williams and L. C. Baird, “Analysis of some incremental variants of policy iteration: First steps toward understanding actor-critic learning systems,” CCS Tech Rep. NU-CCS-93-11, Northeastern Univ., Boston, MA, Sept. 1993. [11] R. A. Howard, Dynamic Programming and Markov Processes. Cambridge, MA: MIT Press, 1960. [12] P. R. Kumar, “Survey of results in stochastic adaptive control,” SIAM J. Contr. Optim., vol. 23, pp. 329–380, 1985. [13] P. R. Kumar and W. Lin, “Optimal adaptive controllers for unknown Markov chains,” IEEE Trans. Automat. Contr., vol. AC-27, pp. 765–774, 1982. [14] M. Sato, K. Abe, and H. Takeda, “Learning control of finite markov chains with an explicit trade-off between estimation and control,” IEEE Trans. Syst., Man, Cybern,, vol. 18, pp. 677–684, 1988. [15] R. M. Wheeler, Jr. and K. S. Narendra, “Decentralized learning in finite markov chains,” IEEE Trans. Automat. Contr., vol. AC-31, pp. 519–526, 1986. [16] A. G. Barto and S. P. Singh, “On the computational economics of reinforcement learning,” in Proc. 1990 Connectionist Models Summer School, D. Touretzky, J. Elman, T. Sejnowski, and G. Hinton, Eds. San Mateo, CA: Morgan Kaufmann, 1990. [17] A. G. Barto, S. J. Bradtke, and S. P. Singh, “Real-time learning and control using asynchronous dynamic programming,” Tech. Rep. 91-57, Dept. Comput. Sci., Univ. Massachusetts, Amherst, 1991. [18] C. J. Watkins and P. Dayan, “Q learning,” Mach. Learn., vol. 8, pp. 279–292, 1992.
[19] R. P. Lippmann, “An introduction to computing with neural nets,” IEEE ASSP Mag., pp. 4–22, Apr. 1987. [20] A. Beneveniste, M. Metivier, and P. Priouret, Adaptive Algorithms and Stochastic Approximations. New York: Springer-Verlag, 1987. [21] H. J. Kushner, Approximation and Weak Convergence Methods for Random Processes. Cambridge, MA: MIT Press, 1984. [22] N. Rouche, P. Habets, and M. Laloy, Stability Theory by Liapunov’s Direct Method. Applied Mathematical Sciences-22. New York: SpringerVerlag, 1977. [23] G. Santharam and P. S. Sastry, “A reinforcement learning neural network for adaptive control of markov chains,” Dept. Elect. Eng. Tech. Rep. 87/2/92, Indian Inst. Science, Bangalore, Feb. 1992. [24] R. S. Sutton, “Learning to predict by the method of temporal differences,” Mach. Learn., vol. 3, pp. 9–44, 1988. [25] V. S. Borkar, Topics in Controlled Markov Chains. Pitman Research Notes in Mathematics, Series 240. Essex, U.K.: Longman, 1991. [26] A. F. Filippov, “Differential equations with discontinuous right-hand side,” Amer. Math. Soc. Translations, Series 2, vol. 42, pp. 199–231, 1968.
G. Santharam received the B.Sc. degree in physics from Madras University, Madras, India, in 1984, the M.E. degree in electrical communication engineering and the Ph.D degree in electrical engineering, both from the Indian Institute of Science, Bangalore, in 1988 and 1994, respectively. He is currently a Senior Software Engineer with Motorola India Electronics Pvt. Ltd., DSP Division, Bangalore. His research interests include speech compression, digital communications (TDMA and CDMA cellular systems) and learning algorithms.
P. S. Sastry (S’82–M’85) received the B.Sc. (Hons.) degree in physics from the Indian Institute of Technology, Kharagpur, in 1978, the B.E. degree in electrical communications engineering, and the Ph.D. degree in electrical engineering from the Indian Institute of Science, Bangalore, in 1981 and 1985, respectively. He is currently an Associate Professor in the Department of Electrical Engineering, Indian Institute of Science. He has held visiting positions at the University of Michigan, Ann Arbor, and General Motors Research Laboratories, Warren, MI. His research interests include reinforcement learning, pattern recognition, neural networks and computational neuroscience.
Authorized licensed use limited to: INDIAN INSTITUTE OF SCIENCE. Downloaded on March 10, 2009 at 02:41 from IEEE Xplore. Restrictions apply.