Optimal Parameter Trajectory Estimation in Parameterized ... - CiteSeerX

Report 5 Downloads 156 Views
Optimal Parameter Trajectory Estimation in Parameterized SDEs: An Algorithmic Procedure Shalabh Bhatnagar Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560 012, India. E-Mail: [email protected]

Karmeshu School of Computer and Systems Sciences Jawaharlal Nehru University New Delhi 110 067, India. E-Mail: [email protected] and Vivek Kumar Mishra Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560 012, India. E-Mail: [email protected] January 28, 2008

1

Abstract We consider the problem of estimating the optimal parameter trajectory over a finite time interval in a parameterized stochastic differential equation (SDE), and propose a simulation-based algorithm for this purpose. Towards this end, we consider a discretization of the SDE over finite time instants and reformulate the problem as one of finding an optimal parameter at each of these instants. A stochastic approximation algorithm based on the smoothed functional technique is adapted to this setting for finding the optimal parameter trajectory. A proof of convergence of the algorithm is presented and results of numerical experiments over two different settings are shown. The algorithm is seen to exhibit good performance. We also present extensions of our framework to the case of finding optimal parameterized feedback policies for controlled SDE and present numerical results in this scenario as well. Key Words Optimal parameter trajectory, parameterized stochastic differential equations (SDEs), simulation optimization, smoothed functional algorithm.

1

Introduction

Stochastic differential equations (SDEs) play a major role in the modeling and analysis of many real life engineering problems such as those arising in communication and wireless networks [28], [14], [24], aerospace systems [30], [26] and portfolio optimization [20]. Problems of control and optimization in systems governed by SDEs [22] arise naturally in many applications and have been studied both from a theoretical perspective as well as from the viewpoint of applications. For instance, [14] considers a dynamic channel model for wireless networks based on SDEs and proposes a new power control algorithm. In [20], the problem of portfolio optimization for finding an optimal investment strategy using a controlled SDE is dealt with. A problem of risk-sensitive control is considered in [23] for a system whose dynamics is governed by a controlled SDE. In [12], a controlled linear SDE is analyzed and certain stabilizability conditions are obtained. Also, in [13], a stochastic gradient algorithm is proposed in order to stabilize a system similar to that in [12]. Control and optimization problems involving SDEs arise in various areas such as location tracking and trajectory prediction of mobile users in wireless networks [24], and optimal tracking of maneuverable targets in space and defense applications [30], [26]. A book treatment on applications of the SDE approach to communications can be found in [28]. In recent times, problems of control and estimation in large scale systems have been studied using techniques based on reinforcement learning [4] that involve a parameterization of the value function and/or policy so as to tackle the curse of dimensionality. Problems 2

of estimation and control in stochastic systems have also been many times suitably recast as appropriate parameter optimization problems [25], [19], [1], [8] with a view to finding good estimates of the value function and/or policy in a computationally efficient manner. The above mentioned references however consider the Markov reward process (MRP) or the Markov decision process (MDP) framework for studying these problems. An alternative framework is via parameterized SDEs where the underlying parameters need to be tuned in order to achieve ‘optimal’ performance of the system. As an example, consider a slotted Aloha multi-access communication system [2]. In [27], the Markov chain for this system is approximated via a diffusion process (in terms of an SDE) with reflecting boundaries. Further, it is shown that the behavior of backlogged packets corresponds to a stochastic cusp catastrophe. However, it is assumed in [27] that the probability of retransmission is fixed and is given a priori, therefore resulting in a given SDE. One may consider an associated problem involving a parameterized class of SDEs that are parameterized by the probability of retransmission of backlogged packets, now treated as the tunable parameter, where the aim would be to obtain the optimal value of such a parameter. This would help in optimizing network performance, for instance, in terms of minimizing the number of packet collisions and delays on the one hand, and maximizing throughput on the other. In this paper we propose an algorithmic procedure for finding the optimal parameter trajectory in parameterized SDEs when an expected cost objective over a time interval [0, T ] (for some T < ∞) is specified. We assume that the parameter enters the drift term of the SDE. This is similar to the case of controlled SDEs treated in [22] where the control enters the drift term. However, our methods will also work (in a similar manner as shown) for the case where the diffusion term is parameterized as well. We formulate this problem in the simulation optimization setting by considering a discretization of the SDE over a certain finite number of epochs and adapt a smoothed functional algorithm for optimization over this setting. We also present an extension of our approach to the case of controlled SDEs for which the feedback control is parameterized. Our method in such a case provides an optimal control policy within the specified class of policies. While we consider the problem of parameter optimization / parameterized control in SDEs here, the problem of estimation of transient statistical characteristics of a process governed by various classes of random differential equations has also been recently studied in [11]. 3

Simulation optimization [5], [6], [7], [9], [10] is a useful technique for finding optimal parameters in the case when the objective function involves an expectation and is not analytically computable. Most often, gradient search methods are used in such cases. In [5], [6] and [7], two-timescale stochastic approximation (TTSA) is used for simulation optimization. The idea there is that estimation of the expectation proceeds on the faster timescale and parameter optimization/update proceeds on the slower one. TTSA has been a useful alternative to traditional perturbation analysis [17] and likelihood ratio methods [16], many of which update parameters only at certain regenerative epochs of an underlying process. Using TTSA, one performs parameter updates at regular (deterministic) time instants by making use of the two-timescale structure of the updates. Amongst gradient estimation techniques, two approaches stand out. One of them that has attracted considerable attention is the simultaneous perturbation stochastic approximation (SPSA) method. The basic idea there is that in order to estimate the gradient, one perturbs the parameter randomly, simultaneously along all component directions (most commonly by using i.i.d. symmetric, Bernoulli random variables). The perturbation random variables in such a case must not (in particular) have any probability mass at zero and must have a finite inverse moment. The resulting gradient estimate requires only two simulations and updates all parameter components simultaneously. The second method that goes under the name of the smoothed functional (SF) technique [29], [32], in fact came before SPSA and is also based on the idea of randomly perturbing all parameter components simultaneously. However, the perturbations used in SF usually have the Gaussian distribution, although the Cauchy distribution may also be used. The original SF algorithm involved only one simulation. A two-simulation, finite-difference, SF-based estimate of the gradient is proposed in [32] and is seen to provide better performance than its one-simulation counterpart. The basic idea in SF is to convolve the gradient of the objective function with a Gaussian density that is then viewed as a convolution of a scaled Gaussian with the objective function itself. Note that both SPSA and SF are based on estimating gradients. In [9] and [10], certain Newton-based variants of the SPSA and SF algorithms, respectively, are presented. The Newton-based algorithms in [9] and [10] use three-timescale recursions. In [10], a two-timescale gradientbased algorithm with a finite-difference SF estimate is also developed in addition to the Newton-based variants. 4

In this paper we adapt the gradient-based SF algorithm of [10] for finding an optimal parameter trajectory in parameterized SDEs. Note that the parameter optimization setting in [5], [6], [7], [9], [10] deals with stationary parameterized Markov processes characterized by a single parameter (and not a parameter trajectory), and hence one is interested there in finding a stationary (time independent) optimal parameter. However, in the SDE setting of this paper, a more general problem is considered where the parameter itself changes deterministically over time according to a control policy. We are interested in optimizing the system behavior continuously in an interval of time [0, T ] for given T < ∞, by determining the optimal parameter trajectory (viz., parameter as a function of time) over this interval. It may be worth noting that any other efficient gradient or Newton search algorithm may be used here as well. The reason for adapting an SF-algorithm to our setting is that SF is known to converge in many cases to the global optimum (and not just a local optimum) [32], [29] because of the ‘objective function smoothing’ that results from the use of Gaussian perturbations. We employ the gradient-based and not the Newton-based SF procedure here because of its simplicity and ease of implementation, and also because it exhibits good performance as well (though one of the Newton-based SF schemes has been observed to exhibit the best performance in [10]). In view of the reasons above, we have adopted gradientbased SF here primarily as a representative procedure for simulation-based optimization techniques. Our main goal in this paper is to develop a theory for finding an optimal parameter trajectory in parameterized SDEs and to propose an efficient simulation-based optimization procedure for the same. To the best of our knowledge, this is the first work that develops a rigorous algorithmic procedure for simulation-based optimization in parameterized SDEs. We provide, in an appendix, a proof of convergence of our algorithm in this setting. An important aspect here that needs to be emphasized is that the parameter values at each time instant affect the (entire) state evolution from that instant onwards. Hence, the value of the initial parameter in the parameter trajectory affects all subsequent states while the value of the final parameter in the trajectory affects only the final state of the system. Our analysis formalizes this intuitive fact. We show numerical experiments over two different settings – one of which involves the well-studied linear quadratic Gaussian (LQG) system while the other involves a nonlinear system setting. Our algorithm is seen to exhibit good 5

performance in both settings. Finally, we also present extensions of our basic framework to the problem of finding optimal closed loop parameterized policies in the case of controlled SDEs. We discuss here the advantages of our methodology over the ‘approximating Markov chain’ approach described in [22] and also present results of some numerical experiments for this case as well. The rest of the paper is organized as follows: We describe our parameterized SDE model and problem formulation in Section 2. In Section 3, we present our SF-based algorithm for optimizing parameters in SDEs. Numerical experiments are presented in Section 4. Extensions of the proposed methodology for obtaining parameterized optimal feedback control policies are discussed in Section 5. The concluding remarks are provided in Section 6. Finally, the convergence analysis for the algorithm is presented in an appendix at the end of the paper.

2

Problem Formulation

Consider a process X(t), t ≥ 0 whose sample paths are governed by a parameterized stochastic differential equation (SDE) of the following form: dX(t) = b(X(t), θ(t))dt + σ(X(t))dW (t).

(1)

Here θ(t), t ≥ 0 is the associated parameter process. We assume θ(t) ∈ U ⊂ R, for all t ≥ 0 where U is a compact subset of R. In general, one may have the parameters θ(t)

to be Rm -valued for given m ≥ 1. We consider R-valued parameters here only in order to

simplify notation in the subsequent analysis. We assume X(t) ∈ Rd , for given d ≥ 1 for all t ≥ 0 and W (·) is a 1-dimensional Brownian motion. Also, X(0) = X0 is a (deterministic)

fixed quantity. Here b : Rd × R → Rd and σ : Rd → Rd are the drift and diffusion terms, respectively, that we assume are both Lipschitz continuous functions. Let [0, T ] for some T > 0 be the interval over which we consider the evolution of SDE (1) with 0 and T being the initial and terminating time instants, respectively. Suppose g¯ : [0, T ] × Rd → R represents the associated cost function. Define by J¯X0 (θ(·)), the quantity J¯X0 (θ(·)) = E

"Z

T 0

#

g¯(t, X(t))dt | X(0) = X0 , 6

(2)

that we assume is well defined and finite for all feasible parameter trajectories {θ(t), 0 ≤ t ≤ T }. Our objective is to find a function θ∗ : [0, T ] → R with θ∗ (t) ∈ U, ∀t ∈ [0, T ] that minimizes (2), given the initial state X0 . In order to find the optimal parameter function θ∗ (·), we adapt multi-timescale simulationbased stochastic approximation methods to this setting. We consider an Euler discretization of the SDE (1) over discrete time epochs and recast the problem below over the resulting discrete time setting. Our aim then becomes one of tuning the parameter trajectory for the discretized SDE in order to obtain optimal overall performance (by minimizing the cost objective) over a given (finite) number of stages. Let h > 0 be the small time element corresponding to which the time discretization of SDE (1) is performed. Without any loss of generality, let T = Nh for some integer N > 1. Let Xj ≡ X(jh) and θj ≡ θ(jh), respectively, correspond to the state and parameter at instant jh. The following is a general form for the Euler discretization of the SDE that can be seen to hold recursively. X1 = f1 (X0 , θ0 , Z1 ) Xj = fj (X0 , θ0 , θ1 , . . . , θj−1 , Z1 , Z2 , . . . , Zj ),

(3)

for all 1 < j ≤ N. Here fj , 1 ≤ j ≤ N are appropriate measurable functions that provide the states at instants jh. Also, Zj , 1 ≤ j ≤ N are independent N(0, 1)-distributed samples. As an example of the above, for the case when the states Xj are R-valued (i.e., d = 1), a simple calculation as on pp.340-343 of [15] based on Ito’s formula gives the following discretization of the SDE that is referred to as the Euler-Milstein scheme. √ 1 2 − 1). Xj+1 = Xj + b(Xj , θj )h + σ(Xj ) hZj+1 + σ ′ (Xj )σ(Xj )h(Zj+1 2

(4)

Here σ ′ (·) is the derivative of σ(·). We consider the discretization (4) for our numerical experiments. From (4), quantities Xj can be obtained as Xj = X 0 +

√ 1 2 (b(Xi , θi )h + σ(Xi ) hZi+1 + σ ′ (Xi )σ(Xi )h(Zi+1 − 1)), 2 i=0

j−1 X

(5)

for all j = 1, . . . , N. A simple induction procedure now illustrates that the form (3) is valid for the above discretization (in (4)). The forms of fj , 1 ≤ j ≤ N can be obtained from (5). More general discretizations for the SDE (1) may also be used [15]. 7

Now let gj (Xj ) ≡ g¯(jh, X(jh)) denote the cost rate at instant jh. For our analysis, we will assume that gj (·), j = 1, . . . , N are Lipschitz continuous functions. Given the initial state X0 of the SDE, our aim is to find parameters θ0 , θ1 , . . ., θN −1 (each) taking values in U that minimize △



JX0 (θ0 , . . . , θN −1 ) = E 

N X

j=1





gj (Xj ) | X(0) = X0  h ≡ EX0 

N X

j=1



gj (Xj ) h.

(6)

Note that we do not include the cost rate g0 (X0 ) in the above as given X0 , it is a constant for all parameters. Moreover, since h > 0 is fixed, it does not play a role in the minimization procedure. Hence, we drop the multiplier term, h, from (6) and consider the following function to be minimized (that by an abuse of notation, we denote by JX0 (θ0 , . . . , θN −1 ) itself). △



JX0 (θ0 , . . . , θN −1 ) = EX0 

N X

j=1



gj (Xj ) .

(7)

We assume that JX0 (θ0 , . . . , θN −1 ) are continuously differentiable functions in θ0 , . . ., θN −1 . This is a general requirement in gradient-search-based techniques. Certain sufficiency conditions for functions such as JX0 (θ0 , . . . , θN −1 ) that are obtained as expectations (of certain cost objectives) with respect to parameterized stationary distributions, to be continuously differentiable are available in [33]. These, however, appear difficult to verify in practice. Note that we use a discretized setting here in order to make the problem amenable to solution approaches based on numerical/simulation techniques. Discretization is commonly used in many (numerical) solution approaches for continuous problems, see for instance, [22]. However, as we explain in Section 5, whereas we only discretize time but not state and control/parameter spaces, [22] considers discretizations of all three for their (numerical) solution methodology. Discretization in time is also meaningful because in many practical problems control can only be exercised at certain discrete time points. As stated before, we formulate this problem in the setting of simulation optimization. In particular, we simulate sample trajectories of the discretized SDE. Let Xj (n) denote the state obtained at instant jh in the nth (n ≥ 1) sample trajectory. We have X0 (n) = X0 for all n. The parameter trajectory {θ0 , θ1 , . . . , θN −1 } is tuned in order to optimize the cost (7). As can be seen above, the parameters (θj , 0 ≤ j ≤ N − 1) in our case are stage (or time)

8

dependent. Now note that (7) can be decomposed as JX0 (θ0 , . . . , θN −1 ) = JX1 0 (θ0 ) + JX2 0 (θ0 , θ1 ) + · · · + JXN0 (θ0 , . . . , θN −1 ),

(8)

for appropriate functions JX1 0 (·), JX2 0 (·, ·) etc., defined according to JXi 0 (θ0 , . . . , θi−1 ) = EX0 [gi (Xi )], i = 1, . . . , N. Since we use Monte-Carlo estimates for the above quantities, we have m 1 X gi (Xi (n)), m→∞ m n=1

EX0 [gi (Xi )] = lim

for all 1 ≤ i ≤ N with Xn0 = X0 , ∀n ≥ 1. Thus, our overall objective is to find parameters ∗ θ0∗ , θ1∗ , . . ., θN −1 ∈ U, given state X(0) = X0 such that ∗ {θ0∗ , θ1∗ , . . . , θN −1 } = arg

min

{θ0 ,θ1 ,...,θN−1 }

JX0 (θ0 , θ1 , . . . , θN −1 ).

Also, for any parameter trajectory {θ0 , θ1 , . . . , θN −1 }, JX0 (θ0 , θ1 , . . . , θN −1 ) =

m 1 X gi (Xi (n)). m→∞ m n=1 i=1

N X

lim

We assume here that any sample trajectory {X0 (n), X1 (n), . . . , XN (n)}, n ≥ 1, with X0 (n) = X0 (fixed) is governed by the parameter trajectory {θ0 , . . . , θN −1 }. From now on, we shall many times also represent the above parameter trajectory as the vector (θ0 , . . . , θN −1 )T .

Next, we provide a stochastic approximation algorithm for finding the optimal parameter ∗ ∗ trajectory {θ0∗ , θ1∗ , . . ., θN −1 } with each θi ∈ U, 0 ≤ i ≤ N − 1.

3

The Algorithm

We use here the smoothed functional approach [29], [32], [10]. The basic idea in this approach is to ‘smooth’ the objective by means of a convolution with a multivariate Gaussian density function. This is particularly helpful in the case of objectives that are ill-behaved. It turns out (as also explained below) that the convolution of the gradient of the objective function with a multivariate Gaussian density function is the same as convolution of a scaled multivariate Gaussian with the objective function itself. We present first the form of the

9



gradient estimates that we use. Let θ denote the parameter vector θ = (θ0 , θ1 , . . . , θN −1 )T . Then θ ∈ U N . For some scalar constant β > 0, let Dβ,1 JX0 (θ) =

Z

Gβ (θ − η)∇η JX0 (η)dη

(9)

represent the convolution of the gradient of the cost with the N-dimensional multivariate Gaussian density −1 1 NX (θi − ηi )2 . exp − Gβ (θ − η) = (2π)N/2 β N 2 i=0 β2

!

1



Here η ∈ RN with η = (η0 , . . . , ηN −1 )T . Integrating by parts in (9), it is easy to see that Dβ,1 JX0 (θ) =

Z

∇θ Gβ (θ − η)JX0 (η)dη =

One can easily check that ∇η Gβ (η) =

Z

∇η Gβ (η)JX0 (θ − η)dη.

(10)

−η η Gβ (η). Substituting the above and η ′ = in (10), 2 β β

one obtains ! −1 1 1 NX 1Z ′ ′ 2 −η (η ) JX0 (θ − βη ′ )dη ′. exp − Dβ,1 JX0 (θ) = β (2π)N/2 2 i=0 i

(11)

′ T ′ In the above, η = βη ′ = (βη1′ , . . . , βηN ) and hence dη = β N dη1′ · · · dηN = β N dη ′ . Upon

substituting η¯ = −η ′ , one obtains

"

#

1 Dβ,1 JX0 (θ) = E η¯JX0 (θ + β η¯) , β

(12)

where the expectation above is taken w.r.t. the N-variate Gaussian density −1 1 NX 1 (¯ ηi )2 . exp − G(¯ η) = (2π)N/2 2 i=0

!

The form of the gradient estimator suggested by (12) (for M large and β small) is ∇JX0 (θ) ≈

M 1 1 X η(n)JX0 (θ + βη(n)). β M n=1

(13)



Here η(n) = (η0 (n), . . . , ηN −1 (n))T , with ηi (n), i = 0, 1, . . . , N − 1, n ≥ 0, being independent N(0, 1)-distributed samples. It is observed in [32] that the use of the following (balanced) form of the gradient estimator Dβ,2 JX0 (θ) = E[

η¯ (JX0 (θ + β η¯) − JX0 (θ − β η¯))] 2β 10

(14)

instead of (12) yields better performance, because the estimator in (14) has lower bias. In [10], a two-timescale gradient search algorithm with gradient estimates based on (14) has been proposed alongwith certain Newton-based algorithms that estimate both the gradient and Hessian of the objective. Two-timescale stochastic approximation is necessitated in those cases where the objective function typically is an expectation or a long-run average over sample objectives. Since our objective can be viewed as the sum of certain long-run averages (as explained previously), we use two-timescale stochastic approximation in our algorithm. Along the faster timescale, we estimate the objective for any given set of parameter updates {θ0 (n), θ1 (n), . . ., θN −1 (n)}, n ≥ 1, while along the slower one, we update the (above) parameters using a gradient search recursion with estimates based on (14). We now present our algorithm. Let {a(n)} and {c(n)} be two step-size sequences that satisfy the requirements ∞ X

n=0

a(n) =

∞ X

n=0

c(n) = ∞;

∞ X

2

a(n) ;

∞ X

n=0

n=0

c(n)2 < ∞ and a(n) = o(c(n)).

(15)

Recursions driven by {a(n)} are ‘slower’ while those driven by {c(n)} are ‘faster’. This is so because beyond some finite integer N0 (i.e., for n ≥ N0 ), the increments in recursions governed by {a(n)} would be uniformly smaller than corresponding increments in recursions governed by {c(n)}. Hence recursions governed by {c(n)} converge faster, though with a higher variance, than the other recursions. Using the tunable parameter trajectory {θ0 (n), θ1 (n), . . ., θN −1 (n)}, we obtain two sets of parallel simulations that are respectively generated + − − − by the trajectories {θ0+ (n), θ1+ (n), . . ., θN −1 (n)} and {θ0 (n), θ1 (n), . . ., θN −1 (n)}, n ≥ 0. △



Here θj+ (n) = θj (n) + βηj (n) and θj− (n) = θj (n) − βηj (n), j = 0, 1, . . . , N − 1, n ≥ 0, where ηj (n) are independent N(0, 1)-distributed random variables and β > 0 is a small constant. We then obtain perturbed SDE trajectories {X0+ (n), X1+ (n), X2+ (n), . . . , XN+ (n)}

and {X0− (n), X1− (n), X2− (n), . . . , XN− (n)}, n ≥ 0 with X0+ (n) = X0− (n) = X0 (for all n), that − + − − are governed by {θ0+ (n), θ1+ (n), . . ., θN −1 (n)} and {θ0 (n), θ1 (n), . . ., θN −1 (n)}, respectively,

as follows: X1+ (n) = f1 (X0 , θ0+ (n), Z1+ (n)), + Xj+ (n) = fj (X0 , θ0+ (n), θ1+ (n), . . . , θj−1 (n), Z1+ (n), Z2+ (n), . . . , Zj+ (n)),

11

for j = 2, . . . , N, n ≥ 0. Similarly, X1− (n) = f1 (X0 , θ0− (n), Z1− (n)), − Xj− (n) = fj (X0 , θ0− (n), θ1− (n), . . . , θj−1 (n), Z1− (n), Z2− (n), . . . , Zj− (n)),

for j = 2, . . . , N, n ≥ 0. Here, Zj+ (n) and Zj− (n), n ≥ 0, j = 1, . . . , N, are indepen-

dent N(0, 1)-distributed random variates. Thus, {X0+ (n), X1+ (n), X2+ (n), . . . , XN+ (n)} and {X0− (n), X1− (n), X2− (n), . . ., XN− (n)}, respectively, correspond to the nth sample paths or trajectories in the two perturbed systems. These sample paths are generated for given + − − T T perturbed parameter vector updates (θ0+ (n), . . ., θN −1 (n)) and (θ0 (n), . . ., θN −1 (n)) , re-

spectively. After generation of these sample trajectories, averaging of the cost function is done and gradient estimates of the cost objective are obtained along the faster timescale. Next, along the slower timescale, parameter updates (along a gradient descent direction) are obtained. The algorithm is as follows. N η1 (n) X (gi (Xi+ (n)) − gi(Xi− (n))) − Y1 (n) Y1 (n + 1) = Y1 (n) + c(n) 2β i=1

!

N η2 (n) X (gi (Xi+ (n)) − gi(Xi− (n))) − Y2 (n) Y2 (n + 1) = Y2 (n) + c(n) 2β i=2 ·

!

! !

· ·

!

 ηN (n)  YN (n + 1) = YN (n) + c(n) gN (XN+ (n)) − gN (XN− (n)) − YN (n) . 2β

(16)

Next for 1 ≤ j ≤ N, we have θj (n + 1) = Γj (θj (n) − a(n)Yj (n)). Here, Γj : R → R, j = 0, 1, . . . , N − 1 are projection operators that are defined so that △

Γ(x) = (Γ0 (x0 ), Γ1 (x1 ), . . . , ΓN −1 (xN −1 ))T ∈ U N , △

for any x = (x0 , x1 , . . . , xN −1 )T ∈ RN . 12

(17)

Note that the jth recursion in (16) corresponds to 

Yj (n + 1) = Yj (n) + c(n) 





N X



ηj (n)  (gi (Xi+ (n)) − gi (Xi− (n))) − Yj (n) , 2β i=j

(18)

where Yj (n) is the nth estimate of the partial derivative of JX0 (θ0 , . . . , θN −1 ) w.r.t. θj . From the decomposition of the latter in (8), one can see that ∇j JX0 (θ0 , . . . , θN −1 ) =

N X i=j

∇j JXi 0 (θ0 , . . . , θi−1 ).



Here, for a function f (x) (for some vector x = (x0 , x1 , . . ., xN −1 )T ), ∇j f (x) corresponds to the partial derivative of f (x) with respect to the jth component of x (viz., xj−1 ). The above thus explains the range of summation in (18) from i = j to N and not i = 1 to N, as the other terms do not contribute. Remark We observe in our experiments that an additional averaging over M epochs of the faster timescale recursions (16) improves performance of the algorithm. Thus, we run (16) for M epochs keeping the parameter update (17) fixed in between. Similar observations have also been made in [7] and [10]. The convergence analysis of the next section is shown for M = 1 (i.e., the case for which the algorithm above is presented). However, it goes through in a similar manner as what follows for M > 1 as well.

4

Numerical Experiments

We use our algorithm to find the optimal parameter trajectory {θ0 , θ1 , . . . , θN −1 } or analogously the optimal parameter vector θ = (θ0 , θ1 , . . . , θN −1 )T . For any given θ, the functions

b(·, ·) and σ(·) determine the evolution of the system with states Xj , j = 0, 1, . . . , N − 1. We assume that the states are real-valued and that the state evolution is governed according to (4) which is a discretization of the parameterized SDE (1). We have considered here two different settings for our experiments. The first setting (Setting A) below corresponds to the well studied LQG (linear-quadratic-Gaussian) problem [3] wherein the system dynamics is linear, noise is Gaussian and cost is quadratic. The second setting (Setting B) is a more general version with nonlinear system dynamics even though quadratic cost is still used there. We run our algorithm in all experiments for 50000 iterations with parameters M (see remark just before Section 6) and β set at 25 and 1, respectively. We use the following step-sizes 13

in both settings: a(n) =

1 n ⌉3/4 ⌈ 1000

and c(n) =

1

, n ⌉2/3 ⌈ 1000

respectively. We observe that the

algorithm shows convergence (in all cases studied) in this number of iterations. Also, in all cases studied (for both settings), we let the constraint set U to be the closed interval [−1000, 1000].

4.1

Setting A

We consider the following functions for b(·, ·), σ(·) and gj (·), j = 1, . . . , N, respectively. b(Xj , θj ) = b1 Xj + b2 θj

(19)

σ(Xj ) = c

(20)

gj (Xj ) = kj (Xj − Xj )2

(21)

Here, b1 , b2 , c and kj , j = 0, . . . , N are given positive constants. Also, Xj , j = 0, . . . , N can be viewed as target states at time instants jh. Functions b(·, ·), σ(·) and gj (·), j = 1, . . . , N are seen to satisfy the requirements given in Section 2. It is easy to see that under (19)-(21), the discretized SDE is analogous to a linear parameterized system (i.e., the state updates are linear in the state, parameter and disturbance) with a quadratic cost criterion. Such a system with a closed loop feedback control mechanism, wherein states are observed at instants jh, j = 1, . . . , N and a control is picked at each of these instants according to the observed state, has been well studied in the control theory literature, see for instance, [3]. The well known discrete time Riccati equation is obtained from the solution to the associated finite horizon dynamic programming equation for such ∗ a system. Our algorithm finds optimal parameters θ0∗ , . . ., θN −1 that minimize the overall

expected cost (7). One expects that in our setting here, the system states Xj would be driven towards the target states Xj at instants jh. We let N = 40 viz., we consider 40-dimensional parameter vectors. We consider four different cases here that differ from each other in the values of Xj , j = 1, . . . , 40. The setting parameters for all four cases are chosen as follows: b1 = 0.1, b2 = 2, c = 0.1, h = 0.01 and kj = 1, respectively. The values of Xj for the four cases are as given below. 1. Case 1: Xj = 5.0, ∀j = 1, . . . , 40. 14

2. Case 2: Xj = ⌊j/3⌋, j = 1, . . . , 40. 3. Case 3: Xj = 13 − ⌊j/3⌋, j = 1, . . . , 40. 4. Case 4: Xj = 5 + j mod 5, j = 1, . . . , 40. In Figure 1, we show for Case 1, the plot of convergence of the parameters θ0 , θ1 and θ20 , respectively, as functions of the number of iterations when X0 = 1. These three (out of the 40) parameters were chosen arbitrarily. Similar convergence plots were also obtained for the other three cases. 200 θ0 θ 1 θ20

Value of Parameter

150

100

50

0

−50

0

0.5

1

1.5

2 2.5 3 Number of Iterations

3.5

4

4.5

5 4

x 10

Figure 1: Variation in θ-Values in Setting A, Case 1 △

opt Let θopt = (θ0opt , . . . , θ39 ) denote the value of the parameter vector at the end of the

simulation. In order to verify that the algorithm converges to the optimal parameter vector, one would ideally like to simulate the cost function JX0 (θ) and plot it as a function of θ to see whether or not θopt corresponds to the optimum parameter. However, since θ is a 40-dimensional vector, a plot of the cost as a function of θ is clearly not possible. Hence, we arbitrarily select a component θi that is then varied over the whole range of permissible values (in U) keeping the remaining components θj , j 6= i fixed at θjopt , and obtain the expected cost curve as a function of θi . The corresponding plots for the cases of i = 0, 1 and 20, respectively, for all four cases when X0 = 1 are shown in Figures 2, 3, 4 and 5,

15

Table 1: Initial and Terminal Costs for Setting A, Case 1 X0 1 2 3 4 5

JX0 (θ(0)) 633.26 350.17 149.77 33.76 0.68

JX0 (θopt ) 2.65 0.18 0.23 0.11 0.10

Table 2: Initial and Terminal Costs for Setting A, Case 2 X0 1 2 3 4 5

JX0 (θ(0)) 1796.38 1385.88 1060.05 818.15 658.53

JX0 (θopt ) 0.45 0.62 0.45 0.20 1.38

opt respectively. From these plots, one can see that parameters θ0opt , θ1opt and θ20 are optimal

when other parameter components are set to their optimal values as given by the algorithm. In Tables 1–4, we show expected cost values obtained after simulating the system for different initial states X0 for two different sets of parameter values: (a) the initial parameter vector θ(0) and (b) the parameter θopt as obtained after convergence of the algorithm. We observe that the expected cost using the converged parameter values is very close to zero while the initial cost is high in most cases. Table 3: Initial and Terminal Costs for Setting A, Case 3 X0 1 2 3 4 5

JX0 (θ(0)) 1801.35 1399.83 1081.89 846.06 694.22

16

JX0 (θopt ) 0.47 0.56 0.55 0.39 0.27

Table 4: Initial and Terminal Costs for Setting A, Case 4 X0 1 2 3 4 5

JX0 (θ(0)) 1510.67 1063.44 698.71 421.28 224.01

JX0 (θopt ) 1.93 1.13 1.34 3.12 1.03

4

2.5

x 10

θ0 θ1 θ20

Expected Cost

2

1.5

1

0.5

0 −1000

−800

−600

−400

−200 0 200 Parameter Value

400

600

800

1000

Figure 2: Variation in Cost with θ in Setting A, Case 1

4.2

Setting B

In this setting we use the following forms of the functions b(·, ·), σ(·) and hj (·), j = 1, . . . , N, respectively. These functions also satisfy the requirements in Section 2 . b(Xj , θj ) = a1 Xj2 + a2 (θj − θj )2

(22)

σ(Xj ) = cXj

(23)

hj (Xj ) = kj (Xj − Xj )2

(24)

Here a1 , a2 , c, kj , θj and Xj , j = 1, . . . , N are given constants. For our experiments, we let a1 = a2 = 0.001, and θj = 5, ∀j = 0, . . . , N − 1. Also we let c = h = 0.01, kj = 1 and Xj = 5.0 for all j = 1, . . . , N. The initial value of all parameters θj (0), j = 0, . . . , N − 1 is θj (0) = 0. Also we let N = 40 as before. For our first experiment here, we let X0 = 1.0 17

4

2

x 10

θ0 θ 1 θ

1.8

20

1.6

Expected Cost

1.4

1.2

1

0.8

0.6

0.4

0.2

0 −1000

−800

−600

−400

−200 0 200 Parameter Value

400

600

800

1000

Figure 3: Variation in Cost with θ in Setting A, Case 2 as the initial state. We show in Figure 6, the plots of convergence of the parameters θ0 , θ1 and θ20 . Also, Figure 7 shows three plots of the expected cost JX0 (θ) as functions of θ0 , θ1 and θ20 , respectively, when the other parameter components (in each plot) are fixed at their corresponding optimal values as given by the algorithm (viz., θjopt for component j as with the previous setting). These experiments are along similar lines as with Setting A. However, an interesting observation here is that while in Setting A, the objective function (which is the expected cost) shows only one local minimum (which is therefore also the global minimum) for components θ0 , θ1 and θ20 , respectively, when other components are fixed (see Figures 2,3,4 and 5), in Setting B, we observe multiple local minima in the projection of the objective function along each of these components. For instance, the plot of JX0 (θ) w.r.t. θ0 shows two distinct local minima while plots of JX0 (θ) w.r.t. θ1 and θ20 respectively, show a continuum of local minima. However, the cost at each of these local minima is nearly zero. Hence the algorithm converges to one of these which is therefore also a global or a near-global minimum. On the other hand, it has been observed in [32] that if the cost function has multiple local minima which exhibit different costs, the algorithm converges in most cases to a global minimum unlike most gradient search algorithms. This is an advantage with the smoothed functional technique. In Table 5, we show the (simulated) objective function values both with the initial and

18

4

4.5

x 10

θ0 θ 1 θ

4

20

3.5

Expected Cost

3

2.5

2

1.5

1

0.5

0 −1000

−800

−600

−400

−200 0 200 Parameter Value

400

600

800

1000

Figure 4: Variation in Cost with θ in Setting A, Case 3 Table 5: Initial and Terminal Costs for Setting B X0 1 2 3 4 5

JX0 (θ(0)) 638.28 358.51 158.95 39.33 0.03

JX0 (θopt ) 2.96 3.16 1.36 0.83 0.02

the final (viz., post-convergence) parameter vectors. One can again see that while the initial cost values in many cases are high, the final ones are very close to zero.

5

Extension to Controlled SDEs with Feedback

A controlled SDE with feedback has the following form: dX(t) = b(X(t), u(t))dt + σ(X(t))dW (t).

(25)

Here u(t) corresponds to the control picked at time t that however depends on the state X(t). The problem then becomes one of finding an optimal feedback control policy under a cost objective defined in a similar manner as before. In [22], a numerical procedure for obtaining the optimal policy for controlled SDEs in 19

4

3

x 10

θ0 θ 1 θ

20

2.5

Expected Cost

2

1.5

1

0.5

0 −1000

−800

−600

−400

−200 0 200 Parameter Value

400

600

800

1000

Figure 5: Variation in Cost with θ in Setting A, Case 4 a discretized setting is given. This involves finding an approximating controlled Markov chain from the original SDE by suitably discretizing both space and time. Thus, in this procedure, one considers not just discrete time instants but both state and action spaces are suitably discretized as well. The controlled transition probabilities for the discretized state-valued process with a discrete action set are then obtained/estimated and the resulting Bellman equation (for the discretized process) is solved. For the purpose of computation, one ensures here that the discretized state and action spaces are finite. Note that obtaining the resulting controlled transition probability matrices for discretized states and actions are time consuming procedures with the computational complexity growing enormously with the numbers of states and actions considered. Thus, a finer discretization of the state and action spaces would result in a larger computational effort in obtaining the controlled transition probabilities. Moreover, once the controlled transition probabilities have been obtained, the amount of computational effort required in numerically solving the Bellman equation grows exponentially in the cardinality of the resulting state and action spaces. This is the well known curse of dimensionality effect. Our simulation-based framework can be extended to accomodate parameterized statebased feedback control. As an example, consider the following simple feedback control policy:

20

100 θ0 θ 1 θ20 0

Value of Parameter

−100

−200

−300

−400

−500

0

0.5

1

1.5

2 2.5 3 Number of Iterations

3.5

4

4.5

5 4

x 10

Figure 6: Variation in θ-Values in Setting B For j = 0, 1, . . . , N − 1,

θj =

                        

θj1 θj2 · · · θjL−1 θjL

if if · · · if if

Xj < Kj1 Kj1 ≤ Xj < Kj2 · · · KjL−1 ≤ Xj < KjL Xj ≥ KjL ,

(26)

for suitable θj1 , . . . , θjL . Here Kj1 < Kj2 < · · · < KjL are certain appropriately chosen threshold levels for the states. Note that (26) is a multi-layered state-dependent (closed-loop) feedback control policy and is Markovian since under (26), the resulting state-valued process is Markov. Under the framework of Section 2, the tunable parameter now becomes θ = (θ01 , . . . , θ0L , 1 L T θ11 , . . . , θ1L , . . ., θN −1 , . . . , θN −1 ) , and our algorithm in Section 3 can now be used to tune

the above parameter vector. Upon convergence, one would obtain an optimal closed-loop feedback control policy within the class of policies given by (26). The number of levels L may in general depend on the stage j as well. Thus one may have Lj levels at stage j, j = 0, 1, . . . , N − 1, with Lj 6= Lk for j 6= k. Note that in feedback policies such as (26), we do not discretize either the state or the control space as with [22]. On the other hand, we cluster together states in a certain way and assign one control to each state cluster at each stage. The algorithm may then be applied to find the best control policy within the 21

4000 θ0 θ1 θ

3500

20

3000

Expected Cost

2500

2000

1500

1000

500

0 −1000

−800

−600

−400

−200 0 200 Parameter Value

400

600

800

1000

Figure 7: Variation in Cost with θ in Setting B specified class of policies. Note also that if the state space is discretized into a finite number M of states (as with [22]), then by choosing L = M, the policy structure in (26) could be used in a way as to correspond to a general Markovian policy (for the discretized state setting) as under: For j = 0, 1, . . . , N − 1,

θj =

                        

θj1 θj2 · · · θjM −1 θjM

if if · · · if if

Xj Xj · · · Xj Xj

= Kj1 = Kj2 (27) = KjM −1 = KjM ,

with Kj1 , . . . , KjM now corresponding to the M (discretized) states at instant j (in fact, M may also depend on j). Our algorithm when applied on policies such as (27), would conduct a gradient search in the space of the above policies (with the original action sets) and converge to an optimum policy in the discretized state-time setting. This is again an advantage over [22], since we do not discretize actions (unlike [22]) but perform a gradient search in the original action sets (corresponding to each discretized state). One may use policies (27) together with our algorithm in cases where the problem is not naturally parameterizable and discretizing states into a finite number of them is therefore necessary. It must be noted that our procedure does not require a computation of the controlled 22

parameterized transition probabilities. This is because we use stochastic approximation with simulated outcomes. We explain this more clearly below. As stated previously, the procedure in [22] requires solving the associated Bellman equation for a discretized state-action problem. The controlled transition probabilities are required there in order to compute the conditional expectation (given the current state) of the ‘cost-to-go’ or the ‘value’ of the subsequent (or next) state. Even though we do not require solving the Bellman equation for our procedure, we do require an expectation with respect to the parameterized stationary distribution of the underlying Markov process. Ordinarily, in order to obtain such an expectation, one would require that the parameterized stationary distribution for the associated Markov process be obtained first and for which the parameterized transition probabilities need be obtained. However, since we use stochastic approximation with simulated outcomes, we readily obtain the desired expectation itself (with respect to the parameterized stationary distribution) in the asymptotic limit. This follows as a consequence of the martingale convergence result in Lemma A.1 (see Appendix). Another potential advantage of our approach is that it would still work even if the functions b(·, ·) and σ(·) in the SDE (1) are not analytically known in functional form, however, the states Xj can be simulated or observations made available. This is because the algorithm (16)-(17) only requires observations (could be simulated outcomes) on the states Xj and does not depend explicitly on the above functions. This is not the case with the procedure in [22] that requires explicit model information (in terms of the above functions as well as the transition probabilities) for its numerical procedure. Finally, using our approach, we believe on the basis of observations from experiments in Section 5.1 that the amount of computational effort would not grow tremendously with the number of policy levels L. Note that a larger L only has the effect of increasing the dimension of the parameter θ. Since our algorithm estimates the gradient by simultaneously perturbing all θ-components randomly using Gaussian variates and requires only two simulations at any instant, the increase in computational effort would be mainly towards generating the additional Gaussian samples. This fact has also been noted in [10] for similar algorithms in the simulation optimization setting and is also true of SPSA-based algorithms [31]. Our basic parameterized SDE framework described previously corresponds to the case of L = 1, in the context of policies (26). In Section 5.1, we show the results of experiments when 23

Table 6: Initial and Terminal Costs for Setting B with Feedback Control X0 1 2 3 4 5

JX0 (θ(0)) 638.28 358.51 158.95 39.33 0.03

JX0 (θopt ) 1.59 1.50 0.26 0.31 0.04

L = 3. Even though this constitutes a three-folds increase in the parameter-dimension (with the same increasing from 40 for L = 1 to 120 for L = 3), we observed that the amount of computational effort for L = 3 did not increase significantly (over L = 1). More detailed experiments with parameters of very high dimension and over other settings must however be conducted in order to conclusively verify the above.

5.1

A Numerical Example for Controlled SDEs

We now illustrate via a numerical example, the performance of our algorithm over controlled SDEs. We consider again a similar setting as Setting B. The system dynamics is however now governed by a controlled SDE similar to (25). We consider the following three-layered feedback control policies that are similar to (26): For j = 0, 1, . . . , N − 1, θj1 if Xj < Xj − ǫ θj = θj2 if Xj − ǫ ≤ Xj ≤ Xj + ǫ   3 θj if Xj > Xj + ǫ.   

(28)

We let N = 40 as before and ǫ = 1 in our experiments. The parameter that we tune becomes 1 2 3 T in this case θ = (θ01 , θ02 , θ03 , θ11 , θ12 , θ13 , . . ., θ39 , θ39 , θ39 ) . We tune this parameter using our

algorithm. Thus the feedback policy is also updated automatically at regular instants and at each instant, the most recent policy update specifies the control to use in any given state. We arbitrarily set the initial values of the parameter components as θjk (0) = 0, ∀j, k. The initial and terminal costs obtained are shown below for various values of X0 . Note that the terminal costs JX0 (θopt ) exhibit an improvement (in most cases) over corresponding terminal costs in Table 5 (that was for the case without feedback control). More general feedback control policies than those described in (26) may also be considered. Our algorithm would in each case find an optimal feedback control policy within 24

the specified class of policies. Finally, one may also develop and apply techniques based on reinforcement learning (RL) [4] in this setting. These would serve as simulation-based (direct) counterparts to the Markov chain approximation technique of [22], as RL-based algorithms are also geared towards solving the Bellman equation of optimality. Again a major advantage with RL-based methods (as with our approach developed here) is that unlike [22], one would not require an explicit computation of the transition probabilities and instead work with simulated outcomes. Also, because of the use of stochastic approximation, one would directly obtain in RL-based algorithms [1], [8], in the asymptotic limit of these, the conditional expectation of the ‘cost-to-go’ given the current state, for solving the associated Bellman equation. Moreover, unlike [22], feature based abstractions of state and control spaces could be used in RL-algorithms to overcome the curse of dimensionality associated with solving the Bellman equation. We leave the development of such methods in the context of controlled SDEs as future work.

6

Conclusions

We studied the problem of finding an optimal parameter trajectory for a system whose state evolution is obtained from a parameterized SDE. We considered a time discretization of the SDE and formulated the problem in the simulation optimization framework. An optimization algorithm based on the smoothed functional technique was presented and analyzed for convergence. Numerical results were shown over two different settings. In the first of these, the system dynamics was considered linear in the state, control and disturbance, and the quadratic cost criterion was used. The second setting involved a more general nonlinear state evolution. We chose the number of epochs to be 40 in both settings resulting in parameter vectors of dimension 40. Our algorithm was found to perform well in both settings as it converged to the optimum set of parameters. While the original scheme presented was open-loop in nature, we also discussed extensions of the method to the case of controlled SDEs with parameterized feedback control policies and gave detailed discussions on comparisons of our approach over the one in [22]. We also showed results of some experiments on this setting for the case of three-layered feedback policies that

25

resulted in 120-dimensional parameters and observed an improvement in performance over the setting with no feedback. As future work, we shall investigate other applications of this framework and also study the performance of other simulation-based algorithms such as SPSA and the Newton-based smoothed functional algorithms in [10] that estimate both the gradient and Hessian of the cost objective by using three-timescale stochastic approximation. Finally, as described in the last part of Section 5, an interesting future direction would be to develop and apply reinforcement learning algorithms with (possibly) feature-based parameterizations for the problem of obtaining optimal closed-loop feedback control policies for controlled SDEs.

Appendix: Convergence Analysis Let F (l) = σ(θj−1 (n), ηj−1(n), Xj+ (n), Xj− (n), n ≤ l, j = 1, . . . , N), l ≥ 1, denote σ-fields generated by the quantities above. We first analyze the faster recursions, viz., those governed by the faster schedule {c(n)}. Consider the generic fast scale recursion (18). Also, consider the sequence {Mj (p)} (for given j ∈ {1, . . . , N}) as specified by (18) defined according to p X

N ηj (n) X (gl (Xl+ (n)) − gl (Xl− (n))) c(n)( Mj (p) = 2β l=j n=1

−E[

N ηj (n) X (gl (Xl+ (n)) − gl (Xl− (n))) | F (n − 1)]). 2β l=j

We have the following result.

Lemma A.1 The sequences {Mj (p), F (p)}, j = 1, . . . , N are almost surely convergent martingale sequences. Proof Note that E[Mj (p + 1) | F (p)] = Mj (p) holds with probability one, for all p ≥ 0. Now E[Mj2 (p)] ≤

p N X Lp,j X 2 2 c (n)E[η (n) (gl2(Xl+ (n)) + gl2 (Xl− (n))) j 4β 2 n=1 l=j

2

+E [ηj (n)

N X l=j

(gl (Xl+ (n)) − gl (Xl− (n))) | F (n − 1)]],

(A.1)

for some constant Lp,j > 0 (that however depends on p, j and N). Now by the conditional Jensen’s inequality, we have that almost surely, E 2 [ηj (n)

N X l=j

(gl (Xl+ (n)) − gl (Xl− (n))) | F (n − 1)] 26



E[ηj2 (n)(

N X l=j

≤ Kj E[ηj2 (n)

(gl (Xl+ (n)) − gl (Xl− (n))))2 | F (n − 1)]

N X l=j

(gl2 (Xl+ (n)) + gl2(Xl− (n))) | F (n − 1)]

for some constant Kj > 0 (that depends on j and N). Hence from (A.1), for some Kp,j > 0, E[Mj2 (p)]

p N X Kp,j X 2 2 ≤ c (n)E[η (n) (gl2 (Xl+ (n)) + gl2(Xl− (n))) j 4β 2 n=1 l=j

+E[ηj2 (n)

N X l=j

(gl2(Xl+ (n)) + gl2 (Xl− (n))) | F (n − 1)]].

Thus, E[Mj2 (p)]

p N X Kp,j X 2 2 ≤ c (n)E[η (n) (gl2(Xl+ (n)) + gl2 (Xl− (n)))] j 2β 2 n=1 l=j





p N N X X Kp,j X 2 2 2 +  c (n) E[ηj (n)gl (Xl (n))] + E[ηj2 (n)gl2 (Xl− (n))] = 2 2β n=1 l=j l=j



p N   X Kp,j X 4 + 1/2 4 − 1/2 2 4 1/2 , E[g (X (n))] + E[g (X (n))] c (n) E[η (n)] l l l l j 2β 2 n=1 l=j

by the Cauchy-Schwartz inequality. Since, gl (·) are Lipschitz continuous functions, we have |gl (Xm )| − |gl (0)| ≤ |gl (Xm ) − gl (0)| ≤ Kl k Xm k where Kl > 0 is the Lipschitz constant associated with gl (·). Thus, |gl (Xm )| ≤ C1 (1+ k Xm k) for C1 = max(Kl , |gl (0)|) < ∞. Hence, one gets E[gl4 (Xm )] ≤ C2 (1+ E[k Xm k4 ]) for a constant C2 > 0. It is easy to see that supn E[k Xl+ (n) k4 ] < ∞ since j ≤ l ≤ N

< ∞ i.e., we have finite (fixed) length trajectories and the only sources of randomness

are the N random variables Zl+ (n), l = 1, . . . , N in any sample path, each of which is

N(0, 1)-distributed. Moreover, as stated before, the parameter θ (or the parameter trajectory {θ0 , . . . , θN −1 }) takes values in the compact set U N , and both functions b(·, ·) and σ(·) are Lipschitz continuous. Similarly, supn E[k Xl− (n) k4 ] < ∞ as well. Thus, E[Mj2 (p)] < ∞, for all p ≥ 1. One can also see that Mj (p) are integrable random variables. Further, it is easy to see using again the conditional Jensen’s and Cauchy-Schwartz inequalities that X p

E[(Mj (p + 1) − Mj (p))2 | F (p)] < ∞ a.s. 27

Thus, by the martingale convergence theorem, {Mj (p)} are almost surely convergent martingale sequences.

2

Proceeding in a similar manner as Lemma A.3 of [10], one can show that the iterates Yj (n), j = 1, . . . , N, n ≥ 0 in (18) remain uniformly bounded with probability one. Let △

Y (n) = (Y1 (n), . . . , YN (n))T . Consider the following system of ordinary differential equations (ODEs): .

θ(t) = 0,

(A.2)

Y (t) = Dβ,2 JX0 (θ) − Y (t).

(A.3)

.

Along the faster timescale (i.e., the one corresponding to step-sizes c(n), n ≥ 0), one can show using Theorem 1, pp.339 of [18] that the algorithm asymptotically tracks the trajectories of (A.2)-(A.3) and in particular, one obtains Lemma A.2 k Y (n) − Dβ,2 JX0 (θ(n)) k→ 0 w.p. 1, as n → ∞.

2

Next, we have the following key result that also provides the structure of the gradient of JX0 (·) with respect to the θ-parameter. Proposition A.1 k Dβ,2 JX0 (θ(n)) − ∇JX0 (θ(n)) k→ 0 as β → 0. Proof Recall that Dβ,2 JX0 (θ(n)) = E[

=

N X

j=1

E[

η(n) (JX0 (θ(n) + βη(n)) − JX0 (θ(n) − βη(n)))] 2β

η(n) j (J (θ0 (n) + βη0 (n), . . . , θj−1 (n) + βηj−1(n)) 2β X0

−JXj 0 (θ0 (n) − βη0 (n), . . . , θj−1 (n) − βηj−1(n)))] Using appropriate Taylor series expansions of JXj 0 (θ0 (n)+βη0 (n), . . . , θj−1 (n)+βηj−1 (n)) and JXj 0 (θ0 (n)−βη0 (n), . . . , θj−1 (n)−βηj−1 (n)), respectively, around the point (θ0 (n), . . . , θj−1 (n))T , j = 1, . . . , N, one gets E[

η(n) j (J (θ0 (n) + βη0 (n), . . . , θj−1 (n) + βηj−1 (n)) 2β X0

−JXj 0 (θ0 (n) − βη0 (n), . . . , θj−1 (n) − βηj−1(n)))]

28



=

           E          

j−1 η0 (n) l=0 ηl (n)∇l JXj 0 (θ0 (n), . . . , θj−1 (n)) Pj−1 η1 (n) l=0 ηl (n)∇l JXj 0 (θ0 (n), . . . , θj−1 (n)) · · · Pj−1 ηj−1 (n) l=0 ηl (n)∇l JXj 0 (θ0 (n), . . . , θj−1 (n)) Pj−1 ηj (n) l=0 ηl (n)∇l JXj 0 (θ0 (n), . . . , θj−1 (n)) · · · Pj−1 ηN −1 (n) l=0 ηl (n)∇l JXj 0 (θ0 (n), . . . , θj−1 (n))

P



=

                

∇0 JXj 0 (θ0 (n), . . . , θj−1 (n)) ∇1 JXj 0 (θ0 (n), . . . , θj−1 (n)) · · j ∇j−1 JX0 (θ0 (n), . . . , θj−1(n)) 0 · · 0





            + o(β),          

(A.4)

         + o(β),        

(A.5)

since η0 (n), η1 (n), . . ., ηN −1 (n) are all independent N(0, 1)-distributed random variables. Hence, Dβ,2 JX0 (θ(n)) = E[ 

=

        

η(n) (JX0 (θ(n) + βη(n)) − JX0 (θ(n) − βη(n)))] 2β

∇0 JX1 0 (θ0 (n)) + ∇0 JX2 0 (θ0 (n), θ1 (n)) + · · · + ∇0 JXN0 (θ0 (n), . . . , θN −1 (n)) ∇1 JX2 0 (θ0 (n), θ1 (n)) + · · · + ∇1 JXN0 (θ0 (n), . . . , θN −1 (n)) · · · ∇N −1 JXN0 (θ0 (n), . . . , θN −1 (n))



     + o(β).    

(A.6)

The first term in the RHS of (A.6) corresponds to ∇JX0 (θ(n)). The claim now follows.

2

Note the structure of the gradient ∇JX0 (θ(n)) from the above. This follows because of the fact that the component θj (n) of the parameter vector θ(n), 0 ≤ j ≤ N − 1 affects all states at times j + 1, . . . , N and not the preceding states. Lemma A.3 k Y (n) − ∇JX0 (θ(n)) k→ 0 w.p. 1, as n → ∞ and β → 0. Proof Note that by triangle inequality, k Y (n) − ∇JX0 (θ(n)) k≤k Y (n) − Dβ,2 JX0 (θ(n)) k + k Dβ,2 JX0 (θ(n)) − ∇JX0 (θ(n)) k . 29

The claim now follows from Lemma A.2 and Proposition A.1.

2

Now consider the slower timescale recursion. The ODE associated with it is as follows: .

˜ θ= Γ(−∇J X0 (θ(t))),

(A.7)

where for any y ∈ RN and a bounded, continuous function v(·) : RN → RN , ˜ Γ(v(y)) = lim

0 0, let (K)ǫ = {θ ∈ U N | k θ − θ0 k< ǫ, for some θ0 ∈ K}. We finally have: ˆ the sequence Theorem A.1 Given ǫ > 0, there exists a βˆ > 0, such that for all β ∈ (0, β],

{θ(n)} obtained using our algorithm converges to a point in (U N )ǫ with probability one as n → ∞. Proof As a consequence of Lemma A.3, recursion (17) can be viewed as an Euler discretization of (A.7) with nonuniform (in fact, diminishing) step-size and an asymptotically diminishing noise. As noted above, JX0 (·) serves as the associated Liapunov function for (A.7). The claim now follows from Theorem 5.3.1 on pp.191 of [21].

2

References [1] Abdulla, M.S. and Bhatnagar, S. (2007) “Reinforcement learning based algorithms for average cost Markov decision processes”, Discrete Event Dynamic Systems, 17(1):23-52. [2] Bertsekas, D.P. and Gallager, R.G. Data Networks, Prentice-Hall, NY 1991. [3] Bertsekas, D.P. (1995) Dynamic Programming and Optimal Control, Athena Scientific, Belmont, MA. 30

[4] Bertsekas, D.P. and Tsitsiklis, J.N. (1996) Neuro-Dynamic Programming, Athena Scientific, Belmont, MA. [5] Bhatnagar, S. and Borkar, V.S. (1998) “A two time-scale stochastic approximation scheme for simulation based parametric optimization”, Probability in the Engineering and Informational Sciences, 12:519-531. [6] Bhatnagar, S. and Borkar, V.S. (2003) “Multiscale chaotic SPSA and smoothed functional algorithms for simulation optimization”, Simulation, 79(10):568-580. [7] Bhatnagar, S., Fu, M.C., Marcus, S.I. and Bhatnagar, S. (2001) “Two timescale algorithms for simulation optimization of hidden Markov models”, IIE Transactions, 33(3):245-258. [8] Bhatnagar, S. and Kumar, S. (2004) “A simultaneous perturbation stochastic approximation based actor-critic algorithm for Markov decision processes”, IEEE Transactions on Automatic Control, 49(4):592-598. [9] Bhatnagar, S. (2005) “Adaptive multivariate three-timescale stochastic approximation algorithms for simulation based optimization, ACM Transactions on Modeling and Computer Simulation, 15(1):74–107. [10] Bhatnagar, S. (2007) “Adaptive Newton-based smoothed functional algorithms for simulation optimization”, Accepted in ACM Transactions on Modeling and Computer Simulation. [11] Bhatnagar, S. and Karmeshu (2007) “ Monte-Carlo estimation of time-dependent statistical characteristics of a process governed by a random differential equation”, Submitted. [12] Campillo, F. and Traore, A. (1994) “Lyapunov exponents of controlled SDEs and stabilizability property: some examples”, Rapport de Recherche 2397, INRIA. [13] Campillo, F. and Traore, A. (1995) “A stabilization algorithm for linear controlled SDEs”, Proceedings of IEEE Conference on Decision and Control, New Orleans, LA, 1034-1035.

31

[14] Charalambos, C.D., Djouadi, S.M. and Denic, S.Z. (2005) “Stochastic power control for wireless networks via SDEs: probabilistic QoS measures”, IEEE Transactions on Information Theory, 51(12):4396-4401. [15] Glasserman, P. (2005) Monte Carlo Methods in Financial Engineering, Springer, New York. [16] Glynn, P.W. (1990) “Likelihood ratio gradient estimation for stochastic systems”, Communications of the ACM, 33(10): 75-84. [17] Ho, Y.-C. and Cao, X.-R. (1991) Perturbation Analysis of Discrete Event Dynamic Systems, Kluwer, MA. [18] Hirsch, M.W. (1989) “Convergent activation dynamics in continuous time networks”, Neural Networks, 2:331-349. [19] Konda, V.R and Tsitsiklis, J.N. (2003) “Actor-critic algorithms”, SIAM Journal on Control and Optimization, 42(4):1143-1166. [20] Korn, R. and Kraft, H. (2002) “A stochastic control approach to portfolio problems with stochastic interest rates”, SIAM Journal on Control and Optimization, 40(4):1250-1269. [21] Kushner, H.J. and Clark, D.S. (1978) Stochastic Approximation Methods for Constrained and Unconstrained Systems, Springer Verlag, NY. [22] Kushner, H.J. and Dupuis, P.G. (2001) Numerical Methods for Stochastic Control Problems in Continuous Time, Springer Verlag, NY. [23] Lim, A.E.B., Zhou, X.Y. and Moore, J.B. (2003) “Multiple-objective risk-sensitive control and its small noise limit”, Automatica, 39:533-541. [24] Liu, T., Bahl, P. and Chlamtac, I. (1998) “Mobility modeling, location tracking, and trajectory prediction in wireless ATM networks”, IEEE Journal on Selected Areas in Communications, 16(6):922-936. [25] Marbach, P. and Tsitsiklis, J.N. (2001) “Simulation-based optimization of Markov reward processes”, IEEE Transactions on Automatic Control, 46(2):191-209. 32

[26] Moose, R.L., Vanlandingham, H.F. and McCabe, D.H. (1979) “Modeling and estimation for tracking maneuvering targets”, AES-15(3):448-456. [27] Nelson, R. (1987) “Stochastic catastrophe theory in computer performance modeling”, Journal of the Association for Computing Machinery, 34(3):661-685. [28] Primak, S., Kontorovich, V. and Lyandres, V. (2004) Stochastic Methods and their Applications to Communications: Stochastic Differential Equations Approach, Wiley, West Sussex, England. [29] Rubinstein, R.Y. (1981) Simulation and the Monte Carlo Method, Wiley, New York. [30] Singer, R.A. (1970) “Estimating optical tracking filter performance for manned maneuvering targets”, IEEE Transactions on Aerospace and Electronic Systems, AES6(4):473-483. [31] Spall, J.C. (1992) “Multivariate stochastic approximation using a simultaneous perturbation gradient approximation”, IEEE Transactions on Automatic Control, 37:332-341. [32] Styblinski, M.A. and Tang, T.-S. (1990) “Experiments in nonconvex optimization: stochastic approximation with function smoothing and simulated annealing”, Neural Networks, 3:467-483. [33] Vazquez-Abad, F.J. and Kushner, H.J. (1992) Estimation of the derivative of a stationary measure with respect to a control parameter,” J. Appl. Prob., 29:343-352.

33