Dynamic Admission and Service Rate Control of a Queue Kranthi Mitra Adusumilli and John J. Hasenbein 1 Graduate Program in Operations Research and Industrial Engineering Department of Mechanical Engineering University of Texas at Austin, Austin, Texas, 78712 {
[email protected],
[email protected] }
Abstract This paper investigates a queueing system in which the controller can perform admission and service rate control. In particular, we examine a single server queueing system with Poisson arrivals and exponentially distributed services with adjustable rates. At each decision epoch the controller may adjust the service rate. Also, the controller can reject incoming customers as they arrive. The objective is to minimize long-run average costs which include: a holding cost, which is a non-decreasing function of the number of jobs in the system; a service rate cost c(x), representing the cost per unit time for servicing jobs at rate x; and a rejection cost κ for rejecting a single job. From basic principles, we derive a simple, efficient algorithm for computing the optimal policy. Our algorithm also provides an easily computable bound on the optimality gap at every step. Finally, we demonstrate that, in the class of stationary policies, deterministic stationary policies are optimal for this problem.
April 13, 2010
1
Introduction
This paper investigates a joint admission and service rate control problem for a single-server queue with Poisson arrivals and exponential service times. The controller is allowed to choose a service rate in each state and also whether to outsource (or reject) arriving jobs in each state. Thus the model allows service rate control in addition to a form of admissions control. There is a cost function associated with the available service rates, in addition to a rejection cost and a general holding cost. The controller’s objective is to minimize the long-run average cost. There is a rich set of literature associated with this model and similar models, but the primary motivation for this work derives from an elegant analysis of the same problem, without admission control, performed in George and Harrison [4]. That paper specifically left open the problem of extending the model to incorporate admission control. Furthermore, [4] assumes that the optimal control is in the class of stationary deterministic control policies. In this paper, we extend their model to allow admission control (which also removes a technical 1
Research supported in part by National Science Foundation grant DMI-0132038.
1
condition on the service rate cost function) and we prove that the optimal control policy is deterministic, within the class of stationary policies. The problem of controlling a queue with service rate or admission decisions has been addressed by many authors in various forms. Apart from [4], the most closely related paper is one by Stidham and Weber [10], who provided a uniform method for proving monotonicity of optimal rates in a variety of one station queueing problems. There are two primary differences between their model and ours. First, they did not explicitly consider the case of admission control, in the sense where customer can be rejected. Although they allow arrival rate control, this is not necessarily equivalent to the problem with admission control. Second, they require the action space to be compact, an assumption not imposed in our model. Koole [7] also provided a framework, called “event-based dynamic programming” to prove monotonicity results in queueing control problems of the type analyzed here. Koole’s framework applies most directly to finite horizon discounted cost problems and it is possible that those results could be used to prove monotonicity results for the finite horizon version of our problem. However, extending the technique to particular infinite horizon average cost problems requires verifying technical side conditions. Furthermore, Koole’s technique seems to rely on a uniformization, which would not apply in our model. The books of Sennott [9] and Puterman [8] also contain discussion and background on admission and rate control problems in queueing models. To our knowledge, the combined problem studied in this paper has not been analyzed previously. In particular, we allow the set of available service rates to be non-compact, which is different from the action space assumption in many standard Markov decision process (MDP) models, such as those in [2, 8, 9]. For example, [5], Guo and Hern´andez consider the overall action space to be unbounded, however, for each state, the action space is bounded. In our model, as in [4], the action space is allowed to be unbounded in each state. Ata and Shneorson [1], which extends the model of [4] to include capacity constraints, also consider an unbounded action space. There are three main contributions in this paper. First, we construct a simple, efficient iterative algorithm for solving the joint service rate and admission control problem for a single server queue. Although the system is relatively simple, the model is quite general in that there are minimal assumptions on the service rate and holding cost functions (see the next section for details). Furthermore, each step of the algorithm provides an immediately computable upper bound on the optimality gap. If the optimal policy is to reject customer at a system threshold level n, then the algorithm terminates when truncation scheme reaches level n. If it is not optimal to apply admission control then the algorithm does not terminate, but converges to the optimal control policy as the truncation level goes to infinity. Hence, in this case, the bound on the optimality gap is useful when applying a stopping criterion. Although our approach is similar in spirit to the algorithm in [4], the iterative approximation scheme is different. In particular, our algorithm truncates the state space, whereas the algorithm in [4] truncates the holding costs. The second contribution comes as a byproduct of the computational development, in which we also prove that the optimal service rates are monotone in the system state. The third contribution is to prove that deterministic policies
2
are optimal within the class of stationary policies. It remains to show that one can restrict the search for optimal policies to the class of stationary policies. Stidham and Weber [10] are able to show that this is the case when the action space is compact and their analysis relies crucially on this assumption. Hern´andez-Lerma and Lasserre [6] provide more general results results on the optimality of stationary policies for MDP models with unbounded action spaces. However, that book deals only with discrete-time Markov decision processes. Because our action space is unbounded, so are the transition rates, and thus uniformization cannot be invoked to convert our problem into an equivalent discrete-time problem. The rest of this paper is organized as follows. In Section 2 we introduce the control model. Section 3 presents the optimality equations and an associated verification theorem. Also, for a system with a fixed rejection threshold modified optimality equations are provided. The computational algorithm is presented in Section 4, and numerical examples are given in Section 5. Finally, in Section 6 we prove that the optimal control policy is stationary.
2
The Control Model
Our model consists of a single server with an adjustable service rate. The service time of a customer being served at rate x > 0, is exponentially distributed with mean 1/x. Arrivals occur according to a Poisson process and without loss of generality we take the rate of this process to be 1. The system manager can change the service rate at any time. Further, an arrival can be denied admission by the system manager. There are costs associated both with providing service at a particular rate x and denying admission. These costs along with a holding cost comprise the total system cost. The objective of the system manager is to minimize the long run average cost per unit time. For practical applications, the rejection cost can be viewed as a cost to outsource jobs. It is assumed that all cost functions are known to the system manager. Let the cost per unit time to serve at rate x be c(x), and let hn be the cost per unit time to hold n customers in the system. In general, there can be a non-negative holding cost h0 incurred even when no customers are present in the system. The cost incurred for rejecting a customer is κ. One can also view the rejection of jobs as providing instantaneous service. As will be seen below, this view is useful when considering the technical conditions on the cost of service function c(·). In summary, the system manager’s decision policy consists of determining the service rates and admission policy at any instant of time, with the objective of minimizing the long-run average cost of operating the system. Under such an objective function, standard arguments show that the decision time points can be restricted to times when arrivals or departures occur. Thus, the control problem can be embedded in the framework of a continuous-time Markov decision process with a countable state space. In Section 6 we show that one need only consider the set of stationary, state-dependent controls. We now discuss the technical assumptions in more detail. We make the following assumptions on the action space and system costs: (A1) The action space A is a closed subset of [0, ∞) containing 0 and an element greater 3
than 1. (A2) The holding cost hn is non-decreasing in n. (A3) The cost for service function c(x) is non-decreasing in x. (A4) c(·) is continuous on (0, ∞) (and right-continuous at 0) with c(0) = 0. These assumptions are relatively weak and are essentially the minimal assumptions required to avoid pathological cases. They are also identical to the assumptions that appear in the motivating paper of George and Harrison [4], with some exceptions. First, George and Harrison require only left-continuity of c(·). Our computational approach requires the additional assumption of right-continuity. Second, they also require a geometric growth condition on the holding costs (see equation (2) in [4]). This condition was imposed primarily in order to prove convergence of optimal values under their algorithmic method, which involves truncating holding costs. Since our truncation method is different, we do not require this geometric growth condition. Finally, and most importantly, in [4], the following additional condition is imposed on the cost function when the action space A is unbounded: c(x) : x ∈ A, x ≥ y = ∞. lim inf y→∞ x If the limit above is finite, then the control model must take into account a possible additional mode of control: instantaneous service of a customer, for a cost equal to the limit. George and Harrison wished to exclude that mode of control from their analysis. We allow this mode of control, and thus the relaxed assumption below is imposed: n o (A5) limy→∞ inf c(x) : x ∈ A, x ≥ y ≥ κ ≥ 0. x Recall that κ is the cost to reject (or outsource) a customer. Assumption A5 implies that the cost for service per unit time is greater than or equal to the rejection cost, as the service rate grows without bound. When A5 is not satisfied, the rejection option can be ignored, since serving at some large service rate is always more beneficial than rejecting the customer. A5 is similar in form to the condition for “near monotone costs” discussed in Borkar and Meyn [3]. When this conditions holds, they are able to show existence of optimal policies under the so-called risk-sensitive cost criterion. The five assumptions A1–A5 are the only technical assumptions needed for our analysis. Since all interevent times are exponential in our model, the state space can be restricted to simply the current system size. The action space at each decision epoch involves the admission decision and the service rate control. Note that admission control is only relevant when the decision epoch is at a customer arrival time (as opposed to a departure time). We represent the admission control decision by a, where a ∈ {0, 1}. a = 1 when the decision is to admit the new customer and a = 0 when the decision is to reject. The service rate can be changed when the system state changes due to a departure or an arrival, even if the customer is not admitted. We assume that the admission and service rate decisions are made 4
simultaneously by the controller. Thus, at an arrival time the joint control decision can be represented as (a, x), where the service rate until the next decision epoch is 0 ≤ x < ∞. When the state changes due to a departure, then we represent the action as (1, x). Without loss of generality we assume x = 0 when n = 0. For most of this paper, we restrict our analysis to the set of deterministic stationary controls. Under a given stationary control, it is clear that the system size process is a continuous-time Markov chain (CTMC). If the Markov chain is positive recurrent under a policy, we call the policy ergodic.
2.1
Dynamical Equations and Objective Functions
In this section we assume that the system is operated under a stationary ergodic policy. For simplicity of discussion we also assume that the system starts empty at time 0, although the results hold for any initial state. Since the control is stationary, if incoming customers are rejected when the system size is m, then the system size will never exceed m. Hence the set of controls can be divided into two general types: terminating and non-terminating. Under a given terminating policy we let m be the smallest state for which customers are rejected. For such a policy, we can specify the policy by (~µ, m). When there are n ≤ m customers in the system the server processes jobs at rate µn . When there are m customers in the system, all incoming customers are rejected. We refer to such a policy as m-terminating when the threshold level m needs to be emphasized. If a given stationary policy does not reject arrivals in any state then the policy is non-terminating and the policy can be specified by an infinite vector µ ~ . In this case, we think of m as taking the value infinity. Under an ergodic policy (~µ, m), let pn (~µ, m) be the steady-state probability that the system size is n. Standard CTMC theory then implies that the long-run average cost per unit time under this policy is: z(~µ, m) := pm (~µ, m)κ +
m X
pn (~µ, m) {c(µn ) + hn } .
(1)
n=0
For a non-terminating policy, with m = ∞, we take pm (~µ, m) = 0. Hence, in this case the first term on the right-hand side of (1) is zero, and the sum is an infinite sum. The steady-state probabilities satisfy the local balance equations: pn (~µ, m) = pn−1 (~µ, m)µ−1 n
∀ 1 ≤ n ≤ m, µn > 0.
(2)
It is possible to not serve customers in some states, i.e., to have µn = 0 for some n. In this case, the state space and balance equations are modified in a straightforward manner. Note that in order for a non-terminating policy to be ergodic the number of states with µn = 0 has to be finite. Next, define z ∗ (m) := inf z(~µ, m), where the infimum is taken over all m–terminating policies. Note that z(0, 0) = z ∗ (0). If customers are rejected in every state, the resulting policy is ergodic with z(0, 0) = h0 +κ < ∞. 5
Thus, for every set of parameters there exists at least one ergodic policy with a finite average cost. As a result, the infimum above is well-defined and finite. An m–terminating policy with z(~µ, m) = z ∗ (m), if it exists, is called m-optimal. The infimum among all ergodic policies is: z ∗ = inf z ∗ (m). m≥0
The control problem is to find a ergodic policy (~µ, m) which achieves the infimum and a policy which does so is said to be globally optimal. We use this term to distinguish such policies from policies which are only optimal among all m-terminating policies, for a given m (we sometimes call such policies “m-optimal”). When holding cost rates are bounded, it is possible that hn ↑ h∞ ≤ z ∗ < ∞, i.e., the long-run average cost under the “do-nothing policy” is smaller than achievable long-run average cost under any ergodic policy. As in [4], in this case the MDP is said to be degenerate. When the problem is non-degenerate, we prove existence of the optimal policy in a constructive manner; in particular, we provide an algorithm to compute the policy.
3
Optimality Equations and Verification
In this section, we provide the optimality equations and an associated verification theorem. The overall procedure is similar to that in [4]. However, several modifications are necessary due to the extra mode of control allowed. Standard arguments from the theory of Markov decision processes yield the following equations for the relative cost functions, which we denote by vn , n ≥ 0: o n n−1 +vn+1 , inf c(x)+hn −z+xv 1+x ∀ n ≥ 1, vn = min x∈A n c(x)+h −z+xv +v +κ o n n n−1 inf 1+x x∈A n o c(x)+hn −z−x(vn −vn−1 )+(vn+1 −vn ) inf , 1+x o ∀ n≥1 n (3a) ⇒ 0 = min x∈A n −vn−1 )+κ inf c(x)+hn −z−x(v 1+x x∈A
and v1 = v0 − h0 + z.
(3b)
Since for a given z the sequence of vn ’s is determined only up to an additive constant, one usually works with the relative cost differences yn := vn − vn−1 , n ≥ 1. Following the arguments in [4] and [11], using the relative cost differences reduces the optimality equations to the following form: ! inf {c(x) − yn x + yn+1 } z − hn = min x∈A ∀ n ≥ 1, (4a) inf {c(x) − yn x + κ} x∈A
and y1 = z − h0 . 6
(4b)
Defining φ(y) := sup {yx − c(x)} ,
for y ≥ 0,
(5)
x∈A
simplifies the optimality equations. Specifically, (4a) becomes hn − z = φ(yn ) − min {yn+1 , κ}
∀ n ≥ 1.
(6)
It is worthwhile to note that the optimality equations remain the same even if the holding cost is not non-decreasing. The smallest value of n for which yn+1 ≥ κ is said to be the terminating state for the optimality equations. This corresponds to the threshold state m of the associated terminating policy. If yn+1 < κ for all n ≥ 1, then the solution pair is said to be non-terminating. Note that if a particular pair z and (y1 , y2 , . . . ) is a solution of the optimality equations (4b) and (6), then yn+1 < κ for any non-terminating state n ≥ 1. When y < κ, under A1, A4 and A5, the function φ(·) has finite values which are attained and the smallest maximizers in the set A exist. Let ψ(y) be the smallest minimizer of φ(y). Note that assumption A5 implies that ψ(y) is finite for each y ≥ 0. Further, if φ(κ) < ∞, then ψ(κ) is well defined. In the next subsection we confirm the validity of the optimality equations for bounded solutions (sometimes called a “verification theorem”). Also, a modified version of these optimality equations is introduced and verified.
3.1
Verification Theorem
We first state and prove the main verification result. Theorem 1. Let z < ∞, (y1 , y2 , . . . ) be a solution to the optimality equations (4b) and (6), with the yi ’s being uniformly bounded. Let m∗ be the corresponding terminating state (if there is no terminating state, then m∗ = ∞). Then z ≤ z(~µ, m) for every ergodic policy (~µ, m), that is, z ≤ z ∗ . If the policy (~µ∗ , m∗ ) defined by ( ψ(yn ) for 1 ≤ n < m∗ ; µ∗n = ψ(ym∗ ) for n ≥ m∗ (when m∗ < ∞) is ergodic, then (~µ∗ , m∗ ) is an optimal policy. Proof. First, note that since z < ∞ and the yi are bounded, φ(yn ) < ∞ for all n ≥ 1. Using (5) and (6), one obtains the following relations: xyn − c(x) ≤ φ(yn ) ≤ yn+1 + hn − z xyn − c(x) ≤ φ(yn ) ≤ κ + hn − z
∀ x ∈ A, n ≥ 1, ∀ x ∈ A, n ≥ 1.
(7) (8)
Any ergodic policy is either a terminating or a non-terminating policy. First, consider an arbitrary terminating policy (~µ, m). Setting x = µn in (7) and x = µm and n = m in (8), we have µn yn − c(µn ) ≤ yn+1 + hn − z µm ym − c(µm ) − κ ≤ hm − z 7
for n ≥ 1, for m < ∞.
(9) (10)
Multiplying both sides of (9) by pn (~µ, m), multiplying both sides of (10) by pm (~µ, m), substituting µn pn (~µ, m) = pn−1 (~µ, m) from (2), and rearranging terms yields: pn (~µ, m)[hn + c(µn ) − z] ≥ pn−1 (~µ, m)yn − pn (~µ, m)yn+1 pm (~µ, m) [hm + c(µm ) + κ − z] ≥ pm−1 (~µ, m)ym .
for 1 ≤ n < m
(11a) (11b)
Summing all the equations in (11a) and (11b), and using relation (1) gives z(~µ, m) − p0 (~µ, m)h0 − z[1 − p0 (~µ, m)] ≥ p0 (~µ, m)y1 .
(12)
Applying (4b) we conclude z(~µ, m) ≥ z, which establishes the result for any terminating policy. For a non-terminating policy (~µ, ∞), the derivation is analogous, where now we sum the equations in (11a) over all n ≥ 1. Since the yi ’s are bounded the sums due to the right-hand side of (11a) are finite and we again obtain (12). This establishes z(~µ, ∞) ≥ z for any non-terminating policy. Next, given a solution to the optimality equations with m∗ < ∞, set x = ψ(yn ) in (7) and x = ψ(ym∗ ) in (8). In that case, (9) and (10) hold with equality, implying that (11a) and (11b) also hold with equality. As a result (12) holds with equality for this policy, i.e., (µ∗ , m∗ ) is optimal. An analogous argument holds in the non-terminating case. In the theorem above, even for a terminating policy, we assigned a service rate for states beyond the threshold state. Of course, these states are transient, so the assigned rates are inconsequential in terms of long-run costs. In the theorem below, we consider the optimality equations for a fixed n-terminating policy. These equations will come in to play in later sections. As such, we refer to (4b) and the equations in (13) as the n-optimality equations. Furthermore, we call a policy n-optimal if it is optimal for the control problem in which the state space is truncated to be {0, . . . , n}. The following theorem is stated without a proof, as it follows directly from arguments in the proof above. Theorem 2. If there exist an n ≥ 1,and a sequence (y1 , · · · , yn ) and z(n) < ∞ satisfying (4b) and hk − z(n) = φ(yk ) − yk+1 yn+1 = κ
for 1 ≤ k ≤ n (13)
then z(n) = z ∗ (n) and the policy (~µ∗ , n) given by µ∗k = ψ(yk )
for 1 ≤ k ≤ n
(14)
is n-optimal. This theorem can be applied when a system with a fixed threshold is considered and there is no option of rejecting customers unless the buffer is full. One such case is discussed in [1]. Note that both theorems in this section hold even when the holding cost hn is not non-decreasing in n. 8
4 4.1
Policy Computation Overview
In Section 3.1 we developed optimality equations and established their validity. In this section, we suggest a computational methodology to obtain an optimal or near optimal policy. The optimality equations alone do not suggest an efficient methodology for constructing near optimal policies. However, the structure of the model does suggest an algorithm, and most of this section is devoted to validating the approach. It should be noted that our algorithm differs from the one suggested in [4]. In their model, approximating problems are formed by “truncating” the holding costs at some buffer level n, i.e., they set hn = hn+1 = . . . for some n ≥ 1. Since our model allows customer rejection, it is natural to consider a sequence of approximating problems which truncate the state space instead. This approach works because it can be shown that if there is a “local minimum” in these approximating problems, then the solution to the corresponding truncated problem yields the optimal policy for the original problem. If no local minimum exists, then it is not optimal to reject customers in any state. In either case, one can use the approximating problems to generate a bound on the optimality gap at any stage, thus allowing the implementation of a stopping criterion. Because we allow the action space to be unbounded, there may be no formal solution to the optimality equations for a particular truncation level n. Such cases arise when the “optimal” service rate for a state below the truncation level is ∞. In such cases, as the algorithm below indicates, it is then globally optimal to reject at a lower threshold value. Assumption A5 insures that such a policy is optimal in these cases. The algorithm is as follows: Initialization Set n = 1. Step 1 Solve the n-optimality equations. Step 2 If a solution to the optimality equations exists and z ∗ (n) ≥ z ∗ (n − 1) then the optimal (n − 1)-terminating policy is globally optimal. If no solution exists, then the optimal (n − 1)-terminating policy is globally optimal. If neither case applies, then increase n by 1 and go to step 1. Consider the sequence of solutions (z(n), (y1n , · · · , ynn )), n ≥ 1, satisfying equations (4b) and (13). If a solution pair exists for some n ≥ 1 then Theorem 2 applies, i.e., z(n) = z ∗ (n). In particular, this solution pair corresponds to an optimal policy for a system with a limit of n customers. Starting from n = 1 solution pairs are computed for incremental values of n. As mentioned above, this computation continues until a local minimum is found or until no solution pair exists for some n. It is shown below that if n ˜ is the first local minimum of ∗ the sequence then z(˜ n) = z . If there is no such local minimum, then lim z(n) = z ∗ .
n→∞
9
The sequence of terminating optimal policies can be seen as corresponding to progressive approximations of an optimal policy in the same way that [4] provides successive approximations via holding cost truncation. We now summarize the development in the remainder of this section. In Section 4.2 we prove optimality of the n-terminating policy corresponding to the first local minimum of the z(·) values described above. Section 4.3 treats the case when there is no local minimum. In that case, the limiting policy is shown to be optimal. In both cases, it is shown that the optimal service rates are monotone in the number of customers in the system.
4.2
Policy Computation: The Local Minimum Case
The sequence of solutions (z(1), y11 ), (z(2), y12 , y22 ), . . . might have a local minimum or terminate when no solution pair exists for a particular n + 1. When no solution exists it is shown that z(n) < z ∗ (n + 1) (note that z ∗ (n + 1) exists even if there is no solution z(n + 1)). Hence, in either case the sequence is said to have a local minimum. Here, we show that when z(n) is the first local minimum of the sequence, then the n-optimal policy is an optimal policy. We first establish a preliminary lemma. Lemma 1. Consider a sequence of optimal terminating solutions (z(1), y11 ), (z(2), y12 , y22 ), . . .. If z(k) > z(n) for 1 ≤ k < n, then n yk+1 < κ. (15) Proof. Since z(k) > z(n), we have hk − z(n) > hk − z(k) n ⇒ φ(ykn ) − yk+1 > φ(ykk ) − κ,
(16)
where the latter inequality is due to (13). Next note that given a value z(n), the sequence of y’s are constructed from (4b) and (13). This construction directly implies that if z(k) > z(n) then ykk > ykn . Since φ(·) is a non-decreasing function we have φ(ykk ) ≥ φ(ykn ). Combining n this with (16) yields yk+1 < κ. For a decreasing sequence of z’s Lemma 1 establishes that the y 0 s are bounded away from κ, i.e., they satisfy the defining property of non-terminating states. The next theorem is the main result of this subsection. Theorem 3. If there exists a sequence of optimal terminating solutions with values satisfying z(k) > z(n) z (n + 1) ≥ z(n) ∗
for 1 ≤ k < n, (17)
then the n-optimal policy corresponding to the solution (z(n), (y1n , · · · , ynn )) is globally optimal. Furthermore y1n ≤ y2n ≤ · · · ≤ ynn , (18) which implies that the optimal service rates are monotone in the number of jobs in the system. 10
Proof. By assumption, n ≥ 1. The case where even the 1-optimality equations cannot be satisfied is discussed at the end of this subsection. We first establish (18). For any `, such that 1 < ` ≤ n we can apply (13) and the fact that the holding cost rates are non-decreasing to obtain: φ(y`` ) − κ = h` − z(`) ≥ h`−1 − z(`) ` ≥ φ(y`−1 ) − κ,
where the last inequality is derived by applying both (13) and Lemma 1. Since φ(·) is ` and another application of (13) yields non-decreasing, the inequality above implies y`` ≥ y`−1 ` ) − h`−2 . φ(y`−1 ) − h`−1 ≥ φ(y`−2
(19)
` ` ` Since the holding costs are non-decreasing (19) implies φ(y`−1 ) ≥ φ(y`−2 ), which gives y`−1 ≥ ` y`−2 . We can now recursively apply the last few observations to obtain (18). n We are now prepared to prove the main part of the theorem. By Lemma 1, yk+1 < κ for 1 ≤ k < n. Thus, the optimality equations in (13) from the terminating case can be written in the equivalent form of the original optimality equations: n hk − z(n) = φ(ykn ) − min yk+1 ,κ for 1 ≤ k < n. (20)
Case 1: Suppose that there exists a solution to the set of (n + 1)-optimality equations. As previously noted, if z(n) ≤ z(n + 1), then by construction ykn ≤ ykn+1 for all 1 ≤ k ≤ n. Further, we also have hn − z(n) ≥ hn − z(n + 1). Applying the n- and (n + 1)-optimality equations, plus the last two observations gives n+1 n+1 φ(ynn ) − κ ≥ φ(ynn+1 ) − yn+1 ≥ φ(ynn ) − yn+1 n+1 ⇒ yn+1 ≥ κ.
(21)
Define δ := z(n + 1) − z(n) and, for k ≥ n + 1 set g := hn+1 − δ. Using these definitions, (13) and (21) yields: n+1 n+1 g − z(n) = φ(yn+1 ) − min yn+1 ,κ for k ≥ n + 1. (22) Let us modify the holding cost vector by replacing hi , i > n with g, i.e., the new rates are: (h0 , h1 , · · · , hn−1 , hn , g, g, · · · ). Since δ ≥ 0 the new holding cost rates for states larger than n are less than or equal to the original rates. Then, for this cost rate vector, the n-optimal policy corresponding to the solution (z(n), (y1n , · · · , ynn )) is globally optimal using (4b), (20) and (22) and applying Theorem 1. Since it is optimal to reject customers in state n with these modified holding 11
costs it must also be optimal to reject them when all holding costs are larger beyond state n, as they are in the original holding cost vector. Hence, the n-optimal policy is also globally optimal under the original holding cost vector. Case 2: Suppose that a solution to the set of (n + 1)-optimality equations does not exist. In state n+1, replace the holding cost hn+1 with the modified cost h0n+1 := z(n)+φ(κ)−κ. We claim that hn+1 − h0n+1 ≥ 0. To see this, first note that for the system with the modified cost, (z(n), (y1n , · · · , ynn , κ)) satisfies the (n + 1)-optimality equations. Now suppose that hn+1 − h0n+1 < 0. Let zˆ(n+1) be the cost of implementing the policy corresponding to (y1n , · · · , ynn , κ) with the original holding cost vector. In that case, we must have zˆ(n + 1) < z(n), since the only thing that has been changed is the holding cost in state n + 1, which by assumption is strictly lower in the original system. This last inequality implies z ∗ (n + 1) < z(n) which contradicts the assumption in the theorem statement that z ∗ (n + 1) ≥ z(n). Thus we have established that hn+1 − h0n+1 ≥ 0. The remainder of the argument is similar to that of case 1. In particular we again modify that holding cost vector to be (h0 , h1 , · · · , hn−1 , hn , h0n+1 , h0n+1 , · · · ) and argue that the n-optimal policy corresponding to the solution (z(n), (y1n , · · · , ynn )) is globally optimal in the modified system. Hence, it must also be globally optimal in the original system. It is possible that there is not even a solution to the 1-optimality equations, i.e., there is no z and finite y1 satisfying the equations. In this case, it is straightforward to argue that the policy of rejecting customers in all states (including state 0) is optimal.
4.3
Policy Computation: The Decreasing Sequence Case
If there exists a sequence of terminating optimal solutions, satisfying z(n) > z(n + 1) ∀ n ≥ 0,
(23)
then z(∞) := limn→∞ z(n) is the cost incurred by using the limiting control policy, i.e., ~ ∞) where ζm = limn→∞ ψ(y n ) for all m ≥ 1. From the discussion preceding the policy (ζ, m Section 3.1 it is clear that the limiting service rate ζm is finite for all m ≥ 1. Thus, to prove optimality of the limiting control policy we need to show that a limiting policy exists and that z(∞) = z ∗ . The theory in this section is closely related to the “truncating holding cost” case considered in Section 6 of [4]. So, consider a modified problem with holding costs (h0 , h1 , · · · , hn−1 , hn , hn , · · · ). The next lemma shows that there exists an n-terminating solution for this truncated holding cost problem. This result is used in establishing optimality of the limiting control policy. 12
Lemma 2. Suppose there exists a sequence of n-terminating solutions {(z(n), (y1n , . . . , ynn ), n ≥ 1} with decreasing values, i.e., (23) is satisfied. Then for each n ≥ 1 there exists a solution vector (ˆ z (n), (ˆ y1n , · · · , yˆnn )) which solves the truncated holding cost problem. In particular n n (ˆ z (n), (ˆ y1 , · · · , yˆn )), satisfies: (i) (4b) and n ) − hk−1 + zˆ(n) yk−1 yˆkn = φ(ˆ n n yˆn = φ(ˆ yn ) − hn + zˆ(n).
for 2 ≤ k ≤ n
(24) (25)
(ii) Furthermore, for each n ≥ 1 the yˆn ’s are non-decreasing: yˆ1n ≤ yˆ2n ≤ · · · ≤ yˆnn .
(26)
Proof. Consider an n-terminating solution (z, (y1 , · · · , yn )). Such a solution satisfies (4b) and (24), but not necessarily (25). We wish to show that we can find a zˆ(n) which will generate a new sequence of yˆn ’s which also satisfy (25). Note that for any fixed z if a sequence of yn ’s exists it is unique. Hence, below we can think of yn as a function of z and we denote this below by using the notation yn (z). Define ∆n (z) := φ(yn (z)) − yn (z) + z − hn
(27)
and let S(n) represent the statement that there exists a zˆ(n) such that ∆n (ˆ z (n)) = 0. If for some n ≥ 1, the original n-terminating solution is such that, ∆n (z(n)) ≤ 0, then this solution satisfies the global optimality equations (4b)–(6), implying that the corresponding policy is globally optimal. This contradicts our assumption that the sequence of n-terminating optimal policies have decreasing values. Therefore, ∆n (z(n)) > 0,
∀ n ≥ 1.
For n = 1, plugging in z = h0 in (27) and using the fact that the holding cost is nondecreasing, we have ∆1 (h0 ) = φ(0) + h0 − h1 ≤ 0. By construction, it can be seen that ∆1 (·) is continuous on (0, ∞). Hence, from the observations in the last two displays we conclude that there exists a zˆ(1), h0 ≤ zˆ(1) < z(1), with ∆1 (ˆ z (1)) = 0, i.e., S(1) holds. Now assume S(m) holds for some m ≥ 1. Let zˆ(m) be the function argument for which S(m) holds, i.e., suppose ∆m (ˆ z (m)) = 0. Next, by definition ∆m+1 (ˆ z (m)) = φ(ym+1 (ˆ z (m)) − ym+1 (ˆ z (m)) + zˆ(m) − hm+1 .
(28)
Using (28), the identity ym+1 (ˆ z (m)) = ym (ˆ z (m)) and the fact that S(m) holds we have: ∆m+1 (ˆ z (m)) = hm − hm+1 , 13
which implies that ∆m+1 (ˆ z (m)) is non-positive. From observations above recall that ∆m+1 (z(m+ 1)) > 0. Again using the continuity of ∆m+1 (·) we conclude that there exists a zˆ(m + 1), zˆ(m) ≤ zˆ(m + 1) < z(m + 1),
(29)
such that ∆m+1 (ˆ z (m + 1)) = 0. So, S(m + 1) holds and by induction, S(n) holds for all n ≥ 1, thus establishing (i). Result (ii) follows from an argument analogous to that in the proof of Theorem 3. For completeness, we now state a lemma similar to Proposition 4 in [4]. The proposition establishes that the solutions of Lemma 2 correspond to optimal policies, at least when the control problem is nondegenerate. Lemma 3. For a fixed n ≥ 1, consider the control problem with modified holding cost vector (h0 , · · · , hn−1 , hn , hn , · · · ). Let zˆ(n) be the optimal objective value for the modified problem, ˆn2 , . . .) be the corresponding optimal service rates. The following hold: and let (ˆ µn1 , µ (i) If zˆ(n) ≥ hn , then the modified control problem is degenerate. (ii) If zˆ(n) < hn , then (~η (n), ∞), where ~η (n) = {ˆ µni } is an optimal ergodic policy. Proof. For clarity note that the elements of ~η (n) are dictated by zˆ(n), and µ ˆnk = µ ˆnn for k ≥ n. The result follows from Proposition 4 in [4] and observing that (ˆ z (n), (ˆ y1n , · · · , yˆnn )) satisfies (4b), (24) and (25). If the optimal value for a terminating policy is strictly smaller than the previous value, i.e, z(n) > z(n + 1), then Lemma 3 guarantees the existence of a non-terminating optimal policy for a system with holding cost truncated at hn . This result holds even if z(n + 1) ≤ z(n + 2), i.e., the sequence of z(·)’s need only be decreasing up until stage n + 1 for the existence of an optimal policy under truncated holding costs. Recall that the original holding costs are non-decreasing in the number of jobs in the system. Hence, as we increase the truncation level, the sequence of zˆ(n)’s must also be non-decreasing. Furthermore, we have zˆ(n) ≤ z ∗ for all n ≥ 1. Hence, zˆ(∞) := lim zˆ(n) ≤ z ∗ . n→∞
(30)
Next note that the yˆin (·) are bounded and increasing functions of zˆ. From these properties, (26), and the construction of non-terminating policies in Lemma 2, we have for i ≥ 1 yˆi∗ := lim yˆin = yˆi (ˆ z (∞)) n→∞
Similarly, the pre-limit property (26) of the yˆin yields yˆ1∗ ≤ yˆ2∗ ≤ · · · · Since ψ(·) is continuous and non-decreasing, (30) implies the existence of a limiting control rate, for each i ≥ 1: µ ˆ∗i := lim ψ(ˆ yin ) = ψ(ˆ yi∗ ). (31) n→∞
With these observations above in hand, we first consider the degenerate case. 14
Theorem 4. If hn ↑ h∞ as n ↑ ∞ and zˆ(∞) ≥ h∞ , then the original control problem is degenerate. Proof. The result follows from (30) and the definition of degeneracy. In the appendix we prove the following lemma, from which the main result will follow immediately. Lemma 4. For every non-terminating ergodic policy (~η (n), ∞) construct a terminating pol~¨n , n), defined by icy (µ ( µ ˆnk if 1 ≤ k < n; n µ ¨k = n µ ˆn − 1 if k = n. Let z¯(n) be the optimal values of these so-constructed policies. Then lim z¯(n) = z ∗ .
n→∞
We are now prepared to present the main result of this section, which provides the justification for the computational method when optimal objective values are decreasing. Theorem 5. If the sequence of n-terminating optimal policies have decreasing values, i.e., (23) is satisfied, then the corresponding limiting control policy (~µ(∞), ∞) is an optimal policy. Furthermore, the optimal service rates are non-decreasing with respect to the number of customers in the system. Proof. First, we wish to establish that the sequence of values of optimal terminating policies converges to z ∗ . In Lemma 4 we establish this for terminating policies derived from the sequence of policies derived from truncated holding cost problems. Recall that these optimal values are denoted by z¯(n), n ≥ 1. It is easy to see that for all n, we have z¯(n) ≥ z(n), since z(n) is the optimal value among all terminating policies, and z¯(n) corresponds to a value under some (perhaps suboptimal) terminating policy, at the same truncation level. Lemma 4 shows that limn→∞ z¯(n) = z ∗ , which with the observation above immediately implies that lim z(n) = z ∗ .
n→∞
From here on, our argument is analogous to that in the proof of Proposition 7 in [4]. In particular, y(·) and ψ(·) are continuous functions implying that the optimal truncated service rates converge to a limiting service rate vector corresponding to z ∗ : lim ψ(yk (z(n)) = µk (∞) = ψ(yk (z ∗ )),
n→∞
for all k ≥ 1. These same continuity properties similarly imply that the optimal limiting service rates are monotone in the number of customers in the system.
15
A minor difference between our proof and the arguments appearing in [4] should be noted. George and Harrison require left-continuity of c(·) because they truncate holding costs, which means that the sequence of truncated optimal values converge to z ∗ from the left. Since we truncate the buffer level instead, our sequence of optimal values converge to z ∗ from the right, requiring right-continuity of c(·) and the related function ψ(·) for the argument above (left-continuity is used in earlier sections). Though yn∗ < κ for n ≥ 1, the optimal service rates associated with z(∞) can grow without bound as the number of customers in the system increases, i.e., the optimal service rates in this case could be unbounded.
5
Numerical Examples
In this section we illustrate the algorithm developed in the last section with a few numerical examples. First, we present a result which provides a lower bound on the optimality gap between the control policy obtained after an iteration of the algorithm, and the optimal control policy. George and Harrison [4] also provided a bound on the optimality gap for their algorithm. However, their algorithm evolves in a different manner, since they truncate holding costs rather than the state space. Theorem 6. If z(n) > z(n + 1) for some n ≥ 1, and z(k) > z(n) for all k < n then z(n) − z ∗ ≤ κ − ynn .
(32)
Proof. Since z(n) > z(n + 1), the n-terminating policy is not globally optimal. It follows that ynn < κ. (33) If the above inequality does not hold then setting ym = ynn for m > n, we have a solution pair to the optimality equations when the holding cost vector is modified to be (h0 , h1 , · · · , hn−1 , hn , hn , · · · ), from Theorem 1 and Lemma 1. This implies z ∗ = z(n) which 0 contradicts our assumption. Set hm = hm − θn for m < n, where θn = κ − ynn . From 0 0 0 (33) θn is positive. For a system with modified holding costs (h0 , h1 , · · · , hn−1 , hn , hn , · · · ), n the pair (z(n)−θn , (yn1 , yn2 , · · · , yn−1 , ynn , ynn , · · · )) satisfies the optimality equations. Since the modified holding costs are less than or equal to the original costs in every state, we conclude z(n) − θn ≤ z ∗ , which immediately implies the result. Thus, after each iteration of the algorithm, one has an easily computable bound on the optimality gap, which allows for the implementation of a stopping criterion, whether or not the optimal policy is a terminating policy. When the limiting policy is the optimal policy then, θn ↓ 0 as n ↑ ∞. Note that Theorem 6 holds irrespective of the holding cost structure.
16
Figure 1: Optimal buffer size: c(x) = x2
5.1
Example Control Problems
In all of the numerical examples in this section we take A = [0, ∞) and set hn = h0 + s(n − M + 1)+ , where s and M are parameters that are varied across cases. In all the cases, s is strictly positive and M is an integer. In the first example we consider the quadratic cost structure for service used in [4], in particular, c(x) = x2 . Further, we set h0 = 10 and M = 1. For this example, the optimal policies are terminating policies. Figure 1 shows the variation in the buffer limit under the optimal policy as the rejection cost κ and the multiplicative factor s in the holding cost are varied. It is clear that for fixed κ, the optimal buffer size increases with decreasing s and further, this effect is more prominent for smaller values of s. An interesting result is the apparent lack of monotonicity in the optimal buffer size with respect to κ, for a fixed value of s. One also observes that this apparent lack of monotonicity is prominent only once the optimal buffer size “saturates.” As expected, the optimal buffer size is highest when the rejection cost is large and s is small. For small values of the rejection cost, irrespective of s, all customers are rejected. This is due primarily to the particular values chosen for the holding cost function. For the same parameter settings, the optimal values are plotted against s and κ in Figure 2. As expected, the optimal value increase with either increased rejection or holding cost. However, note that the optimal values saturate relatively quickly in the rejection cost and this saturation occurs faster for smaller values of s. The optimal buffer size and optimal objective values have opposite trends with respect to s, i.e., as the optimal cost decreases the optimal buffer size increases. The next example is selected such that the cost for service per unit time approaches the rejection cost as the service rate approaches ∞. In this case A5 holds with equality. 1 The cost for service is c(x) = x − x 1+ε , where ε is a positive parameter that is varied in 17
Figure 2: Optimal cost: c(x) = x2
1
Figure 3: Optimal buffer size: c(x) = x − x 1+ε
18
1
Figure 4: Optimal cost: c(x) = x − x 1+ε the numerical experiments. In Figures 3 and 4, the optimal buffer limit and optimal cost, respectively, are plotted on the vertical axis, against ε and 1s . The other parameters are fixed at κ = 1, M = 1 and h0 = 1. As in the first example studied, there is an apparent lack of monotonicity of the buffer limit, with respect to ε, when the value of s is fixed. Note that both the optimal cost and optimal buffer size are essentially insensitive to the value of s. Clearly, as the service cost function decreases, i.e., ε increases, the optimal value decreases. The algorithm presented in this paper is quite efficient, in the following sense. In Step 1 of the algorithm, one need only perform a one-dimensional search in terms of z, to satisfy the terminating optimality equations. Generally speaking the computation of φ(·) is not intense and thus Step 1 is computationally inexpensive. The number of these steps which must be performed is simply linear in the truncation level. Hence, for most practical problems, the algorithm will terminate, or converge, in less than a second on a standard desktop computer. Although we did not do so, one could also fine tune the algorithm to make optimal re-use of the one-dimensional search results, which would increase the algorithmic efficiency even more.
6
Optimality of Deterministic Policies
In George and Harrison’s paper an important question regarding the control of their model was left unresolved. In particular, it remains to show that the search for an optimal policy can be restricted to deterministic stationary policies. For a large class of MDP’s this question has been settled by previous work. For example, control problems with adjustable service and service rates along with additional constraints are discussed in [5, Section 6] and existence of an optimal stationary policy is established. However, there it is assumed that the service 19
rates are bounded. When the action space is uncountable, and non-compact, there are fewer results. As mentioned earlier Chapter 5 of Hern´andez-Lerma and Lasserre [6] provides conditions for the optimality of stationary deterministic policies. The primary requirement there is for the transition probabilities to be setwise continuous. However, the results are given only for discrete-time MDP’s. From basic principles, we now proceed to show that randomized policies need not be considered, assuming that one restricts to stationary policies a priori. So, we now consider an arbitrary stationary policy, which may be randomized. When such a policy is considered, at any point of time the control can be considered to have two components: admission control and service rate control. Service rate control is applied only when a customer is present in the system and admission control is applied only when a new customer arrives. When control rates are random, the associated decision process is a continuous time semiMarkov decision process (SMDP). Using the SMDP framework, we directly show whenever a solution to the original optimality equations exists, and there is a corresponding stationary ergodic policy, then no randomized policy can do better. Hence, it is sufficient to restrict the search to deterministic policies. We model the randomized decision in each state as follows. After a system event, the service rate Xn is assumed to be a random variable with c.d.f. Fn (·). Clearly, Xn must be restricted to take values in A. In this case, a system event is an arrival or departure of a customer. Furthermore, when there is an arrival to a system which is in state n, it is assumed that the customer is accepted with probability qn . For clarity, note that we allow the controller to choose a new service rate Xn (drawn from Fn (·)) even if an incoming arrival is rejected. Thus the control decision now is to choose probabilities {qn , n ≥ 0} and the c.d.f.’s {Fn (·), n ≥ 1} which minimize the long-run cost. Specifically, when the number of customers in the system is n ≥ 0, before an arrival or after a departure, the randomized policy is specified by Qn ≡ (qn , Fn (·)) and the entire control policy is given by the functional ~ ≡ {Qn , n ≥ 0}. Below, the expectation operator is defined with respect to Q. ~ So, vector Q for example, E[Xn ] is the expected service rate under this policy, when there are n jobs in ~ we assume that E[Xn ] < ∞ for all n. A policy with E[Xn ] = ∞ will the system. Under Q, have an average cost per unit time equal to the cost when customers are always rejected in state n. Under a general stationary policy, the service time in a state n follows a distribution which is determined by Qn . Given such a policy, it should be clear that the queue length ~ with E[Xn ] < ∞ there process is then a semi-Markov process. Under an ergodic policy Q exist steady-state probabilities vn which represent the long-run average amount of time the SMDP spends in state n. Note that vn > 0 for states below the rejection threshold. Using standard ergodic theory for SMDP’s, the long-run cost average cost under any ~ is: ergodic stationary policy Q ~ = z(Q)
∞ X
vn [hn + (1 − qn )κ] +
n=0
∞ X
vn E[c(Xn )].
(34)
n=1
We now provide the extension to Theorem 1 which establishes the optimality of deter20
ministic stationary policies. Theorem 7. If there exist a z < ∞ and uniformly bounded (y1 , y2 , . . . ), satisfying the optimality equations (4b)-(6), and the corresponding policy as specified in Theorem 1 is ~ for every stationary policy Q. ~ ergodic, then z ≤ z(Q) ~ and a bounded solution z, (y1 , y2 , . . .) to the Proof. Consider then a fixed stationary policy Q, ~ First, optimality equations. Let µ ˜n := E[Xn ] be the mean service rate in state n, under Q. one can derive the following detailed balance equations for the SMDP via the embedded Markov chain (which is a simple random walk with self-transitions): qn νn = νn+1 µ ˜n+1
for n ≥ 0.
(35)
Next, equations (7) and(8) imply: xyn − c(x) ≤ yn+1 + hn − z ∀ x ∈ A, n ≥ 1, xyn − c(x) ≤ κ + hn − z ∀ x ∈ A, n ≥ 1. Taking the convex combination of these inequalities induced by (qn , 1 − qn ) yields: xyn − c(x) − (1 − qn )κ − yn+1 qn ≤ hn − z. This inequality holds for each n ≥ 1 and x ∈ A, thus it also holds for each realization of Xn taking values in A, i.e., Xn yn − c(Xn ) − (1 − qn )κ − yn+1 qn ≤ hn − z
w.p. 1,
for each n ≥ 1. Taking expectations gives: yn µ ˜n − E[c(Xn )] − (1 − qn )κ − yn+1 qn ≤ hn − z
∀ n ≥ 1.
(36)
Next, by definition if n = 0 is not a terminating state under the policy corresponding to z, (y1 , y2 , . . .) then y1 < κ. If n = 0 is a terminating state then it is straightforward to check that y1 = κ. Thus, in either case, y1 ≤ κ. Combining this with (4b) yields z − h0 ≤ κ. Taking a convex combination of (4b) and this last inequality gives z − h0 ≤ y1 q0 + (1 − q0 )κ. Now, multiplying (36) by νn for each n ≥ 1 we obtain: yn νn µ ˜n − νn E[c(Xn )] − νn (1 − qn )κ − νn qn yn+1 ≤ νn [hn − z]. Applying (35) gives yn qn−1 νn−1 − yn+1 qn νn − νn E[c(Xn )] − νn (1 − qn )κ ≤ νn [hn − z] ∀ n ≥ 1. 21
(37)
Finally, summing this last set of inequalities over all n ≥ 1 and using (37) (which is essentially the n = 0 case) we obtain: z≤
∞ X
vn [hn + (1 − qn )κ] +
n=0
∞ X
vn E[c(Xn )],
n=1
~ for every ergodic stationary policy Q. ~ Note that the last summation step is i.e., z ≤ z(Q) justified by the fact that the yi are bounded. From first principles we have proved that if there exists an optimal stationary policy, then there also exists an optimal deterministic stationary policy. The result is derived from the balance equations for the semi-Markov process and the original optimality equation. Acknowledgements. We would like to thank Mike Harrison, Levent Ko¸ca˘ga, Mark Lewis, Mike Veatch, and Amy Ward for advice while this research was in process.
Bibliography [1] B. Ata and S. Shneorson. Dynamic control of an M/M/1 service system with adjustable arrival and service rates. Management Science, 52(11):1778–1791, 2006. [2] D. P. Bertsekas. Dynamic Programming and Optimal Control, volume 2. Athena Scientific, Belmont, Massachusetts, 3rd edition, 2007. [3] V. S. Borkar and S. P. Meyn. Risk-sensitive optimal control for Markov decision processes with monotone cost. Mathematics of Operations Research, 27(1):192–209, 2002. [4] J. M. George and J. M. Harrison. Dynamic control of a queue with adjustable service rate. Operations Research, 49(5):720–731, 2001. [5] X. Guo and O. Hern´andez-Lerma. Drift and monotonicity conditions for continuous-time controlled Markov chains with an average criterion. IEEE Transactions on Automatic Control, 48:236–245, 2003. [6] O. Hern´andez-Lerma and J. B. Lasserre. Discrete-Time Markov Control Processes: Basic Optimality Criteria. Springer, 1996. [7] G. Koole. Structural results for the control of queueing systems using event-based dynamic programming. Queueing Systems: Theory and Applications, 30:323–339, 1998. [8] M. L. Puterman. Markov Decision Processes. Wiley-Interscience, New York, 1994. [9] L. I. Sennott. Stochastic Dynamic Programming and the Control of Queueing Systems. Wiley-Interscience, New York, 1998.
22
[10] S. Stidham and R. Weber. Monotonic and insensitive optimal policies for control of queues with undiscounted costs. Operations Research, 37(4):611–625, 1989. [11] J. Wijngaard and S. Stidham. Forward recursion for Markov decision processes with skip-free-to-the-right transitions. Mathematics of Operations Research, 11:295–308, 1986.
7
Appendix
Before proceeding to prove Lemma 4, we need to introduce the notation M (n). For an optimal non-terminating policy (~η (n), ∞) defined in Lemma 3 which is ergodic we have, using the balance equations (2), zˆ(n) = ⇒ zˆ(n) =
n−1 X k=0 n−1 X
pk (~η (n), ∞) {c(ˆ µnk )
+ hk } +
∞ X
pk (~η (n), ∞) {c(ˆ µnn ) + hn }
k=n
pk (~η (n), ∞) {c(ˆ µnk )) + hk } + M (n) {c(ˆ µnn ) + hn } ,
(38a)
k=0
where
n−1 X
µ ˆnn . M (n) = 1 − pk (~η (n), ∞) = pn (~η (n), ∞) n µ ˆn − 1 k=0
(38b)
~¨n , n), Now, for every non-terminating policy (~η (n), ∞) construct a terminating policy (µ defined by ( µ ˆnk if 1 ≤ k < n; µ ¨nk = η (n),∞) n pn (~ if k = n. µ ˆk M (n) It can be checked that the expression for the k = n case reduces to µ ˆnn − 1 as in the statement of Lemma 4. The more complex expression above is useful in the proof below. It is argued below that µ ˆnn − 1 is positive for large enough n. Without loss of generality we assume that the policies constructed above are for large enough n. Proof of Lemma 4. Let z¯(n) be the objective value under the terminating policy constructed above. Then by construction we have z¯(n) = zˆ(n) + M (n) {c(¨ µnn ) − c(ˆ µnn ) + κ} ≤ z ∗ + M (n)κ µ ˆn = z ∗ + pn (~η (n), ∞) n n κ, µ ˆn − 1
23
(39)
where the second inequality holds since µ ¨nn < µ ˆnn and zˆ(n) ≤ z ∗ . Since the degenerate case is excluded, (30) implies that there exists N ≥ 1 such that zˆ(n) ≤ zˆ(∞) < hn
∀ n ≥ N.
(40)
Therefore from (25), φ(ˆ ynn ) > yˆnn for all n ≥ N . This implies that ψ(ˆ ynn ) > 1 for all n ≥ N , which implies µ ˆnn > 1 for all n ≥ N . Since the yni are non-decreasing in i and ψ(·) is non-decreasing we have ψ(ynm ) > 1 and µ ˆm n > 1 ∀ m ≥ n ≥ N.
(41)
Using (2), and the structure of the non-terminating policies defined in Lemma 3, we have ( !) n−1 n X Y µ ˆnn pn ((~η (n), ∞)) µ ˆn = 1. (42) + µ ˆnn − 1 k=0 i=k+1 i Note that in the case when µ ˆni = 0 for some i < N , the equations above can be suitably modified. Due to (41), the sum in the brackets in (42) diverges as n → ∞. Thus it must be the case that pn ((~η (n), ∞) ↓ 0 as n ↑ ∞. (43) Applying (43) to (39) yields that lim supn→∞ z¯(n) ≤ z ∗ . Furthermore we have z¯(n) ≥ z ∗ for all n ≥ 1 since any terminating policy is a feasible, ergodic policy. Thus, lim inf n→∞ z¯(n) ≥ z ∗ . These last two observations finally yield that limn→∞ z¯(n) = z ∗ as desired.
24