A Probabilistic Analysis of Bias Optimality in Unichain Markov Decision Processesy Mark E. Lewis
Department of Industrial and Operations Engineering University of Michigan, 1205 Beal Avenue, Ann Arbor, MI 48109-2117
[email protected] 734-764-6473
Martin L. Puterman
Faculty of Commerce and Business Administration University of British Columbia, 2053 Main Mall, Vancouver, BC Canada V6T 1Z2
[email protected] 604-822-1800
submitted August 12, 1999 Abstract
Since the long-run average reward optimality criterion is underselective, a decisionmaker often uses bias to distinguish between multiple average optimal policies. We study bias optimality in unichain, nite state and action space Markov Decision Processes. A probabilistic approach is used to give intuition as to why a bias-based decision-maker prefers a particular policy over another. Using relative value functions from the long-run average reward model, we present new methods for computing optimal bias. Furthermore, while the properties of discounting are lost in the long-run average formulation, we show how and why bias implicitly discounts future rewards and costs. Each of these observations are motivated by examples that have applications to queueing theory.
1 Introduction Bias optimality has previously been regarded as a theoretical concept in Markov Decision Process (MDP) theory. It was viewed as one of many optimality criteria that is more sensitive AMS Subject Classi cations: primary-90C40: Markov decision processes, secondary-60K25 theory y IAOR Subject Classi cations: primary-3160: Markov processes, secondary-3390: dynamic programming: theory
1
than long-run average optimality, but its usefulness in application had not been considered. In many applications when there are multiple gain optimal policies there is only one bias optimal policy. Hence, the bias-based decision-maker need not look any further to decide between a group of gain optimal policies. We show through probabilistic arguments that bias can be used to make such decisions rather easily. In fact, in the examples we propose, computing the bias is not necessary since the bias optimal policy can be obtained directly from the average optimality equations. Furthermore, there are numerous similarities between nding bias optimal policies and nding average optimal policies. This relates the bias to the vast literature on average optimality and justi es a similar general appeal. Discount and average (gain) optimality have received considerable attention in the MDP literature. However, substantially less attention has been paid to bias optimality. Recently, Haviv and Puterman [5] showed in an admission control model, with one server and a holding cost, that one can distinguish between two average optimal solutions by appealing to their bias. Their work was extended by Lewis, et. al [8] to a nite capacity, multi-class system with the possibility of multiple gain optimal policies. Further, Lewis and Puterman [9] showed that in the Haviv-Puterman model, the timing of rewards impacts on bias optimality. Whereas the Haviv-Puterman paper showed that when rewards are received upon acceptance and there are two consecutive gain optimal control limits, say L and L +1, that only L +1 is bias optimal, the Lewis-Puterman paper showed that if the rewards are received upon departure, only control limit L is bias optimal. This suggests that bias may implicitly discount rewards received later. However, these papers do not address why bias dierentiates between gain optimal policies and how the timing of rewards comes into play. In this paper, we present a new approach to compute the bias directly from the average optimality equations. This leads to sample path arguments that provide alternative derivations of the above mentioned results. We also discuss the similarities involved in computing the bias and the gain using 2
the bias optimality equations and show that the policy iteration algorithm can be used to nd bias optimal policies. Discount and average optimality have been considered expansively in the literature, therefore we will not provide a complete review here. For a comprehensive review refer to the survey paper of Arapostathis et. al [1] or Chapters 8 and 9 of Puterman [14]. Howard introduced a policy iteration algorithm to solve the average reward model in the nite state space case. This has been considerably extended. For example, see the recent work of HernandezLerma and Lasserre [7] or Meyn [13]. In contrast, bias optimality has not received much direct attention in the literature. In fact, to our knowledge, in addition to the previously mentioned papers ([5], [8], [9]), the use of bias to distinguish between gain optimal policies has only appeared in a short section of an expository chapter by Veinott [18]. Methods of computing optimal bias were considered for the nite state and action space case by Denardo [3] and on countable state and compact action spaces by Mann [12]. Blackwell's [2] classic paper showed the existence of stationary optimal policies in the discounted nite state case and introduced a more sensitive optimality criterion now called Blackwell optimality. In essence, Blackwell optimal policies are discount optimal for all discount rates close to 1. It turns out that Blackwell optimality implies bias optimality, so that we have the existence of bias optimal policies in the nite state and action space case as well. There is also a vast literature on sensitive optimality that indirectly addresses bias optimality (cf Veinott [17]). However none of these works give an intuitive explanation for what the bias-based decision-maker prefers and why. The rest of the paper is organized as follows; Section 2 lays out the model formulation and relevant de nitions. We show methods to compute and interpret the gain and the bias in Section 3. This is extended to computing optimal gain and bias in Sections 4 and 5. The issue of discounting is discussed in Section 6. We conclude in Section 7. 3
2 Model Formulation Our notation and formulation follows Puterman [14]. Consider an in nite horizon, discrete time, Markov decision process (MDP) with nite state space S . Let As be the nite set of actions available to a decision-maker when in state s. If the decision-maker chooses action
a 2 As when in state s, an immediate (expected) reward of r(s; a) is received and the system enters state j with probability P (j js; a). Let A =
s2S As be the action space.
A deterministic, Markovian decision rule d maps S into A and speci es which action the decision-maker will take when the system is in state s. A sequence of such decision rules
= fd ; d ; : : : g is called a deterministic, Markovian policy and speci es the decision-maker's 1
2
actions for each state, for all time. We say that a policy is stationary if it uses the same decision rule at each decision epoch. The set of such policies is denoted D1 . Each policy generates a sequence of random variables f(Xn ; Yn ); n = 1; 2; : : : g where Xn denotes the state of the system and Yn denotes the action chosen by policy at decision epoch n given
Xn . Unless otherwise noted, we assume the Markov decision process is unichain. That is, all stationary policies generate Markov chains that consist of a single ergodic class and possibly some transient states. We now formalize the de nitions of gain and bias.
De nition 1 The long-run average reward or gain of a policy given that the system starts in state s 2 S denoted g (s) is given by !
, 1 NX N n r(Xn ; Yn) : 1
g (s) = Nlim E !1 s
=0
where the expectation is conditioned on the state at time zero and taken with respect to the probability measure generated by . Furthermore, a policy, , is called gain optimal if
g (s) g (s) for all s 2 S; for all : 4
De nition 2 The bias of a stationary policy d1 , given that the system started in state s, denoted hd (s), is de ned to be
hd (s) =
1 X n=0
E s [r(Xn ; d(Xn )) , gd (Xn )]:
(1)
We say that a policy, (d)1 is bias optimal if it is gain optimal, and
hd (s) hd(s) for all s 2 S; for all d: If the Markov chain generated by d1 is aperiodic this sum is convergent, otherwise; replace the above sum with sums in the Cesaro sense.
Remark 1 Our de nition of bias optimality restricts attention to stationary policies. This causes no actual restriction in our formulation as this set of policies is large enough to guarantee existence of a bias optimal policy. We will brie y discuss this fact momentarily.
De nition 3 For a particular policy d1 and s 2 S let ed(s) = rd(s) , gd (s)
(2)
be called the excess reward of d1 .
If one de nes a new system in which the excess reward replaces the reward function, then the bias is the expected ( nite) total reward in the modi ed system. Alternatively, the bias represents the expected dierence in total reward under policy d1 between two dierent initial conditions; when the process begins in state s and when the process begins with the state selected according to the probability distribution de ned by the s row of the limiting th
P ,1 i matrix Pd = limn!1 n1 ni=0 Pd . Since we assume that the process under d1 is unichain,
this initial distribution is the stationary distribution of the chain. When the process is multichain, the distributions speci ed by the rows of Pd may vary with the inital state. 5
As a result of these observations, if two stationary policies have the same gain, a decisionmaker would prefer the one with the greater bias. We show that in certain models the bias has the further characteristic of implicitly discounting rewards received later. Of course, there are other more sensitive optimality criteria including 1 , optimality or Blackwell optimality.
De nition 4 A policy is called Blackwell optimal if there exists such that 0 < 1, v v , for all and < 1, where v is the total discounted reward of the Markov
decision process when using policy with discount rate .
Since Blackwell [2] showed the existence of stationary Blackwell optimal policies when the rewards are bounded and the state and action space are nite, and Blackwell optimality implies bias optimality, which in turn implies gain optimality (cf. Puterman [14] (Theorem 10.1.5)) we restrict our attention to stationary, deterministic policies.
3 Computing the Gain and Bias We next discuss the computation of the gain and the bias of a xed stationary policy d1. Denote by rd the reward vector corresponding to decision rule d. When d is deterministic,
Pd (j js) = p(j js; d(s)) and rd (s) = r(s; d(s)). We refer to Yd (I , Pd + Pd), (I , Pd) as 1
the deviation matrix of Pd . It is well-known that
gd = Pdrd
(3)
hd = Yd rd:
(4)
and
6
Alternatively, the gain and the bias of d1 may be computed by solving the following system of linear equations:
g = Pd g
(5)
h = rd , g + Pd h
(6)
w = ,h + Pd w
(7)
and
for g, h, and w. Speci cally, the gain and the bias of d1 satisfy (5) and (6) and there exists some vector w which together with the bias satisfy (7). Moreover, the gain and the bias are the unique vectors with this property. We refer to (5) and (6) as the average evaluation equations (AEE) and to (5), (6), and (7) as the bias evaluation equations (BEE). Since Pd is
unichain, 1. g is a constant, which we express as g1 where 1 is a vector of 1's of dimension jS j. 2. (5) is redundant and (6) becomes
h = rd , g1 + Pdh:
(8)
3. If (g; h) satis es (8), g = gd1 and h is unique up to a constant. 4. (gd1 ; hd1 ) is the unique solution of (8) and the additional condition Pdh = 0. With these observations in mind, we have the following de nition.
De nition 5 Let d1 2 D1 be a xed stationary policy. For each solution to the average evaluation equations (gd ; h), the constant dierence between h and the bias of d1 , cd (h) is called the bias constant associated with h.
7
De nition 6 If (g; hrvd ) satis es (8), and hrvd () = 0, hrvd is unique and is called a relative value function of d1 . ( )
( )
( )
Notice that there is a relative value function for each state . Suppose now that is any recurrent state for the chain generated by the policy d1 2 D1 . Let the rst passage time of a process on S to be denoted . That is,
= minfn 0jXn = g: Applying a classic result in renewal theory to (3) (cf. Resnick [15]),
gd =
E d
,P ,1 r ( X ; Y ) n n n=0 : E d
(9)
This allows for a probabilistic interpretation of the gain. To get a similar interpretation for the bias let
hd(s) =
E ds
!
X ,1 n=0
[r(Xn ; Yn ) , gd] :
(10)
Note that X ,1
hd(s) = rd(s; d(s)) , gd + E d
n=1
[r(Xn ; Yn) ,
gd ] X0
!
=s
= rd(s; d(s)) , gd + (Pdhd)(s): Hence, (gd; hd) satis es (8) for d1 and represents the total excess reward earned until the process enters state given that the process started in state s and uses policy d1. Furthermore, hd() = 0. Thus, the function hd is the relative value function of d1 and reference state ; hd = hrvd . The fact that (gd ; hd) satis es the AEE was rst reported in Derman ( )
and Veinott [4]. 8
Since for a xed policy d1 , the relative value functions and the bias satisfy the AEE, by our previous observations, they must dier by a constant. The next proposition uses this fact to show a close relationship between the two. This proposition may also be shown to hold by rewriting (7) in a form equivalent to Poisson's equation and applying results discussed in Derman and Veinott [4] or more recently Makowski and Shwartz [11].
Proposition 1 Suppose a nite state and action space Markov decision process is unichain. () Let d1 be a stationary policy. Denote a relative value function of d by hrv . The bias of d d () is given by hd = hrv , (Pdhrvd ())1. d
Proof. We know that any relative value function is within a constant of the bias. Recall cd(hrvd ) is the bias constant associated with hrvd . Then ( )
( )
hd = hrvd + cd(hrvd )1;
(11)
Pdhd = Pdhrvd + Pdcd(hrvd )1:
(12)
( )
( )
and ( )
( )
However, we know that Pdhd = 0, so that
Pdhrvd = ,Pdcd(hrvd )1 ( )
(13)
= ,cd(hrvd )1;
(14)
( )
( )
where the last equality follows from the fact that the unichain assumption implies that the rows of Pd are equal. Making the appropriate subsitution into (11) yields the result. Recall that the gain of a stationary policy d1 can be computed by solving Pdrd . Hence, in the same manner that we can compute the gain of a policy given the reward function, we can compute the bias using a relative value function by computing Pdhrvd . Again applying ( )
a classic result in renewal theory we have the following important corollary. 9
Corollary 1 If hrvd is the relative value function of d1 related to the recurrent state , ( )
we may compute the bias of d1
hd(s) =
hrvd ()(s)
,
E d
1
P ,1 rv() (Xn ) : n=0 hd 1 d E
(15)
This expression allows us to compute the bias of a stationary policy sample pathwise. In the next two sections, we show how this can be used to nd bias optimal policies.
4 The Average Optimality Equation Since the state and action space are nite, computation of gain optimal policies reduces to solving the average optimality equations (AOE)
h = max frd , g1 + Pd hg d2D
(16)
for g and h. Let G(h) be the set of policies that achieve the maximum in (16). That is,
2 argmaxd2Dfrd + Pdhg G(h):
(17)
We refer to (17) as the average optimality selection equations (AOSE). To begin our analysis of the average optimality equations, we consider a special case of a result of Schweitzer and Federgruen [16]. In essence, the result states that solutions of the AOE must dier by a constant just as they do for the AEE. We include a simple proof for this result to keep this paper self-contained. In addition to being useful for establishing several results below, it shows that the average optimality equations do not determine the set of gain optimal solutions in unichain average reward models. Thus, the optimal gain is uniquely determined by the AOE, but the solution of the AOE is not unique.
10
Proposition 2 Suppose all stationary policies are unichain and let (g ; h ) and (g ; h ) be 1
1
2
2
solutions to the AOE. Then g1 = g2 and
h = h + c1 1
(18)
2
for some constant c. In particular, if h1 = h is the optimal bias, then
h = h + c(h )1: 2
(19)
2
Proof. Suppose (g ; h ) is a solution of (16). Then there exists a 2 G(h) for which 1
1
h = max frd , g1 + Pd hg d2D 1
= r , g 1 + P h 1
(20)
1
Since P is unichain, (20) uniquely determines g and determines h up to a constant. Since 1
1
(g ; h) also satis es (16), (19) follows. The case for a general solution of (16) is analogous. 1
Remark 2 Note that this result does not require that S be nite, only that the gain is constant. A nice discussion of the average optimality equations on Borel spaces can be found in Hernandez-Lerma and Laserre [6]. We refer to c(h2 ) as the optimal bias constant associated with h2 .
Proposition 2 implies that any solution to the AOE must have relative value functions that dier by a constant. This leads to the question of whether there are decision rules that are gain optimal, but do not satisfy the AOE. When all policies generate irreducible Markov chains it is known that the average optimality equations are indeed necessary and sucient (see for example Lewis, et. al [8]). The following example shows that this does not hold in the unichain case. 11
Example 1 Suppose S = f1; 2g, A = fa; bg and A = fcg, r(1; a) = 2, r(1; b) = 3 r(2; c) = 1 and 1
2
p(2j1; a) = p(2j1; b) = p(2j2; c) = 1. Let be the decision rule which chooses action a in state 1 and let be the decision rule which chooses action b in state 1. Clearly this model is unichain and g = g = 1, h (1) = 2, h (1) = 1, h (2) = h (2) = 0. Since h and h do not dier by a constant, it follows from Proposition 2, that (g ; h ) and (g ; h ) cannot both
satisfy the optimality equation. If is recurrent for all stationary policies, then a solution to the AOE, (g; h), such that
h() = 0 must be unique. Recall that the relative value function de ned in (10) is such that for a recurrent state , hrvd () = 0. Suppose d is gain optimal so that gd = g. From the ( )
AEE we have rv hrvd (s) = max f r )(s)g d (s; d(s)) , g + (Pd hd d2D ( )
( )
=
f
max E ds d2D
!
X ,1 n=0
[r(Xn ; Yn ) , gd] + gd , gg
(21)
Let D0 be the set of decision rules that have corresponding stationary policies with optimal gain, g. Then since for d 2 D0, gd = g, from (21)
hrvd (s) = max fE d d2D0 s ( )
X ,1 n=0
!
[r(Xn ; Yn) , g] g:
In the previous example, one might ask the question, which pair, (g ; h ) or (g ; h ), satis es the average optimality equations? From the preceding analysis we have the following proposition.
12
Proposition 3 Suppose there exists a state that is positive recurrent for all gain optimal stationary policies. If (g; h ) satis es the AOE and d 2 G(h), then hrv hrvd for gain d ( )
0
( )
optimal d0.
Thus, in Example 1 only (g ; h ) satis es the average optimality equations. Denote the relative value function associated with the optimal gain by hrv . That is ( )
(g; hrv ) is a solution to the AOE. Whereas previous work has focused on algorithmic ( )
methods to compute the bias (cf. Veinott[17]), a probabilistic interpretation has yet to be established. In the sequel we show that our previous observations lead to simple sample path arguments for optimal bias.
5 The Bias Optimality Equation Suppose in addition to satisfying the AOE, h satis es
w = max f,h + Pd wg d2G
(22)
2 argmaxd2GfPd wg
(23)
and satis es
where G is the set of policies that achieve the maximum in the AOE (16) for g and h for some vector w. Then 1 is bias optimal and h is the optimal bias. We refer to the above set of equations as the bias optimality equations (BOE). Upon substituting (19) into (22) for the relative value with reference state that together with the optimal gain satis es the AOE, we have the following result
Proposition 4 Suppose hrv is a relative value function with reference state such that ( )
(g; hrv ) is a solution to the AOE. The BOE (22) can be rewritten ( )
w = maxd2Gf,hrv , c(hrv )1 + Pd wg: ( )
13
(24)
Observe that (24) has exactly the same form as the AOE (16). That is to say, if rd =
,hrv and g = ,c(hrv )1 we have again the AOE. Thus, in a unichain model, (24) uniquely ( )
determines c(hrv ) and determines w up to a constant. Furthermore, all solution methods and theory for the AOE apply directly in this case. In particular, (24) can be solved by value iteration or policy iteration. Alternatively, as in the AEE, if (gd ; h) satisfy the AOE and
Pdh = 0
(25)
where Pd is the stationary distribution of the chain generated by d, then h is the optimal bias. Neglecting the trivial case rd = 0 for all d 2 D, it is interesting to note that since Pd is positive on the recurrent class generated by d, the optimal bias must have both positive and negative elements. We will show in the examples that follow that we can take advantage of this fact. Suppose that d1 is bias optimal. From Proposition 1
h = hrv , (Pdhrv )1: ( )
( )
(26)
In essence, solving for the policy with maximum bias reduces to nding the policy that achieves the maximum bias constant, say c. That is,
c = max f,Pdhrv g d2G
(27)
= , min fP hrv g d2G d
(28)
( )
( )
where hrv is any relative value function of a gain optimal policy. Thus, under the assump( )
tion that there exists a state that is recurrent for all decision rules in G we can alternatively compute the optimal bias by solving
c = , min d2G
E
! P,1 rv() h ( X ; d ( X )) n n n=0 d E
14
(29)
where the expectation is taken with respect to the probability transition function conditioned on starting in state . Since we are minimizing, hrv can be interpreted as a cost function. ( )
Thus, nding a bias optimal policy corresponds to a minimum average cost problem. Furthermore, one might notice that given a relative value function we can solve for the gain, by noting gd = rd + Pdhrvd , hrvd . That is, the relative value function can be used to obtain ( )
( )
both the gain and the bias. We emphasize the importance of these observations in the following examples. Since we will often be interested in the dierence in the cost starting in states s and s + 1, de ne for a function b on S , b(s) b(s + 1) , b(s). In the following example, we show that we can use this interpretation to acquire insight into the structure of bias optimal policies for non-trivial problems.
Example 2 Consider an admission controlled M=M=1=k queueing system with Poisson arrival rate and exponential service rate . Assume that a holding cost is accrued at rate f (s) while there are s customers in the system. If admitted the job enters the queue and the decisionmaker immediately receives reward R. Rejected customers are lost. Assume that the cost is convex and increasing in s and f (0) = 0. Furthermore, assume that we disretize the model by applying the standard uniformization technique in Lippman [10]. Without loss of generality assume that the uniformization constant is + = 1. Since rejecting all customers yields g = 0, we assume customers are accepted in state zero. This example was previously considered in Haviv and Puterman [5] where it was shown with algebra that bias distinguishes between gain optimal policies. For this model, the average optimality equations
15
are
h(s) = maxfR , g , f (s) + h(s + 1) + h((s , 1) ); +
,g , f (s) + h(s) + h(s , 1)g
(30)
Consider the set of policies T 1 that accept customers until the number of customers in the system reaches some control limit L > 0 and rejects customers for all s L. Denote the stationary policy that uses control limit L by L. It is known that there exists a Blackwell optimal policy within this set. The following lemma asserts that it is better to start with less customers in the system. We will use this result in the sample path arguments to follow.
Lemma 1 Suppose (g; h) satisfy the optimality equations. For s 2 S , h(s) < 0. Proof. We show the assertion by induction. Suppose L is gain optimal and L is its stationary distribution. If g is the optimal gain we have
g
=
L,1 X i=0
L(i)[R , f (i)] + d(L)[,f (L)]:
(31)
Since each element of d is strictly less than one, a little algebra yields
g , R + f (s) < ,
X i2f0;:::;Lg=s
L(i)f (i)
< 0:
(32) (33)
In state zero the gain optimality equations yield h(0) = g ,R < 0
(34)
Assume the assertion is true for s , 1. If it is optimal to reject in state s the gain optimality equations yield that R + h(s) 0 and the result is trivially satis ed. Thus, assume that 16
State L+1 L
x-departure o-arrival x
x
x
o
x
x
o
x
o
o
o
o Time
Figure 1: We would like to compare control limits L and L + 1. accept is optimal in state s. Hence, we have h(s) = g , R + f (s) + h(s , 1):
(35)
Applying (32) and the induction hypothesis yields the result. 2 Haviv and Puterman [5] show that if there are two gain optimal control limit, only the higher one is bias optimal. Let L and L +1 be gain optimal control limit. The gain optimality equations are 8
cL, +1
and the bias of control limit L + 1 is larger than that of L. The previous example shows that by an astute choice of the reference state a simple sample path argument can be used to show the usefulness of bias in distinguishing among gain optimal policies. This analysis begs the question, why is the higher control limit preferred? In essence, the choice the decision-maker must make is whether to add more waiting space. If optimal gain is the primary objective, it is clear that if adding this server reduces the gain, it should not be added. On the other hand, if adding the waiting space, does not change the gain, but decreases the average cost as measured by hrv the decision-maker would prefer to add ( )
the space. The question of why the relative value functions measure cost remains open. However, we can make the observation that with L as the reference state, Lemma 1 implies that hrv (s) < 0 for s > L while hrv (s) > 0 for s < L. Thus, the average cost is ( )
( )
decreased by time spent with more than L customers in the system. Hence, the bias based 18
decision-maker prefers negative relative value functions. In the next section we discuss how bias implicitly discounts rewards received late in the cycle using the relative value functions.
6 Bias and Implicit Discounting Neither the interpretation of the bias as the total excess reward before reaching stationarity nor as the average reward over a cycle give a complete picture. If either were so, one might conjecture that a decision-maker using bias as the optimality criterion would be indierent to when in the cycle rewards were received. After all, in most total reward models this is the case. Suppose we consider Example 2 except that rewards are received upon service completion instead of upon acceptance to the system. Using the bias optimality equation (22), [9] showed that if there are two gain optimal control limits, it is in fact the lower control limit that is bias optimal. Thus, by changing when rewards are received, we have changed which control limit is preferred. This result can easily be obtained using a sample path argument similar to Example 2.
Example 3 Consider the M=M=1=k queueing system of Example 2 except that the rewards are received upon service completion instead of acceptance. Assume L and L +1 are gain optimal control limits and consider the gain optimality equations. 8
0), we know that hrv (0) > 0. ( )
Thus, since hrv (1) hrv (L) = 0, hrv (0) < 0 and hrv (s) 0 for s < L. When a ( )
( )
( )
( )
departure is the rst event starting in state L the average cost before returning to L must be negative; a reward. Applying our previous argument, the average cost accrued during a cycle would in fact be increased by adding another waiting space. It is interesting to note that in Example 2 hrv (s) > 0 when s < L, while hrv (s) 0 for
s L. In the present example these inequalities are reversed. The bias-based decision-maker prefers the negative relative values. Thus, while viewing bias as an cost reward problem is correct, it is not a standard average cost problem. This analysis does, however, allow us to interpret why bias prefers control limit L or L + 1. In Example 2 rewards are received at arrivals and thus, the reward is received before the cost of having the customer in the system is accrued. On the other hand, in Example 3 since rewards are received at service completions the decision-maker must accrue the cost of having a customer in the system before receiving the reward. Thus, the decision-maker only chooses to increase the amount of waiting space if the reward is received before cost. Furthermore, since the optimal policies are not the same for both problems, it is clear that the bias-based decision-maker dierentiates between receiving rewards on entrance or 20
exit. The following result provides supporting evidence for this assertion.
Theorem 1 Suppose that is a positive recurrent state for a xed policy d1 2 D1 . Further () () suppose that hrv is the relative value function of d with hrv () = 0. Let cd be the bias d d () constant associated with hrv . Then d
cd = ,
E d
P ,1 n=0 (n + 1)[r(Xn ) E d
, g]
(40)
Proof. From Proposition 1 it suces to show that P hrv()()
=
d d
E d
P ,1 n=0 (n + 1)[r(Xn ) E d
, g]
Recall,
P hrv()() d d
=
E d
P ,1 rv() t=0 hd (Xt ) : E d
Consider the numerator, ,1 X E d hrvd ()(Xt ) t=0
= =
E d E d
,1 X t=0 ,1 X t=0
21
E dXt Ed
,1 X n=0
[r(Xn ) , g]
,1 X n=t
[r(Xn ) ,
!
!! g] Xt ;
(41)
where the second equality follows from the time-homogeneity of the process. Conditioning on the rst passage time, given that the initial state is we get, X ,1 d E hrvd ()(Xt ) t=0
= = =
1 X k=1
k ,1 X
E d
k=1 t=0
1 X k ,1 X k=1 t=0
=
E d
=
E d
=
E d
Ed
t=0
1 X k ,1 X
E d
E d
X ,1 n=t
X ,1
t=0
n=t
X ,1
n X
n=0
t=0
n=0
n=t
[r(Xn ) ,
X ,1
Ed
X ,1
X ,1
X ,1
n=t
[r(Xn ) ,
[r(Xn) ,
[r(Xn ) , g]
! g] Xt ! g] Xt
g]
!
= k P( = k) !
= k P( = k)
!
= k P( = k)
!
!
[r(Xn ) , g]
(n + 1)[r(Xn ) , g]
Substituting this into (14) yields the result. Since we are maximizing this term, this shows that excess rewards received later in the cycle are \worth less" than those received earlier. The factor of n+1 in the previous proof was mentioned in Meyn [13], however, to our knowledge this is the rst time that it has been used to explain implicit discounting captured by the bias. Consider the following deterministic example.
Example 4 Suppose S = f0; 1; 2; 3g, A = fa; b; abg, A = fag, A = fbg, and A = fabg, r(0; a) = 0
1
2
3
r(2; b) = 1, r(0; b) = r(1; a) = ,1, r(0; ab) = r(3; ab) = 0 and p(1j0; a) = p(2j0; b) = 22
(ab,0)
(a,1)
(a,-1) 1
δ γ
0
(ab,0)
3
ψ (b,-1)
2
(b,1)
Figure 2: A deterministic example with average reward 0. Quantities in parentheses denote actions and reward, respectively.
p(3j1; a) = p(3j2; a) = p(3j0; ab) = p(0j3; ab) = 1. Let be the decision rule that chooses action a in state zero, the on that choooses ab, and be that which chooses action b. Clearly, this model is unichain and g = g = g = 0. It is also easy to see that the bias constant,
c , of 1 must be zero. Suppose we choose f0g as the reference state (so hrv (0) = 0). ( )
By examination of Figure 2 we have hrv (1) = 1, hrv (2) = ,1, and hrv (3) = 0. The ( )
( )
( )
stationary distributions, d, are = f1=3; 1=3; 0; 1=3g and = f1=3; 0; 1=3; 1=3g. Thus, the bias constants are , hrv = 1=3 and , hrv = ,1=3. By Proposition 1 1 is bias ( )
( )
optimal.
23
Alternatively, we can use Proposition 1 to compute the constant
c = , f[r(0) , g] + [r(1) , g] + [r(3) , g]g +3f[r(1) , g] + [r(3) , g]g + f[r(3) , g]g = , f[r(1) , g] + 2[r(2)3, g] + 3[r(3) , g]g = ,(1 + 2 (,1) + 3 (0))=3 = 1=3: Similarly,
c = ,(,1 + 2 (1) + 3 (0))=3 = ,1=3 Notice that when the excess reward is received earlier it is worth more (,1 compared to
,2), and when the cost is received earlier, it is more costly than later (1 compared to 2 1). In this example, we have an explicit cost that must be accrued as opposed to the cost to the system in the previous example. The bias-based decision-maker, however, must make a similar decision. If we compare to the decision-maker chooses to receive the immediate reward and accrue the later cost. If we compare to then the decision-maker prefers not to accrue the immediate cost, despite the fact that the is a reward to be received later. Precisely the same logic can be applied to the prior queueing example. Since the reward received later is discounted the decision-maker chooses not to accept the arriving customer. Next we show that the sample path arguments and implicit discounting discussed in Propositions 1 and 1 are useful in other examples. We consider a simpli cation of the model discussed in Lewis, et. al [8].
Example 5 Consider an M=M=1=1 queueing system with two arriving customer classes. Upon arrival each customer oers a reward that is received upon acceptance to the system. Accepted 24
λ1
1−λ∗ 0
1
1−µ
µ
µ λ2
2
1−µ
Figure 3: An example of the two class, one-server model. Note that is either + or depending on if class 2 is accepted or rejected, respectively. 1
2
1
customers are served with rate regardless of their class. Assume that class i customers oer reward ri, arrive with rate i , i = 1; 2 and that r > r . Since class one customers oer 1
2
a higher reward they are always accepted and decisions only need to be made when class two customers arrive. Rejected customers are lost. Since customers are served at the same rate regardless of their class the state space need not include the class of the customer that is being served. However, as will also be interested in this model when the customers pay after service, we include this infomation in the state space. Hence, the state of the system is the class of customer being served, where state zero corresponds to an empty system. (see Figure 3). Assume that the uniformization constant + + = 1. Thus, we have 1
25
2
the following optimality equations,
h(0) = maxf r + r , g + h(1) + h(2) + (1 , ( + ))h(0); 1 1
2 2
1
2
1
r , g + h(1) + (1 , )h(0)g 1 1
1
2
(42)
1
h(1) = ,g + h(0) + (1 , )h(1) h(2) = ,g + h(0) + (1 , )h(2) If we let the reference state be state zero so that hrv (0) = 0, a little algebra yields that ( )
hrv (1) = hrv (2) = ,g=. Let i = i=. Consider the policy that accepts both classes. ( )
( )
A little computation shows that the stationary distribution, , is 1
(0) = 1 + 1 + 1
1
2
(1) = 1 + + 1
1
1
2
(2) = 1 + + 2
1
1
2
Hence,
hrv = ,g1((+ ++ )=) 1
1
( )
2
1
2
Similarly, denote the stationary distribution of the policy that only accepts class 1 customers by . Computing the bias constant we have 2
hrv = ,1g(+ =) 2
1
( )
1
26
To compare the two, consider
hrv = g( + )=) 1 + hrv 1 + + g( =) 1
2
( )
1
( )
2
1
1
2
1
+ + )+ = (1 (1 + + ) 1
1
1
2
1
2
2
> 1 It is bias optimal to accept both classes. On the other hand, if payment is not received until service is completed, the AOE are,
h(0) = maxf,g + h(1) + h(2) + (1 , ( + ))h(0); 1
2
1
2
,g + h(1) + (1 , )h(0)g 1
(43)
1
h(1) = r , g + h(0) + (1 , )h(1) 1
h(2) = r , g + h(0) + (1 , )h(2): 2
Notice that if h(2) > h(0) it is optimal to accept class 2 customers. If h(2) < h(0) it is optimal to reject class 2 customers. At equality, a gain optimizing decision-maker is indierent. Again, letting zero be the reference state and assuming that it is optimal to both accept and reject class 2 customers, we have hrv (2) = hrv (0) = 0. Further, from ( )
( )
the gain optimality equations hrv (i) = ri , g=, i = 1; 2. To compare the bias of the two ( )
policies, consider
hrv = [(r , g) + (r , g) ]= 1 + hrv 1+ + (r , g) = 1
2
( )
1
1
( )
2
1
2
2
= 1 +1 + + 1
1
2
< 1: Hence, it is bias optimal to reject class 2 customers. 27
1
1
1
Notice that in this example, upon subsituting the appropriate values for hrv (1) and ( )
hrv (2) in state zero, the gain optimality equations for pay on entrance and exit are precisely ( )
the same. Thus, the gain is the same for both models. However, when paying on entrance a typical reward stream might look like, fr ; 0; 0; : : : ; 0g whereas when paying on exit, the 1
stream is f0; 0; 0; : : : ; 0; r g. When the customer must pay on entrance, the reward is received 1
rst, while when they pay on exit, the cost of having customers in the system is accrued rst. Using Proposition 1, the excess reward streams become cost streams and might look like fr , g; ,2g; ,3g; : : : ; ,gg and f,g; ,2g; ,3g; : : : ; ,(r , g)g when paying on 1
1
entrance or exit, respectively. Again the excess reward is discounted (remember we want the maximum) so that it is not worth as much if received at the end of the cycle. This makes explicit the fact that the bias also captures the desirable properties of discounting.
7 Conclusions We have presented a probabilistic approach to interpreting bias optimality. This leads to simple sample path arguments for results that previously required more algebra and presented no intuition for why the bias optimizing decision-maker would prefer a particular policy. Furthermore, this probabilitic analysis leads to an explanation of implicit discounting in bias. It is important to note that the discounting captured by bias is only valid for recurrent states. In fact, it is easy to constuct examples in which the bias is indierent to receiving reward earlier or later in transient states (see Veinott [18]). To capture this one must turn to more sensitive optimality criterion. Finally, we have restricted our attention to nite state, nite action space models. The authors hope that this paper makes apparent the need to develop these ideas on more general spaces and for multichain Markov decision processes. 28
8 Acknowledgements We would like to thank Enrique Lemus for some preliminary discussions on bias optimality.
References [1] A. Arapostathis, V. S. Borkar, E. Fernandez-Gaucherand, M. K. Ghosha, and S. I. Marcus. Discrete-time controlled Markov processes with average cost criterion: A survey. Siam Journal on Control and Optimization, 31(2):282{344, March 1993.
[2] D. Blackwell. Discrete dynamic programming. Annals of Mathematical Statistics, 33:719{726, 1962. [3] E. V. Denardo. Computing a bias optimal policy in a discrete-time Markov decision problem. Operations Research, 18:279{289, 1970. [4] C. Derman and A. F. Veinott, Jr. A solution to a countable system of equations arising in Markovian decision processes. Annals of Mathematical Statistics, 38(2), 1967. [5] M. Haviv and M. L. Puterman. Bias optimality in controlled queueing systems. Journal of Applied Probability, 35:136{150, 1998.
[6] O. Hernandez-Lerma and J. B. Lasserre. Discrete-Time Markov Control Processes: Basic Optimality Criteria. Springer-Verlag Inc., New York, 1996.
[7] O. Hernandez-Lerma and J. B. Lasserre. Policy iteration in average cost Markov control processes on Borel spaces. Acta Applicandae Mathematicae, 47:125{154, 1997. [8] M. E. Lewis, H. Ayhan, and R. D. Foley. Bias optimality in a queue with admission control. Probability in the Engineering and Informational Sciences, 13:309{327, 1999. to appear. 29
[9] M. E. Lewis and M. L. Puterman. A note on bias optimality in controlled queueing systems. Journal of Applied Probability, 37(1), 2000. to appear. [10] S. A. Lippman. Applying a new device in the optimization of exponential queueing systems. Operations Research, 23(4):687{712, 1975. [11] A. M. Makowski and A. Shwartz. On the Poisson equation for Markov chains: Existence of solutions and parameter dependence by probabilistic mehtods. Technical report, Technion{Israel Institute of Technology, September 1994. [12] E. Mann. Optimality equations and sensitive optimality in bounded Markov decision processes. Optimization, 16(5):767{781, 1985. [13] S. Meyn. The policy iteration algorithm for average reward Markov decision processes with general state space. IEEE Transactions on Automatic Control, 42:1663{1680, December 1997. [14] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley and Sons, Inc., New York, 1994.
[15] S. Resnick. Adventures in Stochastic Processes. Birkhauser, Boston, 1992. [16] P. Schweitzer and A. Federgruen. The functional equations of undiscounted Markov renewal programming. Mathematics of Operations Research, 3:308{321, 1977. [17] A. F. Veinott, Jr. Discrete dynamic programming. Annals of Mathematical Statistics, 40(5):1635{1660, October 1969. [18] A. F. Veinott, Jr. Markov decision chains. In Studies in Optimization, volume 10 of Studies in Mathematics, pages 124{159. Mathematics Association of America, 1974.
30