52nd IEEE Conference on Decision and Control December 10-13, 2013. Florence, Italy
Analysis of Best-Reply Strategies in Repeated Finite Markov Chains Games Julio Clempner and Alexander Poznyak
that are not necessarily presented in a Nash equilibrium point. In general, a game is said to be stable with respect to a set of strategies if the iterated process of strategies (in our case, the best-reply dynamics) selection converges to an equilibrium point, without considering what are the initial strategies the players start with. To converge to an equilibrium point every player selects his strategies by optimizing his individual cost function looking at the available strategies of other players [6]. Any deviation from such an equilibrium point would return back to the same equilibrium point. This is because the natural evolution of the iterated process of strategies selection that tries to follow the optimal strategies and rectifies the trajectory to reach a stable equilibrium point (this is the case when the equilibrium point is unique). In this sense, we can state that a Lyapunov equilibrium point is a strategy once being in the stable state of the strategies choices it is no player’s interest to unilaterally change strategy. An important advantage of the Lyapunov games is that every ergodic system can be represented by a Lyapunovlike function. For a repeated (ergodic) game a recursive mechanism is implemented to justify an equilibrium play [2]. If the ergodic process of the stochastic game converges, then we have reached an equilibrium point, and moreover, a highly justifiable one [7]. We present a method for the construction of a Lyapunovlike function (with a monotonic behavior) that has a one-toone relationship with a given cost-function. Being bounded from below, a decreasing Lyapunov-like function provides the existence of an equilibrium point for the applied localoptimal policy and, besides, ensures the convergence of the cost-function to a minimal value [2]. The resulting vector Lyapunov-like function is a monotonic function whose components can only decrease over time. As a result, a repeated game may be represented by a one-shot game. It is important to note that in our case, the problem becomes more complicated to justify because repeated games are transformed in one-shot games replacing the recursive mechanism by a Lyapunov-like function. The suggested approach is illustrated by the repeated asynchronous “Prisoner’s Dilemma” game with best-reply actions application. The main contribution of this paper consists in the following points: 1) we show that the behavior of the cost sequence corresponding to the local-optimal (best-reply) strategy, has a non-monotonic character that does not permit to prove exactly the existence of a limit point; 2) we suggest a “one-to-one” mapping between the current cost-function and a new “energy function” (Lyapunov-like function) which is monotonically non-increasing on the trajectories of the
Abstract— The "best-reply strategy" is a natural and most commonly applied type of actions which players prefer to use during a repeated game. Usually, the behavior of an individual cost-function, when such strategies are applied, turns out to be non-monotonic, and, as the results, to make the conclusion that such strategies lead to some equilibrium point is a non-trivial and delicate task. Moreover, even in repeated games the convergence to a stationary equilibrium is not always guaranteed. Here we show that in the ergodic class of finite controllable Markov Chains Dynamic Games the best reply actions lead obligatory to one of Nash equilibrium points. This conclusion is done by the Lyapunov Games concept which is based on the designing of an individual Lyapunov function (related with an individual cost function) which monotonically decreases (non-increases) during the game. The suggested approach is illustrated by the repeated asynchronous “Prisoner’s Dilemma” game with bestreply actions application.
I. I NTRODUCTION The "best-reply strategy" is a natural and most commonly applied type of actions which players prefer to use during a repeated game. In fact, such action realizes a local (one-step) predicted optimization assuming that the past history (states and actions) can not be changed evermore. The behavior of an individual cost-function, when such strategies are applied, turns out to be non-monotonic, and, as the results, to make the conclusion that such strategies lead to some equilibrium point (usually, the Nash equilibrium [4]) is a hard task requiring a special additional analysis. Moreover, even in repeated games, the convergence to a stationary equilibrium is not always guaranteed (see [1], [3]). Here we show that in the ergodic class of finite controllable Markov Chains Dynamic Games the best reply actions lead to one of Nash equilibrium points obligatory. This conclusion is done by the Lyapunov Games concept which is based on the designing of an individual Lyapunov function (related with an individual cost function) which monotonically decreases (non-increases) during the game. In Lyapunov games [2] a natural existence of the equilibrium point is ensured by definition. Convergence to an equilibrium point is also guaranteed to exist. A Lyapunov-like function monotonically decreases and converges to a Lyapunov equilibrium point tracking the state-space in a forward direction. The best-reply dynamics result in a natural implementation of the behavior of a Lyapunov-like function. As a result, a Lyapunov game has also the benefit that it is common knowledge of the players that only best-reply is chosen. A Lyapunov equilibrium point presents properties of stability J.Clempner is with High School of Physics and Mathematics of National Polytechnic Institute (IPN), Mexico D.F A. Poznyak is with Automatic Control Department, CINVESTA, Mexico D.F
[email protected] 978-1-4673-5716-6/13/$31.00 ©2013 IEEE
568
system under the local-optimal (best-reply) strategy application; 3) we show that a Lyapunov equilibrium point is a Nash equilibrium point (the inverse is not true), but in addition it also presents several advantages: a) a natural existence of the equilibrium point is ensured by definition, b) a Lyapunov-like function can be constructed to respect the constraints imposed by the game, c) a Lyapunov-like function definitely converges to a Lyapunov equilibrium point, and d) a Lyapunov equilibrium point presents properties of stability; 4) the convergence of the local-optimal (best-reply) strategy is also obtained for a class of ergodic controllable finite Markov chains and 5) we provide an analytical formula for the numerical realization of local-optimal (best-reply) strategy.
The strategy (policy) dn (kji) P (a(k)jsn = s(i)) represents the probability measure associated with the occurrence of an action an from state sn = s(i). The elements of the transition matrix for the controllable Markov chain can be expressed as P fsn+1 = s(j)jsn = s(i)g = M P
Q We use notations Q = 2N (the mixed strategies j profile), and = j2N jfg (the mixed strategies profile of all the players except for player ). Let us denote the collection fdn (kji)g by n as follows
II. P RELIMINARIES
n = fdn (kji)gk=1;M ;
As usual let the set of real numbers be denoted by R and let the set of non-negative integers be denoted by N. The inner product for two vectors u; v in Rn is denoted by hu; vi = v T u. Let S be a finite set, called the state space, consisting of all positive integers N 2 N of states fs (1) ; :::; s (N )g. A Stationary Markov chain [5] is a sequence of S-valued random variables sn ; n 2 N; satisfying the Markov condition:
i=1;N
In this paper we will investigate the class of the, so-called, local-optimal policies (strategies) defined below. Definition 2: A policy loc is said to be localn n0 optimal (or best reply) if for each n 0 it minimizes the conditional mathematical expectation of the cost-function V (sn+1 ) under the condition that the prehistory of the process
P (sn+1 = s(j)jsn = s(i); sn 1 = s(in 1 ); :::; s1 = s(i1 )) = P (sn+1 = s(j)jsn = s(i)) = (ij) (1) The Markov chain can be represented by a complete graph whose nodes are the states, where each edge (s(i); s(j)) 2 S 2 is labeled by the transition probability (1). The matrix = ((ij))(s(i);s(j))2S 2 [0; 1]SS determines the evolution of the chain: for each k 2 N , the power k has in each entry (s(i); s(j)) the probability of going from state s(i) to state s(j) in exactly k steps. Definition 1: A Markov Decision Process is a 5-tuple M DP = fS; A; ; ; V g
P fsn+1 = s(j)jsn ak = a(k)g dn (kji)
k=1
Fn := f0 ; P fs0 = s (j)g j = 1; N ; :::; n P fsn = s (j)g j = 1; N g
1;
is fixed and can not be changed hereafter, i.e., it realizes the “one-step ahead” conditional optimization rule loc n := arg min E fV (sn+1 ) j Fn g Dn
(3)
where V (sn+1 ) is the cost function of the player at the state sn+1 . Remark 1: Locally optimal policy is known as a “myopic” policy in the games literature. A non-cooperative stochastic game is a tuple
G = N ; S; ( )2N ; ( )2N ; ; (V )2N
(2)
where: 1) S is a finite set of states, S N, endowed with discrete topology; 2) A is the set of actions, which is a metric space. For each s 2 S; A(s) A is the non-empty set of admissible actions at state s 2 S. Without loss of generality we may take A= [s2S A(s); 3) = f(s; a)js 2 S; a 2 A(s)g is the set of admissible state-action pairs, which is a measurable subset of S A; 4) (k) = [(ijjk)] is a stationary transition controlled matrix, where
For a strategy tuple d = (d1 ; :::; djN j ) 2 we denote the complement strategy d = (d1 ; :::; d 1 ; d+1 ; :::; djN j ) and, with an abuse of notation, d = (d ; d ). The state d = (d1 ; :::; djN j ) represents the distribution vector of strategy frequencies and can only move on . d the vector average cost function Let us denote by Vn+1 at the state sn+1 and time (n + 1) under the fixed strategy dn = d, that is, 1 d d djN j Vn+1 := Vn+1 ; :::; Vn+1
(ijjk) P (sn+1 = s(j)jsn = s(i); an = a(k)) representing the probability associated with the transition from state s(i) to state s(j) under an action a(k) 2 A(s(i)); k = 1; :::; M ; 5) V : S ! R is a cost function, associating to each state a real value. The Markov property of the decision process in (2) is said to be fulfilled if
d where Vn+1 is the average cost function at the state sn+1 and time (n + 1) for the player , namely, Vn+1 := E(V (sn+1 )jdn )
where V (sn+1 ) is the cost function of the -player at the state sn+1 and, E(jdn ) is the operator of the conditional mathematical expectation subject to the constraint that at
P (sn+1 j(s1 ; s2 ; :::; sn 1 ); sn = s(i); an = a(k)) = P (sn+1 jsn = s(i); an = a(k)) 569
time n the mixed strategy dn has been applied. A Lyapunov game is a tuple
G = N ; S; ( )2N ; ( )2N ; ; (V )2N
In a vector format, the formula (4) can be expressed as := Ed (V (sn+1 )) = Vn+1 PN PN PM i=1 [ j=1 k=1 V (sn+1 = s(j) j sn = s(i); an = a (k)) (ijjk)dn (kji)] P (sn = s(i))) = hwn ; pn i
where V is a Lyapunov-like function (monotonically decreasing in time).
where
i PM hPN (wn )i := k=1 j=1 Vijjk (ijjk) dn (kji) Vijjk := V (sn+1 = s(j) j sn = s(i); an = a (k))
III. P ROBLEM F ORMULATION To tackle this problem we proposed representing the statevalue function V using a linear (with respect to the control d 2 ) model. After that we obtain the policy d that results in the minimum trajectory value. Finally, we present V in a recursive matrix format.
B. The Recursive Matrix Form Let us first introduce the following Proposition. Proposition 1: Let be the unit simplex in RM , that is, = fu 2 RM j
A. The State-Value Function
M X
u(k) = 1; u(k) 0g
k=1
The probability of the player 2 N in the game G to find itself in the next state is as follows:
Then,
P (sn+1 = s(j)jsn = s(i)) = = s(j)jsn = s(i); a (k))dn (kji) = k=1 P (sn+1 PM k=1 (ijjk)dn (kji)
PM
min
u2
M X
k=1
v(k)u(k) =
min v(k) = v()
k=1;:::;M
and the minimum 1is achieved at least for u = 0 The cost function V of any fixed policy d is defined over all :::; 0; 1; 0; :::; 0A. possible combinations of states and actions, and indicates the @|0; 0; {z } expected value when taking action a in state s and following Indeed, it is evident that policy d thereafter. The V -values for all the states of (2) can be expressed by PM PM PN PN PM k=1 (min v(k))u(k) k=1 v(k)u(k) PM Ed (V (sn+1 )) := j=1 i=1 k=1 V (sn+1 = s(j) j = min v(k) k=1 u(k) = min v(k) = v() sn = s(i); an = a (k)) (ijjk)dn (kji)P (sn = s(i)) (4) 0 and the equality1 is achieved at least for u = where V (sn+1 = s(i) j sn = s(i); an = a (k)) is a constant at state s(i) when the action a (k) is applied @0; 0; :::; 0; 1; 0; :::; 0A. | {z } (without loss of generality it can be assumed to be positive) As a result we have and P (sn ) for any given P (s0 ) is defined as follows PN PN Vn+1 = hwn ; pn i = i=1 (wn )i (pn )i P (sn+1 = s(j)) = i=1 P (sn+1 = s(j)js = s(i)) P n N PN PM i=1 mindn (kji)2h(wn )i (pn )i = i P (sn = s(i)) = i=1 PN P k=1 (ijjk)dn (kji) P (sn = s(i)) N i=1 (pn )i mink=1;:::;M j=1 Vijjk (ijjk) or, in matrix format, Then, given fixed history of the process p0 ; d0 ; d1 ; :::; dn 1 > pn+1 = (n ) pn 3 2 PM PM N N X X (n )ij := k=1 k=1 (ijjk)dn (kji) (6) Vijjk (ijjk)5 (pn )i 4 min Vn+1 = dn Remark 2: We will assume hereafter that V (sn+1 = j=1 i=1 s(i) j sn = s(i); an = a (k))) > 0 for all . Indeed, by and the identity in (6) is achieved for the stationary localthe identity optimal policy PN PN PM Ed (V (sn+1 )) = j=1 i=1 k=1 V (sn+1 = s(j) j d n = 0; 1; :: (7) n (k ji) = k (i);i sn = s(i); an = a (k)) (ijjk)dn (kji)P (sn = s(i)) PN PN PM = j=1 i=1 k=1 [V (sn+1 = s(j) j sn = s(i); an = where k (i);i is the Kronecker symbol and k (i) is an index a (k)) c] (ijjk)dn (kji)P (sn = s(i)) + c for which the minimization of the state-value function Ed (V (sn ) is equivalent to the minimization of the function Ed (V~ (sn )) where V~ (sn ) = V (sn ) c which is strictly positive if we take 0 p0 (10) pn+1 = pn = ( )ij :=
min ~ ij (n0 )
max
j=1;:::;N i=1;:::;N
(13)
is strictly positive, that is, erg > 0, then the following properties hold: 1) there exists a unique stationary distribution
where
M X
(12)
Theorem 3: If for a finite Markov chain, corresponding the player and controllable by the best-reply strategy (8), the lower bound estimate of the ergodicity coefficient
where w := ((w )1 ; :::; (w )N ) and (w )i :=
min ~ ij (n0 )
j=1;:::;N i=1;:::;N
(9)
p = lim pn
(14)
n!1
(ijjk) k (i);i = (ijjk (i))
2) the convergence of the current state distribution to the stationary one is exponential:
k=1
IV. C ONSTRUCTION OF A LYAPUNOV- LIKE F UNCTION The aim of this section is to associate to any cost function Vn , governed by (9), a Lyapunov-like function which monotonically decreases (non-increases) on the trajectories of the given system.
jpn (i)
p (i)j C exp f D ng
C =
1 ; D = 1 ln C ; erg n0
1
~ n ij (n0 ) 0 = arg minn0 maxj=1;:::;N mini=1;:::;N (15) Now we are ready to formulate the main result of this paper.
A. Recurrent form for the cost function In view of (5) let us represent Vn+1 as
Vn+1 = hw ; pni = w ; pn 1 + w ; pn pn1 n n 1 > > = Vn + w ; ( ) ( ) p0 h i n 1 > > p0 I ( ) = Vn + w ; ( ) Dh E i I w ;pn 1 = 1+ Vn V
C. The Lyapunov function design Defining ~ n as ~ n =
n if n 0 0 if n < 0
(16)
n
we get
and denoting n =
Dh
i I w ;pn Vn
1
E
=
Dh
i I w ;pn
1
Vn+1 = (1 + n )Vn (1 + ~ n )Vn
E
hw ;pn 1 i
which leads to the following statement. Theorem 4: Let
G = N ; S; ( )2N ; ( )2N ; ; (V )2N
we get Vn+1 = (1 + n )Vn
(11)
B. Ergodicity property for Markov Chains controllable by the best-reply strategy Following Section 1.4 in [7], let us introduce the next definition. Definition 3: For a homogeneous (stationary) finite Markov chain with transition matrix = [ ij ]i;j=1;:::;N the parameter kerg (n0 ) defined by kerg (n0 ) := 1
K X 1 im )n0 max (~ 2 i;j=1;:::;N m=1
(17)
(~ jm )n0 2 [0; 1)
is said to be the coefficient of ergodicity of this Markov chain at time n0 , where (~ im )n0 = P f sn0 = s(m)j s0 = s(i)g = (n0 )im
be a non-cooperative stochastic game and let the recursive matrix format be represented by (11). Then, a possible Lyapunov-like function Vn;mon (which is monotonically non-increasing) for G has the form Qn 1 1+ Vn;mon = Vn t=1 (1 + ~ t ) 1 = 1+~n 1 Vn;mon 1 n 1 (18) V0;mon = V0 Proof: Let us consider the recursion xn+1 (1 + n )xn + n with n ; xn ; n 0. Defining Yn 1 Yn (1 + n ) 1 ; n := n x ~n := xn (1 + n ) t=1
t=1
and
is the probability to evolve from the initial state s0 = s(i) to the state to the state sn0 = s(m) after n0 transitions.
yn = x ~n 571
Xn
1
t=1
t
1
we obtain yn+1 yn . Indeed, Qn x ~n+1 = xn t=1 (1 + n )Q1 Qn n 1 xn (1 + n ) + n t=1 (1 + n ) t=1 (1 + n ) =x ~n + ~n
V. E XAMPLE Let consider the repeated “Prisoner’s Dilemma” (choices are made asynchronously). Let N = 2 be the number of states and M = 2 be the number of actions. The payoff is defined by 5 1 V 1 (i; j) = for the player 1 10 3 5 10 2 V (i; j) = for the player 2 1 3
1
which implies Pn yn+1 = x ~ t=1 Pn n+1 Pnt 1 x ~n + n = x ~ + n t=1 t t=1 t = yn
and therefore yn+1 yn . In view of this we have Qn ;mon Vn+1 = Vn+1 ~ t ) 1 t=1 (1 + Q n (1 + ~ )V (1 + ~ t) 1 = Qnn 1 n t=1 1 Vn t=1 (1 + ~ t ) = Vn;mon
and let the transition matrix for k = 1; 2 be defined as follows 9 for the > > player 1: > > 0:0247 0:9753 = 1 (i; j; 1) = 0:9756 0:0244 > > 0:5668 0:4332 > 1 > (i; j; 2) = ; 0:9960 0:0040 9 for the > player 2: > > > 0:1904 0:8096 = 2 (i; j; 1) = 0:8612 0:1388 > > 0:0027 0:9973 > > 2 (i; j; 2) = ; 0:7693 0:2307
that proves the result. Corollary 5: Since the sequence fVn;mon g is bounded from below and monotonically non-increasing, then by the Weierstrass theorem it converges, that is, there exists a limit ;mon V1 := lim Vn;mon n!1 P n Corollary 6: If the series t=1 ~ t converges, i.e., X1 ~ t < 1
The beginning profile is supposed to be uniform, that is, P (s0 = j) = 0:5 for any player = 1; 2 and its state t=1 Qn j = 1; 2. But as it follows from the statements above in the then the product ~ t ) also converges (by the t=1 (1 + ergodic case this profile can be arbitrarily selected without inequality 1+x ex application that is valid for any x 2 R), any influence to the final equilibrium point. namely, For d1 and d2 the fixed local-optimal strategies (7), and Y1 (1 + ~ t) < 1 (19) k the best-reply strategy the following results have been t=1 obtained: which implies the existence of a limit (a convergence) of the 9 for > sequence fVn g of the given loss-function too, i.e., > the player 1: > > 0:0247 0:9753 = Y1 1 = ;mon V1 := lim Vn = V1 (1 + ~ t) (20) 0:9756 0:0244 n!1 t=1 > > n0 = 1; 1erg = 0:0247 > Remark 4: Notice that by the property (15) the infinit > ; 1 1 k = [1; 1] ; w = [1:0986; 9:8290] product Q in (20) always exists for egrodic Markov chains, 1 9 that is, t=1 (1 + ~ t ) < 1, since by Corollary above for > > the player 2: > > Q1 0:1904 0:8096 = 2 (1 + ~ ) = tDh t=1 E i 0:8612 0:1388 I w ;pt 1 P1 P1 > > n0 = 1; 2erg = 0:1904 exp f t=1 ~ t g exp > t=1 > Vt ; 2 2 k = [1; 1] ; w = [9:0482; 1:2775] P1 hpt pt 1 ;w i P1 hpt pt 1 ;w i exp 1) In Fig. 1 and Fig. 3 the state-value function behavior is exp t=1 t=1 Vt Vt 1
shown (where during game repetition the states of the players P
pt pt 1 exp kwc k fluctuate according to the given probabilistic dynamics) t=1 1 showing completely non-monotonic behavior; P
(pt p ) pt 1 p exp kwc k 2) In Fig. 2 and Fig. 4 the corresponding Lyapunovt=1 1 like functions (18) are plotted definitely demonstrating a P exp 2 kwc k k(pt p )k monotonic decreasing behavior; t=1 3) The results of the two methods clearly show that 1 kw k P C exp f D tg < 1 exp 2 c under the same fixed local-optimal strategy the original cost t=1 functions converge non-monotonically to the values 5.4643 This means that the behavior of the sequence fVn;mon g may (for the first player) and 5.0854 (for the second player) serve as an indicator of the convergence of the game: the and the corresponding Lyapunov-like functions converge approach of the vector-cost function fVn;mon g to its limit monotonically to the values 5.4543 and 4.9298, respectively point Vn; means that we are close to one of the equilibrium which, obviously, are very close. points of the game. Note that this convergence is exponential. 572
Fig. 1.
Non-monotonic behavior of the cost function for Player 1.
Fig. 3.
Fig. 2.
Non-monotonic behavior of the costfunction for Player 2.
Monotonic behavior of the Lyapunov-like function for Player 1
R EFERENCES [1] X. Chen and X. Deng. Setting the complexity of 2-player Nash equilibrium. in Proceedings of IEEE FOCS, (2006), 261-270. [2] J. B. Clempner and A. S. Poznyak. Convergence Properties and Computational Complexity Analysis for Lyapunov Games, International Journal of Applied Mathematics and Computer Science. 21, 2, (2011), 349-361. [3] C. Daskalakis, P. Goldberg and C. Papadimitriou. The complexity of computing a Nash Equilibrium. in Proceedings of ACM STOC, (2006), 71-78. [4] M. Goemans, V.Mirrokni and A. Vetta. Sink equilibria and convergence. Proc. 46th IEEE Symposium on Foundations of Computer Science, (2005) 142–154. [5] O. Hernández-Lerma and J.B. Lasserre: Discrete-Time Markov Control Process: Basic Optimality Criteria. Berlin: Springer. (1996). [6] J. Hilas, M. Jansen, J. Potters and D. Vermeulen. Independence of Inadmissible Strategies and Best ReplyStability: A Direct Proof. International Journal of Game Theory, 32, (2003), 371-377. [7] A. S. Poznyak, K. Najim and E. Gomez-Ramirez. Self-learning control of finite Markov chains. New York: Marcel-Dekker. (2000).
Fig. 4.
573
Monotonic behavior of the Lyapunov-like function for Player 2.