Dynamic programming Martin Ellison
1
Motivation
Dynamic programming is one of the most fundamental building blocks of modern macroeconomics. It gives us the tools and techniques to analyse (usually numerically but often analytically) a whole class of models in which the problems faced by economic agents have a recursive nature. recursive problems pervade macroeconomics: any model in which agents face repeated decision problems tends to have a recursive formulation. This lecture introduces two key concepts: the value function and value function iterations. To fully understand the intuition of dynamic programming, we begin with simple models that are deterministic. Models which are stochastic and nonlinear will be considered in future lectures.
2
Key reading
This lecture draws on the material in chapters 2 and 3 of “Dynamic Economics: Quantitative Methods and Applications” by Jérôme Adda and Russell Cooper, Massachusetts Institute of Technology, 2003. We will also use the book later in the course. It is a very accessible introduction to techniques for dynamic economics, covering topics including consumption, investment 1
and employment. As it is very reasonably priced, I recommend this text for purchase.
3
Other reading
The same material is covered in several textbooks, most notably “Recursive Macroeconomic Theory”, 2nd ed by Lars Ljungqvist and Tom Sargent, MIT Press, 2000 and the original grandfather text “Recursive Methods in Economic Dynamics” by Nancey Stokey and Robert Lucas, 1989. These cover the material with greater mathematical rigor, whereas Adda and Cooper place greater weight on intuitive understanding.
4
The value function
At the heart of dynamic programming is the value function, which shows the value of a particular state of the world. For example, what is the value of having income I when there are two goods in the economy, x1 and x2 , at prices p1 and p2 and utility is logarithmic and separable? To answer this question, we begin by asking how the consumer would allocate income I across the two goods. The utility maximisation problem faced by the consumer is
max [! ln x1 + (1 − !) ln x2 ] x1 ;x2
s:t: p1 x1 + p2 x2 = I Taking first order conditions gives the familiar solution in which there are constant expenditure shares.
2
p1 x1 = !I p2 x2 = (1 − !)I Solving these two equations for x1 and x2 , we can then substitute into the utility function to obtain. µ
¶ µ ¶ !I (1 − !)I V (I; p1 ; p2 ) = ! ln + (1 − !) ln p1 p2 Students of microeconomics will immediately recognise this as the indirect utility function. It describes how utility depends on the state variables I, p1 and p2 . However, in the dynamic programming terminology, we refer to it as the value function - the value associated with the state variables. Note that it is intrinsic to the value function that the agents (in this case the consumer) is optimising. More generally, we can write V (I; p) = maxu(c) c∈C
s:t: pc = I where C is a vector of possible consumption goods and p is a vector of their corresponding prices. this formulation makes it explicit that the value function incorporates optimisation. Whilst the example in this section is trivial, we will see that recasting economic problems in terms of value functions turns out to be extremely powerful.
5
Cake-eating example
To introduce dynamics to the problem, we now consider the problem of how quickly one should eat a cake of given size. Imagine the cake is initially of 3
size W1 and all cake should be eaten before time T (by which time presumably either the cake has become moldy or the consumer has died and become moldy!) Instantaneous utility derived from eating cake is given by the function u(ct ) and the consumer discounts future utility by the factor ¯. This is a finite-horizon dynamic problem with discounting. T X max ¯ t−1 u(ct ) {ct }
t=1
s:t:
Wt+1 = Wt − ct W1 given
The problem is to find {ct }. In standard undergraduate or master courses
in macroeconomics, the preferred solution method is one of direct attack. We solve the budget constraint forward to obtain T X ct + WT +1 = W1 t=1
and define the Lagrangian
" # T T X X ¯ t−1 u(ct ) + ¸ W1 − WT +1 − ct L= t=1
t=1
The first order conditions for t = 1 : : : T are ¯ t−1 u0 (ct ) = ¸ Alternatively, u0 (ct ) = ¯u0 (ct+1 )
4
This is the familiar Euler equation, equating the net present value of marginal utility of consumption across consecutive time periods. In itself, it is not sufficient to uniquely determine how the cake should be eaten. For that, we also require that WT +1 = 0, a terminal condition that states that no cake should be left over after period T . To solve the problem using the method of dynamic programming, we define a value function VT (W1 ) to be the solution derived above with the method of direct attack, i.e., T X VT (W1 ) = max ¯ t−1 u(ct ) {ct }
t=1
s:t:
Wt+1 = Wt − ct W1 given
Although we do not know the function VT (W1 ), we do know its derivative VT0 (W1 ).
An increment in the initial cake size W1 allows consumption in any
period to increase, therefore, VT0 (W1 ) = ¯ t−1 u0 (ct ). It does not matter in which period the extra cake is eaten since, due to optimality, the return (in terms of the value function) of eating extra cake is equalised across periods. The power of dynamic programming becomes apparent when we add an additional period 0 to our problem. The problem at time 0 is to solve
max [u(c0 ) + ¯VT (W1 )] c0
s:t: W1 = W0 − c0 W0 given 5
This is a simple problem to solve because we only have to choose c0 rather than a whole time path for consumption {ct }. The first order condition is
simply
u0 (c0 ) = ¯VT0 (W1 ) However, we know from before that VT0 (W1 ) = ¯ t−1 u0 (ct ) = u0 (c1 ) so we can conclude u0 (ct ) = ¯u0 (ct+1 ) and we have derived the Euler equation using the dynamic programming method. Notice how we did not need to worry about decisions from time t = 1 onwards. This is an example of the Bellman optimality principle. It is sufficient to optimise today conditional on future behaviour being optimal. The ease with which we did this is of course illusionary because we already knew the form of VT0 (W1 ) from the direct attack approach. In general, this will not be the case and we will not know the exact form of the value function or its first derivative. Fortunately, this is not a completely insurmountable problem. Our approach will be to make a first guess at the value function and then have several value function iterations until our guesses converge on the true value function. The next section is devoted to showing how these value function iterations are carried out and under what conditions they converge to the true value function.
6
Value function iterations
We illustrate the convergence of value function iterations to the true value function in a general formulation of the dynamic programming problem. The key ingredients are an payoff function ¾ ˜ (st ; ct ) and a transition equation 6
st+1 = ¿ (st ; ct ). The payoff function describes the instantaneous return from choosing a vector of controls ct at a given vector of states st . In the cakeeating example, ¾ ˜ (·) is simply the (direct) utility function. The transition equation describes the evolution of the vector of state variables st . For the cake-eating example, ¿ (·) is the intertemporal budget constraint. Of crucial importance for the remainder of this course is that ¾ ˜ (·) and ¿ (·) are not time-dependent so the problem is stationary. This ensures that the problem has a recursive representation. In other words, for a given state vector, the problem faced by the agent is always the same. The value function is now defined as the value of having a particular state s (we remove the time index and use s and s0 to denote the state vector in adjacent time periods).
V (s) =
max [˜ ¾(s; c) + ¯V (s0 )]
c∈C(s)
s:t:
st+1 = ¿ (st ; ct ) C(s) is the set of all possible choices for the controls c for a given state vector s. To make the notation more compact, we invert the transition equation to define the control c as a function of current and future states, enabling the payoff function to be written in terms of s and s0 . Instead of choosing c, the agent chooses the future state s0 from the set of feasible states Γ(s). [¾(s; s0 ) + ¯V (s0 )] V (s) = max 0 s ∈Γ(s)
The problem now is to find the value function V (s). In some cases, it is possible to make an intuitive guess at its form (e.g. quadratic in the state variables) and then proceed via the method of undetermined coefficients to 7
show that the guess is consistent with optimality. The approach we take through value function iterations is more general, although it leads to a numerical rather than analytical solution. We begin by making an initial guess of the value function W (·). The next guess of the value function is obtained by applying an operator T , defined as follows: T (W )(s) = max [¾(s; s0 ) + ¯W (s0 )] 0 s ∈Γ(s)
To put the iteration in words, what we are doing in each iteration is reoptimising the choice of the future states s0 . In doing this, we need to know how a change in the future states affects the payoff this period and in all future periods (the latter is often known as the continuation value). For the payoff this period, we can use the function ¾(s; s0 ). For the payoff in future periods, we use the previous guess of the value function W (s0 ). In effect, we are reoptimising our choice of s0 , assuming that W (s0 ) is a correct representation of the true value function from the next period onwards. Clearly it is not, but successive iterations will generally converge so that W (s0 ) becomes the true value function V (s0 ) and the guess of the value function does not change between successive value function iterations. How do we know that value function iterations will converge? Even if they do, how do we know that they converge to the unique value function? To answer these question we need a fixed point theorem, since we wish to show that value function iterations converge to the unique fixed point defined by T (V )(s) = max [¾(s; s0 ) + ¯V (s0 )] 0 s ∈Γ(s)
There are many fixed point theorems, some more useful than others. For our purposes, the most useful fixed point theorem is known as Blackwell’s sufficiency conditions. These conditions ensure convergence to a unique fixed point. The conditions are 1) monotonicity and 2) discounting of the T oper8
ator. The mathematics behind these conditions can be found in Stokey and Lucas and many other places. Here, we will concentrate on gaining intuition into why Blackwell’s sufficiency conditions guarantee convergence of value function iterations to the unique true value function. To illustrate the intuition, we describe a simple example in which the state space collapses to a single point. In this example, there is no possibility to change state so no control or optimisation decision. However, there is a value associated with the (unique) state so we can still illustrate the operation of value function iterations. In our simple example the value function will be a constant. If the initial guess of the value function is W then the next guess of the value function is obtained by applying the operator T (W ) = ¾ + ¯W Notice that the assumption of a collapsed state space removes the state dependency of the payoff function ¾(·). It is a simple matter to plot the mapping graphically. T(W)
45o line T(W) = s + b W
s W V
This mapping converges to V for any starting value. It is an example of a contraction mapping as the span of W is contracted at each iteration by the T operator. The key condition for convergence in the simple model is |¯| < 1, which is guaranteed by discounting. Graphically, this ensures that 9
the T mapping cuts the 45o line from above and with a gradient of modulus less than 1. Once we move to problems with a fully specified state space, the operator T is applied to a function W (s0 ) rather than a constant W . In this case, discounting is not a sufficient condition for unique convergence of value function iterations to the true value. Intuitively, what we require is that the T mapping cuts the 45o surface from above in every direction and that the dynamics are stable. The condition of monotonicity ensures this. The original Blackwell paper form 1965 contains a formal proof, as does Stokey and Lucas. It is easy to see that Blackwell’s sufficiency conditions apply to the dynamic programming problems we will study. Monotonicity requires that if W (s) ≥ Q(s) for all s ∈ S then T (W )(s) ≥ T (Q)(s) for all s ∈ S. This is guaranteed because our problem is one of maximisation. We have
T (W )(s) = ≥
max [¾(s; s0 ) + ¯W (s0 )] 0
s ∈Γ(s) ¾(s; s00 )
+ ¯W (s00 )
= T (Q)(s)
where s00 is the future state chosen in the previous period. Discounting is satisfied if we consider adding a constant k to the value function and T (W +k)(s) ≤ T (W )(s) +¯k. This is satisfied trivially in our model because T (W + k) = max [¾(s; s0 ) + ¯(W (s0 ) + k)] = T (W ) + ¯k 0 s ∈Γ(s)
7
Numerical example
In this final section we show how to apply the principles of dynamic programming to the cake-eating problem in practice. We discuss the Matlab program 10
available from the (very preliminary) Adda and Cooper book’s homepage at http://www.eco.utexas.edu/~cooper/dynprog/dynprog1.html. This program iterates the value function and derives an optimal policy function. Initialise the program by clearing the working space and define the number of value function iterations and the discount factor. Clear; dimIter=30; beta=0.75; We discretise the possible cake sizes into a vector K. It contains of one hundred rows, starting from 0 and increasing in steps of 0.01 to 1. We store the row and column size of K in rowK and colK K=0:0.01:1; [rowK,colK]= size(K); V is a matrix that stores the results of the value function iterations. The rows of V correspond to the value of the possible cake sizes defined in K. The columns of V contain successive value function iterations. The initial guess of the value function is zero for all sizes of cake. V=zeros(colK,dimIter); Begin with the first value function iteration. Continue until the desired number of iterations have been completed. iter; FOR iter=1:dimIter; Define aux as an auxiliary matrix with the same number of rows and columns as the cake-size matrix K. We will use this matrix to store the 11
value of choosing to leave K(ik2) cake for the next period when the current size of the cake is K(ik). We will actually only use the lower left triangle of the aux matrix since it is impossible to leave more cake in the future than you have at present, i.e. K(ik2) ≤ K(ik). aux=zeros(colK,colK)+NaN; Beginning with the first possible current cake size, start looking through all possible cake sizes. for ik=1:colK; For each current cake size K(ik), we examine the value of leaving all possible future cake sizes, K(ik2) < K(ik). We ignore the possibility K(ik2) = K(ik) since that would imply no consumption and therefore starvation under logarithmic utility. for ik2=1:(ik-1) The value of choosing K(ik2) when the current cake has size K(ik) is stored in the ik; ik2 element of aux. It consists of two parts, the (logarithmic here) payoff function log(K(ik) − K(ik2)) and the expected continuation
value V (ik2; iter). Note that this uses the value function V (ik2; iter) from the previous iteration. aux(ik,ik2)=log(K(ik)-K(ik2))+beta*V(ik2,iter); END END The newly iterated value function is derived by choosing the best future cake K(ik2) for each current cake K(ik). Simply looking for the maximum value of each row of aux (or alternatively the maximum value of each column of aux0 ) is sufficient to find the optimal cake choices. 12
V(:,iter+1)=max(aux’)’; Loop to the next value function iterations until we get to last iteration, dimIter. ENDO; The value function iterations are now complete. The final value function is stored in V al, with the corresponding indices of future cake choices in Ind. optK converts these indices into actual cake sizes, with optC the necessary consumption. [Val,Ind]=max(aux’); optK=K(Ind); optK=optK+Val*0; optC=K’-optK’; Plot a graph of the successive value function iterations. figure(1) plot(K,V); xlabel(’Size of Cake’); ylabel(’Value Function’); Plot the optimal policy function. figure(2) plot(K,optC,’LineWidth’,2) hold on plot(K,K’,’—r’,... ’LineWidth’,2) xlabel(’Size of Cake’); 13
ylabel(’Optimal Consumption’); text(0.4,0.65,’45 degree line’,’FontSize’,18) text(0.4,0.13,’Optimal Consumption’,’FontSize’,18) legend(’Optimal Consumption’,’45 degree line’,2) The output of the code is shown below. In the first figure, the value function clearly converges over successive iterations. the second figure shows the optimal consumption as a function of current cake size.
0
-5
-10
Value Function
-15
-20
-25
-30
-35
-40
-45
0
0.1
0.2
0.3
0.4
0.5 Size of Cake
0.6
0.7
Iteration of the value function
14
0.8
0.9
1
1
0.9
0.8
Optimal Consumption
0.7
45 degree line 0.6
0.5
0.4
0.3
0.2
Optimal Consumption
0.1
0
0
0.1
0.2
0.3
0.4
0.5 Size of Cake
0.6
Optimal policy function
15
0.7
0.8
0.9
1