Introduction Background Strings of Actions The End
Performance Guarantees for Approximate Dynamic Programming Schemes Edwin K. P. Chong Colorado State University
IAS, 24 August 2017 Joint work with Yajing Liu and Ali Pezeshki. Partially supported by NSF grant CCF-1422658.
Edwin K. P. Chong
IAS, 24 Aug 2017
1 / 30
Introduction Background Strings of Actions The End
Motivation
Optimal decision making with multiple actions: Typically computationally intractable. Usual approach: Resort to approximations and heuristics. Downside: Often no provable performance guarantees.
Edwin K. P. Chong
IAS, 24 Aug 2017
2 / 30
Introduction Background Strings of Actions The End
Motivation
In some cases, these decision problems have a special structure called submodularity. Often, approximations/heuristics are special cases of greedy schemes. For submodular problems, greedy schemes have provable guarantees. Typically, greedy is > 1 − e−1 (63%) as good as optimal.
Edwin K. P. Chong
IAS, 24 Aug 2017
3 / 30
Introduction Background Strings of Actions The End
Outline
Background on submodularity Strings of actions and ADP We give only the simplest versions of results, enough to convey the basic ideas. No proofs!
Edwin K. P. Chong
IAS, 24 Aug 2017
4 / 30
Introduction Background Strings of Actions The End
Monotoneity and Submodularity Function from real numbers to real numbers f : R → R WLOG, f (0) = 0 Monotone: ∀x ≤ y ∈ R, f (x) ≤ f (y) Diminishing return: ∀x ≤ y ∈ R, ∀z > 0, f (x + z) − f (x) ≥ f (y + z) − f (y) z
z
x Edwin K. P. Chong
y IAS, 24 Aug 2017
5 / 30
Introduction Background Strings of Actions The End
More General Setting
Want to go beyond the real line to a more general setting. Specifically, want to consider objective functions with multiple decision “actions” as arguments. Two specific settings: Sets of actions Strings (ordered sets) of actions
Here, consider only strings of actions. Typically, consider discrete actions.
Edwin K. P. Chong
IAS, 24 Aug 2017
6 / 30
Introduction Background Strings of Actions The End
Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP
Actions and Strings Set of possible actions: X String of actions: A = (a1 , a2 , . . . , ak ), ai ∈ X for i = 1, 2, . . . , k Length of string: |A| = k All possible strings: X∗ = {(a1 , a2 , . . . , ak )| k = 0, 1, . . . and ai ∈ X, i = 1, 2 . . . , k} Empty string: ∅ Concatenation of M = (m1 , m2 , . . . , mk1 ) and N = (n1 , n2 , . . . , nk2 ): M ⊕ N = (m1 , m2 , . . . , mk1 , n1 , n2 , . . . , nk2 ) Prefix: M N if N = M ⊕ L for some L ∈ X∗ Edwin K. P. Chong
IAS, 24 Aug 2017
7 / 30
Introduction Background Strings of Actions The End
Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP
String Functions Function from strings to real numbers f : X∗ → R WLOG, f (∅) = 0 Prefix monotone: ∀M N ∈ X∗ , f (M ) ≤ f (N ) Diminishing return: ∀M N ∈ X∗ , ∀a ∈ X, f (M ⊕ (a)) − f (M ) ≥ f (N ⊕ (a)) − f (N ) f is string submodular if above hold. Postfix monotone: ∀M, N ∈ X∗ , f (M ⊕ N ) ≥ f (N )
Edwin K. P. Chong
IAS, 24 Aug 2017
8 / 30
Introduction Background Strings of Actions The End
Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP
Optimization
maximize f (M ) subject to M ∈ X∗ , |M | ≤ K,
(1)
This constraint set is called a uniform string-matroid of rank K. Can extend to more general string-matroids [TAC 2016].
Edwin K. P. Chong
IAS, 24 Aug 2017
9 / 30
Introduction Background Strings of Actions The End
Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP
Optimal and Greedy
Optimal string: If f is prefix monotone, then there exists one with length K, denoted OK . Greedy string: Gk = (g1 , g2 , . . . , gk ) is called greedy if ∀i = 1, 2, . . . , k, gi ∈ argmax f ((g1 , g2 , . . . , gi−1 , g)), g∈X
argmax denotes the set of actions that maximize f ((g1 , g2 , . . . , gi−1 , ·)).
Edwin K. P. Chong
IAS, 24 Aug 2017
10 / 30
Introduction Background Strings of Actions The End
Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP
Performance Bounds for Greedy Strategies
Theorem (Streeter & Golovin, 2008) If f is string submodular and postfix monotone, then any greedy string GK satisfies f (GK ) 1 K > 1 − e−1 . ≥1− 1− f (OK ) K
Edwin K. P. Chong
IAS, 24 Aug 2017
11 / 30
Introduction Background Strings of Actions The End
Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP
Performance Bounds for Greedy Strategies
Result requires f to be postfix monotone (in addition to submodular). Recall: Postfix monotone means ∀M, N ∈ X∗ , f (M ⊕ N ) ≥ f (N ). Same result holds with weaker condition: ∀G GK , f (G ⊕ OK ) ≥ f (OK ) [TAC 2016]. For an even weaker condition, see [CDC 2015]. For tighter bounds with curvature, see [TAC 2016].
Edwin K. P. Chong
IAS, 24 Aug 2017
12 / 30
Introduction Background Strings of Actions The End
Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP
Applications
Bounds for greedy strategy have many obvious applications. Main goal from here on: Apply to optimal control and ADP.
Edwin K. P. Chong
IAS, 24 Aug 2017
13 / 30
Introduction Background Strings of Actions The End
Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP
Strings of Actions and ADP
Recall problem: Maximize objective function with respect to string of actions over finite horizon. Usual approach: Dynamic programming via Bellman’s principle. Computational complexity grows exponentially. Approximate Dynamic Programming (ADP) methods: Performance guarantees remain elusive. Example ADP: Rollout.
Edwin K. P. Chong
IAS, 24 Aug 2017
14 / 30
Introduction Background Strings of Actions The End
Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP
Questions about ADP
How good is an ADP scheme relative to the optimal strategy in terms of the objective function? Sometimes, ADP not much better than myopic scheme. Sometimes, rolling out a base policy does not improve the performance much. What’s going on?
Edwin K. P. Chong
IAS, 24 Aug 2017
15 / 30
Introduction Background Strings of Actions The End
Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP
Applying Bounds to ADP
If optimal control problem is submodular, then greedy scheme has provably bounded performance relative to optimal. Main idea: Even if optimal control problem is not submodular, might still be able to use our results to bound the performance of ADP.
Edwin K. P. Chong
IAS, 24 Aug 2017
16 / 30
Introduction Background Strings of Actions The End
Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP
General Optimal Control Problem
maximize a1 ,...,aK ∈A
K X
rk (xk , ak )
(2)
k=1
subject to xk+1 = hk (xk , ak ), k = 1, . . . , K − 1. X : Set of states. xk ∈ X : State at time k A: Set of control actions ak ∈ A: Control action applied at time k hk : X × A → X : Transition function at time k rk : X × A → R+ : Reward function at time k Edwin K. P. Chong
IAS, 24 Aug 2017
17 / 30
Introduction Background Strings of Actions The End
Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP
Optimal Solution Define Vk (xk , (ak , . . . , aK )) = k = 1, . . . , K.
PK
i=k ri (xi , ai )
for
Optimal control problem can be written as maximize a1 ,...,aK ∈A
V1 (x1 , (a1 , . . . , aK ))
subject to xk+1 = hk (xk , ak ), k = 1, . . . , K − 1. Optimal solution OK = (o1 , . . . , oK ). Optimal state sequence: x∗1 = x1 , x∗k+1 = hk (x∗k , ok ) for k = 1, . . . , K − 1.
Edwin K. P. Chong
IAS, 24 Aug 2017
18 / 30
Introduction Background Strings of Actions The End
Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP
Bellman’s Principle Bellman’s principle: For k = 1, . . . , K, ok ∈ argmax{rk (x∗k , a) + Vk+1 (hk (x∗k , a), (ok+1 , . . . , oK ))} a∈A
(3) Vk+1 (hk (x∗k , a), (ok+1 , . . . , oK )):
Value-to-go (VTG)
Dynamic Programming Algorithm: Use (3) to iterate backwards over time indices k = K, K − 1, . . . , 1, keeping the states as variables, working all the way back to k = 1. Curse of dimensionality: Merely storing the iterates Vk (·, (ok , . . . , oK )) requires an exponential amount of memory. Edwin K. P. Chong
IAS, 24 Aug 2017
19 / 30
Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP
Introduction Background Strings of Actions The End
Myopic Scheme Myopic solution: a ˆk ∈ argmax{rk (ˆ xk , a)}, a∈A
x ˆk+1 = hk (ˆ xk , a ˆk ). Define f ((a1 , · · · , ak )) =
Pk
i=1 ri (xi , ai ).
Myopic solution is greedy scheme for this f . If this f is string submodular, then previous bounds hold. But in general, string submodularity fails for this f . The whole point of optimal control is delayed gratification, which is counter to submodularity. Edwin K. P. Chong
IAS, 24 Aug 2017
20 / 30
Introduction Background Strings of Actions The End
Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP
ADP Schemes
Approximate Dynamic Programming (ADP): Approximate the VTG Vk+1 (hk (x∗k , a), (ok+1 , . . . , oK )) by some other term Wk+1 (ˆ xk , a) Start at time k = 1 at state x ˆ1 = x1 , and for each k = 1, . . . , K, compute the control action and state using a ˆk ∈ argmax{rk (ˆ xk , a) + Wk+1 (ˆ xk , a)}, a∈A
x ˆk+1 = hk (ˆ xk , a ˆk ).
Edwin K. P. Chong
IAS, 24 Aug 2017
(4)
21 / 30
Introduction Background Strings of Actions The End
Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP
Examples of ADP The VTG approximation Wk+1 (ˆ xk , a) can be based on a number of methods, for example: When Wk+1 (ˆ xk , a) ≡ 0, get myopic scheme. P Wk+1 (ˆ xk , a) = K xi , πb (ˆ xi )), called rollout. i=k+1 ri (ˆ πb : X → A is a given base policy x ˆk+1 = hk (ˆ xk , a) x ˆi+1 = hi (ˆ xi , πb (ˆ xi )) for i = k + 1, . . . , K − 1
See [JDEDS 2009] for other examples. Also, simulation optimization and AlphaGo. How to bound the performance of an ADP scheme?
Edwin K. P. Chong
IAS, 24 Aug 2017
22 / 30
Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP
Introduction Background Strings of Actions The End
Optimal Control as String Optimization Key idea: Define a string-optimization problem for which the greedy strategy is the ADP solution. Ak : Set of all strings in A with length not exceeding k. Define f : AK → R+ by f (∅) = 0 and, for k = 1, . . . , K, f ((a1 , a2 , . . . , ak )) =
k X
ri (xi , ai ) + Wk+1 (xk , ak ),
i=1
where xk+1 = hk (xk , ak ) and WK+1 (·) ≡ 0. P For strings of length K, f ((a1 , . . . , aK )) = K i=1 ri (xi , ai ). This is the original objective function in (2). Hence, the string-optimization problem defined above is equivalent to the optimal control problem (2). Edwin K. P. Chong
IAS, 24 Aug 2017
23 / 30
Introduction Background Strings of Actions The End
Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP
ADP Solution as Greedy String Given (g1 , . . . , gk−1 ): gk ∈ argmax f ((g1 , . . . , gk−1 , g)) g∈A
∈ argmax{ g∈A
k−1 X
ri (xi , gi ) + rk (xk , g) + Wk+1 (xk , g)}
(5)
i=1
∈ argmax{rk (xk , g) + Wk+1 (xk , g)}. g∈A
This is simply the ADP scheme in (4).
Edwin K. P. Chong
IAS, 24 Aug 2017
24 / 30
Introduction Background Strings of Actions The End
Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP
ADP Solution as Greedy String
Hence, we have the following. Proposition The ADP scheme in (4) is a greedy strategy for the string-optimization problem maximize f ((a1 , a2 , . . . , aK )) subject to (a1 , a2 , . . . , aK ) ∈ AK .
Edwin K. P. Chong
IAS, 24 Aug 2017
(6)
25 / 30
Introduction Background Strings of Actions The End
Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP
Applying the Bound
To apply the bound, need some notation. Given 1 ≤ κ1 < κ2 < · · · < κL ≤ K, define β = min {(L − 1)(f (Gκi ) − f (Gκi−1 )) + (f (Gκi ) − f (G1 ))} 1≤i≤L
where Gκ0 = ∅. Just some number!
Edwin K. P. Chong
IAS, 24 Aug 2017
26 / 30
Introduction Background Strings of Actions The End
Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP
Applying the Bound Theorem If f satisfies 1. there exist 1 ≤ κ1 < κ2 < · · · < κL ≤ K such that V2 (x∗1 , o1 ) − W2 (x∗1 , o1 ) ≤ β, 2. f is nondecreasing with respect to Gi , 1 ≤ i ≤ L, then the ADP scheme GK satisfies f (GK ) 1 L ≥1− 1− > 1 − e−1 . f (OK ) L
Edwin K. P. Chong
IAS, 24 Aug 2017
27 / 30
Introduction Background Strings of Actions The End
Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP
Applying the Bound
Condition 1: W2 (x∗1 , o1 ) and V2 (x∗1 , o1 ) are not too different. Condition 2: Monotone w.r.t. greedy (sub)string. Can be shown to hold for specific cases; e.g., rollout [CDC 2014].
Edwin K. P. Chong
IAS, 24 Aug 2017
28 / 30
Introduction Background Strings of Actions The End
Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP
Other Ongoing Work
Easily checkable conditions under which specific ADP schemes satisfy bound. Example: rollout [CDC 2014]. Design of ADP schemes based on these conditions. Canonical examples that satisfy these conditions. Tighter bounds based on curvature [TAC 2016]. More general constraint sets (string matroids), but with looser bounds [TAC 2016]. Stochastic optimal control problems (MDP and POMDP).
Edwin K. P. Chong
IAS, 24 Aug 2017
29 / 30
Introduction Background Strings of Actions The End
Questions?
[email protected] www.edwinchong.us
Edwin K. P. Chong
IAS, 24 Aug 2017
30 / 30