Performance Guarantees for Approximate Dynamic Programming ...

Report 3 Downloads 110 Views
Introduction Background Strings of Actions The End

Performance Guarantees for Approximate Dynamic Programming Schemes Edwin K. P. Chong Colorado State University

IAS, 24 August 2017 Joint work with Yajing Liu and Ali Pezeshki. Partially supported by NSF grant CCF-1422658.

Edwin K. P. Chong

IAS, 24 Aug 2017

1 / 30

Introduction Background Strings of Actions The End

Motivation

Optimal decision making with multiple actions: Typically computationally intractable. Usual approach: Resort to approximations and heuristics. Downside: Often no provable performance guarantees.

Edwin K. P. Chong

IAS, 24 Aug 2017

2 / 30

Introduction Background Strings of Actions The End

Motivation

In some cases, these decision problems have a special structure called submodularity. Often, approximations/heuristics are special cases of greedy schemes. For submodular problems, greedy schemes have provable guarantees. Typically, greedy is > 1 − e−1 (63%) as good as optimal.

Edwin K. P. Chong

IAS, 24 Aug 2017

3 / 30

Introduction Background Strings of Actions The End

Outline

Background on submodularity Strings of actions and ADP We give only the simplest versions of results, enough to convey the basic ideas. No proofs!

Edwin K. P. Chong

IAS, 24 Aug 2017

4 / 30

Introduction Background Strings of Actions The End

Monotoneity and Submodularity Function from real numbers to real numbers f : R → R WLOG, f (0) = 0 Monotone: ∀x ≤ y ∈ R, f (x) ≤ f (y) Diminishing return: ∀x ≤ y ∈ R, ∀z > 0, f (x + z) − f (x) ≥ f (y + z) − f (y) z

z

x Edwin K. P. Chong

y IAS, 24 Aug 2017

5 / 30

Introduction Background Strings of Actions The End

More General Setting

Want to go beyond the real line to a more general setting. Specifically, want to consider objective functions with multiple decision “actions” as arguments. Two specific settings: Sets of actions Strings (ordered sets) of actions

Here, consider only strings of actions. Typically, consider discrete actions.

Edwin K. P. Chong

IAS, 24 Aug 2017

6 / 30

Introduction Background Strings of Actions The End

Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP

Actions and Strings Set of possible actions: X String of actions: A = (a1 , a2 , . . . , ak ), ai ∈ X for i = 1, 2, . . . , k Length of string: |A| = k All possible strings: X∗ = {(a1 , a2 , . . . , ak )| k = 0, 1, . . . and ai ∈ X, i = 1, 2 . . . , k} Empty string: ∅ Concatenation of M = (m1 , m2 , . . . , mk1 ) and N = (n1 , n2 , . . . , nk2 ): M ⊕ N = (m1 , m2 , . . . , mk1 , n1 , n2 , . . . , nk2 ) Prefix: M  N if N = M ⊕ L for some L ∈ X∗ Edwin K. P. Chong

IAS, 24 Aug 2017

7 / 30

Introduction Background Strings of Actions The End

Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP

String Functions Function from strings to real numbers f : X∗ → R WLOG, f (∅) = 0 Prefix monotone: ∀M  N ∈ X∗ , f (M ) ≤ f (N ) Diminishing return: ∀M  N ∈ X∗ , ∀a ∈ X, f (M ⊕ (a)) − f (M ) ≥ f (N ⊕ (a)) − f (N ) f is string submodular if above hold. Postfix monotone: ∀M, N ∈ X∗ , f (M ⊕ N ) ≥ f (N )

Edwin K. P. Chong

IAS, 24 Aug 2017

8 / 30

Introduction Background Strings of Actions The End

Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP

Optimization

maximize f (M ) subject to M ∈ X∗ , |M | ≤ K,

(1)

This constraint set is called a uniform string-matroid of rank K. Can extend to more general string-matroids [TAC 2016].

Edwin K. P. Chong

IAS, 24 Aug 2017

9 / 30

Introduction Background Strings of Actions The End

Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP

Optimal and Greedy

Optimal string: If f is prefix monotone, then there exists one with length K, denoted OK . Greedy string: Gk = (g1 , g2 , . . . , gk ) is called greedy if ∀i = 1, 2, . . . , k, gi ∈ argmax f ((g1 , g2 , . . . , gi−1 , g)), g∈X

argmax denotes the set of actions that maximize f ((g1 , g2 , . . . , gi−1 , ·)).

Edwin K. P. Chong

IAS, 24 Aug 2017

10 / 30

Introduction Background Strings of Actions The End

Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP

Performance Bounds for Greedy Strategies

Theorem (Streeter & Golovin, 2008) If f is string submodular and postfix monotone, then any greedy string GK satisfies   f (GK ) 1 K > 1 − e−1 . ≥1− 1− f (OK ) K

Edwin K. P. Chong

IAS, 24 Aug 2017

11 / 30

Introduction Background Strings of Actions The End

Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP

Performance Bounds for Greedy Strategies

Result requires f to be postfix monotone (in addition to submodular). Recall: Postfix monotone means ∀M, N ∈ X∗ , f (M ⊕ N ) ≥ f (N ). Same result holds with weaker condition: ∀G  GK , f (G ⊕ OK ) ≥ f (OK ) [TAC 2016]. For an even weaker condition, see [CDC 2015]. For tighter bounds with curvature, see [TAC 2016].

Edwin K. P. Chong

IAS, 24 Aug 2017

12 / 30

Introduction Background Strings of Actions The End

Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP

Applications

Bounds for greedy strategy have many obvious applications. Main goal from here on: Apply to optimal control and ADP.

Edwin K. P. Chong

IAS, 24 Aug 2017

13 / 30

Introduction Background Strings of Actions The End

Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP

Strings of Actions and ADP

Recall problem: Maximize objective function with respect to string of actions over finite horizon. Usual approach: Dynamic programming via Bellman’s principle. Computational complexity grows exponentially. Approximate Dynamic Programming (ADP) methods: Performance guarantees remain elusive. Example ADP: Rollout.

Edwin K. P. Chong

IAS, 24 Aug 2017

14 / 30

Introduction Background Strings of Actions The End

Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP

Questions about ADP

How good is an ADP scheme relative to the optimal strategy in terms of the objective function? Sometimes, ADP not much better than myopic scheme. Sometimes, rolling out a base policy does not improve the performance much. What’s going on?

Edwin K. P. Chong

IAS, 24 Aug 2017

15 / 30

Introduction Background Strings of Actions The End

Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP

Applying Bounds to ADP

If optimal control problem is submodular, then greedy scheme has provably bounded performance relative to optimal. Main idea: Even if optimal control problem is not submodular, might still be able to use our results to bound the performance of ADP.

Edwin K. P. Chong

IAS, 24 Aug 2017

16 / 30

Introduction Background Strings of Actions The End

Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP

General Optimal Control Problem

maximize a1 ,...,aK ∈A

K X

rk (xk , ak )

(2)

k=1

subject to xk+1 = hk (xk , ak ), k = 1, . . . , K − 1. X : Set of states. xk ∈ X : State at time k A: Set of control actions ak ∈ A: Control action applied at time k hk : X × A → X : Transition function at time k rk : X × A → R+ : Reward function at time k Edwin K. P. Chong

IAS, 24 Aug 2017

17 / 30

Introduction Background Strings of Actions The End

Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP

Optimal Solution Define Vk (xk , (ak , . . . , aK )) = k = 1, . . . , K.

PK

i=k ri (xi , ai )

for

Optimal control problem can be written as maximize a1 ,...,aK ∈A

V1 (x1 , (a1 , . . . , aK ))

subject to xk+1 = hk (xk , ak ), k = 1, . . . , K − 1. Optimal solution OK = (o1 , . . . , oK ). Optimal state sequence: x∗1 = x1 , x∗k+1 = hk (x∗k , ok ) for k = 1, . . . , K − 1.

Edwin K. P. Chong

IAS, 24 Aug 2017

18 / 30

Introduction Background Strings of Actions The End

Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP

Bellman’s Principle Bellman’s principle: For k = 1, . . . , K, ok ∈ argmax{rk (x∗k , a) + Vk+1 (hk (x∗k , a), (ok+1 , . . . , oK ))} a∈A

(3) Vk+1 (hk (x∗k , a), (ok+1 , . . . , oK )):

Value-to-go (VTG)

Dynamic Programming Algorithm: Use (3) to iterate backwards over time indices k = K, K − 1, . . . , 1, keeping the states as variables, working all the way back to k = 1. Curse of dimensionality: Merely storing the iterates Vk (·, (ok , . . . , oK )) requires an exponential amount of memory. Edwin K. P. Chong

IAS, 24 Aug 2017

19 / 30

Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP

Introduction Background Strings of Actions The End

Myopic Scheme Myopic solution: a ˆk ∈ argmax{rk (ˆ xk , a)}, a∈A

x ˆk+1 = hk (ˆ xk , a ˆk ). Define f ((a1 , · · · , ak )) =

Pk

i=1 ri (xi , ai ).

Myopic solution is greedy scheme for this f . If this f is string submodular, then previous bounds hold. But in general, string submodularity fails for this f . The whole point of optimal control is delayed gratification, which is counter to submodularity. Edwin K. P. Chong

IAS, 24 Aug 2017

20 / 30

Introduction Background Strings of Actions The End

Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP

ADP Schemes

Approximate Dynamic Programming (ADP): Approximate the VTG Vk+1 (hk (x∗k , a), (ok+1 , . . . , oK )) by some other term Wk+1 (ˆ xk , a) Start at time k = 1 at state x ˆ1 = x1 , and for each k = 1, . . . , K, compute the control action and state using a ˆk ∈ argmax{rk (ˆ xk , a) + Wk+1 (ˆ xk , a)}, a∈A

x ˆk+1 = hk (ˆ xk , a ˆk ).

Edwin K. P. Chong

IAS, 24 Aug 2017

(4)

21 / 30

Introduction Background Strings of Actions The End

Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP

Examples of ADP The VTG approximation Wk+1 (ˆ xk , a) can be based on a number of methods, for example: When Wk+1 (ˆ xk , a) ≡ 0, get myopic scheme. P Wk+1 (ˆ xk , a) = K xi , πb (ˆ xi )), called rollout. i=k+1 ri (ˆ πb : X → A is a given base policy x ˆk+1 = hk (ˆ xk , a) x ˆi+1 = hi (ˆ xi , πb (ˆ xi )) for i = k + 1, . . . , K − 1

See [JDEDS 2009] for other examples. Also, simulation optimization and AlphaGo. How to bound the performance of an ADP scheme?

Edwin K. P. Chong

IAS, 24 Aug 2017

22 / 30

Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP

Introduction Background Strings of Actions The End

Optimal Control as String Optimization Key idea: Define a string-optimization problem for which the greedy strategy is the ADP solution. Ak : Set of all strings in A with length not exceeding k. Define f : AK → R+ by f (∅) = 0 and, for k = 1, . . . , K, f ((a1 , a2 , . . . , ak )) =

k X

ri (xi , ai ) + Wk+1 (xk , ak ),

i=1

where xk+1 = hk (xk , ak ) and WK+1 (·) ≡ 0. P For strings of length K, f ((a1 , . . . , aK )) = K i=1 ri (xi , ai ). This is the original objective function in (2). Hence, the string-optimization problem defined above is equivalent to the optimal control problem (2). Edwin K. P. Chong

IAS, 24 Aug 2017

23 / 30

Introduction Background Strings of Actions The End

Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP

ADP Solution as Greedy String Given (g1 , . . . , gk−1 ): gk ∈ argmax f ((g1 , . . . , gk−1 , g)) g∈A

∈ argmax{ g∈A

k−1 X

ri (xi , gi ) + rk (xk , g) + Wk+1 (xk , g)}

(5)

i=1

∈ argmax{rk (xk , g) + Wk+1 (xk , g)}. g∈A

This is simply the ADP scheme in (4).

Edwin K. P. Chong

IAS, 24 Aug 2017

24 / 30

Introduction Background Strings of Actions The End

Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP

ADP Solution as Greedy String

Hence, we have the following. Proposition The ADP scheme in (4) is a greedy strategy for the string-optimization problem maximize f ((a1 , a2 , . . . , aK )) subject to (a1 , a2 , . . . , aK ) ∈ AK .

Edwin K. P. Chong

IAS, 24 Aug 2017

(6)

25 / 30

Introduction Background Strings of Actions The End

Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP

Applying the Bound

To apply the bound, need some notation. Given 1 ≤ κ1 < κ2 < · · · < κL ≤ K, define β = min {(L − 1)(f (Gκi ) − f (Gκi−1 )) + (f (Gκi ) − f (G1 ))} 1≤i≤L

where Gκ0 = ∅. Just some number!

Edwin K. P. Chong

IAS, 24 Aug 2017

26 / 30

Introduction Background Strings of Actions The End

Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP

Applying the Bound Theorem If f satisfies 1. there exist 1 ≤ κ1 < κ2 < · · · < κL ≤ K such that V2 (x∗1 , o1 ) − W2 (x∗1 , o1 ) ≤ β, 2. f is nondecreasing with respect to Gi , 1 ≤ i ≤ L, then the ADP scheme GK satisfies   f (GK ) 1 L ≥1− 1− > 1 − e−1 . f (OK ) L

Edwin K. P. Chong

IAS, 24 Aug 2017

27 / 30

Introduction Background Strings of Actions The End

Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP

Applying the Bound

Condition 1: W2 (x∗1 , o1 ) and V2 (x∗1 , o1 ) are not too different. Condition 2: Monotone w.r.t. greedy (sub)string. Can be shown to hold for specific cases; e.g., rollout [CDC 2014].

Edwin K. P. Chong

IAS, 24 Aug 2017

28 / 30

Introduction Background Strings of Actions The End

Notation and Terminology Optimization and Bounds Bounding ADP General Optimal Control Myopic Scheme ADP Schemes Bounds for ADP

Other Ongoing Work

Easily checkable conditions under which specific ADP schemes satisfy bound. Example: rollout [CDC 2014]. Design of ADP schemes based on these conditions. Canonical examples that satisfy these conditions. Tighter bounds based on curvature [TAC 2016]. More general constraint sets (string matroids), but with looser bounds [TAC 2016]. Stochastic optimal control problems (MDP and POMDP).

Edwin K. P. Chong

IAS, 24 Aug 2017

29 / 30

Introduction Background Strings of Actions The End

Questions?

[email protected] www.edwinchong.us

Edwin K. P. Chong

IAS, 24 Aug 2017

30 / 30