Bayes-Adaptive POMDPs - Semantic Scholar

Report 11 Downloads 193 Views
Bayes-Adaptive POMDPs St´ephane Ross†, Brahim Chaib-draa‡ and Joelle Pineau† † ‡

School of Computer Science, Mcgill University, Montreal, Canada

Department of Computer Science, Laval University, Quebec city, Canada

Motivation

Bayes-Adaptive POMDP

Many real world problems can be modeled by POMDPs

Approximate Monitoring and Planning

Assume a POMDP model where T and O are the only unknowns.

e.g. Robot navigation, Helicopter control, Machine Maintenance, Car Driving System, Dialogue Management, etc.

Belief Monitoring:

a

Let φass0 = # times transition s → s0 occured and ψsa0z = # times observation z occured after moving to state s0 by doing a.

Problem: Computing bt exactly in the BAPOMDP is in O(|S|t+1). Particle Filters:

In practice, the exact parameters describing these systems are usually unknown or uncertain a priori!

• φ and ψ specify the full joint Dirichlet posterior over the POMDP model.

• Monte Carlo : Perform belief update by sampling K particles and state transitions.

• However their values are unknown since state is not observed.

We need approaches that can:

• But we can still maintain a belief over their values via Baye’s rule.

• K Most Likely : After each belief update, keep only the K particles with highest probability.

• Learn the POMDP parameters as experience is acquired.

Model: M = (S 0, A, Z, P 0, R0)

• Plan considering uncertainty on POMDP parameters.

How should an agent behave in order to maximize its expected long-term return given an uncertain model? Bayesian Reinforcement Learning address this problem in MDPs. ⇒ We would like to extend these ideas to POMDPs.

2 0 |S| • S = S × N |A| × N|S||A||Z| • R0(s, φ, ψ, a) = R(s, a)

• Weighted Distance Minimization : After each belief update, use an approximate algorithm that tries to keep the K particles which minimize a weighted sum of the bound in Theorem 1.

• P 0(s, φ, ψ, a, s0 , φ0, ψ 0, z) = ψsa0z φass0 P a P a I(φ + δ a 0 , φ0)I(ψ + δ a0 , ψ 0 ) sz ss φ ψ s00

ss00

z0

Planning:

s0 z 0

P 0 is the joint transition-observation function. Initial belief is b00(s, φ, ψ) b0(s)I(φ, φ0)I(ψ, ψ0 ) where φ0 and ψ0 is the prior on the POMDP model.

=

• We now have an infinite-state POMDP with known model. • Belief in the BAPOMDP is a distribution over both state and model (mixture of dirichlet) • Probability density will concentrate over most likely (s, T, O) given prior and actionobservation sequence. Example: Belief in Tiger with unknown sensor accuracy. b0: Pr(L, < 5, 3 >) = 21 ; Pr(R, < 5, 3 >) = 21

Background

l,l

→ b1: Pr(L, < 6, 3 >) = l,l

→ b2: Pr(L, < 7, 3 >) =

Bayesian Reinforcement Learning in MDPs:

→ r →

• T (s, a, s0) = Pr(s0|s, a), the transition probabilities • R(s, a) ∈ R, the immediate rewards

5

3

• T 0(s, φ, a, s0 , φ0) =

φass0 P a s00 φss00

2

1

0.3

0.4

0.5 p

0.6

0.7

0.8

0.9

1

1

a , φ0 ) I(φ + δss 0

• R0(s, φ, a) = R(s, a)

POMDPs: (S, A, T, R, Z, O)

b2

b3

b (R,⋅) 3

2

b (L,⋅) 4

Experimental Results

1 0 0

0.2

0.4 0.6 accuracy

0.8

1

Tiger: (Littman et al. 1995)

Approximating the BAPOMDP model: Theoretical Results

2 Exact model

20

1

1

Most Probable (2) Monte Carlo (64) Weighted Distance (2)

0.8

• O(s0, a, z) = Pr(z|s0, a), the observation probabilities • Z: Set of observations Agent only observes some z ∈ Z at every step; it can maintain a belief over the state via Bayes rule. P 0 0 Belief monitoring: bt(s ) = ηO(s , at−1, zt−1) s∈S T (s, at−1, s0)bt−1(s)   P ∗ ∗ Bellman equation: V (b) = maxa∈A R(b, a) + γ z∈Z Pr(z|b, a)V (τ (b, a, z))

0a φa 0 P P P φ sa a sa a sa 0 ss0 ss Define Nφ = φ − = , N ψ , D (φ, φ ) = s0∈S ss0 z∈Z sz s0∈S Nφsa S ψ Nφsa0 and ψa 0a P ψ sa 0 DZ (ψ, ψ ) = z∈Z Nszsa − Nszsa . ψ ψ0

Theorem 1:

sup

2γ||R||∞ 0 0 |αt(s, φ, ψ) − αt(s, φ , ψ )| ≤ (1−γ)2

αt∈Γt,s∈S P  P a 0a 0a a 0 z∈Z |ψs0z −ψs0 z | s00 ∈S |φss00 −φss00 | 4 s a 0 +DZ (ψ, ψ ) + ln(γ −e) (N sa+1)(N sa+1) + s0a s0 a +1) +1)(N (N 0 φ φ ψ ψ0

sup

s,s0∈S,a∈A

 sa DS (φ, φ0)

|S|(1+0) 1 (1−γ)2 00 (1−γ)2 ln(γ −e) 0  Given any  > 0, define  = 8γ||R|| ,  = 32γ||R|| , 00 − 1 , NS = max 0  ∞ ∞   0 |Z|(1+ ) 1 , 00 − 1 . NZ = max 0





and

Finite POMDP Approximation: M = (S˜, A, Z, P˜, R˜ )

2 |S| ˜ • S = S × {φ ∈ N |A||∀s, a · 0 < Nφsa ≤ NS } × {ψ ∈ N|S||A||Z||∀s, a · 0 < Nψsa ≤ NZ }

• P˜(s, φ, ψ, a, s0 , φ0, ψ 0, z) =

φass0 ψsa0z P a P a s00 φss00 z 0 ψs0 z 0

Theorem 2:

 for α ˜ t computed with M and αt, |α ˜ t(P(s, φ, ψ)) − αt(s, φ, ψ)| < 1−γ

its corresponding α-vector, computed with M .

WL1

−1

0.4 −2

Prior model Most Probable (2) Monte Carlo (64) Weighted Distance (2)

−3 −4 0

20

40 60 Episode

80

0.2 0 0

100

20

40 60 Episode

80

15

10

5

0

100

MP (2)

MC (64)

WD (2)

MC (64)

WD (16)

Follow:

A robot has to follow an individual with a priori unknown behavior. There are 2 possible individuals with different state-time-invariant bahaviors. The current individual is unknown and can only be infered through its behavior. Here we consider only the behavior of the individuals to be unknown. We start with a prior (φ1N A, φ1N , φ1E , φ1S , φ1W ) = (2, 3, 1, 2, 2) for person 1 and (φ2N A, φ2N , φ2E , φ2S , φ2W ) = (2, 1, 3, 2, 2) for person 2, while person 1 moves with Pr = (0.3, 0.4, 0.2, 0.05, 0.05) and person 2 with Pr = (0.1, 0.05, 0.8, 0.03, 0.02). 2

2

200

Exact model

a , ψ + δ a ), (s0, φ0, ψ 0 )) I(P(s0, φ + δss 0 s0z

˜ (s, φ, ψ, a) = R(s, a) •R where P : S 0 → S˜ returns the state in S˜ minimizing bound of Theorem 1.

0.6

0

1.5 Most Probable (64) Monte Carlo (64) Weighted Distance (16)

−2 Most Probable (64) Monte Carlo (64) Weighted Distance (16)

−4

20

40 60 Episode

80

1

0.5

−6 Prior model −8 0

Planning Time/Action (ms)

2 0 |S| • S = S × N |A| • A0 = A

b1

3

WL1

This can be framed as an infinite state MDP (S 0, A0, T 0, R0) (called Bayes-Adaptive MDP) as follows:

on

...

0

3

0.2

o2

o1

b (L,⋅)

Dir(1,1) Dir(3,3) Dir(2,5) Dir(10,1) Dir(75,25)

0.1

...

Planning Time/Action (ms)

• Act such as to maximize expected return given current posterior and how it will evolve.

0 0

an

Here we consider only sensor accuracy to be unknown. We start with a prior (ψLl , ψLr , ψRl , ψRr ) = (5, 3, 3, 5), i.e. 62.5% sensor accuracy in each state.

• Maintain posterior Pr(T |s1, a1, s2, a2, . . . , at−1, st) via Bayes rule.

The Dirichlet distribution is parametrised by the counts φass0 of the number of times each a 0 transition s → s was seen.

a2

0

• Define prior Pr(T )

4

a1

• Planning algorithm is executed at every step.

b (L,⋅)

Assume transition function T is the only unknown.

If the prior Pr(T (s, a, ·)) is Dirichlet, then the posterior is also Dirichlet.

• Approximate value is obtained for each action in the current belief via Bellman backups in the tree; best action is executed.

Return

• A: Set of actions

b3: Pr(L, < 8, 3 >) = b4: Pr(L, < 8, 3 >) = Pr(R, < 8, 3 >) =

b0

• The particle filter is used to compute the reachable beliefs.

Return

MDP: (S, A, T, R) • S: Set of states

l,l

5 ; Pr(R, < 5, 4 >) = 3 8 8 5 ; Pr(R, < 5, 5 >) = 2 7 7 7 ; Pr(R, < 5, 6 >) = 2 9 9 7 ; Pr(L, < 5, 6 >) = 2 ; 18 18 7 ; Pr(R, < 5, 6 >) = 2 18 18

4

• Online planning via a D-step lookahead search from the current belief of the agent.

100

0 0

20

40 60 Episode

80

100

150

100

50

0

MP (64)

Recommend Documents