Bayes-Adaptive POMDPs St´ephane Ross†, Brahim Chaib-draa‡ and Joelle Pineau† † ‡
School of Computer Science, Mcgill University, Montreal, Canada
Department of Computer Science, Laval University, Quebec city, Canada
Motivation
Bayes-Adaptive POMDP
Many real world problems can be modeled by POMDPs
Approximate Monitoring and Planning
Assume a POMDP model where T and O are the only unknowns.
e.g. Robot navigation, Helicopter control, Machine Maintenance, Car Driving System, Dialogue Management, etc.
Belief Monitoring:
a
Let φass0 = # times transition s → s0 occured and ψsa0z = # times observation z occured after moving to state s0 by doing a.
Problem: Computing bt exactly in the BAPOMDP is in O(|S|t+1). Particle Filters:
In practice, the exact parameters describing these systems are usually unknown or uncertain a priori!
• φ and ψ specify the full joint Dirichlet posterior over the POMDP model.
• Monte Carlo : Perform belief update by sampling K particles and state transitions.
• However their values are unknown since state is not observed.
We need approaches that can:
• But we can still maintain a belief over their values via Baye’s rule.
• K Most Likely : After each belief update, keep only the K particles with highest probability.
• Learn the POMDP parameters as experience is acquired.
Model: M = (S 0, A, Z, P 0, R0)
• Plan considering uncertainty on POMDP parameters.
How should an agent behave in order to maximize its expected long-term return given an uncertain model? Bayesian Reinforcement Learning address this problem in MDPs. ⇒ We would like to extend these ideas to POMDPs.
2 0 |S| • S = S × N |A| × N|S||A||Z| • R0(s, φ, ψ, a) = R(s, a)
• Weighted Distance Minimization : After each belief update, use an approximate algorithm that tries to keep the K particles which minimize a weighted sum of the bound in Theorem 1.
• P 0(s, φ, ψ, a, s0 , φ0, ψ 0, z) = ψsa0z φass0 P a P a I(φ + δ a 0 , φ0)I(ψ + δ a0 , ψ 0 ) sz ss φ ψ s00
ss00
z0
Planning:
s0 z 0
P 0 is the joint transition-observation function. Initial belief is b00(s, φ, ψ) b0(s)I(φ, φ0)I(ψ, ψ0 ) where φ0 and ψ0 is the prior on the POMDP model.
=
• We now have an infinite-state POMDP with known model. • Belief in the BAPOMDP is a distribution over both state and model (mixture of dirichlet) • Probability density will concentrate over most likely (s, T, O) given prior and actionobservation sequence. Example: Belief in Tiger with unknown sensor accuracy. b0: Pr(L, < 5, 3 >) = 21 ; Pr(R, < 5, 3 >) = 21
Background
l,l
→ b1: Pr(L, < 6, 3 >) = l,l
→ b2: Pr(L, < 7, 3 >) =
Bayesian Reinforcement Learning in MDPs:
→ r →
• T (s, a, s0) = Pr(s0|s, a), the transition probabilities • R(s, a) ∈ R, the immediate rewards
5
3
• T 0(s, φ, a, s0 , φ0) =
φass0 P a s00 φss00
2
1
0.3
0.4
0.5 p
0.6
0.7
0.8
0.9
1
1
a , φ0 ) I(φ + δss 0
• R0(s, φ, a) = R(s, a)
POMDPs: (S, A, T, R, Z, O)
b2
b3
b (R,⋅) 3
2
b (L,⋅) 4
Experimental Results
1 0 0
0.2
0.4 0.6 accuracy
0.8
1
Tiger: (Littman et al. 1995)
Approximating the BAPOMDP model: Theoretical Results
2 Exact model
20
1
1
Most Probable (2) Monte Carlo (64) Weighted Distance (2)
0.8
• O(s0, a, z) = Pr(z|s0, a), the observation probabilities • Z: Set of observations Agent only observes some z ∈ Z at every step; it can maintain a belief over the state via Bayes rule. P 0 0 Belief monitoring: bt(s ) = ηO(s , at−1, zt−1) s∈S T (s, at−1, s0)bt−1(s) P ∗ ∗ Bellman equation: V (b) = maxa∈A R(b, a) + γ z∈Z Pr(z|b, a)V (τ (b, a, z))
0a φa 0 P P P φ sa a sa a sa 0 ss0 ss Define Nφ = φ − = , N ψ , D (φ, φ ) = s0∈S ss0 z∈Z sz s0∈S Nφsa S ψ Nφsa0 and ψa 0a P ψ sa 0 DZ (ψ, ψ ) = z∈Z Nszsa − Nszsa . ψ ψ0
Theorem 1:
sup
2γ||R||∞ 0 0 |αt(s, φ, ψ) − αt(s, φ , ψ )| ≤ (1−γ)2
αt∈Γt,s∈S P P a 0a 0a a 0 z∈Z |ψs0z −ψs0 z | s00 ∈S |φss00 −φss00 | 4 s a 0 +DZ (ψ, ψ ) + ln(γ −e) (N sa+1)(N sa+1) + s0a s0 a +1) +1)(N (N 0 φ φ ψ ψ0
sup
s,s0∈S,a∈A
sa DS (φ, φ0)
|S|(1+0) 1 (1−γ)2 00 (1−γ)2 ln(γ −e) 0 Given any > 0, define = 8γ||R|| , = 32γ||R|| , 00 − 1 , NS = max 0 ∞ ∞ 0 |Z|(1+ ) 1 , 00 − 1 . NZ = max 0
and
Finite POMDP Approximation: M = (S˜, A, Z, P˜, R˜ )
2 |S| ˜ • S = S × {φ ∈ N |A||∀s, a · 0 < Nφsa ≤ NS } × {ψ ∈ N|S||A||Z||∀s, a · 0 < Nψsa ≤ NZ }
• P˜(s, φ, ψ, a, s0 , φ0, ψ 0, z) =
φass0 ψsa0z P a P a s00 φss00 z 0 ψs0 z 0
Theorem 2:
for α ˜ t computed with M and αt, |α ˜ t(P(s, φ, ψ)) − αt(s, φ, ψ)| < 1−γ
its corresponding α-vector, computed with M .
WL1
−1
0.4 −2
Prior model Most Probable (2) Monte Carlo (64) Weighted Distance (2)
−3 −4 0
20
40 60 Episode
80
0.2 0 0
100
20
40 60 Episode
80
15
10
5
0
100
MP (2)
MC (64)
WD (2)
MC (64)
WD (16)
Follow:
A robot has to follow an individual with a priori unknown behavior. There are 2 possible individuals with different state-time-invariant bahaviors. The current individual is unknown and can only be infered through its behavior. Here we consider only the behavior of the individuals to be unknown. We start with a prior (φ1N A, φ1N , φ1E , φ1S , φ1W ) = (2, 3, 1, 2, 2) for person 1 and (φ2N A, φ2N , φ2E , φ2S , φ2W ) = (2, 1, 3, 2, 2) for person 2, while person 1 moves with Pr = (0.3, 0.4, 0.2, 0.05, 0.05) and person 2 with Pr = (0.1, 0.05, 0.8, 0.03, 0.02). 2
2
200
Exact model
a , ψ + δ a ), (s0, φ0, ψ 0 )) I(P(s0, φ + δss 0 s0z
˜ (s, φ, ψ, a) = R(s, a) •R where P : S 0 → S˜ returns the state in S˜ minimizing bound of Theorem 1.
0.6
0
1.5 Most Probable (64) Monte Carlo (64) Weighted Distance (16)
−2 Most Probable (64) Monte Carlo (64) Weighted Distance (16)
−4
20
40 60 Episode
80
1
0.5
−6 Prior model −8 0
Planning Time/Action (ms)
2 0 |S| • S = S × N |A| • A0 = A
b1
3
WL1
This can be framed as an infinite state MDP (S 0, A0, T 0, R0) (called Bayes-Adaptive MDP) as follows:
on
...
0
3
0.2
o2
o1
b (L,⋅)
Dir(1,1) Dir(3,3) Dir(2,5) Dir(10,1) Dir(75,25)
0.1
...
Planning Time/Action (ms)
• Act such as to maximize expected return given current posterior and how it will evolve.
0 0
an
Here we consider only sensor accuracy to be unknown. We start with a prior (ψLl , ψLr , ψRl , ψRr ) = (5, 3, 3, 5), i.e. 62.5% sensor accuracy in each state.
• Maintain posterior Pr(T |s1, a1, s2, a2, . . . , at−1, st) via Bayes rule.
The Dirichlet distribution is parametrised by the counts φass0 of the number of times each a 0 transition s → s was seen.
a2
0
• Define prior Pr(T )
4
a1
• Planning algorithm is executed at every step.
b (L,⋅)
Assume transition function T is the only unknown.
If the prior Pr(T (s, a, ·)) is Dirichlet, then the posterior is also Dirichlet.
• Approximate value is obtained for each action in the current belief via Bellman backups in the tree; best action is executed.
Return
• A: Set of actions
b3: Pr(L, < 8, 3 >) = b4: Pr(L, < 8, 3 >) = Pr(R, < 8, 3 >) =
b0
• The particle filter is used to compute the reachable beliefs.
Return
MDP: (S, A, T, R) • S: Set of states
l,l
5 ; Pr(R, < 5, 4 >) = 3 8 8 5 ; Pr(R, < 5, 5 >) = 2 7 7 7 ; Pr(R, < 5, 6 >) = 2 9 9 7 ; Pr(L, < 5, 6 >) = 2 ; 18 18 7 ; Pr(R, < 5, 6 >) = 2 18 18
4
• Online planning via a D-step lookahead search from the current belief of the agent.
100
0 0
20
40 60 Episode
80
100
150
100
50
0
MP (64)