Hypervolume-based Multi-Objective Reinforcement Learning Kristof Van Moffaert Madalina M. Drugan Ann Nowé
Overview • Single-objective reinforcement learning (RL) • Multi-objective RL •
State of the art
• Hypervolume-based RL • Experiments • Conclusions
Reinforcement Learning Environment
• • • •
a(t)
Origin in psychology
s(t+1) r(t+1)
Learning from interaction Senses and acts upon its environment Chosen action influences the state of the environment, which determines the reward
Reinforcement Learning •
Environment?
‣ Markov Decision Process (MDP) contains: 1.
A set of possible states S
2.
A set of possible actions A
Environment
a(t)
3.
A real-valued reward function R(s,a)
s(t+1) r(t+1)
4.
•
A transition function T : S x A Prob(S) Reinforcement learning
Goal?
•
RL leert op basis van ervaring (trial and error) supervised learning
Maximize long-term reward (R) Sequentieel beslissingsprobleem ...
•
Learn policy
•
st
at
rt +1
st +1 Rt
at +1 rt
rt +2
1
st +2 rt
at +2 2
2
rt
rt +3 s t +3
... at +3 k
3
rt
,
k 1
k 0
met , 0 1, discount rate. Online leren exploratie Determine (optimal) actionvs.toexploitatie take in each state Theoretische achtergrond: DP & Stochastische approximatie
Reinforcement Learning • How? •
Q-values store estimated quality of state-action pair, i.e. Q(s,a)
•
Update rule adapts Q-values into the direction of the discounted future reward
Single-objective Q-learning Hypervolume-based Multi-Objective Reinforcement Learning
5
Algorithm 1 Scalarized ✏-greedy action selection, scal-✏-greedy() 1: 2: 3: 4: 5: 6: 7:
SQList {} for each action ai 2 A do ~o {Q(s, ai , o1 ), . . . , Q(s, ai , om )} SQ(s, a) scalarize(~o) Append SQ(s, a) to SQList end for return ✏-greedy(SQList)
. Scalarize Q-values
Algorithm 2 ✏-greedy action selection, ✏-greedy() 1: 2: 3: 4:
r rnd if r > ✏ then return argmaxa Q(s, a) else return randoma Q(s, a) end if
take current best take random
In Algorithm 2, we present the scalarized action selection strategy for MO Q-learning. At line 4, the scalarize function can be instantiated by any scalar-
•
Multiple objectives Multi-objective reinforcement learning (MORL)
‣ ‣ ‣
MOMDP Environment
Vector of rewards a(t)
Vector of Q-values
s(t+1) r0(t+1) ... rm(t+1)
•
Values in Space!
Goal:
Single-objective V (s0 ) Q(s0 , a0 ) Each policy gives one value
Multi-objective
State of the art MORL •
Scalarization approaches 1. Linear scalarization MORL
‣
Weighted-sum
[Vamplew, 2011]
2. Non-linear scalarization MORL
‣
Chebyshev function
[Van Moffaert, 2013]
Problems are similar to problems in MO
➡ ➡ ➡
Defining weights a-priori Performance heavily depends on weights used Not all solutions in Pareto front discovered
Alternative solution? Indicator-based search!
Hypervolume unary indicator A unary quality indicator I assigns a real number to a Pareto set approx.
I:
!R
S1
2nd objective
•
S2
S3
r
• •
1st objective
Measures the hypervolume between r and s1, s2 and s3 Used in EMO algorithms:
•
MO-CMA-ES, HypE, SMS-EMOA, ...
Hypervolume-based MORL Hypervolume-based Multi-Objective Reinforcement Learning
7
Algorithm 4 Hypervolume-based Q-learning algorithm 1: Initialize Q(s, a, o) arbitrarily 2: for each episode T do list of previously visited Q-vectors 3: Initialize s, l = {} 4: repeat 5: Choose a from s using policy derived from Q (e.g. ✏-greedy HBAS(s, l)) ~ 6: Take action a and observe state s0 2 S, reward vector ~r 2 R 7: ~o {Q(s, a, o1 ), . . . , Q(s, a, om )} 8: Add ~o to l . Add Q-values of selected action a to l 0 9: maxa0 greedy HBAS(s , l) . Get greedy action in s0 based on new l 10: 11: for each objective o do . Update Q-values for each objective 12: Q(s, a, o) Q(s, a, o) + ↵[~r(s, a, o) + Q(s0 , maxa0 , o) Q(s, a, o)] 13: end for 14: 15: s s0 . Proceed to next state 16: until s is terminal 17: end for
still make her/his decision on which policies or trade-o↵s are preferred, but the advantage is that emphasis on particular objectives is not required beforehand.
5
Results
In this section, we experimentally evaluate the performance of the HB-MORL algorithm on two benchmark environments for di↵erent quality measures. These re-
Hypervolume-based MORL Hypervolume-based Multi-Objective Reinforcement Learning
7
Algorithm 4 Hypervolume-based Q-learning algorithm 1: Initialize Q(s, a, o) arbitrarily 2: for each episode T do 3: Initialize s, l = {} Perform action selection based on current state and l 4: repeat 5: Choose a from s using policy derived from Q (e.g. ✏-greedy HBAS(s, l)) ~ 6: Take action a and observe state s0 2 S, reward vector ~r 2 R 7: ~o {Q(s, a, o1 ), . . . , Q(s, a, om )} 8: Add ~o to l . Add Q-values of selected action a to l 0 9: maxa0 greedy HBAS(s , l) . Get greedy action in s0 based on new l 10: 11: for each objective o do . Update Q-values for each objective 12: Q(s, a, o) Q(s, a, o) + ↵[~r(s, a, o) + Q(s0 , maxa0 , o) Q(s, a, o)] 13: end for 14: 15: s s0 . Proceed to next state 16: until s is terminal 6 end for Hypervolume-based Multi-Objective Reinforcement Learning 17:
Algorithm 3 Greedy Hypervolume-based Action Selection, HBAS(s, l) still make her/his policies or trade-o↵s are preferred, but the 1: volumes {} decision on which . The list collects hv contributions for each action advantage is action that emphasis particular objectives is not required beforehand. 2: for each ai 2 A of on state s do
3: ~o {Q(s, ai , o1 ), . . . , Q(s, ai , om )} 4: hv calculate hv(l + ~o) . Compute hv contribution of ai to l 55: Results Append hv to volumes 6: end for return argmax . Retrieve with theof maximal contribution a volumes In7:this section, we experimentally evaluatethe theaction performance the HB-MORL al-
gorithm on two benchmark environments for di↵erent quality measures. These re-
Hypervolume-based MORL Hypervolume-based Multi-Objective Reinforcement Learning
7
Algorithm 4 Hypervolume-based Q-learning algorithm 1: Initialize Q(s, a, o) arbitrarily 2: for each episode T do 3: Initialize s, l = {} Perform action selection based on current state and l 4: repeat 5: Choose a from s using policy derived from Q (e.g. ✏-greedy HBAS(s, l)) ~ 6: Take action a and observe state s0 2 S, reward vector ~r 2 R 7: ~o {Q(s, a, o1 ), . . . , Q(s, a, om )} 8: Add ~o to l . Add Q-values of selected action a to l 0 9: maxa0 greedy HBAS(s , l) . Get greedy action in s0 based on new l 10: 11: for each objective o do . Update Q-values for each objective 12: Q(s, a, o) Q(s, a, o) + ↵[~r(s, a, o) + Q(s0 , maxa0 , o) Q(s, a, o)] 13: end for 14: 15: s s0 . Proceed to next state 16: until s is terminal 6 end for Hypervolume-based Multi-Objective Reinforcement Learning 17:
Algorithm 3 Greedy Hypervolume-based Action Selection, HBAS(s, l) still make her/his policies or trade-o↵s are preferred, but the 1: volumes {} decision on which . The list collects hv contributions for each action advantage is action that emphasis particular objectives is not required beforehand. 2: for each ai 2 A of on state s do
3: ~o {Q(s, ai , o1 ), . . . , Q(s, ai , om )} 4: hv calculate hv(l + ~o) .Return Compute hv contribution of acontribution i to l action a with maximal in 55: Results Append hv to volumes HV taking into account the contents of l 6: end for return argmax . Retrieve with theof maximal contribution a volumes In7:this section, we experimentally evaluatethe theaction performance the HB-MORL al-
gorithm on two benchmark environments for di↵erent quality measures. These re-
Hypervolume-based MORL Hypervolume-based Multi-Objective Reinforcement Learning
7
Algorithm 4 Hypervolume-based Q-learning algorithm 1: Initialize Q(s, a, o) arbitrarily 2: for each episode T do 3: Initialize s, l = {} 4: repeat 5: Choose a from s using policy derived from Q (e.g. ✏-greedy HBAS(s, l)) ~ 6: Take action a and observe state s0 2 S, reward vector ~r 2 R 7: ~o {Q(s, a, o1 ), . . . , Q(s, a, om )} Add current Q-vector to l 8: Add ~o to l . Add Q-values of selected action a to l 0 9: maxa0 greedy HBAS(s , l) . Get greedy action in s0 based on new l 10: 11: for each objective o do . Update Q-values for each objective 12: Q(s, a, o) Q(s, a, o) + ↵[~r(s, a, o) + Q(s0 , maxa0 , o) Q(s, a, o)] 13: end for 14: 15: s s0 . Proceed to next state 16: until s is terminal Update Q-value for each 6 end for Hypervolume-based Multi-Objective Reinforcement Learning 17: objective individually
Algorithm 3 Greedy Hypervolume-based Action Selection, HBAS(s, l) still make her/his policies or trade-o↵s are preferred, but the 1: volumes {} decision on which . The list collects hv contributions for each action advantage is action that emphasis particular objectives is not required beforehand. 2: for each ai 2 A of on state s do
3: ~o {Q(s, ai , o1 ), . . . , Q(s, ai , om )} 4: hv calculate hv(l + ~o) . Compute hv contribution of ai to l 55: Results Append hv to volumes 6: end for return argmax . Retrieve with theof maximal contribution a volumes In7:this section, we experimentally evaluatethe theaction performance the HB-MORL al-
gorithm on two benchmark environments for di↵erent quality measures. These re-
Benchmark 1 •
Benchmark instances
[Vamplew, 2011]
Deep Sea Treasure world
‣ ‣
Minimize time and maximize treasure value Transformed into full maximization problem
‣ ‣ ‣
Time objective x -1
10 Pareto optimal policies Represent non-convex Pareto front
Learning curve 10
Hypervolume-based Multi-Obje 7
x 10
1200
3
1000
2.5 Hypervolume
Hypervolume
800 600 Pareto front Linear scal. Q−learning Chebyshev Q−learning Hypervolume Q−learning
400 200 0 0
100
200
Run
300
400
(a) Deep Sea Treasure world
500
2 1.5 1 0.5 0
100
(b) MO
Fig. 1: Learning curves on the Deep Sea Treasure and th respectively
Hyp
2 1.8 1.6
Pareto optimal policies learned 1.4 1.2
0
1
2
Run x 100
3
4
5
0
(a) Deep Sea Treasure world
1
2
Run x 100
3
4
5
(b) MO Mountain Car world
0.7 0.6
Frequency
0.5
Linear scal. Q−learning Chebyshev Q−learning Hypervolume Q−learning
0.4 0.3 0.2 0.1 0
1
2
3
5
8 16 Treasure value
24
50
74
124
(c) Frequency of goals in the Deep Sea world
ig. 2: Fig. 2(a) and 2(b) depict the performance of each learning algorithm each 10 As expected, the linearof scalarization learner uns. In Fig. 2(c), the frequency probabilities each of the 10was Pareto dominatin ineffective in the non-convex environment
Hyp
2 1.8 1.6
Pareto optimal policies learned 1.4 1.2
0
1
2
Run x 100
3
4
5
0
(a) Deep Sea Treasure world
1
2
Run x 100
3
4
5
(b) MO Mountain Car world
0.7 0.6
Frequency
0.5
Linear scal. Q−learning Chebyshev Q−learning Hypervolume Q−learning
0.4 0.3 0.2 0.1 0
1
2
3
5
8 16 Treasure value
24
50
74
124
(c) Frequency of goals in the Deep Sea world
ig. 2: Fig. 2(a) and 2(b) depict the performance of each learning algorithm each 10 The Chebyshev learnerof obtained the the best 10 spread, uns. In Fig. 2(c), the frequency probabilities each of Pareto dominatin but not all the time (cfr. learning graph)
Hyp
2 1.8 1.6
Pareto optimal policies learned 1.4 1.2
0
1
2
Run x 100
3
4
5
0
(a) Deep Sea Treasure world
1
2
Run x 100
3
4
5
(b) MO Mountain Car world
0.7 0.6
Frequency
0.5
Linear scal. Q−learning Chebyshev Q−learning Hypervolume Q−learning
0.4 0.3 0.2 0.1 0
1
2
3
5
8 16 Treasure value
24
50
74
124
(c) Frequency of goals in the Deep Sea world
ig. 2: Fig. 2(a) and 2(b) depict the performance of each learning algorithm each 10 HB-MORL focusses on thatthe maximize uns. In Fig. 2(c), the frequency probabilities ofpolicies each of 10 Pareto dominatin the hypervolume, given a particular reference point
Benchmark 2 MO Mountain Car world
‣ ‣
3-objective
‣ ‣
Transformed into maximization problem
minimize time, number of reversal and acceleration actions
470 elements in Pareto front
Pareto front
1200
Learning curve 1000
ypervolume-based Multi-Objective Reinforcement Learning Hypervolume
7
800
x 10 3 2.5
600 Pareto front Linear scal. Q−learning Chebyshev Q−learning Hypervolume Q−learning
Hypervolume
400
reto front near scal. Q−learning ebyshev Q−learning pervolume Q−learning
400
world
500
2 1.5 1
200 0 0
200
Run
300
400
500
(a) Deep Sea Treasure world
0.5 0
100
100
200
Run
300
400
500
Learning (b)Fig. MO1: Mountain Carcurves world on the Deep Sea respectively
he Deep Sea Treasure and the MO Mountain Car worlds,
12
Quality indicator comparison
Hypervolume-based Multi-Objective Reinforcement Learning
Table 4: Five quality indicator for each of the three algorithms on the two benchmark instances. The first three are to be minimized, while the latter two are maximization indicators. The best values are depicted in bold face. Inverted Generational distance Generalized spread Generational distance Hypervolume Cardinality
DS MC
Linear 0.128 0.012
Chebyshev 0.0342 0.010
HB-MORL 0.0371 0.005
DS MC DS MC DS MC DS MC
3.14e 16 0.683 0 0.0427 762 15727946 2 15
0.743 0.808 0 0.013824 959.26 23028392 8 38
0.226 0.701 0 0.013817 1040.2 23984880 5 37
gorithm was very consisting in finding solutions that maximize the hypervolume metric, but could be improved by more spread results. Weights vs. quality indicator. In the following test, we investigate into more detail the results of HB-MORL to the results obtained for each weighted tuple for the scalarization-based algorithms (Table 5). It is important to note that the HB-MORL algorithm is not set up with any information on how the objectives should be balanced or weighed. Therefore, in the table, its values remain
Conclusions •
We have combined EMO principles with RL to design a hybrid MORL algorithm
•
HB-MORL uses the hypervolume measure to guide the action selection
•
Results
•
Linear scalarization learner is not generally applicable
•
Chebyshev learns more spread results, but not robust all the time
•
Scalarization methods and their performance depend on weight tuples used
➡
HB-MORL focuses on policies that maximize HV and finds them nearly always
Thank you Kristof Van Moffaert Madalina M. Drugan Ann Nowé