Hypervolume-based Multi-Objective ... - Semantic Scholar

Report 4 Downloads 22 Views
Hypervolume-based Multi-Objective Reinforcement Learning Kristof Van Moffaert Madalina M. Drugan Ann Nowé

Overview • Single-objective reinforcement learning (RL) • Multi-objective RL •

State of the art

• Hypervolume-based RL • Experiments • Conclusions

Reinforcement Learning Environment

• • • •

a(t)

Origin in psychology

s(t+1) r(t+1)

Learning from interaction Senses and acts upon its environment Chosen action influences the state of the environment, which determines the reward

Reinforcement Learning •

Environment?

‣ Markov Decision Process (MDP) contains: 1.

A set of possible states S

2.

A set of possible actions A

Environment

a(t)

3.

A real-valued reward function R(s,a)

s(t+1) r(t+1)

4.



A transition function T : S x A Prob(S) Reinforcement learning

Goal?



RL leert op basis van ervaring (trial and error) supervised learning

Maximize long-term reward (R) Sequentieel beslissingsprobleem ...



Learn policy



st

at

rt +1

st +1 Rt

at +1 rt

rt +2

1

st +2 rt

at +2 2

2

rt

rt +3 s t +3

... at +3 k

3

rt

,

k 1

k 0

met , 0 1, discount rate. Online leren exploratie Determine (optimal) actionvs.toexploitatie take in each state Theoretische achtergrond: DP & Stochastische approximatie

Reinforcement Learning • How? •

Q-values store estimated quality of state-action pair, i.e. Q(s,a)



Update rule adapts Q-values into the direction of the discounted future reward

Single-objective Q-learning Hypervolume-based Multi-Objective Reinforcement Learning

5

Algorithm 1 Scalarized ✏-greedy action selection, scal-✏-greedy() 1: 2: 3: 4: 5: 6: 7:

SQList {} for each action ai 2 A do ~o {Q(s, ai , o1 ), . . . , Q(s, ai , om )} SQ(s, a) scalarize(~o) Append SQ(s, a) to SQList end for return ✏-greedy(SQList)

. Scalarize Q-values

Algorithm 2 ✏-greedy action selection, ✏-greedy() 1: 2: 3: 4:

r rnd if r > ✏ then return argmaxa Q(s, a) else return randoma Q(s, a) end if

take current best take random

In Algorithm 2, we present the scalarized action selection strategy for MO Q-learning. At line 4, the scalarize function can be instantiated by any scalar-



Multiple objectives Multi-objective reinforcement learning (MORL)

‣ ‣ ‣

MOMDP Environment

Vector of rewards a(t)

Vector of Q-values

s(t+1) r0(t+1) ... rm(t+1)



Values in Space!

Goal:

Single-objective V (s0 ) Q(s0 , a0 ) Each policy gives one value

Multi-objective

State of the art MORL •

Scalarization approaches 1. Linear scalarization MORL



Weighted-sum

[Vamplew, 2011]

2. Non-linear scalarization MORL



Chebyshev function

[Van Moffaert, 2013]

Problems are similar to problems in MO

➡ ➡ ➡

Defining weights a-priori Performance heavily depends on weights used Not all solutions in Pareto front discovered

Alternative solution? Indicator-based search!

Hypervolume unary indicator A unary quality indicator I assigns a real number to a Pareto set approx.

I:

!R

S1

2nd objective



S2

S3

r

• •

1st objective

Measures the hypervolume between r and s1, s2 and s3 Used in EMO algorithms:



MO-CMA-ES, HypE, SMS-EMOA, ...

Hypervolume-based MORL Hypervolume-based Multi-Objective Reinforcement Learning

7

Algorithm 4 Hypervolume-based Q-learning algorithm 1: Initialize Q(s, a, o) arbitrarily 2: for each episode T do list of previously visited Q-vectors 3: Initialize s, l = {} 4: repeat 5: Choose a from s using policy derived from Q (e.g. ✏-greedy HBAS(s, l)) ~ 6: Take action a and observe state s0 2 S, reward vector ~r 2 R 7: ~o {Q(s, a, o1 ), . . . , Q(s, a, om )} 8: Add ~o to l . Add Q-values of selected action a to l 0 9: maxa0 greedy HBAS(s , l) . Get greedy action in s0 based on new l 10: 11: for each objective o do . Update Q-values for each objective 12: Q(s, a, o) Q(s, a, o) + ↵[~r(s, a, o) + Q(s0 , maxa0 , o) Q(s, a, o)] 13: end for 14: 15: s s0 . Proceed to next state 16: until s is terminal 17: end for

still make her/his decision on which policies or trade-o↵s are preferred, but the advantage is that emphasis on particular objectives is not required beforehand.

5

Results

In this section, we experimentally evaluate the performance of the HB-MORL algorithm on two benchmark environments for di↵erent quality measures. These re-

Hypervolume-based MORL Hypervolume-based Multi-Objective Reinforcement Learning

7

Algorithm 4 Hypervolume-based Q-learning algorithm 1: Initialize Q(s, a, o) arbitrarily 2: for each episode T do 3: Initialize s, l = {} Perform action selection based on current state and l 4: repeat 5: Choose a from s using policy derived from Q (e.g. ✏-greedy HBAS(s, l)) ~ 6: Take action a and observe state s0 2 S, reward vector ~r 2 R 7: ~o {Q(s, a, o1 ), . . . , Q(s, a, om )} 8: Add ~o to l . Add Q-values of selected action a to l 0 9: maxa0 greedy HBAS(s , l) . Get greedy action in s0 based on new l 10: 11: for each objective o do . Update Q-values for each objective 12: Q(s, a, o) Q(s, a, o) + ↵[~r(s, a, o) + Q(s0 , maxa0 , o) Q(s, a, o)] 13: end for 14: 15: s s0 . Proceed to next state 16: until s is terminal 6 end for Hypervolume-based Multi-Objective Reinforcement Learning 17:

Algorithm 3 Greedy Hypervolume-based Action Selection, HBAS(s, l) still make her/his policies or trade-o↵s are preferred, but the 1: volumes {} decision on which . The list collects hv contributions for each action advantage is action that emphasis particular objectives is not required beforehand. 2: for each ai 2 A of on state s do

3: ~o {Q(s, ai , o1 ), . . . , Q(s, ai , om )} 4: hv calculate hv(l + ~o) . Compute hv contribution of ai to l 55: Results Append hv to volumes 6: end for return argmax . Retrieve with theof maximal contribution a volumes In7:this section, we experimentally evaluatethe theaction performance the HB-MORL al-

gorithm on two benchmark environments for di↵erent quality measures. These re-

Hypervolume-based MORL Hypervolume-based Multi-Objective Reinforcement Learning

7

Algorithm 4 Hypervolume-based Q-learning algorithm 1: Initialize Q(s, a, o) arbitrarily 2: for each episode T do 3: Initialize s, l = {} Perform action selection based on current state and l 4: repeat 5: Choose a from s using policy derived from Q (e.g. ✏-greedy HBAS(s, l)) ~ 6: Take action a and observe state s0 2 S, reward vector ~r 2 R 7: ~o {Q(s, a, o1 ), . . . , Q(s, a, om )} 8: Add ~o to l . Add Q-values of selected action a to l 0 9: maxa0 greedy HBAS(s , l) . Get greedy action in s0 based on new l 10: 11: for each objective o do . Update Q-values for each objective 12: Q(s, a, o) Q(s, a, o) + ↵[~r(s, a, o) + Q(s0 , maxa0 , o) Q(s, a, o)] 13: end for 14: 15: s s0 . Proceed to next state 16: until s is terminal 6 end for Hypervolume-based Multi-Objective Reinforcement Learning 17:

Algorithm 3 Greedy Hypervolume-based Action Selection, HBAS(s, l) still make her/his policies or trade-o↵s are preferred, but the 1: volumes {} decision on which . The list collects hv contributions for each action advantage is action that emphasis particular objectives is not required beforehand. 2: for each ai 2 A of on state s do

3: ~o {Q(s, ai , o1 ), . . . , Q(s, ai , om )} 4: hv calculate hv(l + ~o) .Return Compute hv contribution of acontribution i to l action a with maximal in 55: Results Append hv to volumes HV taking into account the contents of l 6: end for return argmax . Retrieve with theof maximal contribution a volumes In7:this section, we experimentally evaluatethe theaction performance the HB-MORL al-

gorithm on two benchmark environments for di↵erent quality measures. These re-

Hypervolume-based MORL Hypervolume-based Multi-Objective Reinforcement Learning

7

Algorithm 4 Hypervolume-based Q-learning algorithm 1: Initialize Q(s, a, o) arbitrarily 2: for each episode T do 3: Initialize s, l = {} 4: repeat 5: Choose a from s using policy derived from Q (e.g. ✏-greedy HBAS(s, l)) ~ 6: Take action a and observe state s0 2 S, reward vector ~r 2 R 7: ~o {Q(s, a, o1 ), . . . , Q(s, a, om )} Add current Q-vector to l 8: Add ~o to l . Add Q-values of selected action a to l 0 9: maxa0 greedy HBAS(s , l) . Get greedy action in s0 based on new l 10: 11: for each objective o do . Update Q-values for each objective 12: Q(s, a, o) Q(s, a, o) + ↵[~r(s, a, o) + Q(s0 , maxa0 , o) Q(s, a, o)] 13: end for 14: 15: s s0 . Proceed to next state 16: until s is terminal Update Q-value for each 6 end for Hypervolume-based Multi-Objective Reinforcement Learning 17: objective individually

Algorithm 3 Greedy Hypervolume-based Action Selection, HBAS(s, l) still make her/his policies or trade-o↵s are preferred, but the 1: volumes {} decision on which . The list collects hv contributions for each action advantage is action that emphasis particular objectives is not required beforehand. 2: for each ai 2 A of on state s do

3: ~o {Q(s, ai , o1 ), . . . , Q(s, ai , om )} 4: hv calculate hv(l + ~o) . Compute hv contribution of ai to l 55: Results Append hv to volumes 6: end for return argmax . Retrieve with theof maximal contribution a volumes In7:this section, we experimentally evaluatethe theaction performance the HB-MORL al-

gorithm on two benchmark environments for di↵erent quality measures. These re-

Benchmark 1 •

Benchmark instances

[Vamplew, 2011]

Deep Sea Treasure world

‣ ‣

Minimize time and maximize treasure value Transformed into full maximization problem

‣ ‣ ‣

Time objective x -1

10 Pareto optimal policies Represent non-convex Pareto front

Learning curve 10

Hypervolume-based Multi-Obje 7

x 10

1200

3

1000

2.5 Hypervolume

Hypervolume

800 600 Pareto front Linear scal. Q−learning Chebyshev Q−learning Hypervolume Q−learning

400 200 0 0

100

200

Run

300

400

(a) Deep Sea Treasure world

500

2 1.5 1 0.5 0

100

(b) MO

Fig. 1: Learning curves on the Deep Sea Treasure and th respectively

Hyp

2 1.8 1.6

Pareto optimal policies learned 1.4 1.2

0

1

2

Run x 100

3

4

5

0

(a) Deep Sea Treasure world

1

2

Run x 100

3

4

5

(b) MO Mountain Car world

0.7 0.6

Frequency

0.5

Linear scal. Q−learning Chebyshev Q−learning Hypervolume Q−learning

0.4 0.3 0.2 0.1 0

1

2

3

5

8 16 Treasure value

24

50

74

124

(c) Frequency of goals in the Deep Sea world

ig. 2: Fig. 2(a) and 2(b) depict the performance of each learning algorithm each 10 As expected, the linearof scalarization learner uns. In Fig. 2(c), the frequency probabilities each of the 10was Pareto dominatin ineffective in the non-convex environment

Hyp

2 1.8 1.6

Pareto optimal policies learned 1.4 1.2

0

1

2

Run x 100

3

4

5

0

(a) Deep Sea Treasure world

1

2

Run x 100

3

4

5

(b) MO Mountain Car world

0.7 0.6

Frequency

0.5

Linear scal. Q−learning Chebyshev Q−learning Hypervolume Q−learning

0.4 0.3 0.2 0.1 0

1

2

3

5

8 16 Treasure value

24

50

74

124

(c) Frequency of goals in the Deep Sea world

ig. 2: Fig. 2(a) and 2(b) depict the performance of each learning algorithm each 10 The Chebyshev learnerof obtained the the best 10 spread, uns. In Fig. 2(c), the frequency probabilities each of Pareto dominatin but not all the time (cfr. learning graph)

Hyp

2 1.8 1.6

Pareto optimal policies learned 1.4 1.2

0

1

2

Run x 100

3

4

5

0

(a) Deep Sea Treasure world

1

2

Run x 100

3

4

5

(b) MO Mountain Car world

0.7 0.6

Frequency

0.5

Linear scal. Q−learning Chebyshev Q−learning Hypervolume Q−learning

0.4 0.3 0.2 0.1 0

1

2

3

5

8 16 Treasure value

24

50

74

124

(c) Frequency of goals in the Deep Sea world

ig. 2: Fig. 2(a) and 2(b) depict the performance of each learning algorithm each 10 HB-MORL focusses on thatthe maximize uns. In Fig. 2(c), the frequency probabilities ofpolicies each of 10 Pareto dominatin the hypervolume, given a particular reference point

Benchmark 2 MO Mountain Car world

‣ ‣

3-objective

‣ ‣

Transformed into maximization problem

minimize time, number of reversal and acceleration actions

470 elements in Pareto front

Pareto front

1200

Learning curve 1000

ypervolume-based Multi-Objective Reinforcement Learning Hypervolume

7

800

x 10 3 2.5

600 Pareto front Linear scal. Q−learning Chebyshev Q−learning Hypervolume Q−learning

Hypervolume

400

reto front near scal. Q−learning ebyshev Q−learning pervolume Q−learning

400

world

500

2 1.5 1

200 0 0

200

Run

300

400

500

(a) Deep Sea Treasure world

0.5 0

100

100

200

Run

300

400

500

Learning (b)Fig. MO1: Mountain Carcurves world on the Deep Sea respectively

he Deep Sea Treasure and the MO Mountain Car worlds,

12

Quality indicator comparison

Hypervolume-based Multi-Objective Reinforcement Learning

Table 4: Five quality indicator for each of the three algorithms on the two benchmark instances. The first three are to be minimized, while the latter two are maximization indicators. The best values are depicted in bold face. Inverted Generational distance Generalized spread Generational distance Hypervolume Cardinality

DS MC

Linear 0.128 0.012

Chebyshev 0.0342 0.010

HB-MORL 0.0371 0.005

DS MC DS MC DS MC DS MC

3.14e 16 0.683 0 0.0427 762 15727946 2 15

0.743 0.808 0 0.013824 959.26 23028392 8 38

0.226 0.701 0 0.013817 1040.2 23984880 5 37

gorithm was very consisting in finding solutions that maximize the hypervolume metric, but could be improved by more spread results. Weights vs. quality indicator. In the following test, we investigate into more detail the results of HB-MORL to the results obtained for each weighted tuple for the scalarization-based algorithms (Table 5). It is important to note that the HB-MORL algorithm is not set up with any information on how the objectives should be balanced or weighed. Therefore, in the table, its values remain

Conclusions •

We have combined EMO principles with RL to design a hybrid MORL algorithm



HB-MORL uses the hypervolume measure to guide the action selection



Results



Linear scalarization learner is not generally applicable



Chebyshev learns more spread results, but not robust all the time



Scalarization methods and their performance depend on weight tuples used



HB-MORL focuses on policies that maximize HV and finds them nearly always

Thank you Kristof Van Moffaert Madalina M. Drugan Ann Nowé