# Identify the difference between learning and control
# Describe temporal-difference and actor-critic models
# Describe difference between model-free and model-based RL # Describe some findings from RL studies of depression
2
BUILDING A REINFORCEMENT LEARNING MODEL
CREATE A WORLD AND AN AGENT
World
Agent
4
PROPERTIES OF THE WORLD # States, s ∈ S # Available actions, a ∈ A # State transition dynamics, P (st +1 | st , at ) ◦ Stochastic vs. deterministic ◦ May or may not be action dependent # Reward function, r : S → R ◦ May take into account at
5
PROPERTIES OF THE AGENT
# State value, V (s) , or State-action value, Q (s , a) , "table"
# Learning rule for V (s) or Q (s , a) # Control policy, π : S → A
# May or may not have an internal "model" of the world
# Goal: to find the policy π (s) that maximizes the total future reward
6
AT EACH TIME STEP
t0 st World
Agent
7
AT EACH TIME STEP...
t0 st Agent
World
π (st ) at
8
AT EACH TIME STEP...
t0 s t +1 World
r
Agent
Then agent updates values for V (s) (i.e. "learns")
9
FOR YOUR FUTURE REFERENCE 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:
procedure RLSimulate(· · · ) Instantiate world Ω, then agent Ψ Sample first state s ∈ S from Ω Sample first action a ∈ A from Ψ Ψ acts on Ω Ω produces reward r ∈ R and next state s0 for observation Ψ selects next action a0 for t 1 : T do Ψ learns/updates model Q (s, a) , R (s , a, s0 ) , T (s , a, s0 ) s ← s0, a ← a0 Ψ acts on Ω Ω produces reward r and next state s0 for observation Ψ selects next action a0 end for end procedure 10
LEARNING RULES
LEARNING RULE FORM
v w·u
δ r − v then, w ← w + αδ u Basically gradient descent: 1. Take the difference between reality and my prediction ( δ ) 2. Change weights by a small fraction ( α ) of the error ◦ α "learning rate" or "step size" (usually