Projective simulation applied to the grid-world and the mountain-car problem Alexey A. Melnikov, Adi Makmal, and Hans J. Briegel Institute for Quantum Optics and Quantum Information, Austrian Academy of Sciences Institute for Theoretical Physics, University of Innsbruck The rules of the game are the following [3]:
Introduction
• The agent always starts from the (1,3) cell
The model of projective simulation (PS) is a novel approach to artificial intelligence. The PS model is based on a random walk process and is a natural candidate for quantization. The performance of the PS agent was studied in a number of discrete toy-problems. In this work we analyse the PS model further in more complicated scenarios. We consider two well-studied benchmarking problems, the "grid-world" and the "mountain-car" problem, which challenge the model with large and continuous input space. The performance of the PS model is compared with those of existing models and we show that the PS agent exhibits competitive performance also in these tasks.
• It can choose among four actions: left, right, up or down • If the agent decides to go to a square labeled as “wall” or to go beyond the grid, then no movement is performed but the time step is counted • Reward of λ = 1 is received only after reaching the goal
x=1 y=1
x=1 y=2
Figure 6: The goal is to find the “star” at x = 0.5.
hij gij
The rules of the game are the following [5]:
⇐
The PS model - Brief summary The PS model is represented by the network of clips:
x=6 y=9
...
⇒
⇑
• The agent always starts with a random position and velocity: x ∈ [−1.2, 0.5], v ∈ [−0.7, 0.7]
⇓
Figure 3: The PS network. The network has 46 percept clips and 4 action clips. Each edge is associated with an (t) (t) h-value hij and a glow value gij .
• It can choose among 3 actions: forward thrust (to the right), no thrust, and reverse thrust (to the left) • The next state is defined by the equations vnew = vold + 0.001 ∗ Action − 0.0025 cos(3xold) xnew = xold + vold • Reward of λ = 1 is received only after reaching the goal
x=1 y=1
x=1 y=2
x=6 y=9
...
[x0 , x1 ], [v0 , v1 ]
hij gij
Figure 1: The PS clip network [1].
⇐
The h-values of each edge of the PS network are updated with reward λ [2]: (t+1) hij (t+1) gij
=
(t) hij
−
(t) γ(hij
− 1) +
(t) gij λ,
(t) gij (1
= − η), γ and η are the damping parameters.
Grid World
+
Learning curves A performance of an agent in this task is evaluated by the number of steps it makes before reaching the goal at each trial.
Learning curves A performance of an agent in this task is evaluated by the number of steps it makes before reaching the goal at each trial.
pHtL Hc j Èci L by Eq. 1, Η=0.07
140
500
pHtL Hc j Èci L by Eq. 1, Η=0.02
pHtL Hc j Èci L by Eq. 2 HsoftmaxL, Η=0.12
120 100 80 60 40 20
The grid-world environment [3] is a maze in which an agent should learn an optimal path to a fixed goal. The world is divided into discrete cells in which the agent can reside (Fig. 2).
=
−
Figure 7: The PS network composed of 20 by 20 percept clips and 3 action clips
(1)
we also consider an alternative expression, known as the “softmax"distribution function (t) hij e (t) p (cj|ci) = P (t) . (2) hik e k
⇓
Figure 4: The PS network during the learning process. The bold arrows correspond to strong connections.
average number of steps
p(t)(cj|ci) =
(t) hij P (t) , k hik
(x19 , x20 ], (v19 , v20 ]
...
hij gij
average number of steps
Once a percept-clip is excited, the excitation hops between clips probabilistically until it reaches an action-clip. The probabilities of transitions are defined by time-dependent (t) weights hij :
⇒
⇑
(x1 , x2 ], [v0 , v1 ]
0
20
40
60
80
100
pHtL Hc j Èci L by Eq. 2 HsoftmaxL, Η=0.02
400
300
200
100 0
trials
5
10
15
20
trials
Figure 5: PS learning curves are shown for optimal values of η. The performance improves with the number of trials: from about 870 steps at the first trial to 45 (solid red) and 15.4 (dashed blue) steps, after 100 trials [4]. The PI model [3] learns to reach the goal in 14 steps after 100 trials.
Figure 8: PS learning curves are shown for optimal values of η. The performance of the PS is 313 steps/trial (solid red) and 223 steps/trial (dashed blue). The SARSA model [5] does the same in 450 steps/trial.
References [1] H.J. Briegel and G. De las Cuevas, Scientific reports 2 (2012)
Mountain car
[2] J. Mautner, A. Makmal, D. Manzano, M. Tiersch, and H.J. Briegel, New Generation Computing 33, in press (2015) [3] R.S. Sutton, Proc. of the 7th International Conference on Machine Learning (1990)
Figure 2: The goal of the game is to find the “star”. The shortest path to the goal is composed of 14 steps, one such optimal path is marked by a dashed line.
In the mountain-car task, defined in [5], an agent drives a car on a surface between two hills, where a goal awaits at the top of the right hill, as shown in Fig. 6.
[4] A.A. Melnikov, A. Makmal, and H.J. Briegel, Artificial Intelligence Research, 3 (2014) [5] S.P. Singh and R. S. Sutton, Machine learning 22, 123 (1996)
28th SFB meeting in Vienna, December 11–12, 2014