Learning Dynamics in Economic Games - Semantic Scholar

Report 2 Downloads 158 Views
Learning Dynamics in Economic Games Martin Spanknebel

arXiv:1508.05288v1 [physics.soc-ph] 21 Aug 2015

1

∗1

and Klaus Pawelzik

†1,2

Institute for Theoretical Physics, University of Bremen 2 Center for Cognitve Sciences, University of Bremen August 24, 2015

When playing games human decision behaviour is often found to be diverse. For instance, in repeated prisoner dilemma games humans exhibit broad distributions of cooperativity and on average do not optimize their mean payoff. Deviations from optimal behaviour have been attributed to auxiliary causes including randomness of decisions, mis-estimations of probabilities, accessory objectives, or emotional biases. Here we show that also the dynamics resulting from of a general tendency to shift preferences towards previously rewarding choices in interaction with the strategy of the opponent can contribute substantially to observed lacks of ’rationality’. As a representative example we investigate the dynamics of choice preferences in prisoner dilemma games with opponents exhibiting different degrees of extortion and generosity. We find that already a very simple model for human learning can account for a surprisingly wide range of human decision behaviours. In particular, the theory can reproduce reduced cooperation against extortionists, large degrees of cooperation in generous games, as well as explain the broad distributions of individual choice rates in ensembles of players. We conclude that important aspects of the alleged irrationality of human behaviours could be rooted in the dynamics of elementary learning mechanisms.

∗ †

[email protected] [email protected]

1

Humans and animals regularly fail to achieve beneficial outcomes even in simple choice situations and also when allowed for trying them out with many moves. These observations have triggered a plethora of research into the lack of ’rationality’ in human and animal decision behaviour. Deviations from utility maximizations were explained by psychological peculiarities including probability matching [3] and mis-estimations of utilities [7]. Also, auxiliary motives or emotions can have a substantial influence on decision behaviour which cannot always be excluded experimentally [2].

expected reward

0.4 extortion strategy Y against extortion generous strategy Y against generous

0.3

0.2

0.1 0

0.1

0.2

0.3 0.4 0.5 0.6 0.7 Y’s cooperation probability

0.8

0.9

1

Figure 1: expected reward of Y and extortion or generous strategies. In this example we assume that Y only uses one cooperation probability, so she does not use the information of the last round to male her decision. Social dilemma games have served as a paradigm in theoretical research of the evolution of cooperativity. Being particularly well understood they provide a testbed for investigating human deviations from optimal decision behaviour. In the prisoners dilemma two players, named X and Y, are playing a game in which they can either cooperate or defect. In the case of mutual cooperation both get a reward rCC , if one player defects and the other cooperates, the defector gets an even larger reward rCD while the cooperator gets an lower reward rDC . If both players defect, both get an reward rDD . These rewards must satisfy two equations: rCD > rCC > rDD > rDC leads to a Nash Equilibrium of mutual defection, while 2rCC > rDC + rCD guarantees, that mutual cooperation has the best global outcome. In the following we use the standard values defined in table 1. Press and Dyson [11] recently introduced a new class of strategies for the prisoners dilemma, the so called Zero Determinant (ZD) strategies. ZD strategies are memory-one strategies, i.e. they only use the information from the last round to make their decision.

2

Table 1: Definition of the reward r Computer: X Human:Y

cooperate x = C cooperate y = C defect y = D

defect x = D

``` ``` ``` ``` ``` 0.3 (rCC ) ``` 0.5 (rCD ) ` ``` ``` 0.3 (rCC ) 0.0 (r ) `` ` ``` DC ` ``` ``` ``` 0.0 (r ) 0.1 (r ) DC DD ` ``` ``` ``` ``` 0.5 (rCD ) 0.1 (r ) `` `` DD ` `

It can be shown mathematically, that longer memory strategies cannot have any advantage over memory-one strategies. ZD strategies have two further characteristics: (i) they force a linear relationship between the rewards of the players; (ii) the best strategy for the co-player, meaning the strategy with the highest reward, is total cooperation. In this paper we will concentrate on two types of these ZD strategies: The extortion strategy which dominates any adaptive opponent [11] and the generous strategy which does not exceed the reward of the co-player [12]. Figure 1 shows the expected reward of a player (X) having only a single fixed cooperation probability playing against the extortion or the generous strategy. For this very simple strategy of X, but also for all other possible strategies total cooperation always leads to the maximum reward [11]. Recently, an elegant experiment by Hilbe et al. demonstrated a particularly striking lack of cooperation in humans [5] when repeatedly playing prisoner dilemma games against zero-determinant strategies that extort their opponents [11]. As an explanation for this substantially sub-optimal behaviour a desire for punishment was offered. The experiment also revealed a large diversity of individual cooperativities sometimes leading to bimodal distributions after a series of moves, where some participants ended up always defecting while others always cooperated [5, supplement, Fig. 1]. Obviously, the evolution of choice preferences during the course of a game can lead to opposite behaviours in different individuals. We wondered if the coevolution of learning in the human player in interaction with the opponent’s strategy could yield a dynamics that would explain this spread of individual behaviours. Since the pioneering work of Pavlov more than hundered years ago the research into reward dependent learning of choice preferences brought about a range of hypothetical learning rules based on success-dependent reinforcement of behaviour. It turned out already human as well as animal behaviours even in very simple choice situations are far from trivial. For instance the phenomenology of Classical Conditioning can be explained only partially by the famous Rescorla-Wagner learning rule [9]. Therefore, more sophisticated algorithms for reinforcement learning were introduced, among which TD-learning [13] is now a leading paradigm in behavioural brain research and the BushMosteller-Rule [1] is frequently used in Psychology and Economics. All learning rules share the basic principle of reinforcing preferences for successful choices and/or weakening preferences for negatively rewarded decisions. Systematic biases either inherent in the particular learning rule or in specific parameter settings were frequently

3

used to explain systematic deviations from optimal behaviour. As a representative example we here only cite a recent careful study in which success dependent learning effects in professional basketball players were explained by different learning rates for updating the respective preferences [10]. Some studies investigated the evolution of choice preferences in their interaction with the dynamic responses of the environment. An important example here is the work Herrnstein [3], which shows that neither probability matching nor payoff maximization can explain human decision behaviour when the response of the environment is history dependent. To our knowledge a first systematic analysis of the dynamics of choice preferences was performed by Izquierdo and co-workers [6], who used the Bush-Mosteller-Rule to investigate the flow of choice preferences in social dilemma games. While their method is interesting from a theoretical perspective, the concrete predictions of this study are difficult to test since this would require knowledge of the strategies used by both human players which is not available. The case considered by Hilbe et al. [5] is far more useful in this respect. These elegant experiments, where humans played the prisoner dilemma game against fixed zero determinant strategies, are analytically transparent [11]. We use the results from these experiments as a benchmark for investigating the evolution of individual choice preferences in interaction with a dynamic environment.

Results We use a stochastic model of decision behaviour with probabilities P for choice preferences that can depend on the previous decisions of player and opponent. The choice preferences are adapted according to the success of behaviour (see Methods) by a simple learning rule that is a linearisation of the Bush-Mosteller algorithm [1]. When applied to opponents employing zero determinant strategies the model is able to reproduce most details of the experiment by Hilbe et al even quantitatively with a single parameter setting. Figure 2 compares the fraction of cooperating subjects during the course of the game playing against extortion and generous treatments. In the experiment [5] the humans start with a cooperation rate of 30 - 40 %. In the generous treatment the subjects behave ”rational“, so they optimise their strategy towards cooperation. At the end they have a cooperation probability of 70 - 80 %, which lead to a mean payoff of 0.27 e which is not far away from the best possible payoff 0.30 e. Against extortion strategies humans behaved differently. They stayed at a very low cooperation probability (30 - 40 %) and only got a mean reward of 0.15 e which is quite far away from the maximum mean payoff 0.22 e. The simulation leads to similar behaviour, in the generous case the cooperation probability starts at 50 % and goes up to 60 %. In the extortion case the agents also start with a cooperation probability of 50 % but then decrease it during the course of the game and end with a probability of ≈ 25 %. In both cases the simulation has the same trend, high cooperation probability in the generous and low probability in the extortion

4

fraction of cooperating subjects

Experiment

Simulation

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

10

20

30 40 rounds

50

60

0

0

20

40

60

rounds

Figure 2: fraction of cooperating subjects during the course of the game in the experiment by Hilbe et al. and in simulations. The errors are 95 %. For the calculation of the errors in the simulation we assumed that we have used as many agents as humans taking part in the experiment, but we used N = 10201 agents. Left part of the figure taken from [5]. case, as in the experiment. Figure 3 compares the behaviour of the model after 60 rounds with the results from Hilbe et al [5]. In the strong extortion case the results of the model match with the experimental data, the peak at zero has nearly the same height. The generous cased matches qualitatively, the large peak at ten appears in both, the simulation and the experiment, but in the simulation this peak is smaller than in the experiment. To understand the dynamics of the system we calculate the expectation value of equation (1)(see Methods) and display it as a vector field in preference space like in Izquierdo et al. [6]. Figure 4 shows all the unstable and stable fixed points and the corresponding basins of attraction. There are four unstable fixed points and three stable manifolds with basins of attraction that are delimited by the nullclines. In the strong generous case the fixed point in the middle of the preference space at (0.64, 0.64) is the only fixed point which is not caused by the boundary. It determines an edge of the basin of attraction for the stable fixed point at the lower left edge at (0, 0), together with the fixed point at the left boundary at (0, 0.675) and at the lower boundary at (0.66, 0). This stable fixed point is the state of permanent defection. The basin of attraction for the stable fixed point at the lower right edge at (1, 0) is formed by the central and the left fixed point and the fixed point at the right boundary at (1, 0.64). At this fixed point the agents always take the different action as the opponent did in

5

density

Experiment 0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

density

strong extortion

0

2

0 8 10 6 0 2 4 6 cooperative acts during the last ten rounds strong generosity Experiment Simulation 4

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

2

Simulation

4

0 0 2 4 6 6 8 10 cooperative acts during the last ten rounds

8

10

8

10

Figure 3: cooperation acts during the last ten rounds. Left part of the figure taken from [5]. the last round, so we can call this behaviour “anti-Tit-for-Tat”. The stable manifold most relevant for the experiment is the upper one at pYC = 1. The corresponding basin of attraction is formed by the left, central and right unstable fixed point. On the one hand the basin of attraction of this manifold has the largest overlap with the initial conditions (see Methods), so most of the agents will go up towards this stable manifold. At the other hand agents in this manifold always cooperate if the opponent has cooperated in the last round, but because generous strategies also behave like this (see table 2) this is a point of permanent mutual cooperation. This explains why nearly every human in the experiment and the largest fraction of agents in the simulation cooperated permanently at the end of the game(see figure 3). For the strong extortion case the fixed points are similar to those of the strong generous case. However, the central unstable fixed point at (0.69, 0.67) together with the upper (0.68, 1), the lower (0.72, 0) and the right (1, 0.66) unstable fixed point now divide the space into three distinct basins of attraction. All agents, which are initialized in the

6

strong generosity

strong extortion

0.8

0.8

0.6

0.6 PY C

1

PY C

1

0.4

0.4

0.2

0.2

0

0 0

0.2

0.4 0.6 PY D

0.8

1

0

0.2

0.4 0.6 PY D

0.8

1

Figure 4: Flow in the strong extortion and generous case with the unstable (black dots) and stable manifolds(red dots and lines) and the corresponding zones of attraction(black lines). The green curves are examples for the movements of single agents with the starting point (0.6, 0.65) over 500 rounds Table 2: Cooperation probabilities of the ZD strategies. pX CD is the cooperation probability of the ZD strategy if the ZD strategy has cooperated and the opponent has defected in the previous round. Strategies

pX CC

pX CD

pX DC

pX DD

strong extortion

0.692 0.000 0.538 0.000

strong generous

1.000 0.182 1.000 0.364

upper right basin will tend to go to the stable fixed point at (1, 1) of permanent cooperation. In contrast, the agents which start in the lower right basin will move towards the anti-Tit-for-Tat fixed point at (1, 0). And all agents, who start at the left basin will go to the large stable manifold at Y pD = 0, which is the most important one for the understanding of the experimental results. On the one hand the basin encompasses the initialization area and on the other hand the agents always defect if the opponent has defected in the previous round. Since extortion strategies do the same (see table 2) this leads to a state of permanent mutual defection. This explains why the largest fraction of subjects in the experiment and agents in the simulation prefer defection at the end of the game(see figure 3). While the mean behaviour of the agents depends strongly on initial conditions indi-

7

vidual trajectories can leave the basins of attraction because of the noise (green curves in figure 4). We suspect this to be the source of the differences between the experimental and numerical results. While we assumed a very simple initialisation area it might be more complex in reality. Especially variations in the form and the density of the initial area can lead to strong changes in the dynamics of the model. We developed a method to estimate the flow from experimental data which allows to check hypothetical learning rules (see Methods). We tested this method with computer generated data of 500 agents in a game of 200 rounds, which leads to a good match with the analytic flow (figure 5).

Discussion The dynamics of choice preferences was analysed for a simple model of learning in subjects playing iterated prisoner dilemma games against zero-determinant strategies. The flow in preference space exhibits characteristic basins of attraction: depending on their initial readiness to cooperate after many rounds of the game players tend to either always cooperate or to always defect. Thereby small individual differences in cooperation preferences may lead to opposite behaviours in the long run. While the inside that learning can lead to extreme behaviours is not new [6, 8], the general method introduced here allows to determine the potential influence of the dynamics on systematic deviations from optimal behaviour for all environments that depend on few past moves. For instance, we suspect that many of the systematic deviations from both, utility maximization and matching observed experimentally in [3] can be explained by our model. When playing against opponents with zero-determinant strategies a longer memory beyond the last move cannot improve the payoff [11], an intriguing fact, which makes these games a benchmark for investigating the joint dynamics of learning and environment. With a plausible initial distribution of cooperativities the model was able to explain the full phenomenology of the population data gathered by Hilbe et al. [4]. In particular, it reproduces the average dynamics of cooperation and explains the wide distributions of cooperativity among individual players. Our results are robust with respect to many details of the learning rule. As a test we applied the full Bush-Mosteller Rule [1] and obtained similar results when the aspiration parameter in the BMR was small. The aspiration is an offset to which the received payoff is compared and only the difference is used for reinforcement of the preferences. Since the subjects in the experiment of Hilbe et al. were payed according to the absolute payoff of the prisoners dilemma game and had no explicit costs setting this parameter to zero is a plausible choice. If, however, the subjects would need to compare the received rewards to some explicit costs our model would predict the emergence of stable fixed points in preference space located at intermediate values of cooperation. This is a strong experimentally testable prediction of our approach.

8

In principle, also the flow itself can be estimated from experimental data. Using a naive method flow estimation in simulations (see Methods) became significant with 500 players making 200 moves. This demonstrates that corresponding experiments are within reach. Given the learning rule is known, a Bayesian approach would allows to roughly estimate also the cooperativities of individual subjects and thereby enable predictions of individual behaviour during the course of the game. Because the method of determining the flow from groups of players can be implemented for any learning rule it can in principle also be used for testing the characteristics of learning that are implemented in selected groups of human subjects. We speculate, that it would thereby become possible to investigate the relation of pathological behaviour (as e.g. in addiction) to systematic aberrations of learning. Taken together, our results suggest a fundamental role of the joint dynamics of learning and environment for explaining experimentally observed deviations from optimal decision behaviour that has previously been overlooked.

Methods Model To explain human (Y) behaviour in a repeated prisoners dilemma against ZD strategies we assume a reactive strategy, which only depends on the co-players (X) behaviour in the previous round and not on one’s own behaviour. If X has cooperated (C) in the last round, Y will cooperate in the current round with the probability pYC and if X has defected (D), Y will cooperate with the probability pYD . These probabilities are not constant but change during the course of the game. If Y cooperates, her cooperation probability will increase proportional to her reward and if she defects, her cooperation probability will decrease proportional to her reward. Let pYk,t be the cooperation probability at time t with k ∈ {C, D}, so: pYk,t+1 = pYk,t + εσy rxy δx0 k

respectively ∆pYk ∼ σy rxy δx0 k

(1)

ε is the learning rate, σy is : σy =

 

+1 for y = C



−1 for y = D

(2)

rxy is Y’s reward in the current round,δ is the Kronecker delta and x0 ∈ {C, D} is the behaviour of X in the previous round. Because pYk is a probability, pYk ∈ [0, 1], e.g. if pYk gets smaller than zero, then pYk is set to zero. For the initial conditions we assume, that human beings rather cooperate if there opponent has cooperated as well respectively defect if the opponent has defected as well. So we choose a uniform distribution with the borders 0 < pYD < 0.5 and 0.5 < pYC < 1. In the first round we choose one of the two cooperation probabilities randomly. For our simulations we choose a learning rate of ε = 0.025 and N = 10201 players.

9

Mean Field Theory To understand the results of the model, we calculate the expectation value Fk =< ∆pYk > of equation (1). σY,t only depends on Y’s behaviour in the current round, the reward rxy depends on both players behaviour in the current round and the Kronecker delta δx0 k depends on X’s behaviour in the previous round. The behaviour of X and Y in the current round are called x and y, respectively, and in the previous round x0 and y 0 . Fk only depends on x,y and x0 . To calculate this expectation value we need the joint probability π(x, y, x0 ): X Fk ∼ π(x, y, x0 )σy rxy δx0 k . (3) {x,y,x0 }∈{C,D}

To calculate π(x, y, x0 ) we sum over the joint probabilities which include the full information of the last and current round: X X π(x, y, x0 ) = π(x, y, x0 , y 0 ) = π(x, y|x0 , y 0 )Sx0 y0 . (4) y 0 ∈{C,D}

y 0 ∈{C,D}

Sx0 y0 is the probability of the occurrence of the state (x0 , y 0 ). A repeated prisoner’s dilemma can be seen as a Markov process. The only problem is, that Markov processes assume constant transition probabilities which is not the case in our model because of the learning rule. We now assume that this process is a quasistatic process, which allows us to use the steady state probabilities of the Markov process. These state probabilities can easily be calculated by solving the following eigenvalue equation:   X Y X Y X Y X Y pCC pC pCD pC pDC pD pDD pD     Y X Y X Y X Y   pX ) (1 − p ) p (1 − p ) p (1 − p ) p (1 − p D DD D DC C CD C CC =S  S  Y X Y X Y X Y X  (1 − pCC )pC (1 − pDD )pD  (1 − pDC )pD (1 − pCD )pC   Y X Y X Y X Y ) )(1 − p ) (1 − p )(1 − p ) (1 − p )(1 − p ) (1 − p )(1 − p (1 − pX D DD D DC C CD C CC (5) where pX CD is the cooperation probability of the ZD strategy, if the ZD strategy has cooperated and the human opponent has defected in the previous round. Quasistatic processes in general are processes which have an infinitely long break after each parameter change in which the system can equilibrate so that there is a permanent equilibrium. In our case of a repeated prisoner dilemma this means that after every round with a parameter change we assume an infinite number of rounds with no parameter change so that the system can equilibrate. The probability π(x, y|x0 , y 0 ) is the probability that in the current round the state (x, y) occurs, if in the previous round the state (x0 , y 0 ) occurred. This probability can

10

be calculated with the known cooperation probabilities of X and Y: Y π(C, C|x0 , y 0 ) = pX x0 y 0 px0

(6)

Y π(D, C|x0 , y 0 ) = (1 − pX x0 y 0 )px0

(7)

Y π(C, D|x0 , y 0 ) = pX x0 y 0 (1 − px0 ) 0

0

π(D, D|x , y ) = (1 −

pX x0 y 0 )(1



(8) pYx0 )

(9)

Measurement of the Flow To check whether the learning rule is correct or not we want to estimate the flow from experimental data. Therefore we divide the sequence into several parts (which can overlap) and calculate the mean value for pYC and pYD in each part. nCC 0 nCC 0 + nDC 0 nCD0 p¯YD = nCD0 + nDD0 p¯YC =

(10) (11)

e.g. nCD0 is the number of times the human subject cooperates in the current round, if the computer has defected in the previous round. If the sequences are long enough we can assume, that the differences of the mean values is caused by the change of the cooperation probabilities and not by noise. In the case that one agent only cooperates or defects it is only possible to estimate one of the two probabilities. In this case we assume that the other cooperation probability is unchanged and we use the probability of the previous part. If this appears at the beginning of the game we take the next probability. We tested this very simple method with computer generated data of 500 agents in a game of 200 rounds. This is a large, but realistic experimental number of rounds. If the human subjects have e. g. 15 seconds for each decision, which is quite much for pressing one of two buttons, they would need 50 minutes. An experiment with 500 human subjects is a lot of work, but every subject plays independently against a computer. The humans do not have to play at the same time, so the amount of work increases linearly with the number of human subjects. The results of this measurement are shown in figure 5. To determine the quality of this method we calculate the coherence of the analytic flow Fa and the measured flow Fn : < Fa · Fn >2 c= < Fa · Fa >< Fn · Fn >

(12)

The largest coherences were found at a part length of 110 rounds and an overlap between these parts of 90 % (so part one goes from round 1 to 110, part two from round 12 to 122 etc.). The coherence for the shown examples are 0.40 and 0.53, which are typical for the strategies. Especially at the border the analytic and measured flow do not match well. This is cause by the boundary condition which is not used in the calculation of the flow but in the computer generated data.

11

Analytic Flow strong generosity, coherence = 0.40

Measured Flow strong extortion, coherence = 0.53 1

0.8

0.8

0.6

0.6

pYC

pYC

1

0.4

0.4

0.2

0.2

0

0 0

0.2

0.4 0.6 pYD

0.8

1

0

0.2

0.4 0.6 pYD

0.8

1

Figure 5: Measurement of the flow for the strong extortion and strong generous strategies with computer generated date sequences of 500 agents and 200 rounds The coherences of the extortion cases are significantly higher than in the generous cases. This is caused by the left part of the flow (pYD = 0 . . . 0.5) where nearly all arrows are parallel, so errors in pYC does not lead lo a lower coherence.

References [1] Robert R. Bush and Frederick Mosteller. A Stochastic Model with Applications to Learning. The Annals of Mathematical Statistics, 24(4):559–585, 1953. [2] Joseph Henrich, Robert Boyd, Samuel Bowles, Colin Camerer, Ernst Fehr, Richard Mcelreath, By Joseph Henrich, Robert Boyd, Samuel Bowles, Colin Camerer, Ernst Fehr, Herbert Gintis, and Richard Mcelreath. In Search of Homo Economicus: Behavioral Experiments in 15 Small-Scale Societies. The American Economic Review, 91(2), 2001. [3] R. J. Herrnstein, George F Loewenstein, Dranzen Prelec, and William Vaughan. Utility Maximation and Melioration: Internalities in Individual Choice. Journal of Behavoral Decision Making, 6:149–185, 1993. [4] Christian Hilbe, Martin A. Nowak, and Karl Sigmund. Evolution of extortion in Iterated Prisoner’s Dilemma games. Proceedings of the National Academy of Sciences of the United States of America, 110(17):6913–8, April 2013.

12

[5] Christian Hilbe, Torsten R¨ohl, and Manfred Milinski. Extortion subdues human players but is finally punished in the prisoner’s dilemma. Nature communications, 5(3976), January 2014. [6] Segismundo S. Izquierdo, Luis R. Izquierdo, and Nicholas M. Gotts. Reinforcement Learning Dynamics in Social Dilemmas. Journal of Artificial Societies and Social Simulation, 11(2), 2008. [7] Daniel Kahneman. Maps of Bounded Rationality : Economicst Psychology for Behavioral. The American Economic Review, 93(5):1449–1475, 2003. [8] Michael W Macy and Andreas Flache. Learning dynamics in social dilemmas. Proceedings of the National Academy of Sciences of the United States of America, 99 Suppl.:7229–7236, 2002. [9] R R Miller, R C Barnet, and N J Grahame. Assessment of the Rescorla-Wagner model. Psychological bulletin, 117(3):363–386, 1995. [10] Tal Neiman and Yonatan Loewenstein. Reinforcement learning in professional basketball players. Nature Communications, 2:569, 2011. [11] William H Press and Freeman J Dyson. Iterated Prisoner’s Dilemma contains strategies that dominate any evolutionary opponent. Proceedings of the National Academy of Sciences of the United States of America, 109(26), June 2012. [12] Alexander J Stewart and Joshua B Plotkin. Extortion and cooperation in the Prisoner’s Dilemma. Proceedings of the National Academy of Sciences of the United States of America, 109(26):10134–5, June 2012. [13] Richard Sutton and Andrew Barto. Time-Derivative Models of Pavlovian Reinforcement. In Michael Gabriel and John Moore, editors, Learning and Computational Neuroscience: Foundations of Adaptive Networks, chapter 12, pages 497–537. MIT Press, Cambridge, MA, US, 1990.

13