Computational algorithms and neuronal network ... - Semantic Scholar

Report 4 Downloads 121 Views
Neural Networks 19 (2006) 1091–1105 www.elsevier.com/locate/neunet

2006 Special Issue

Computational algorithms and neuronal network models underlying decision processes Yutaka Sakai a,1 , Hiroshi Okamoto b,2 , Tomoki Fukai c,∗ a Department of Intelligent Information Systems, Tamagawa University, Tamagawa Gakeun 6-1-1, Machida, Tokyo 194-8610, Japan b Corporate Research Laboratory, Fuji Xerox Co., Ltd., 430 Sakai, Nakai-machi, Ashigarakami-gun, Kanagawa 259-0157, Japan c Laboratory for Neural Circuit Theory, RIKEN Brain Science Institute, Hirosawa 2-1, Wako, Saitama 351-0198, Japan

Received 6 December 2005; accepted 24 May 2006

Abstract Animals or humans often encounter such situations in which they must choose their behavioral responses to be made in the near or distant future. Such a decision is made through continuous and bidirectional interactions between the environment surrounding the brain and its internal state or dynamical processes. Therefore, decision making may provide a unique field of researches for studying information processing by the brain, a biological system open to information exchanges with the external world. To make a decision, the brain must analyze pieces of information given externally, past experiences in a similar situation, possible behavioral responses, and predicted outcomes of the individual responses. In this article, we review results of recent experimental and theoretical studies of neuronal substrates and computational algorithms for decision processes. c 2006 Elsevier Ltd. All rights reserved.

Keywords: Decision making; Recurrent network; Neural integrator; Bistable neurons; Reward; Reinforcement learning; TD error; Matching behavior

1. Introduction Our behavior comprises sequences of actions which are driven by inputs from various information sources to the brain areas responsible for decision processes. The primary sources are the sensory inputs from the external world, while others may be internally represented in the brain, as memories of sensory or behavioral experiences. While an easy decision may only require the analysis of sensory stimuli, a difficult one can be made only after careful evaluations of the expected outcomes in different strategies for creating behavioral responses. Depending on the specific demands of decision making, the underlying decision process requires various neural computations. Any decision process, however, seems to share a common feature that the information sources of decision exhibit a certain degree of uncertainty. While

∗ Corresponding author. Tel.: +81 048 467 6896; fax: +81 048 467 6899.

E-mail addresses: [email protected] (Y. Sakai), [email protected] (H. Okamoto), [email protected] (T. Fukai). 1 Tel.: +81 042 739 8400; fax: +81 042 739 8400. 2 Tel.: +81 0465 80 2015. c 2006 Elsevier Ltd. All rights reserved. 0893-6080/$ - see front matter doi:10.1016/j.neunet.2006.05.034

decision making has long been studied in psychological (Davison & McCarthy, 1987; Herrnstein, 1997; Herrnstein & Heyman, 1979; Mazur, 1981; Sakagami, Hursh, Christensen, & Silberberg, 1989), economical (Rachlin, Green, Kagel, & Battalio, 1976) or mathematical sciences (Heyman, 1979; Sutton & Barto, 1998), researches of its neural substrates have made significant progress only recently (Barraclough, Conroy, & Lee, 2004; Breiter, Aharon, Kahneman, Dale, & Shizgal, 2001; Haruno et al., 2004; Knutson, Adams, Fong, & Hommer, 2001; McClure, Berns, & Montague, 2003; Montague & Berns, 2002; Morris, Arkadir, Nevet, Vaadia, & Bergman, 2004; Platt & Glimcher, 1999; Samejima, Ueda, Doya, & Kimura, 2005; Schultz, Dayan, & Montague, 1997; Sugrue, Corrado, & Newsome, 2004). In the present article, we will review the results of recent studies which have been performed on two different, but mutually related, aspects of decision making processes. In the first part, we discuss sensory discrimination tasks and explain that temporal integration is crucial for decision making (Gold & Shadlen, 2000; Shadlen & Newsome, 2001). When a faint sensory stimulus is presented to animals, they usually require a certain amount of time before they can correctly recognize the stimulus. For instance, it is crucial for

1092

Y. Sakai et al. / Neural Networks 19 (2006) 1091–1105

the survival of small animals going through a field of grass to find predators hiding behind the grasses as early as possible. In this cognitive task, faint differences between a visual object (a predator) and the background (grasses) must be perceived, which seems possible only after the brain collects sufficient evidence for demonstrating the visual object. We briefly argue how a Bayesian framework describes the accumulation of the pieces of information, each of which gives either positive evidence for or negative evidence against the discrimination of a sensory stimulus. Such evidence for making a decision is considered to be accumulated by frontal or parietal neurons that may function as neural integrators. For instance, in the sensory discrimination task, it is considered that the evidence for a particular stimulus, and presumably also the evidence against it, are formed through the temporal integration of sensory inputs (Gold & Shadlen, 2000; Shadlen & Newsome, 2001). Such a temporal integration may also underlie graded neuronal activity that typically appears in a delay period of the working memory task (Brody, Hernandez, Zianos, & Romo, 2003; Brody, Romo, & Kepecs, 2003; Funahashi, Bruce, & GoldmanRakic, 1989; Goldman-Rakic, 1995; Romo, Brody, Hernandez, & Lemus, 1999; Takeda & Funahashi, 2002). The graded, climbing or descending, activity is considered to represent the prospective and/or retrospective information required for performing a cognitive task. Therefore, temporal integration is a fundamental brain function which is essential for a broad range of cognitive tasks. Several models of temporal integration have been proposed based on intracellular (Durstewitz, 2003; Goldman, Levine, Major, Tank, & Seung, 2003; Loewentein & Sompolinsky, 2003; Teramae & Fukai, 2005) or network-level (Fukai, Kitano, & Okamoto, 2003; Koulakov, Raghavachari, Kepecs, & Lisman, 2002; Miller, Brody, Romo, & Wang, 2003; Rosen, 1972; Seung, 1996; Seung, Lee, Reis, & Tank, 2000; Shapiro & Wearden, 2002; Wang, 2002) computations, yielding similar or dissimilar predictions on the activity patterns of the encoding neurons. We discuss these attempts to model the neural mechanisms of temporal integration. In the second part, we discuss approaches to decision making from aspects of computational theories of rewardbased decision making. When a subject is to make a choice from possible options that are rewarded according to a certain probabilistic rule or a schedule, the subject’s choices may follow an empirical rule such as matching behavior (Davison & McCarthy, 1987; DeCarlo, 1985; Herrnstein, 1961, 1997; Herrnstein & Heyman, 1979; Mazur, 1981; Sugrue et al., 2004; Vaughan, 1981). Choice behavior is a typical example of decision processes, and has been studied by means of temporal difference (TD) learning in reinforcement learning theory (Sutton & Barto, 1998). Reinforcement learning was originally invented for solving difficult problems in engineering that require predictions of whether a particular choice of action sequence may maximize future rewards. TD learning was later shown to provide a good account for reward-related activity of midbrain dopamine neurons (Daw & Touretzky, 2002; Montague, Dayan, & Sejnowski, 1996; Schultz, 2004; Schultz et al., 1997). While the original mission of reinforcement

learning was simply to maximize rewards, several studies have shown the possibility that a similar algorithm may be applicable to account for a subject’s selection of behaviors (Tanaka et al., 2004). We review the possible role of reinforcement learning in a subject’s choice behaviors. 2. Perceptual decision making In this section, we discuss the neural mechanisms underlying perceptual decision making in which perception of sensory stimuli suffers a certain degree of uncertainty. One of the simplest tasks for perceptual decision making is the sensory discrimination task. 2.1. Sensory discrimination tasks When the brain is required to discriminate an attribute of faint sensory stimuli, the brain needs to accumulate information about the stimuli before it reaches any decision. In a typical experiment to test such a cognitive ability, moving dots are presented to the monkeys on a computer display in front of them (Gold & Shadlen, 2000; Shadlen & Newsome, 2001). While a majority of the dots move in random directions, some of them may exhibit a coherent motion in the same direction. The monkeys had to indicate the perceived direction of the coherent motion by making a saccadic eye movement in that direction. As the fraction of coherently moving dots is reduced, the cognitive task becomes harder and the percentage of correct responses is decreased. What kind of the decision rule must be obeyed in solving this cognitive task? Let h L or h R be a hypothesis that the coherent motion is in the leftward or the rightward direction, respectively, and let P(e|h L ) denote the conditional probability that evidence e (an observed visual stimulus consisting of moving dots) may occur under the condition that hypothesis h L is true. Similarly, P(h L |e) stands for the posterior probability that hypothesis h L is true when evidence e is observed. Generally, the role of a sensory decoder is to reveal a likely hypothesis, when some evidence is given. It is reasonable to consider h L a likely hypothesis, that is, to answer “left”, if P(h L |e) > P(h R |e). The Bayesian decision theory utilizes the following basic relationships to convert the conditional probabilities (Dayan & Abbott, 2001): P(h L |e) =

P(e|h L )P(h L ) , P(e)

P(h R |e) =

P(e|h R )P(h R ) , P(e) (1a)

where P(h L ,R ) is called “the prior probability” that hypothesis h L ,R is true. Then, we can rewrite the above criteria for choosing hypothesis h L as (Gold & Shadlen, 2001), P(e|h L )P(h L ) > P(e|h R )P(h R ).

(1b)

The subject’s performance of detecting coherent motions is generally influenced by the signal (coherently moving dots)-to-noise ratio. Some researchers regarded the fraction of coherently moving dots as this ratio (Burgi, Yuille, & Grzywacz, 2000), whereas others considered that the spurious

Y. Sakai et al. / Neural Networks 19 (2006) 1091–1105

motion signals, which are generated by incorrect pairing of dots in consecutive frames of the motion display, set the threshold for signal detection (Barlow & Tripathy, 1997). Thus, there exists some debate about the signal-to-noise ratio in this task, or what the evidence or P(e|h L ,R ) actually represents. In many practical situations, each choice is followed by a positive or a negative reward (i.e., a cost). Bayesian decision theory provides a framework to make a decision also in such cases. Let VRL be the reward obtained by choosing a “left” answer when the correct answer is “right”. We can define VLL , VLR and VRR in a similar fashion. Then, the average loss expected for choosing a “left” answer is given as Loss(L) = (VRR − VRL )P(h R |e). Similarly, Loss(R) = (VLL − VLR )P(h L |e) represents the average loss expected for a “right” answer. It seems reasonable to answer “left” if Loss(L) < Loss(R). This inequality can be rewritten by using the basic Eq. (1a) as, (VLL − VLR )P(e|h L )P(h L ) > (VRR − VRL )P(e|h R )P(h R ). (2) In a realistic situation, a decision should be made based on the evidence collected from multiple sources of information about the hypotheses. Denoting the multiple pieces of evidence by e1 , e2 , . . . , em , and defining the “likelihood ratios” as LR(h L , h R |e1 ) = P(e1 |h L )/P(e1 |h R ) and so on, we can easily write down the condition for choosing h L rather than h R as, log(L R(h L , h R |e1 )) + log(L R(h L , h R |e2 )) + · · · log(L R(h L , h R |em )) + log(P(h L )/P(h R )) + log((VLL − VLR )/(VRR − VRL )) > 0.

(3)

We note that each evidence must be mutually independent for the likelihood ratios merely to be summed in the above inequality. The left-hand side of the above inequality represents a sum of the relative information obtained for the likeliness of the two hypotheses from the set of evidence, and the relative information obtained from a priori knowledge about the hypotheses, rewards and costs. The above expression given in Eq. (3) implies that decision making in general relies on the accumulation, or temporal integration, of the (logarithmic) likelihood of each hypothesis. A psychophysical model called the “linearapproach-to-threshold-with-ergodic-rate model” has proposed that subjects make a decision when the accumulated evidence reaches a certain decision criterion (Reddi & Carpenter, 2000). In the random dot displays, the corrections of evidence may be performed separately for the “left” and “right” answers. By contrast, the drift diffusion model proposes that the evidence accumulation is a stochastic process, typically a Wiener process with a drift force (Ratcliff, 2001). Subjects make a decision when a decision variable reaches either a positive or a negative criterion value. While both models can well explain known results of psychophysical experiments on alternative decisions, it seems difficult to extend the diffusion model to the cases where more than two alternatives exist. Results of recent electrophysiological experiments in behaving monkeys have revealed that the decision variables are

1093

represented by activities of parietal and/or prefrontal neurons. For instance, in an experiment, monkeys were asked to make a saccade in one of the two possible directions, when the target direction was instructed by a visual cue (Schall & Hanes, 1998). Activity of some parietal neuron was correlated with the probability that a movement to the cell’s response field was instructed. In the sensory discrimination tasks mentioned previously, the responses of parietal neurons likely accumulated the evidence for sensory signals relevant to the monkey’s judgment of motion direction. 2.2. Models of temporal integrator neurons and networks Decision making relies crucially on the ability of neural systems to accumulate evidence for each hypothesis. The accumulation of evidence may be performed by the brain through temporal integration of an externally or internally driven input that is informative for the event of interest. Temporal integration has often been related to gradually climbing or descending neuronal activity observed in various brain areas and cognitive tasks. Neural systems possessing multi-stable or continuous attractors are considered to play a crucial role in performing temporal integration (Brody, Romo et al., 2003). For instance, the occulomotor neuron of gold fish encodes the angular eye positions, which can take a continuous value, into the firing rate (Aksay, Baker, Seung, & Tank, 2003; Aksay, Gamkrelidze, Seung, Baker, & Tank, 2001; Pastor, De la Cruz, & Baker, 1994; Seung et al., 2000). To this end, these neurons integrate an input representing the angle velocity of the eye movement, and accordingly the firing rate is increased while the eye is moving. When the eye stops at some position, the neurons exhibit persistent activity at the constant firing rate corresponding to the eye position. Mechanisms for temporal integration of the angle velocity can be explained if the occulomotor system has multiple or continuous states that are stable during the eye fixation. During the eye movement, the occulomotor system slides along these states. After the eye movement is ceased, the occulomotor system is stabilized in a state which it reaches at the end of the eye movement. This stable state, or the firing rate of the occulomotor neuron in this state represents the results of temporal integration of the past history of the eye movements. Several neural mechanisms that produce stable continuous attractors or graded neuronal activity have been proposed so far, which are reviewed below. Continuous attractor networks with finely tuned parameters It is well known that recurrent networks of neurons with symmetric synaptic connections can acquire fixedpoint attractors in the retrieval dynamics. Each fixed-point attractor, which corresponds to a local minimum of the ‘energy landscape’ and hence can be reached by the ‘down-hill’ retrieval dynamics, may represent a stable state showing a constant firing rate. However, stable states given as local minima of the energy landscape are generally discrete. Modeling neural mechanisms that can provide stable continuous attractors is not so easy. In the pioneering study by Sueng et al., this was achieved by tuning parameters, such

1094

Y. Sakai et al. / Neural Networks 19 (2006) 1091–1105

we can derive the equation from Eq.(6) for a variable Pfollowing N defined as Eˆ = i=1 ηi si : Eˆ =

N X

ηi F(ξi Eˆ + Bi ).

(8)

i=1

In the right-hand side of the above expression, the gains and thresholds of the output of the individual neurons are mutually different as functions of Eˆ (Fig. 1b). If thePvalues of ξi , ηi and N Bi are properly tuned so that the profile of i=1 ηi F(ξi Eˆ + Bi ) is almost linear as a function of Eˆ (Fig. 1c), the fixed points of the network P N dynamics are given by the intersections between ˆ ˆ Y = i=1 ηi F(ξi E + Bi ) and Y = E (Fig. 1d, filled circles). If the network size N becomes sufficiently large, these fixed points can be regarded as distributed almost continuously along the line. We note that each fixed point maintains its own attractor basin and hence is stable, and the fixed-point attractors arranged on the line do not exhibit marginal stability as far as N is finite. Fig. 1. Neural integrator circuit with fine parameter tuning. (a) Each model neuron shows a saturating response to a large input. (b) Owing to recurrent connections, the individual model neurons exhibit neuron-dependent gains and thresholds of the responses. (c) With fine tuning of synaptic weights, the recurrent network can possess a continuous attractor in the input–output space. The part of the line attractor designated by the dashed rectangle is magnified in (d) to demonstrate that the attractor consists of multiple numbers of stable and unstable fixed points (filled and open circles, respectively) along a line. A trajectory escaping from an unstable fixed point and falling into a stable fixed point is demonstrated by consecutive arrows.

as synaptic weights or activation thresholds, in the recurrent network model (Seung, 1996; Seung et al., 2000). To this end, consider an all-to-all connected network of N neurons. The ith neuron may discharge at a frequency of ri = A[Ii + Bi ]+

(4)

where Ii stands for a recurrent input, and A and Bi are constants. The response function of the neuron is defined as [x]+ = x if x ≥ 0, and [x]+ = 0 otherwise. The gate variable si of the synapses made on other neurons by the ith neuron may be written as si = f (ri )

(i = 1, . . . , N )

(5)

where f (x) (x > 0) is a certain monotonic function defined between 0 and 1 (e.g., we may set as P f (x) = x/(x + a)). N Denoting the recurrent input by Ii = j=1 Wi j s j in terms of the synaptic weights Wi j , and defining a new function F as f (A[x]+ ) ≡ F(x) (see Fig. 1a), we can obtain the following self-consistent equation for the gating variables: ! N X si = F Wi j s j + Bi (i = 1, . . . , N ). (6) j=1

If the synaptic couplings are given by a correlation matrix as Wi j = ξi η j ,

(7)

Robust temporal integration by a tuned network of bistable neurons The above model of graded persistent activity demonstrates how a recurrent neural network of non-identical neurons can perform temporal integration. The performance of the model, however, crucially relies on the fine tuning of parameter values that achieves the fine arrangement of the intersections, as shown in Fig. 1c. Actually, even an error of 1% in adjusting the values can easily destroy the continuous attractor. To solve the fine tuning problem, Koulakov et al. (2002) has proposed a recurrent network model of bistable excitatory neurons. Whether a single cortical neuron has a bistable response property remains elusive, and they have considered both cases in which the bistability is achieved either by a single neuron or by a local circuit of cortical neurons. Since both models are based on essentially the same mechanism, here we consider the former case. It is in fact possible to make several types of bistable neuron models based on biologically plausible mechanisms, such as the NMDA receptor-mediated synaptic current (Lisman, Fellous, & Wang, 1998). Due to the bistability, the neuronal response displays a hysteresis curve, as demonstrated in Fig. 2a, top. Consequently, the relationship between the gate variable and the recurrent input also shows a hysteresis curve (Fig. 2b, top). Suppose that all recurrent synapses have an equal weight of W . Then the recurrent to individual neurons are the same, given Pinputs N W s by I = i . If different neurons in the recurrent i=1 network have regularly ranged, overlapping P N hysteresis curves (Fig. 2a and b, bottom), then the sum i=1 W si (I ) can be represented by a pile of them in a parallelogram. The attractors of the recurrent network dynamics are P Ngiven by the intersection points between Y = I and Y = i=1 W si (I ), as shown in Fig. 2c. These attractor states are robust against changes in the parameter values as long as the corresponding intersection points remain within the parallelogram. Thus, fine-tuning of the parameter values is unnecessary in this model.

1095

Y. Sakai et al. / Neural Networks 19 (2006) 1091–1105

Fig. 2. A continuous attractor in a recurrent network of bistable neurons with moderate tuning of activity threshold. (a) The intrinsic bistability of each neuron generates a hysteresis in the neuronal response (left). The individual neurons possess different activation thresholds and display different hysteretic curves (right). (b) The hysteretic neuronal responses result in hysteretic relationships between the gate variables and the recurrent inputs. (c) A line attractor is obtained as the intersections of a multi-hysteresis curve and a line in the input–output space of the recurrent network.

Temporal integration by a uniform network of bistable neurons It is noted that the mechanism proposed in the previous subsection requires a broad spectrum for the bistability range in individual neurons. Here, we present yet another possible mechanism of temporal integration by a recurrent network of identical bistable neurons with uniform synaptic weights (Fukai et al., 2003). In this model, excitatory neurons and inhibitory neurons are randomly connected by recurrent synapses. To mimic the cortical biology, we may set the connectivity in neuronal wiring as about 10%. However, the dynamical behavior of the network model is essentially unchanged even with an all-to-all connectivity. The individual neurons receive balanced background synaptic inputs which may be represented by independent Poisson spike trains mediated by excitatory and inhibitory synapses, or simply by independent Gaussian white noise. The background synaptic input plays a crucial role in this temporal integration mechanism, unlike in the previous network models in which dynamics is essentially deterministic. Each neuron is described by a leaky-integrate-and-fire neuron model with biophysically realistic currents. In particular, a Ca2+ dependent nonselective cationic current generates an afterdepolarization current that is responsible for the bistability of the excitatory model neurons. Mathematical details of the model will be found in Fukai et al. (2003).

Suppose that all excitatory bi-stable neurons are initially settled in the resting state and they receive a constant excitatory input I at time >0. Driven by this external input, each neuron makes a transition at some point in time from the resting to the elevated firing state. This process is governed by the stochastic dynamics of the neural network, and the timing of the transitions should vary from neuron to neuron. We can show that the average number n of activated excitatory neurons increases at a nearly constant rate in appropriate ranges of the uniform synaptic weight and noise intensity. The rate of activity change is increased if the input is strengthened. Thus, this model performs temporal integration by a population of neurons. Although the following mathematical analysis is not a rigorous proof of the temporal integration in the above neural network model, the analysis will show the essential mechanism by means of a simple stochastic process. Instead of describing the bistable neurons explicitly, we introduce a truncated description of the bi-stable state transitions. We may use integrate-and-fire neurons to describe the transition from the resting to the elevated firing state, with the average transition rate given by the inverse of the average first passage time to threshold. In addition, we regard the firing rate of neurons as approximately constant ( f ON Hz) in the elevated firing state. Then, the following equation describes the transition process from the resting to the active firing state: Cm dV /dt = −g L (V − VL ) − gsyn

N X

ck sk (t)V + I + σ ξ(t).

k=1

(9) Here, V and Cm are the membrane potential and capacitance, VL and g L the reversal potential and the conductance of a leak current, sk and gsyn the open fraction and the conductance of synaptic current, ξ(t) a normalized Gaussian white noise with zero mean, and σ the standard deviation of background noise. The coefficient ck = 1 or 0 if neuron k is connected or disconnected with this neuron, respectively. The neuron fires when V reaches threshold Vth , and then the neuron is reset to Vrst . From the assumption that the firing rate is given as f ON in the elevated firing state, we can derive the average synaptic activation as s¯ = τs f ON (1 − e−1/τs fON ),

(10)

with the decay constant τs of synaptic current. The average first passage time to threshold may be analytically expressed as (Ricciardi, 1977) Z xU √ 2 τFP = τref + π τ˜m dxex (1 + erf(x)), (11) x0

√ Rx with the error function, erf(x) = (2/ π ) 0 dy exp(−y 2 ). In the above expression, the lower and the upper bounds of the integral are determined as p x0 = Cm (Vrst − (Iext + g L VL )/g˜ L )/ τ˜m σ 2 , (12) p xU = Cm (Vth − (Iext + g L VL )/g˜ L )/ τ˜m σ 2 , (13)

1096

Y. Sakai et al. / Neural Networks 19 (2006) 1091–1105

where g˜ L (n) = g L + cngsyn s¯ , τ˜m = Cm /g˜ L , and c is the connectivity of the network. The possible shifts of x0 induced by recurrent input during the state transitions were neglected in the above derivations. Below, we may rewrite the synaptic weight gsyn c as gsyn . We can find how the stochastic transitions proceed in the coupled bistable elements by solving the following equation of n numerically: dn/dt = τFP (ngsyn , σ, Iext )−1 (N − n) − βn,

(14)

with N representing the total number of the processing units. The parameters were set as Cm = 0.5 nF, τs = 4 ms, τref = 2 ms, g L = 0.025 mS, Vth = −52 mV, Vrst = −60 mV, VL = −70 mV, β = 0.02 s−1 and f ON = 10 Hz. Here, we only discuss temporal integration by climbing activity. In the range of external input that is suitable for climbing activity, the rate of transitions from the active to the resting state is much smaller than that of the reverse transitions. Therefore, we may regard β as a constant in analyzing climbing activity. We have found that n grows nearly at a constant rate in wide ranges of gsyn and σ (Fig. 3). Outside such ranges, the increase of n supra-linearly accelerates or sub-linearly de-accelerates as n/N approaches to unity (data not shown). These results are well consistent with those of the original network model of biologically realistic neurons. The linear growth of n indicates that the transitions from the resting to the active firing state in the original network proceed uniformly in the temporal domain, as in the manner schematically illustrated in Fig. 3. The stochastic process model demonstrates that the stochastic transitions in the network display such a time course only in the presence of recurrent synaptic inputs. In fact, if the bistable elements are uncoupled (gsyn = 0) and they can make independent stochastic transitions, we can easily find that n/N −1 is decelerated as 1−exp(−(τFP +β)t) at large t. In the presence of recurrent connections, the deceleration of graded activity can be compensated by the acceleration effect of reverberating synaptic input. An interesting characteristic of this uniform network is that it does not have any prescribed order of the state transitions among excitatory neurons. As a consequence, the time of the state transition in each neuron also varies significantly from trial to trial. Thus, this model predicts a very large trial-totrial variability in the single-cell spiking activity during the accumulation process. Such a large trial-to-trial variability in spike trains can often been seen in the graded activity recorded from the cortices of behaving monkeys. Temporal integration by slow synaptic currents While the previous models utilize a cellular level of bistability, Mongillo, Amit, and Brunel (2003) employed a network level of bistability achieved by reverberating synaptic input and analyzed fast and slow bistable transitions in the network dynamics. The fast bistable transitions occur if synaptic communications are principally mediated by AMPAtype current. The firing rates of individual neurons change abruptly and coherently and, therefore, the population activity in single trials shows no slowly climbing profile. If NMDAtype current is dominant in synaptic communications, the

Fig. 3. An analysis of accumulating activity in a stochastic recurrent network of bistable neurons. (a) An integrate-and-fire neuron with recurrent, external, and noisy inputs. The recurrent inputs are assumed to be proportional to the instantaneous number of neurons n in the active state. (b) A steady or linear growth of n is achieved with appropriate tuning of σ and g. Different curves show the time evolution of n from weak (lower) to strong (upper) external input. (c) In the parameter range outside that employed in (b), the growth of n does not steadily activate.

bistable transitions are slower and yield a gradual pile-up of the population activity. The slow dynamics of glutermatergic synaptic transmissions underlie temporal integration also in other recurrent network models (Machens, Romo, & Brody, 2005; Miller et al., 2003; Wang, 2002). In the models of slow synaptic reverberation, individual neurons typically display gradual and continuous increases of firing rate in response to an external input. Such a neuronal activity pattern is contrasted with that in the previous models with bistable neurons. Multiunit recordings of cortical neurons in behaving animals are awaited for discriminating between the different models of temporal integration. Temporal integration by single neurons Persistent neuronal activity including the graded activity discussed thus far has been generally considered to emerge as a result of synaptic reverberation in the recurrent neural networks. Remarkable findings that might upset this traditional view were reported by Egorov, Hamam, Fransen, Hasselmo, and Alonso (2002). Individual neurons in the layer V of the rat entorhinal cortex in vitro responded to consecutive stimuli with graded change in the firing frequency that remained stable after each stimulus presentation. This was observed under pharmacological blockade of synaptic transmission, which

Y. Sakai et al. / Neural Networks 19 (2006) 1091–1105

clearly demonstrates that single cells are able to generate graded persistent activity without exploiting a network level of properties. Consistent with these findings, models for intracellular mechanisms to generate graded persistent activity and achieve temporal integration at a single-cell level have been proposed by several authors. Loewentein and Sompolinsky (2003) proposed a temporal integrator neuron based on specifically designed nonlinear dynamics of the intracellular calcium. Assuming that the cytoplasmic membrane is calcium non-permeable, they described the calcium flow along a dendritic tree in terms of reaction–diffusion equations as ∂c ∂ 2c = f (c) + D 2 + I, ∂t ∂x f (c) = −K (c − c1 )(c − c2 )(c − c3 ),

(15a) (c1 < c2 < c3 ) (15b)

where c and D represent the position-dependent calcium concentration and the diffusion constant, respectively, and K , c1 , c2 and c3 are constant. An external input to the neuron is represented as I . The quadratic function f (c) describes the flux of calcium-dependent calcium release from intracellular store. The flux term attempts to set the local concentrations of calcium at either c1 or c3 . On the other hand, the diffusion term tries to produce a spatially uniform concentration of calcium. Therefore, if the calcium concentration satisfies the boundary conditions, c = c1 at x = 0 (the proximal end-point of the dendrite) and c = c3 at x = x∞ (the distal end-point), the effects of the two terms conflict with one another and the concentration should change abruptly from an allowed value to the other at some spatial location between the two end-points. Such a solution can be given as follows:   x − L(t) c3 − c1 c1 + c3 + tanh . (16) c(x, t) = 2 2 λ See Loewentein and Sompolinsky for the explicit forms of L(t) and λ. In this solution, the calcium concentration rapidly changes at x = L(t) from c = c1 (x < L(t)) to c = c3 (x > L(t)). In particular, if we finely tune the parameters to satisfy c = (c1 + c3 )/2, we can show that the position of the calcium wave front L is constant in time. The value of L can be shifted in a negative or a positive direction by an excitatory or an inhibitory transient input. Thus, the position of the calcium wave front encodes the result of temporal integrations of the past external inputs to the neuron model. The information stored by the calcium concentration may be translated into firing rate by, say, Ca2+ -dependent cation current distributed uniformly along the dendrite. While the above model has suggested a possible role of the intracellular calcium concentration, it is unclear whether all of the required boundary conditions and the parameter tuning can be fulfilled by real neurons. Inspired from the experimental findings of graded persistent activity in the rat enthorhinal cortex neurons (Egorov et al., 2002), Teramae and Fukai (2005) proposed another cellular mechanism to generate multiple stable states in single neurons. In the proposed mechanism, bistable concentrations of inositol 1,4,5-trisphosphate (IP3 )

1097

and calcium were achieved by IP3 formation and IP3 -induced calcium release from store within each local sub-cellular compartment, and a multiple number of such compartments were coupled through the diffusion of calcium and IP3 . Therefore, this model may be regarded as a single-cell realization of the mechanism proposed in Koulakov et al. (2002). The model exhibited response properties consistent with the experimental ones. In particular, the model could demonstrate why the induction of the graded rate changes in the entorhinal neurons require unnaturally long stimuli, and how the critical stimulus duration required for this can be significantly reduced by modulating the coupled dynamics of Ca2+ and IP3 . A short critical duration seems to be crucial for this single-neuron graded persistent activity to play a practical role in temporal integration. In addition, this model argued a crucial role of a store-operated Ca2+ channel in refilling the Ca2+ store (Blaustein & Golovina, 2001; Parekh & Penner, 1997; Putney & Ribeiro, 2000) during persistent firing. Fransen, Tahvildari, Egorov, Hasselmo, and Alonso (2006) recently reported that a Ca2+ store is unlikely to contribute to graded persistent activity in entorhinal cortex neurons. They proposed a model based on Ca2+ -dependent biochemical processes that mediate the metabotropic changes in Ca2+ -dependent cation current, such as its phosphorylation and dephosphorylation. Durstewitz (2003) proposed another single-neuron model for temporal integration by means of the NMDA receptor-mediated synaptic current and the spike firing induced by an afterdepolarization current. A line attractor is achieved in the twodimensional space of calcium concentration and the membrane potential by fine-tuning of the parameters involved in the two currents. In addition, such a tuning could be self-organized if the long-term average (typically in 10 s) of the calcium concentration was calculated by a certain dynamical variable. 2.3. Concluding remarks for temporal integration The different models explained above may be discriminated by several experiments including multi-unit recording studies. The bistable-neuron network model by Koulakov et al. (2002) achieves temporal integration with a specific wiring pattern and/or neuron-specific activity threshold. The model, therefore, preferably activates neurons in an approximately fixed order, with a certain degree of trial-to-trial jitter due to biological noise. By contrast, the bistable-neuron network model of Fukai et al. (2003) has the uniform network structure, and essentially produces no preferable order of activation among neurons. In the slow-synapse-based mechanism, not only the trial- or the ensemble-average of neuronal activity, but also the activities of the individual neurons exhibit steady increases in firing rate. The situation will be the same in the single-neuron integrator models. The models discussed above attempted to explain how cortical neurons or their networks might integrate external or internal input. Temporal integration, however, is only a part of the larger neural system that governs the entire decision procedure to generate goal-directed behavior (Koene & Hasselmo, 2005). A neural network model has been proposed

1098

Y. Sakai et al. / Neural Networks 19 (2006) 1091–1105

for solving probabilistic decision tasks with slow synapses, which are modifiable by algorithms similar to reinforcement learning (Wang, 2002). In this model, two separate populations of excitatory neurons innervated by independent Poisson spike trains undergo competition through mutual inhibition. A winner population is determined according to the input firing rates in a probabilistic manner, and displays gradually climbing activity. In a typical sensory discrimination task, subjects are asked to remember a sensory stimulus and later to compare it with a second stimulus to choose an appropriate motor response. It is widely considered that solving such a task requires separate neural systems, one for storing the information on the first stimulus (working memory) and one for comparing it with the information on the second stimulus. Machens et al. (2005) have recently proposed an interesting framework that uses a system of mutually inhibiting neural populations for both purposes. This model involves an external source of input that informs the system whether the current stimulus is the first or the second one. Such a hypothesis must be tested by future experiments. 3. Reward-driven decision making In the previous section, we discussed perceptual decision making between ambiguous sensory stimuli. This section is devoted to a different kind of decision making, that is, the reward-driven decision of behavioral responses. In this type of behavioral task, the sensory cues to indicate behavioral responses are given unambiguously, but uncertainty may exist in the reward delivery. A typical example of reward-based decision making is the alternative choice task in which subjects are asked to choose one of the alternative behavioral responses to get a reward. The reward may be given in a stochastic manner following the reinforcement schedules assigned to the individual responses. Various types of stochastic reinforcement schedules have been introduced to clarify how subjects make a particular decision in various conditions. We are particularly interested in the relationship between optimal behavior, which maximizes the long-term average of reward, and matching behavior, which is commonly seen in animal behaviors and does not necessarily maximize reward. In particular, we discuss the biological mechanism and computational implications of matching behavior, which has not yet been fully understood. 3.1. Stochastic reinforcement schedules and the matching law Typical examples of reinforcement schedules are the variable ratio (VR) and variable interval (VI) schedules (see Mazur (2005)). In a VR schedule, choices of an alternative are reinforced with a constant ratio to the choice frequency. Every choice of the alternative may result in a reward with a conditional probability. In a VI schedule, choices of an alternative are reinforced at different average intervals independent of the choice frequency. A reward is set to the alternative at a rate independent of the current choice. The once set reward remains available, and no additional reward is set, until it is taken by a subject. Because of this persistence of set rewards, the likelihood of being rewarded by choosing

Fig. 4. The likelihood of being rewarded by choosing the alternative after the time passage T from the last choice of the alternative in the VI schedule of reward setting rate λ is given as Pr(reward|T ) = 1 − e−λT in continuous-time tasks, and Pr(reward|T ) = 1 − (1 − λ)T in discrete-time tasks. By contrast, the likelihood is a constant equal to the reward rate λ in the VR schedule.

the alternative increases with the time that has passed from the last choice of the alternative in the VI schedule, while it remains constant in the VR schedule (Fig. 4). The VR schedule may resemble the situations which many carnivorous animals encounter during hunting. For instance, a cheetah may decide which prey, a zebra or a gazelle, she should chase according to the success rate of her past hunting. On the other hand, the VI schedule may imitate the situations which herbivorous animals meet in foraging behavior. Once the grass that was visited has been foraged, it may again become available only after a certain period. In a choice task with concurrent VR schedules, in which VR schedules with different ratios are assigned to individual alternatives, subjects learn to select the response rewarded more frequently in a sufficiently long training time. It is obvious that subjects are able to maximize the reward following this policy of decision making. By contrast, in a choice task with concurrent VI schedules, subjects must scatter their choices over the alternatives to increase the obtainable reward, because of the persistency of once set rewards. Simply choosing one of the alternatives that is rewarded more frequently does not ensure a maximal reward. In fact, in concurrent VI tasks, the subject’s choice behavior is known to obey the matching law, which says that the fractions of the choice frequencies of individual alternatives match the fractions of the amount of the past incomes (Herrnstein, 1961, 1997), , , X X Pa Pa 0 = Ra Ra 0 (17) a0

a0

where Pa represents the choice frequency of alternative a, and Ra represents the average income per unit time from choosing a. The summation for a 0 is taken over the alternatives, i.e., a 0 = 1, 2. The generalization to the case of a choice task with n alternatives is straightforward: a 0 = 1, 2, . . . , n. The

1099

Y. Sakai et al. / Neural Networks 19 (2006) 1091–1105

matching law differs from the so-called “probability matching” in which the fractions of the choice frequencies in the face of a variable combination of VR schedules match the fractions of the reward rates of the VR schedules (e.g. Morris et al. (2004)). “Probability matching” is a law concerning the usage of past knowledge in a variable situation, while the matching law describes an asymptotic behavior in a stationary situation. The matching law is widely seen in many species including humans (Davison & McCarthy, 1987), and has been shown to approximate the optimal probabilistic behavior (Baum, 1981; Heyman, 1979) if the amount or the strength of reward obtainable from each alternative is equivalent, as was the case in many previous studies of animal or human behavior. The seemingly different behaviors in the concurrent VR and the concurrent VI tasks are consistent with the hypothesis that the animal’s behavior is reinforced to learn the optimal behavior to maximize a reward. However, the exclusive choice behavior seen in the VR task is even consistent with the matching law in a trivial sense that all the options but one are never chosen and hence produce no rewards. Therefore, there is another possibility that the different behaviors may emerge from a common computational mechanism to attain the matching law (Herrnstein, 1997; Herrnstein & Vaughan, 1980; Seung, 2003; Sugrue et al., 2004). The issue of matching or maximizing may be clarified in a choice task in which the matching behavior gives only a sub-optimal solution to decision making. 3.2. Insight into matching or optimization Matching behavior is not necessarily optimal in some alternative choice tasks. Some previous works showed several examples in which matching behavior is sub-optimal. A simple example is the case where the amount of the reward obtained in a single choice is not identical for different alternatives in the concurrent VI task. The matching behavior of animals was actually observed in this type of experimental setting (Baum & Rachlin, 1969; Heyman & Monaghan, 1994). However, the internal relationship between an animal’s subjective values of rewards, on which the animal’s decision was based, and its physical strength, on which the qualitative design of experiments was based, is unknown. Therefore, the above results may not provide clear evidence that the animals exhibit matching behavior rather than optimal behavior. Another example can be found in the concurrent VI and VR schedule tasks (Herrnstein, 1997; Herrnstein & Heyman, 1979). In this case, the optimal choice probabilities deviate from the matching ones, even if an identical reward is assigned to different alternatives. Again, matching behavior was observed in this type of task (Herrnstein & Heyman, 1979; Savastano & Fantino, 1994; Vyse & Belke, 1992), although controversial results were reported (Sakagami et al., 1989). The issue of matching or maximizing was examined in other types of choice task (DeCarlo, 1985; Jacobs & Hackenberg, 1996; Mazur, 1981), but the validity of the matching behavior remains unclear (Mazur, 2005). These choice tasks were designed in continuous time. Subjects were allowed to respond at any time. Such task setting is certainly more complicated than that

employing discrete time steps in which subjects are allowed to respond at restricted times (Sugrue et al., 2004). In addition, these experiments introduced “change-over delays” between consecutive choices. The change-over delays impose a certain cost (usually, a delay in reward delivery) on subjects’ frequent changes in their choices in order to avoid an alternate choice behavior or a win-stay-lose-switch strategy, in which subjects keep choosing the same option when it was rewarded in the immediate past and leave it when it was not rewarded (Herrnstein, 1997; Stubbs, Pliskoff, & Reid, 1977). These complex experimental settings make the comparison of the results difficult between the different experiments. Vaughan (1981) introduced a choice task that may be yet more suitable for studying whether animals develop matching or optimal behavior (Herrnstein, 1997; Vaughan, 1981). In the task, a response to an alternative may result in a reward delivery with a current probability determined as a function of the frequency of the past choice of the alternative. The amount of obtainable reward does not depend on the local order of choosing alternatives. Only the total frequencies of choosing the individual alternatives are meaningful. Therefore, there is no quantitative difference in the obtainable reward between periodic and probabilistic choice behaviors. In the Vaughan task, animals neither exhibit the optimal behavior nor the alternate choice behavior, but exhibit matching behavior (Vaughan, 1981). Egelman, Person, and Montague (1998) observed human behaviors in the discrete-time version of Vaughan’s task and found that most subjects exhibited nearmatching choice frequencies, although some exhibited nearoptimal ones. It is likely that a human subject exhibits matching behavior even in this task. It is of extreme interest to test whether matching behavior is observed in other species performing the discrete-time version of the task. In the discrete-time version of Vaughan’s task, the conditional reward expectation on the current choice at is given as a function of the choice frequency averaged over the past τ time steps, hrt |at = ai = Q a (Pa ),

Pa ≡

t−1 1 X δaa 0 , τ t 0 =t−τ t

(18)

where rt represents the amount of the reward obtained at time t, and δi j is 1 if i = j, or 0 otherwise. If the subject makes a random choice of alternatives with a constant set of choice probabilities { pa }, the past choice frequency Pa is approximately equal to the choice probability pa , i.e., Pa ∼ = pa . Therefore, the average income hrt i and the fractional average incomes Ra are described as, X hrt i = Ra , Ra = Q a ( pa ) pa . (19) a

The optimal choice probabilities are determined to maximize the average income hrt i. The matching law can be reformulated as R1 /P1 = R2 /P2 , as far as P1 6= 0 and P2 6= 0. The choice probabilities representing the matching law besides those of the exclusive choice behavior (P1 = 0 or P2 = 0) are determined by the equation Q 1 ( p1 ) = Q 2 ( p2 ) = Q 2 (1 − p1 ). Therefore,

1100

Y. Sakai et al. / Neural Networks 19 (2006) 1091–1105

Fig. 5. (a) Examples of functions Q 1 and Q 2 determining the probabilities that a reward is given as a result of response to alternative 1 (dashed line) and alternative 2 (dot–dashed line) in Vaughan’s task (Vaughan, 1981). If the subject makes a random choice of alternatives with a constant set of choice probabilities p1 and p2 = 1 − p1 , the average income is derived as hrt i = Q 1 ( p1 ) p1 + Q 2 (1 − p1 )(1 − p1 ) (solid curve). The vertical solid and dotted lines indicate the optimal and matching choice probabilities. (b) The same relation in a task with concurrent VI and VR schedules. The reward probabilities Q 1 and Q 2 are determined by the reward setting rate λ1 and λ2 in the VI and VR schedules. The VI schedule of reward setting rate λ1 = 0.25 is assigned to alternative 1, and the VR schedule of reward rate λ2 = 0.3 is assigned to alternative 2.

the intersection point of the functions Q 1 ( p1 ) and Q 2 (1 − p1 ) indicates the choice probabilities of the matching behavior. Fig. 5A shows examples of functions Q 1 ( p1 ) (dashed curve), Q 2 (1− p1 ) (dot–dashed curve), and the average income derived from them, hrt i = Q 1 ( p1 ) p1 + Q 2 (1 − p1 )(1 − p1 ) (solid curve). The solid and dotted vertical lines indicate the optimal and matching choice probabilities. We can see that an arbitrary combination of the optimal and matching choice probabilities and the average amounts of obtainable rewards can be selected by design of the functions Q 1 and Q 2 . By contrast, the Q functions in the VR and VI schedules are determined explicitly. In the VR schedule of reward rate λ, the Q value is a constant equal to the reward rate λ. In a discretetime VI schedule of reward setting rate λ, the probability that a reward is set to the alternative by time T after its previous choice is given as Pr(set|T ) = 1 − (1 − λ)T . If a subject makes a random choice with choice probability P, the Q function is determined as, Q(P) = ρ

∞ X

Pr(set|T )(1 − P)T −1 P

T =1

=

ρλ , 1 − (1 − λ)(1 − P)

(20)

where ρ is the amount of reward obtainable in a single choice. Fig. 5(b) shows examples of function Q 1 ( p1 ) in a VI schedule assigned to alternative 1 (dashed curve) and function Q 2 (1− p1 ) in a VR schedule assigned to alternative 2 (dot–dashed curve), and the average income in the concurrent VI and VR task (solid curve). We can see that the freedoms in designing the optimal and matching choice probabilities and the average amounts of obtainable rewards are more severely restricted in the concurrent VI and VR task than in Vaughan’s task.

3.3. Comparison between different algorithms to attain the matching law There exist several learning algorithms that account for matching behavior. Here, we consider learning algorithms for discrete-time-designed choice tasks with n alternatives (a = 1, 2, . . . , n). The “local matching law” introduced by Sugrue et al. (2004) determines the current choice probabilities { pa } directly as the fraction of the local incomes averaged over a certain interval in the immediate past. The local average income Rˆ a can be calculated by an online learning as, , X Rˆ a 0 , (21) pa = Rˆ a 1 Rˆ a = α(rt δaat − Rˆ a ), a0

where the values of Rˆ a for all alternatives are updated by 1 Rˆ a at every time step, Rˆ a + 1 Rˆ a → Rˆ a . The value of Rˆ a decays at rate α at each time step. Hence, the inverse of the learning rate, 1/α, represents the typical interval for temporal averaging. To test the matching law, let us consider the average incomes {Ra } and the choice frequencies {Pa } over τ time steps. The average changes of {Ra } over τ time steps are written as, h1 Rˆ a iτ = α(hrt δaat iτ − h Rˆ a iτ ),

(22)

where the bracket h·iτ means long-term averaging over τ time steps. The term hrt δaat iτ represents the average income obtained by choosing action a, and hence hrt δaat iτ = Ra . The frequency of choosing action a over τ time steps is nearly equal to the temporal average of the current choice probability over τ time steps, if the averaging span τ is sufficiently long, * , + X Pa = h pa iτ = Rˆ a Rˆ a 0 . (23) a0

τ

1101

Y. Sakai et al. / Neural Networks 19 (2006) 1091–1105

In the local steady state defined by h1 Rˆ a iτ = 0 over τ time steps, the temporal average of the local income Rˆ a is equal to the actual average income h Rˆ a iτ = Ra . In general, + * , , X X Rˆ a Rˆ a 0 6= h Rˆ a iτ h Rˆ a 0 iτ , (24) a0

τ

a0

so the matching law does not always hold true in the steady state of the “local matching law”. However, if the learning rate α is a sufficient small so that α  1/τ , then the local incomes { Rˆ a } are approximately constant during τ time steps. In this case, the “local matching law” locally exhibits the matching behavior. “Melioration” introduced by Herrnstein and Vaughan (1980) proposes to update the current choice probabilities so as to increase the choice probability associated with the greatest value of Q a = Ra /Pa . This learning rule gives an equilibrium state in which all alternatives have the same average value of these ratios, hence leading to the matching law. Let us introduce one of the mathematical formulations of “melioration”. The value of Q a is estimated by an online learning, 1 Qˆ a = α(rt − Qˆ a )δaat,

(25)

and the current choice probabilities { pa } are described with the policy parameters {qa } updated by using Qˆ a values, , X βqa pa = e eβqa0 , (26)

a ∗ ≡ arg max Qˆ a 0 a

where the learning updates only the policy parameter associated with the greatest Qˆ a value. In a steady state defined by h1 Qˆ a iτ = 0 and h1qa iτ = 0, the temporal averages of the estimated values { Qˆ a } are equal,

(28)

The “actor” chooses an action from n alternatives according to the current choice probabilities { pa }, where every choice is made independently of choices in the past. The choice probabilities are determined by the policy parameters {qa } updated by using the error in the reward estimation, rt − V , , X βqa pa = e eβqa0 , 1qa = α(rt − V )δaat , (29) a0

where only the policy parameter corresponding to the current choice at is updated in the time step t. The current choice at made by the “actor” is evaluated by the “critic” in terms of the error in the reward estimation, rt − V . If the error is positive, i.e., the obtained reward is greater than what was expected, the “actor” increases the probability of choosing the current action. The long-term averages of changes in the policy parameters over τ time steps are given as, If the learning rate α is sufficiently small so that α  1/τ , then the reward estimation V is approximately constant during the averaging span τ . In this case, hV δaat iτ can be factorized into hV iτ hδaat iτ . The average hV iτ coincides with the actual average income hrt iτ over τ time steps in a steady state, since the estimation error should vanish in that state. Therefore, hV iτ = hrt iτ =

n X

Ra .

(31)

a=1

h Qˆ 1 iτ = R1 /P1 = h Qˆ 2 iτ = R2 /P2 = · · · = h Qˆ n iτ = Rn /Pn ,

1V = α(rt − V ).

h1qa iτ = α(hrt δaat iτ − hV δaat iτ ) = α(Ra − hV δaat iτ ). (30)

a0

! n X 1 Qˆ a 0 δaa ∗ , 1qa = α Qˆ a − n a 0 =1

In “actor-critic” learning, which is one of the most popular reinforcement learning algorithms, the “critic” predicts the rewards obtainable in the future, and the “actor” changes the system’s internal variables and selects an action that presumably leads to an optimized future reward according to the prediction (Sutton & Barto, 1998). In actor-critic learning without state variables, the “critic” updates the estimation of the average income,

(27)

which implies the matching law (17). “Melioration” and the “local matching law” are designed to acquire the matching behavior. Now, we attempt to determine whether matching behavior is observed in other computational algorithms designed for another purpose. Reinforcement learning theory provides a computational framework to develop the optimal choice behavior in Markov decision processes, including the case that an actual reward is given after certain state transitions caused by the learner’s actions (Sutton & Barto, 1998). We deal with reinforcement learning without state transitions, since no explicitly varying state seems to exist in the present alternative choice tasks. The Markov decision process without state transitions is equivalent to the concurrent VR task. Therefore, in choice tasks other than the concurrent VR task, reinforcement learning algorithms without state variables do not necessarily ensure the optimal choice behavior.

The average hδaat iτ represents the choice frequency of individual actions, i.e., hδaat iτ = Pa . Thus, we obtain the slow dynamics of the long-term averages over τ time steps, ! n X (32) Ra 0 , h1qa iτ ' α Ra − Pa a 0 =1

and the steady state defined by h1qa iτ ' 0 implies the matching law (17) over τ time steps. The result shows that the actor-critic learning always exhibits the matching behavior in the steady state, as far as the learning rate α can be regarded as sufficiently small. Other types of reinforcement learning algorithm, such as “Q-learning”, evaluate individual action values. The conditional reward expectation obtained by action a, i.e., the action value Q a , is estimated by an online learning, as in “melioration”, 1 Qˆ a = α(rt − Qˆ a )δaat ,

(33)

1102

Y. Sakai et al. / Neural Networks 19 (2006) 1091–1105

Fig. 6. The matching behavior of the learning algorithms in the task with concurrent VI and VR schedules (blocks 1 and 2) and Vaughan’s task (blocks 3 and 4). The VI schedule of reward rate λ = 0.25 and the VR schedule of reward rate λ = 0.3 are assigned to alternatives 1 and 2 in block 1 and to alternatives 2 and 1 in block 2, respectively. The Q functions determining the reward probabilities are set as linear functions Q 1 ( p1 ) = −( p1 − 1) and Q 2 (1 − p1 ) = −0.13 p1 + 0.35 in block 3, and alternatives 1 and 2 are swapped in block 4. (a)–(d) The reward probabilities in each block are plotted in the same way as Fig. 5. (e) The four curves show the time courses of local frequencies that alternative 1 is chosen during the past 100 time steps in “actor-critic” learning (marked by circles), “direct actor” (marked by diamonds), “local matching law” (marked by triangles), and “melioration” (marked by squares). The parameters of the learning algorithms are set commonly as α = 0.01 and β = 10, and the initial conditions are set as q1 = q2 = V = Qˆ 1 = Qˆ 2 = 0, and Rˆ 1 = Rˆ 2 = 1 at time t = 0. The optimal and matching choice probabilities in individual blocks are plotted as horizontal solid and dotted lines, corresponding to the vertical solid and dotted lines in (a)–(d).

and the current choice probabilities { pa } are directly determined by the estimated action values Qˆ a . For instance, soft-max exploration is often used, , X ˆ ˆ pa = eβ Q a eβ Q a 0 . (34) a0

In the concurrent VR task, as mentioned in the previous section, the learner is able to maximize the income by choosing the action associated with the maximum action value Q a at every time step. Therefore, if the choice probabilities are determined as (34), then the value of β must be infinitely large, β → ∞, in order to obtain the optimal behavior in the concurrent VR task. In this case, the Q-learning persists in choosing the action which is rewarded most probably based on the estimated action values Qˆ a , without trying other actions. Hence, the limit β → ∞ is called the “greedy limit”. In the greedy limit, the choice probability associated with the maximum Qˆ a value is increased up to 1. Hence, greedy Q-learning is an extreme case of “melioration”. Dayan and Abbott (2001) discussed learning algorithms designed for the case of no state variables, called “indirect actor” and “direct actor”. The “indirect actor” is equivalent to Q-learning without state variables. The current choice probabilities are updated indirectly through the estimated action values Qˆ a . By contrast, in the “direct actor”, the policy parameters {qa } that fix the current choice probabilities are updated according to a gradient ascent rule so as to maximize the income in the concurrent VR task (see Chapter 9 in Dayan and Abbott (2001)), , X βqa pa = e eβqa0 , 1qa = α(δaat − pa )rt . (35) a0

The long-term averages of changes in the policy parameters are given as, h1qa iτ = α(hrt δaat iτ − h pa rt iτ ) = α(Ra − h pa rt iτ ).

(36)

If the learning rate α is sufficiently small, then the current choice probabilities { pa } are approximately constant during the averaging span τ . In this case, the term h pa rt iτ can be factorized into h pa iτ hrt iτ . The average of a current choice probability h pa iτ is approximately equal to the average choice frequency Pa . Thus, the steady state condition h1qa iτ ' 0 implies the matching law over τ time steps: , n X Pa = Ra /hrt iτ = Ra Ra 0 . (37) a 0 =1

Therefore, the “direct actor” operating at a sufficiently small learning rate always exhibits the matching behavior in the steady state. 3.4. Numerical simulations We find that the matching behavior is always observed in the steady state of the above mentioned leaning algorithms: “local matching law”, “melioration”, “actor-critic” and “direct actor”, as far as the leaning rate α is sufficiently small. Fig. 6 shows simulations of the learning algorithms in successive 4 blocks consisting of the concurrent VR and VI task and Vaughan’s task. In block 1, certain VI and VR schedules are assigned to alternatives 1 and 2, respectively, and in block 2, the VI and VR schedules are assigned to alternatives 2 and 1. In blocks 3, different linear functions are assigned to alternatives 1 and 2 as Q functions of Vaughan’s task, and in block 4, alternatives 1 and 2 are swapped. The top panels (Fig. 6a–d) show the

Y. Sakai et al. / Neural Networks 19 (2006) 1091–1105

reward probabilities and the average incomes in the same way as Fig. 5. Solid curves in the top panels (Fig. 6a–d) show the average incomes hrt i as functions of the choice probability P1 in the corresponding blocks. The vertical solid and dashed lines indicate the optimal and matching choice probabilities in the individual blocks, respectively, which are represented by the horizontal solid and dashed lines in the bottom panel, respectively (Fig. 6e). The solid curves in Fig. 6e show the time courses of the local frequencies of choosing alternative 1 during the past 100 time steps in “actor-critic” learning, “direct actor”, “local matching law” and “melioration”. We can see that all learning algorithms with a practical learning rate learn matching choice probabilities (horizontal dotted lines) other than the optimal choice probabilities (horizontal solid lines). 3.5. Decision algorithm in neural systems How can we find which algorithm is the most likely candidate for the decision algorithm implemented in neural systems? We have confirmed the matching law in the longterm choice frequencies of the following leaning algorithms: “local matching law”, “melioration”, “actor-critic” and “direct actor”. Therefore, there is no difference in the long-term choice frequencies in the steady state. Is there any difference in the transient time courses of the choice frequencies? Such transient time courses were examined in animals’ behaviors by Gallistel, Mark, King, and Latham (2001). However, there seems to be no significant difference among the learning algorithms in the transient time courses of the choice frequencies following the changes in task setting (Fig. 6e). The present results imply that detailed experimental analyses, such as trial-by-trial fittings between theoretical and behavioral results (Samejima et al., 2005), are required to determine which algorithm is most likely to work in an animal’s decision behavior. For the sake of simplicity, we have omitted state-transitions and state-dependent choices, which are included in the conventional framework of reinforcement learning theory. The state-dependent choice probabilities can reflect the reward expectation in the future through the state-transitions caused by action sequences (see Sutton and Barto (1998)). Therefore, the state-dependence plays an active computational role when the reward associated with an action may be delivered with a delay. Temporal difference (TD) learning is one of the learning algorithms used to estimate the expected value of future reward. The framework we discussed here, including the “matching law” and “melioration”, can easily be generalized to incorporate state-dependence and TD learning. Neural correlates with behavioral results might tell which algorithm is the most likely. The neural mechanism of decision making has extensively been studied in various behavioral tasks (Barraclough et al., 2004; Breiter et al., 2001; Knutson et al., 2001; Montague & Berns, 2002). Accumulating evidence suggests that the errors in reward prediction are signaled by dopamine neurons in the ventral tegmental area and the basal ganglia (McClure et al., 2003; Montague et al., 1996; Schultz et al., 1997), and the mechanisms to generate such neuronal responses have been suggested by modeling studies (Dayan

1103

& Balleine, 2002; Doya, 2000; Haruno et al., 2004; Houk, Davis, & Beiser, 1994; O’Doherty et al., 2004; Schultz, 2004; Seung, 2003; Tanaka et al., 2004; Wang, 2002). Recently, Samejima et al. (2005) introduced a well-designed combination of concurrent VR tasks to determine whether the recorded neuronal activities are correlated with action-dependent values, action-independent values, or actions. It has been suggested that activities of many neurons are correlated with actiondependent values in the striatum of monkeys. Because the policy parameters are also correlated with action-dependent values, these results do not necessary imply that learning algorithms based on action values, such as “Q-learning” or “melioration”, are more likely than those based on actionindependent values, such as “actor-critic”. In addition, we can reformulate the “actor-critic” without using the value V , as V can be written P by the summation of the policy parameters, 1qa = α(rt − a 0 qa 0 )δaat . Therefore, the “actor-critic” system does not require any explicit representation of the actionindependent value V . To determine the algorithm that optimally fits known behavior of subjects, further experimental tests are required for discriminating the action values {Q a } from the policy parameters {qa }. 4. Concluding remarks In this article, we have reviewed recent progress in our understanding of the neural mechanisms and computational algorithms underlying decision making. Since decision making in general suffers from uncertainty, information must be accumulated before a decision is made by the brain. For the accumulation of information, the underlying neuronal network conducts temporal integration of external input to it. We have shown several mechanisms that have been proposed for temporal integration based on the fine-tuning of synaptic weights, slow reverberating synapses, bistable neurons with neuron-dependent parameter tuning, or stochastically-driven bistable neurons. It is known that some cortical neurons show temporal integration ability with multiple stable firing states without reverberating synaptic input. These neurons have also been explained in some detail. We note that the stochastic dynamics of a bistable-neuron network (Fukai et al., 2003) qualitatively differ from those of a similar network of discrete bistable units (Okamoto & Fukai, 2001). In the latter, the transitions between the bistable states occur rapidly and coincidently in a majority of the units. This cooperative network behavior resembles the phase transition phenomena in statistical physics. By contrast, the recurrent synaptic input in the former reduces the effective membrane time constant of the individual spiking neurons, thus modulating the speed of temporal integration by the individual neurons. The neural population dynamics show a gradual increase in activity. Decision making relies on a wider range of computations other than simple temporal integration. For instance, new information must often be compared with the information already stored in the brain. The decision making often involves the selection of an option from others. It is generally

1104

Y. Sakai et al. / Neural Networks 19 (2006) 1091–1105

considered that networks of inhibitory neurons may provide the competition mechanism required for such a selection process, although an alternative mechanism has recently been proposed based on continuous attractor dynamics (Machens et al., 2005). All of these points have not been discussed in this article. In the second half of the paper, we have discussed the computational algorithms proposed in the reinforcement learning paradigm, and their implications in an animal’s and human’s matching behavior in probabilistic choice tasks. There is now sufficiently strong evidence that TD learning plays a crucial role in an animal’s goal-directed behaviors including choice behaviors, and that the TD error is represented by activity of the midbrain dopamine neurons. It is, however less clear how the TD error is used by the neuronal networks that are responsible for determining the strategy of choosing options. In addition, different computational algorithms may often generate similar behavioral responses, as demonstrated in this article. Further studies are required for clarifying the neural basis of decision making and choice behavior. References Aksay, E., Baker, R., Seung, S., & Tank, D. W. (2003). Correlated discharge among cell pairs within the oculomotor horizontal velocity-to-position integrator. Journal of Neuroscience, 23, 10852–10858. Aksay, E., Gamkrelidze, G., Seung, H. S., Baker, R., & Tank, D. W. (2001). In vivo intracellular recording and perturbation of persistent activity in a neural integrator. Nature Neuroscience, 4, 184–193. Barlow, H., & Tripathy, S. P. (1997). Correspondence noise and signal pooling in the detection of coherent visual motion. The Journal of Neuroscience, 17, 7954–7966. Barraclough, D., Conroy, M., & Lee, D. (2004). Prefrontal cortex and decision making in a mixed-strategy game. Nature Neuroscience, 7(4), 404–410. Baum, W., & Rachlin, H. (1969). Choice as time allocation. Journal of the Experimental Analysis of Behavior, 12, 861–874. Baum, W. M. (1981). Optimization and the matching law as accounts of instrumental behavior. Journal of the Experimental Analysis of Behavior, 36, 387–402. Blaustein, M. P., & Golovina, V. A. (2001). Structural complexity and functional diversity of endoplasmic reticulum Ca2+ store. Trends in Neuroscience, 24, 602–608. Breiter, H. C., Aharon, I., Kahneman, D., Dale, A., & Shizgal, P. (2001). Functional imaging of neural responses to expectancy and experience of monetary gains and losses. Neuron, 30, 619–639. Brody, C. D., Hernandez, A., Zianos, A., & Romo, R. (2003). Timing and neural encoding of somatosensory parametric working memory in macaque prefrontal cortex. Cerebral Cortex, 13, 1196–1207. Brody, C. D., Romo, R., & Kepecs, A. (2003). Basic mechanisms for graded persistent activity: Discrete attractors, continuous attractors, and dynamical representations. Current Opinion in Neurobiology, 13, 204–211. Burgi, P. Y., Yuille, A. L., & Grzywacz, N. M. (2000). Probabilistic motion estimation based on temporal coherence. Neural Computation, 12, 1839–1867. Davison, M., & McCarthy, D. (1987). The matching law: A research review. Lawrence Erlbaum Assoc Inc. Daw, N. D., & Touretzky, D. S. (2002). Long-term reward prediction in TD models of the dopamine system. Neural Computation, 14, 2567–2583. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience. Cambridge, MA: The MIT press. Dayan, P., & Balleine, B. W. (2002). Reward, motivation, and reinforcement learning. Neuron, 36, 285–298. DeCarlo, L. T. (1985). Matching and maximizing with variable-time schedules. Journal of the Experimental Analysis of Behavior, 43, 75–81.

Doya, K. (2000). Complementary roles of basal ganglia and cerebellum in learning and motor control. Current Opinion in Neurobiology, 10, 732–739. Durstewitz, D. (2003). Self-organizing neural integrator predicts interval times through climbing activity. Journal of Neuroscience, 23, 5342–5353. Egelman, D. M., Person, C., & Montague, P. R. (1998). A computational role for dopamine delivery in human decision-making. Journal of Cognitive Neuroscience, 10, 623–630. Egorov, A. V., Hamam, B. N., Fransen, E., Hasselmo, M. E., & Alonso, A. A. (2002). Graded persistent activity in entorhinal cortex neurons. Nature, 420, 173–178. Fransen, E., Tahvildari, B., Egorov, A. V., Hasselmo, M. E., & Alonso, A. A. (2006). Mechanism of graded persistent cellular activity of entorhinal cortex layer V neurons. Neuron, 49, 735–746. Fukai, T., Kitano, K., & Okamoto, H. (2003). Time representation in the cortex: Two models inspired by prefrontal persistent activity, synfire chain and unitary events. Biological Cebernetics, 88, 387–394. Funahashi, S., Bruce, C. J., & Goldman-Rakic, P. S. (1989). Mnemonic coding of visual space in the monkey’s dorsolateral prefrontal cortex. Journal of Neurophysiology, 61, 331–349. Gallistel, C., Mark, T., King, A., & Latham, P. (2001). The rat approximates an ideal detector of changes in rates of reward: Implications for the law of effect. Journal of Experimental Psychology-Animal Behaviour Processes, 27, 354–372. Gold, J. I., & Shadlen, M. N. (2000). Representations of a perceptual decision in developing oculomotor commands. Nature, 404, 390–394. Gold, J. I., & Shadlen, M. N. (2001). Neural computations that underlie decisions about sensory stimuli. Trends in Cognitive Sciences, 5, 10–16. Goldman, M. S., Levine, J. H., Major, G., Tank, D. W., & Seung, H. S. (2003). Robust persistent neural activity in a model integrator with multiple hysteretic dendrites per neuron. Cerebral Cortex, 13, 1185–1195. Goldman-Rakic, P. S. (1995). Cellular basis of working memory. Neuron, 14, 477–485. Haruno, M., Kuroda, T., Doya, K., Toyama, K., Kimura, M., Samejima, K, et al. (2004). A neural correlate of reward-based behavioral learning in caudate nucleus: A functional magnetic resonance imaging study of a stochastic decision task. Journal of Neuroscience, 24, 1660–1665. Herrnstein, R. J. (1961). Relative and absolute strength of response as a function of frequency of reinforcement. Journal of the Experimental Analysis of Behavior, 4, 267–272. Herrnstein, R. J. (1997). The matching law: Papers in psychology and economics. Cambridge, MA: Harvard Univ. Press. Herrnstein, R. J., & Heyman, G. M. (1979). Is matching compatible with reinforcement maximization on concurrent variable interval, variable ratio? Journal of the Experimental Analysis of Behavior, 31, 209–223. Herrnstein, R. J., & Vaughan, W. J. (1980). Melioration and behavioral allocation. In J. Staddon (Ed.), Limits to action: The allocation of individual behavior. New York: Academic Press. Heyman, G., & Monaghan, M. (1994). Reinforcer magnitude (sucrose concentration) and the matching law theory of response strength. Journal of the Experimental Analysis of Behavior, 61, 505–516. Heyman, G. M. (1979). A Markov model description of changeover probabilities on concurrent variable-interval schedules. Journal of the Experimental Analysis of Behavior, 31, 41–51. Houk, J. C., Davis, J. L., & Beiser, D. G. (1994). Models of information processing in the Basal Ganglia (Computational neuroscience). Bradford Books. Jacobs, E. A., & Hackenberg, T. D. (1996). Humans’ choices in situations of time-based diminishing returns: Effects of fixed-interval duration and progressive-interval step size. Journal of the Experimental Analysis of Behavior, 65, 5–19. Knutson, B., Adams, C. M., Fong, G. W., & Hommer, D. J. (2001). Anticipation of increasing monetary reward selectively recruits nucleus accumbens. Journal of Neuroscience, 15, 1–5. Koene, R. A., & Hasselmo, M. E. (2005). An integrate-and-fire model of prefrontal cortex neuronal activity during performance of goal-directed decision making. Cerebral Cortex, 15, 1964–1981. Koulakov, A. A., Raghavachari, S., Kepecs, A., & Lisman, J. E. (2002). Model for a robust neural integrator. Nature Neuroscience, 5, 775–782.

Y. Sakai et al. / Neural Networks 19 (2006) 1091–1105 Lisman, J. E., Fellous, J. M., & Wang, X. J. (1998). A role for NMDA-receptor channels in working memory. Nature Neuroscience, 1, 273–275. Loewentein, Y., & Sompolinsky, H. (2003). Temporal integration by calcium dynamics in a model neuron. Nature Neuroscience, 6, 961–967. Machens, C. K., Romo, R., & Brody, C. D. (2005). Flexible control of mutual inhibition: A neural model of two-interval discrimination. Science, 307, 1121–1124. Mazur, J. (1981). Optimization theory fails to predict performance of pigeons in a two-response situation. Science, 214(4522), 823–825. Mazur, J. E. (2005). Learning and behavior (6th ed.). Prentice Hall. McClure, S., Berns, G. S., & Montague, P. (2003). Temporal prediction errors in a passive learning task activate human striatum. Neuron, 38, 339–346. Miller, P., Brody, C. D., Romo, R., & Wang, X. J. (2003). A recurrent network model of somatosensory parametric working memory in the prefrontal cortex. Cerebral Cortex, 13, 1208–1218. Mongillo, G., Amit, D. J., & Brunel, N. (2003). Retrospective and prospective persistent activity induced by Hebbian learning in a recurrent cortical network. European Journal of Neuroscience, 18, 2011–2024. Montague, P., & Berns, G. (2002). Neural economics and the biological substrates of valuation. Neuron, 36, 265–284. Montague, P. R., Dayan, P., & Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive hebbian learning. Journal of Neuroscience, 16, 1936–1947. Morris, G., Arkadir, D., Nevet, A., Vaadia, E., & Bergman, H. (2004). Coincident but distinct messages of midbrain dopamine and striatal tonically active neurons. Neuron, 43, 133–143. O’Doherty, J., Dayan, P., Schultz, J., Deichmann, R., Friston, K., & Dolan, R. J. (2004). Dissociable roles of ventral and dorsal striatum in instrumental conditioning. Science, 304, 452–454. Okamoto, H., & Fukai, T. (2001). Neural mechanism for a cognitive timer. Physical Review Letters, 86, 3919–3922. Pastor, A. M., De la Cruz, R. R., & Baker, R. (1994). Eye position and eye velocity integrators reside in separate brainstem nuclei. Proceeding of the National Academy of Science USA, 91, 807–811. Parekh, A. B., & Penner, R. (1997). Store depletion and calcium influx. Physiological Review, 77, 901–930. Platt, M., & Glimcher, P. (1999). Neural correlates of decision variables in parietal cortex. Nature, 400(6741), 233–238. Putney, J. W., Jr., & Ribeiro, C. M. (2000). Signaling pathways between the plasma membrane and endoplasmic reticulum calcium stores. Cellular Molecular Life Sciences, 57, 1272–1286. Rachlin, H., Green, L., Kagel, J., & Battalio, R. (1976). Economic demand theory and psychological studies of choice. In G. Bower (Ed.), The psychology of learning and motivation: vol. 10 (pp. 129–154). New York: Academic Press. Ratcliff, R. (2001). Putting noise into neurophysiological models of simple decision making. Nature Neuroscience, 4, 336–337. Reddi, B. A., & Carpenter, R. H. (2000). The influence of urgency on decision time. Nature Neuroscience, 3, 827–830. Ricciardi, L. M. (1977). Diffusion processes and related topics in biology. Berlin, Heidelberg: Springer-Verlag. Romo, R., Brody, C. D., Hernandez, A., & Lemus, L. (1999). Neuronal correlates of parametric working memory in the prefrontal cortex. Nature, 399, 470–473. Rosen, M. J. (1972). A theoretical neural integrator. IEEE Transactions Biomedical Engineering, 19, 362–367.

1105

Sakagami, T., Hursh, S. R., Christensen, J., & Silberberg, A. (1989). Income maximizing in concurrent interval-ratio schedules. Journal of the Experimental Analysis of Behavior, 52, 41–46. Samejima, K., Ueda, Y., Doya, K., & Kimura, M. (2005). Representation of action-specific reward values in the striatum. Science, 310, 1337–1340. Savastano, H. I., & Fantino, E. (1994). Human choice in concurrent ratiointerval schedules of reinforcement. Journal of the Experimental Analysis of Behavior, 61, 453–463. Schultz, W. (2004). Neural coding of basic reward terms of animal learning theory, game theory, microeconomics and behavioural ecology. Current Opinion in Neurobiology, 14, 139–147. Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599. Seung, H. S. (1996). How the brain keeps the eye still. Proceeding of the National Academy of Science USA, 93, 13339–13344. Seung, H. (2003). Learning in spiking neural networks by reinforcement of stochastic synaptic transmission. Neuron, 40, 1063–1073. Seung, H. S., Lee, D. D., Reis, B. Y., & Tank, D. W. (2000). Stability of the memory of eye position in a recruitment network of conductance-based model neurons. Neuron, 26, 259–271. Schall, J. D., & Hanes, D. P. (1998). Neural mechanisms of selection and control of visually guided eye movements. Neural Networks, 11, 1241–1251. Shadlen, M. N., & Newsome, W. T. (2001). Neural basis of a perceptual decision in the parietal cortex (area LIP) of the Rhesus monkey. Journal of Neurophysiology, 86, 1916–1936. Shapiro, J. L., & Wearden, J. (2002). Reinforcement learning and time perception — a model of animal experiments. Advances in Neural Information Processing, 14. Stubbs, D. A., Pliskoff, S. S., & Reid, H. M. (1977). Concurrent schedules: A quantitative relation between changeover behavior and its consequences. Journal of the Experimental Analysis of Behavior, 27, 85–96. Sugrue, L. P., Corrado, G. S., & Newsome, W. T. (2004). Matching behavior and the representation of value in the parietal cortex. Science, 304, 1782–1787. Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning. Cambridge, MA: MIT press. Takeda, K., & Funahashi, S. (2002). Prefrontal task-related activity representing visual cue location or saccade direction in spatial working memory tasks. Journal of Neurophysiology, 87, 567–588. Tanaka, S. C., Doya, K., Okada, G., Ueda, K., Okamoto, Y., & Yamawaki, S. (2004). Prediction of immediate and future rewards differentially recruits cortico-basal ganglia loops. Nature Neuroscience, 7, 887–893. Teramae, J., & Fukai, T. (2005). A cellular mechanism for graded persistent activity in a model neuron and its implications in working memory. Journal of Computational Neuroscience, 18, 105–121. Vaughan, W., Jr. (1981). Melioration, matching, and maximization. Journal of the Experimental Analysis of Behavior, 36, 141–149. Vyse, S. A., & Belke, T. W. (1992). Maximizing versus matching on concurrent variable-interval schedules. Journal of the Experimental Analysis of Behavior, 58, 325–334. Wang, X. (2002). Probabilistic decision making by slow reverberation in cortical circuits. Neuron, 36(5), 955–968.