Social Heuristics - Semantic Scholar

Report 2 Downloads 425 Views
Learning to Allocate Resources

How Do People Learn to Allocate Resources? Comparing Two Learning Theories

Jörg Rieskamp, Jerome R. Busemeyer, and Tei Laine Indiana University, Bloomington, Indiana

Running Head:

Learning to Allocate Resources

1

Learning to Allocate Resources

2

Abstract

How do people learn to allocate resources? To answer this question, two major learning models are compared, each incorporating different learning principles. One is a global search model, which assumes that allocations are made probabilistically based on expectations that are formed through the entire history of past decisions. The second is a local adaptation model, which assumes that allocations are made by comparing the present decision to the most successful decision up to that point, ignoring all other past decisions. The models’ predictions are tested in two studies in which participants repeatedly allocated a capital resource to three financial assets. In both studies substantial learning effects occurred although the optimal allocation was often not found. In particular, many participants got trapped by a local rather than a global maximum. From the calibrated models of Study 1 a priori predictions could be produced and tested in Study 2. This generalization test demonstrated that the local adaptation model provides a better account of learning in resource allocation tasks than the global search model.

Keywords: Resource allocation, decision making, reinforcement learning, hill-climbing learning, local search model, global search model, directional learning, local and global maximization

Learning to Allocate Resources

3

How do people learn to improve their decision-making behavior through past experience? The purpose of this article is to compare two fundamentally different learning approaches introduced in the decision-making literature that address this issue. One approach, called global search models, assumes that individuals form expectancies for every feasible choice alternative by keeping track of the history of all previous decisions and searching for the strongest of all these expectancies. Prominent recent examples that belong to this approach are the reinforcement-learning models of Erev and Roth (1998; see also Roth & Erev, 1995; Erev, 1998). These models follow up a long tradition of stochastic learning models (Bush & Mosteller, 1955; Estes, 1950; Luce, 1959). The second approach, called local adaptation models, does not assume that people acquire a representation for every choice alternative and keep track of the history of all past decisions. Instead the approach assumes that people compare the consequences of a current decision with a reference point (i.e. the previous decision or the most successful decision up to that point) and adjust their decision in the direction of successful decisions. Prominent learning models that belong to this approach are the learning direction theory by Selten and Stöcker (1986), the error-correction learning models (e.g. Thomas, 1973; Dorfman, Saslow, & Simpson, 1975) and the hill-climbing learning model by Busemeyer and Myung (1987; see also Busemeyer & Myung, 1992). Global search models have recently accumulated a large amount of support especially within a particular domain of decision making—constant-sum games—in which there are a small number of actions available on each trial (Erev & Roth, 1998). It is not clear that this success will extend to decision problems involving a large continuous set of choice options. Global search models may be applicable to situations in which the decision alternatives form a small set of qualitatively different strategies, whereas local adaptation models may be applicable in

Learning to Allocate Resources

4

situations in which the decision alternatives form a continuous metric space of strategies (see Busemeyer & Myung, 1992). Constant-sum games are representative of the former decision problems and resource allocation tasks are representative of the latter decision problems. In the latter domain, local adaptation models may work better. Our main goal consists of comparing the ability of the two learning models to describe a learning process for a particular decision problem, called the resource allocation problem, which provides a large continuous set of choice options to the decision maker on each trial. In the following, we will compare two versions of learning models that best represent the two approaches for the resource allocation task. Direct comparisons of learning models have rarely been done, especially not for the type of task we are considering. Resource Allocation Decision Making Allocating resources to different assets is a decision problem people often face. A few examples of resource allocation decision making are dividing work time between different activities, dividing attention between different tasks, allocating a portfolio to different financial assets, or devoting land to different types of farming. Despite the ubiquity in real life, resource allocation decision making has not been thoroughly acknowledged in the psychological literature. How good are people at making resource allocation decisions? Only a handful of studies have tried to address this question. In one of the earliest studies by Gingrich and Soli (1984), participants were asked to evaluate all of the potential assets before making their allocations. Although the assets were evaluated accurately, the majority of participants failed to find the optimal allocation. Northcraft and Neale (1986) also demonstrated individuals’ difficulties with allocation decisions when attention had to be paid to financial setbacks and opportunity costs.

Learning to Allocate Resources

5

Benartzi and Thaler (2001) studied retirement asset allocations. For this allocation problem the strategy of diversifying one’s investment among assets (i.e. bonds and stocks) appears to be reasonable (Brennan, Schwartz, & Lagnado, 1997). Benartzi and Thaler (2001) showed that many people follow a “1/n strategy” by equally dividing the resource among the suggested investment assets. Although such a strategy leads to a sufficient diversification, the final allocation depends on the number of assets and thus can lead to inconsistent decisions. The above studies show that individuals often do not allocate their resources in an optimal way, which is not surprising given the complexity of most allocation problems. Furthermore, in the above studies little opportunity was provided for learning as the allocation decisions were made only once or infrequently. In contrast, Langholtz, Gettys, and Foote (1993) required participants to make allocations repeatedly (eight times). Two resources—fuel and personnel hours for helicopters—were allocated across a working week to maximize the operating hours of the helicopters. Participants improved their performance substantially through learning and almost reached the optimal allocation. However, under conditions of risk or uncertainty, in which the amount of the resource fluctuated over time, the improvement was less substantial. For similar allocation problems, Langholtz, Gettys, and Foote (1994, 1995) and Langholtz, Ball, Sopchak, and Auble (1997) showed again that learning leads to substantially improved allocations. Interestingly, a tendency to allocate the resource equally among the assets was found here also. Ball, Langholtz, Auble, and Sopchak (1998) investigated people’s verbal protocols when they solved an allocation problem. According to these protocols participants seemed to use simplifying heuristics, which got them surprisingly close to the optimal allocation (i.e. reached on average 94% efficiency).

Learning to Allocate Resources

6

In a study by Busemeyer, Swenson, and Lazarte (1986), an extensive learning opportunity was provided, as participants made 30 resource allocations. Participants quickly found the optimal allocation for a simple allocation problem that had a single global maximum. However, when the payoff function had several maxima the optimal allocation was frequently not found. Busemeyer and Myung (1987) studied the effect of the range of payoffs between the best and worst allocation and the variability of return rates for the assets. For the majority of conditions, participants reached good allocations through substantial learning. However, under a condition with widely varying return rates and with a low range of payoffs, participants got lost and did not exhibit much learning effect. Furthermore, they showed that a hill-climbing learning model, which assumes that individuals improve their allocations step-by-step, provided a good description of the learning process. However, this model was not compared against alternative models, and therefore it remains unclear whether or not this is the best way to characterize learning in these tasks. It can be concluded that when individuals make only a single resource allocation decision, with no opportunity to learn from experience, they generally do not find good allocations at once. In such situations, individuals have a tendency to allocate an equal share of the resource to the different assets, which, depending on the situation, can lead to bad outcomes. Alternatively, when individuals are given the opportunity to improve their allocations through feedback, substantial learning effects are found and individuals often approach optimal allocations. However, if local maxima are present or the payoffs are highly variable, then suboptimal allocations can result even after extensive training. In the following two studies, repeated allocation decisions with outcome feedback were made, providing sufficient opportunity for learning. The allocation problem of both studies can

Learning to Allocate Resources

7

be defined as follows: The decision maker is provided with a financial resource that can be invested in three financial assets. A particular allocation i (allocation alternative) can be represented by a three-dimensional vector X where each dimension represents the proportion of the resource invested in one of the three assets. For repeated decisions, the symbol Xt represents the allocation made at trial t. The distance between two allocations can be measured by the Euclidian distance, which is the length of the vector that leads from one allocation to the other.1 The restriction of proportions to integer percentages implies a finite number of N=5,151 possible allocations. Learning Models The first learning model proposed to describe learning effects of repeated resource allocations represents the global search (GLOS) model approach. The GLOS model presented here is a modified version of the reinforcement-learning model proposed by Erev (1998). The second learning model represents the local adaptation (LOCAD) model approach. The LOCAD model presented here is a modified version of the hill-climbing learning model proposed by Busemeyer and Myung (1987). As already pointed out above, previous research has demonstrated that the two learning model approaches have been successfully applied to describe people’s learning processes for various decision problems: For instance, in Busemeyer and Myung (1987) a hill-climbing learning model was appropriate to describe people’s learning process for a resource allocation task, and Busemeyer and Myung (1992) applied a hill-climbing model successfully to describe criterion learning for a probabilistic categorization task. In contrast, Erev (1998) has shown that a reinforcement-learning model is also appropriate to describe the learning process for a categorization task. Furthermore, Erev (1998) explicitly proposed the reinforcement-learning

Learning to Allocate Resources

8

model as an alternative model to the hill-climbing model. Moreover, Erev and Gopher (1999) suggested the reinforcement-learning model for a resource allocation task in which attention was the resource to be allocated and which shows, by simulations, that the model’s predictions are consistent with experimental findings. In sum, direct comparisons of the two approaches appear necessary to decide in which domain each learning model approach works best. To extend the generality of our comparison of the GLOS and LOCAD models, we will first compare the models with respect to how well they predict a learning process for repeated resource allocations, and second we will test whether the models are also capable of predicting individual characteristics of the learning process. Finally, while it is true that we can only test special cases for each approach, we currently do not know of any other examples within either approach that can outperform the versions that we are testing. Global search model. Erev (1998), Roth and Erev (1995), and Erev and Roth (1998) have proposed in varying forms a reinforcement-learning model for learning in different decision problems. The GLOS model was particularly designed for the resource allocation problem. The basic idea of the model is that decisions are made probabilistically proportional to expectancies (called propensities by Erev and colleagues). The expectancy for a particular option increases whenever a positive payoff or reward is provided after it is chosen. This general reinforcement idea can be traced back to early work by Bush and Mosteller (1955), Estes (1950), and Luce (1959); for more recent learning models see Börgers and Sarin (1997), Camerer and Ho (1999a, 1999b), Harley (1981), Stahl (1996), and Sutton and Barto (1998). The GLOS learning model for the resource allocation problem is based on the following assumptions: Each allocation alternative is assigned a particular “expectancy.” First, an

Learning to Allocate Resources

9

allocation is selected probabilistically proportional to the expectancies. Second, the received payoff is used to determine the reinforcement for all allocation alternatives, in such a way that the chosen allocation alternative receives reinforcement equal to the obtained payoff, allocation alternatives close to the chosen one receive slightly less reinforcement, and allocation alternatives that are far away from the chosen allocation alternative receive very little reinforcement. Finally, the reinforcement is used to update the expectancies of each allocation alternative and the process returns to the first step. In more detail GLOS is defined as follows: The preferences for the different allocation alternatives are expressed by expectancies qit, where i is an index of the finite number of possible allocations. The probability pit that a particular allocation i is chosen at trial t is defined by (cf. Erev & Roth, 1998): pit = qit /



N

i =1

qit .

(1)

For the first trial, all expectancies are assumed to be equal and determined by the average payoff that can be expected from random choice, multiplied by w, which is a free, so-called “initial strength parameter” and is restricted by w>0. After a choice of allocation alternative j on trial t, is made, the expectancies are updated by the reinforcement received from the decision, which is defined as the received payoff, rjt. For a large grid of allocation alternatives, it is reasonable to assume that not only the chosen allocation is reinforced but also similar allocations. Therefore, to update the expectancies of any given allocation alternative, i, the reinforcement rit is determined by the following generalization function (cf. Erev, 1998): rit = rjt ⋅ g(xij) = rjt ⋅ exp(-xij2 / 2σR2)

(2)

where xij is the Euclidian distance of a particular allocation i to the chosen allocation j and with the standard deviation σR as the second free parameter. This function was chosen so that the

Learning to Allocate Resources 10

reinforcement rit for the chosen allocation j should equal the received payoff rjt.2 In the case of a negative payoff, rjt < 0, Equation 2 was modified as follows: rit = rjt g(xij) - rjt. By using this modification, if the current payoff is negative, then the chosen allocation receives the reinforcement of zero, whereas all other allocation alternatives receive positive reinforcements. Finally, the determined reinforcement is used to update the expectancies by the following updating rule (see Erev & Roth,1998): qit = (1-φ) qi(t-1) + rit ,

(3)

where φ ∈[0,1] is the third free parameter, the “forgetting rate.” The forgetting rate determines how strongly previous expectancies affect new expectancies. If the forgetting rate is large, the obtained reinforcement has a strong effect on the new expectancies. To ensure that all possible allocation alternatives are chosen, at least with a small probability, the minimum expectancy for all options is restricted to v=0.0001 (according to Erev, 1998). After the updating process, the probability of selecting any particular allocation alternative is determined again. In sum, the GLOS learning model has three free parameters: (1) the initial strength parameter w, which determines the impact of the initial expectancies; (2) the standard deviation

σR of the generalization function, which determines how similar (close) allocations have to be to the chosen allocation to receive substantial reinforcement; and (3) the forgetting rate φ that determines the impact of past experience compared to present experience. It is important to limit the number of parameters to a relatively small number because models built on the basis of too many parameters will fail to generalize to new experimental conditions. Local adaptation learning model. The LOCAD learning model incorporates the idea of a hill-climbing learning mechanism. In general, hill-climbing mechanisms are widely used heuristics for optimization problems

Learning to Allocate Resources 11

whose analytic solutions are too complex (Russel & Norvig, 1995). The basic idea is to start with a randomly chosen decision as a temporary solution and to change the decision slightly in the next trial. If the present decision led to a better outcome than a reference outcome (i.e. the best previous outcome), this decision is taken as a new temporary solution. Starting from this solution a slightly different decision is made in the same direction as the present one. If the present decision led to an inferior outcome, the temporary solution is kept and starting from this solution a new decision is made in the opposite direction from the present decision. The step size, that is the distance between successive decisions, usually declines during search. The search stops when no further changes using this method yield substantial improvement. This process requires that the available decision alternatives have an underlying causal structure, such that they can be ordered by some criteria and a direction of change exists. Consequently for decision problems that do not fulfill this requirement the LOCAD learning model can hardly be applied. It is well known that hill-climbing heuristics are efficient, as they often require little search, but their disadvantage can be suboptimal convergence (Russel & Norvig; 1995), “getting stuck” in a local maximum. LOCAD is defined as follows: It is assumed that decisions are made probabilistically as in the GLOS learning model. In the first trial, identical to GLOS, an initial allocation is selected with equal probability pit of all possible allocations. For the second allocation, the probability of selecting any particular allocation is defined by the following distribution function: pit = fS(xij) / K = exp[-(xij-st)2 / 2σS2] / K

(4)

where xij is the Euclidian distance of any allocation i to the first chosen allocation j with a standard deviation σS as the first free parameter and K is simply a constant that normalizes the probabilities so that they sum to one. The step size, st, changes across trials as follows:

Learning to Allocate Resources 12

st =

s1 | vt −1 − vt −2 | s1 + vb t 2

(5)

where s1 is the initial step as the second free parameter, vt is the received payoff (with v0 = 0), and vb is the payoff of the “reference allocation.” The reference allocation is the allocation alternative that produced the highest payoff in the past and is represented by the index b for best allocation so far. Accordingly, the step size is defined by two components. The first component depends on the payoffs of the preceding allocations and the maximum payoff received so far. The second component is time, manipulated so that the step size automatically declines over time. Note that for trial t=2 the step size s2 equals the initial step size s1. For the third and all following trials the probability of selecting any particular allocation is determined by the product of two operations, one that selects the step size and the other that selects the direction of change. More formally the probability of selecting an allocation alternative i on trial t > 2 is given by pit = fS(xib) fA(yij) / K .

(6)

In the above equation, the probability of selecting a step size is determined by the function, fS (xib), which is the same function previously defined in Equation 4 with the distance xib defined as the Euclidian distance from any allocation i to the reference allocation b. The second function is represented by fA (yij) = exp [(yij–at)2 / 2σA2], where yij is the angle between the direction vector of any allocation i to the direction vector of the preceding allocation j, and at equals 0o if the preceding allocation led to a higher or equal payoff than the reference allocation; otherwise at equals 180o. The direction vector of any allocation i is defined as the vector from the preceding allocation j to the allocation i (defined as Xi-Xj). The angle between the two direction vectors ranges from 0 to 180o (mathematically the angle is determined by the arccosines of the vector

Learning to Allocate Resources 13

product of the two direction vectors normalized to a length of one). The function fA (yij) has a standard deviation σA as the third free parameter. In sum, the LOCAD learning model takes the following steps. In the first trial an allocation alternative is chosen with equal probability, and in the second trial a slightly different allocation alternative is selected. For selecting an allocation alternative in the third and all following trials, the payoff received in the preceding trial is compared to the reference allocation that produced the maximum payoff received so far (this is an important difference to the model proposed by Busemeyer & Myung, 1987, where the reference allocation was the previous allocation). If the payoff increased (or stayed the same), allocations in the same direction as the preceding allocation are likely to be selected. On the other hand, if the payoff decreased, allocations in the opposing direction are more likely to be selected. The LOCAD learning model has three free parameters: (1) the initial step size s1, which is used to determine the most likely distance between the first and second allocation, and on which the succeeding step sizes depend; (2) the standard deviation σS of the distribution function fS, which determines how likely the distance between new allocations and the reference allocation differ from the distance defined by the step size st; and (3) the standard deviation σA of the distribution function fA, which determines how likely the direction of new allocations differ from the direction (or opposing direction) of the preceding allocation. The LOCAD learning model has similarities to the learning direction theory proposed by Selten and Stöcker (1986) and to the hill-climbing learning model proposed by Busemeyer and Myung (1987). Learning direction theory also assumes that decisions are slightly adjusted based on feedback, by comparing the outcome of a decision to hypothetical outcomes of alternative decisions. The LOCAD model represents a simple learning model with only three free

Learning to Allocate Resources 14

parameters, compared to the hill-climbing model proposed by Busemeyer and Myung (1987) with eight free parameters. The LOCAD model is to some extent also related to so-called belief-based learning models (Brown, 1951; Cheung & Friedman, 1997; Fudenberg & Levine, 1995, see also Camerer & Ho’s, 1999a, 1999b, experience-weighted attraction learning model that includes belief-based models as a special case). These models, which have been prevalently applied to learning in games, assume that people form beliefs about others’ future behavior based on past experience. Thereby decision alternatives that had not been chosen in the past and therefore did not obtain any reinforcement but would have resulted in good payoffs are likely to be chosen in the future. Similar to belief-based models the LOCAD model predicts that people form “beliefs” about which decision alternatives might produce higher payoffs compared to the present decision. However, in contrast to belief-based models these beliefs are not based on foregone payoffs that are determined by the total history of past decisions but are instead based on an assumption of the underlying causal structure of the decision problem. The relationship of the two learning models. The two models presented, in our view, are appropriate implementations of the two approaches of learning models we consider. Any empirical test of the two models strictly speaking only allows conclusions on the empirical accuracy of the particular learning models implemented. However, keeping this restriction in mind, both learning models are provided with a flexibility (expressed in the three free parameters of each model) that allows them to predict various learning processes. Variations of our implementations (e.g. using an exponential choice rule for determining choice probabilities instead of the implemented linear choice rule) might

Learning to Allocate Resources 15

increase the empirical fit of the model but will not abolish substantial different predictions made by the two learning models for the allocation decision problem we consider.3 What are the different predictions that can be derived from the two learning models? In general, the GLOS model predicts that the probabilities with which decision alternatives are selected depend on the total stock of previous reinforcements for these alternatives. This implies a global search process within the entire set of alternatives, which should frequently find the optimal alternative. In contrast, the LOCAD model only compares the outcome of the present decision with the best outcome so far and ignores all other experienced outcomes, which are not integrated in an “expectancy score” for each alternative. Instead, which alternatives will be chosen depends on the success and the direction of the present decision; thereby an alternative similar to the present alternative will most likely be selected. This implies a strong path dependency, so that depending on the starting point of the learning process, the model will often not converge to the optimal outcome if several payoff maxima exist. However, the specific predictions of the models depend on the parameters, so that particular parameter values could lead to similar behavior for both models. For instance, if the GLOS model has a high forgetting rate, the present allocation strongly influences the succeeding allocation, resulting in a local search similar to the LOCAD model, so that it could also explain convergence to local payoff maxima. Likewise, if the LOCAD model incorporates a large initial step size it implies a more global, random search process, and so it could explain convergence to a global maximum. Due to this flexibility of both models, one can expect a relatively good fit of both models when the parameter values are fitted to the data. Therefore, we used the generalization method (Busemeyer & Wang, 2000) to compare the models, which entails using a two-stage procedure. As a first stage, in Study 1, each model was

Learning to Allocate Resources 16

fit to the individual learning data, and the fits of the two models were compared. These fits provided estimates of the distribution of the parameters over individuals for each model. As a second stage, the parameter distributions estimated from Study 1 were used to generate model predictions for a new learning condition presented in Study 2. The accuracies of the a priori predictions of the two models for the new condition in Study 2 provide the basis for a rigorous comparison of the two models. Study 1 In the experiment, the decision problem consisted of repeatedly allocating a resource among three financial assets. The rates of return were initially unknown, but they could be learned by feedback from past decisions. To add a level of difficulty to the decision problem the rate of return for each asset varied depending on the amount invested in that asset and on the amount invested in the other assets. One could imagine a real-life analogue in which financial assets have varying returns because of fixed costs, economy of scale, or efficiency, depending on investments in other assets. The aim of the first study was to explore how people learn to improve their allocation decisions and whether they are able to find the optimal allocation that leads to the maximum payoff. Study 1 was also used to compare the fits of the two models to the individual data, and to estimate the distribution of parameters for each model. Method Participants. Twenty persons (14 women and 6 men) with an average age of 22 participated in the experiment. The computerized task lasted approximately 1 hr. Most participants (95%) were students in various departments of Indiana University. For their participation, they received a

Learning to Allocate Resources 17

show-up payment of $2. All additional payment depended on the participants’ performance; the average payment was $18. Procedure. The total payoff from an allocation is defined as the sum of payoffs obtained from the three assets. The selection of the particular payoff function was motivated by the two learning models’ predictions. As can be seen in Figure 1 the allocation problem was constructed such that a local and a global maximum with respect to the possible payoffs resulted. In general, one would expect that people will get stuck at the local payoff maximum if their learning process is consistent with the LOCAD model. In contrast, the GLOS model predicts a learning process that frequently should converge at the global payoff maximum. Figure 1 only shows the proportion invested in asset B and asset C, the rest being invested in asset A. High investments in asset C lead to low payoffs (in the worst case a payoff of -3.28), whereas low investments in asset C result in higher payoffs. The difficult part is to find out that there are two payoff maxima: first, the local maximum with a payoff of 32.82 when investing 28% in asset B and 19% in asset C, and second, the global maximum with a payoff of 34.46 when investing 88% in asset B and 12% in asset C, yielding a difference of 1.64 between the two maxima. Note that there is no variability in the payoffs at each allocation alternative so that if a person compares the local and the global maximum, it is perfectly obvious that there is a payoff difference between them favoring the latter. The main difficulty for the person is finding the global maximum, not detecting a difference between the local and global maxima. The Euclidean distance between the corresponding allocations of the local and global maximum is 80 (the maximum possible distance between two allocations is 141). From random choice an average

Learning to Allocate Resources 18

payoff of 24.39 can be expected. The payoff functions for each asset are provided in the Appendix.

[ Figure 1 ]

The participants received the following instructions: They were to make repeated allocation decisions in two phases of 100 trials. On each trial, they would receive a loan of $100 that had to be allocated among three “financial assets” from which they could earn profit. The loan had to be repaid after each round, so that the profit from the investment decisions equals the participant’s gains. The three assets were described as follows: Investments in asset A “pay a guaranteed return equal to 10% of your investment,” whereas the returns from asset B and asset C depended on how much of the loan was invested in the asset. Participants were informed that there existed an allocation between asset A, asset B, and asset C that would maximize the total payoffs and that the return rates for the three assets were fixed for the whole experiment. It was explained that they would receive 0.25% of their total gains as payment for their participation. After the first phase of 100 trials, participants took a small break. Thereafter they received the information that the payoff functions for asset B and asset C were changed, but that everything else was identical to the first block. In fact, the payoff functions for asset B and asset C were interchanged for the second phase. To control any order effects of which payoff function was assigned to asset B and which to asset C, for one half of the participants the payoff function of asset B in the first phase was assigned to asset C and the payoff function for asset C was assigned to asset B. For the second phase the reverse order was used.

Learning to Allocate Resources 19

Results First, a potential learning effect is analyzed before the two learning models are compared and more specific characteristics of the learning process are considered. Learning effects. The average investment in asset B increased from 26% (SD=15%) in the first trial to an average investment of 36% (SD=21%) in the 100th trial, whereas the average investment in asset C decreased from an average of 34% (SD=21%) in the first trial to an average of 19% (SD=8%) in the 100th trial. This difference represents a substantial change in allocations corresponding to an average Euclidian distance of 28 (SD=29; t(19)=4.2, p=.001, d=0.94). Furthermore, this change leads to an improvement in payoffs, which will be discussed in more detail. Figure 2 shows the learning curve for the first phase of the experiment. The percentages invested in asset B and asset C are plotted as a function of training (with a moving average of 9 trials). To investigate the potential learning effect, the 100 trials of each phase were aggregated into blocks of 10 trials (trial blocks). A repeated measurement analysis of variance (ANOVA) was conducted, with the average obtained payoff as the dependent variable, the trial blocks and the two phases of 100 trials as two within-subject factors, and the order in which the payoff functions were assigned to the assets as a between-subjects factor. A strong learning effect could be documented, as the average obtained payoff of $28 in the first block (SD=2.2) increased substantially across the 100 trials to an average payoff of $32 (SD=1.6) in the last block (F(9,10)=5.2, p=.008, η2=0.82). Additionally there was a learning effect between the two phases as participants on average did better in the second phase (M=$30 for the first phase, SD=2.9 vs. M=$31 for the second phase, SD=5.7; F(1,18)=12.7, p=.002, η2=0.41). However, this effect was moderated by an interaction between trial blocks and the two phases (F(9,10)=3.3, p=.038,

Learning to Allocate Resources 20

η2=0.28). This interaction can be attributed to a more rapid learning process for the second phase compared to the first phase: The average obtained payoff was higher in the second phase from the second to the fifth trial block, whereas for the first trial block and last five trial blocks, the payoffs did not differ. The order in which the payoff functions were assigned to asset B and asset C had no effect on the average payoffs (therefore, for simplicity, in the following and for the presented figures, the investments in asset B and C are interchanged for one half of the participants). No other interactions were observed. Model comparison. How well do the two learning models fit the observed leaning data? We wished to compare the models under conditions where participants have no prior knowledge, and so we only used the data from the first phase to test the models. Each model was fit separately to each individual’s learning data as follows. First a set of parameter values were selected for a model separately for each individual. Using the model and parameters, we generated a prediction for each new trial, conditioned on the past allocations and received payoffs of the participant before that trial. The model’s predictions are represented by a probability distribution across all possible 5,151 allocation alternatives, where the selected allocation alternative of a participant received a value of 1 and all other allocation alternatives received values of 0. The accuracy of the prediction for each trial was evaluated using the sum of squared error. That is, we computed the squared error of the observed (0 or 1) response and the predicted probability for each of the 5,151 allocation alternatives and summed these squared errors across all the alternatives for each trial to obtain the sum of squared error for each trial (this score ranges between 0 and 2). To assess the overall fit for a given

Learning to Allocate Resources 21

individual, model, and set of parameters, we determined the average of the sum of squared error (denoted SSE) across all 100 trials.4 To compare the fits of the two learning models for Study 1, we searched for the parameter values that minimized the SSE for each model and individual. To optimize the parameters for each participant and model, reasonable parameter values were first selected by a grid-search technique, and thereafter the best-fitting grid values were used as a starting point for a subsequent optimization using the Nelder–Mead simplex method (Nelder & Mead, 1965). For the optimization process the parameter values for the GLOS model were restricted to initial strength values w between 0 and 10, standard deviations σR of the generalization function between 1 and 141, and forgetting rates φ between 0 and 1. The parameter values for the LOCAD model were restricted to initial step sizes between 1 and 141, a standard deviation σS of the distribution function fS between 1 and 141, and a standard deviation σA of the distribution function fA between 0° and 360°. The above procedure was applied to each of the 20 participants to obtain 20 sets of optimal parameter estimates. For the GLOS model, this produced the following means and standard deviations for the three parameters: initial strength mean of w=3.4 (SD=4.4), forgetting rate mean of φ=0.24 (SD=0.18), and a standard deviation mean of σR =1.8 (SD=1.5) of the generalization function. The mean and standard deviation of the SSE for the GLOS model was 0.94 (SD=0.10). For the LOCAD learning model, this estimation procedure produced the following means and standard deviations: initial step size mean of s1=23 (SD=30), a standard deviation mean for the distribution function fS of σS=22 (SD=40), and a standard deviation mean for the distribution function fA of σA=119o (SD=72). The mean and standard deviation of the SSE for the LOCAD

Learning to Allocate Resources 22

model was 0.91 (SD=0.18). In sum, for Study 1 the LOCAD model was slightly more appropriate compared to the GLOS model according to the SSE to predict participants’ allocations (Z=1.5, p=.135; Wilcoxon signed rank test). Figure 2 shows the average allocation of the participants across the first 100 trials. Additionally, it shows the predicted average allocation by both learning models when fitted to each participant. Both models adequately describe the last two thirds of the learning process. However, for the first third, GLOS predicts an excessively large proportion invested in asset C, whereas LOCAD overestimates the proportion invested in asset B and underestimates the proportion invested in asset C.5

[ Figure 2 ]

Individual characteristics of the learning process. Besides analyzing the allocations of the participants, one can ask whether the learning models are also capable of predicting individual characteristics of the learning process. One characteristic is whether a participant eventually found the global maximum, or ended up stuck close to the local maximum, or was not close to either maximum. Figure 3a shows the percentage of participants who were close (±5%) to the allocations that produced the global or local maximum across the 100 trials. In the first trial, no participant made an allocation corresponding to the local or global maximum. At the end of training, only 10% of participants were able to find the optimal allocation producing the maximum payoff, whereas 50% of the participants ended up choosing allocations close to the local maximum. Figure 3a also shows the predictions

Learning to Allocate Resources 23

of the models. Both models accurately describe the proportion of participants who make allocations according to the local or global maximum.

[ Figure 3 ]

As noted earlier, participants were able to increase their payoffs over the 100 trials through learning (see Figure 3b). Both learning models also accurately describe this increase in payoffs. As a third criterion for comparing the two learning models, the effect of training on the magnitude with which individuals changed their decisions was considered. To describe these changes during learning, the Euclidean distances between successive trials were determined. Figure 4a shows that in the beginning of the learning process, succeeding allocations differed substantially with an average Euclidian distance of 30 units, whereas at the end of the task, small changes in allocations were observed (mean distance of 9 units). LOCAD more accurately predicts the magnitude with which participants change their allocation than GLOS. GLOS on average predicts a too small magnitude with which allocations are changed in successive trials.

[ Figure 4 ]

A fourth characteristic to examine is the direction of change in allocations that individuals made following different types of outcome feedback. The LOCAD model predicts that the outcome of a decision is compared to the outcome of the most successful decision to that point and, if the present decision leads to a greater payoff, the direction of the succeeding decision is likely to be in the same direction as the present decision. If a decision leads to a smaller payoff,

Learning to Allocate Resources 24

the succeeding decision is likely to be in the opposite direction. In contrast, the GLOS model predicts that a decision is based on the aggregated success and failure of all past decisions, so that no strong correlation between the success of a present decision and the direction of the succeeding decision is expected. To test this prediction, the angles between the direction of an allocation and the direction of the preceding allocation were determined for all allocations. Figure 4b shows the proportion of the preceding allocations that were successful for all preceding allocations (i.e. led to a greater payoff than the allocation before), categorized with respect to the angle between the direction of an allocation and the direction of the preceding allocation. Consistent with the LOCAD model, we did observe an association between the participants’ allocation directions and their success: For 70% of all allocations made in the same direction as the preceding allocation, the preceding allocation was successful, compared to only 35% of all allocations made in an opposite direction . This association was predicted, although to different extents, by both models. As expected for LOCAD, a preceding allocation was likely to be successful (in 67% of all cases) when the direction of an allocation was the same as the direction of the preceding allocation (angles between 0 and 30o), whereas the preceding allocation was unlikely to be successful (only in 41% of all cases) when the direction of an allocation was opposite to the preceding direction. Surprisingly this association was also observed for GLOS: For 73% of all allocations made in a similar direction to the preceding allocation, the preceding allocation was successful, compared to 39% of all allocations made in an opposite direction. However, the proportions of successful preceding allocations for the different angles was more strongly correlated with LOCAD’s predictions (r=.95) than with GLOS’ predictions (r=.84).

Learning to Allocate Resources 25

Summary of Study 1 Study 1 shows that people are able to improve their decisions in an allocation situation substantially when provided with feedback. However, only a few participants were able to find the allocation that produces the maximum possible payoff. This result can be explained by the LOCAD learning model, which described the empirical results slightly better than the GLOS learning model, based on the goodness-of-fit criterion. If people start with a particular allocation and try to improve their situation by slightly adapting their decisions, as predicted by LOCAD, depending on their starting position, they will often not find the global payoff maximum. However, as both models were fitted to each individual separately it is difficult to decide which model is more appropriate, as the two models make similar predictions. When focusing on several individual learning characteristics only one out of four characteristics supports the LOCAD model: the magnitude with which allocations are changed in successive trials. The other three process characteristics are appropriately described by both learning models. This result is not very surprising if one considers that both models were fitted for each individual and only predict each new trial based on the information of previous trials. In contrast, in Study 2 both models make a priori predictions for independent data, enabling a rigorous comparison of the two models. Study 2 In light of the results found in Study 1 that people, even when provided with substantial learning opportunity, often end up with suboptimal outcomes, one might object that the function of the total payoff used in Study 1 only produced a relatively small payoff difference between the two maxima, providing small incentives for participants to search for the global maximum. Additionally, if one takes the opportunity costs of search into account, it might be reasonable to

Learning to Allocate Resources 26

stay at the local maximum. One could criticize that the small difference between the payoffs does not satisfy the criterion of payoff dominance (Smith, 1982), that is, the additional payoff does not dominate any (subjective) costs of finding the optimal outcome, so that participants are not sufficiently motivated to find the global payoff maximum. Study 2 addresses this critique by increasing the payoff difference between the local and global payoff maximum, but keeping the shape of the total payoff function similar to that in Study 1. Increasing the payoff difference between the local and global payoff maximum has direct implications for the predictions of the GLOS learning model: If the reinforcement for the global payoff maximum increases relative to the local payoff maximum, then the probability of selecting the allocation alternative corresponding to the global maximum should increase according to the GLOS model. Therefore, one would expect the GLOS model to predict that more people will find the global maximum. In contrast, a larger payoff difference between the local and global payoff maximum does not affect the prediction of the LOCAD model. Study 2 also provides an opportunity to test the two learning models on new independent data, by simulating 50,000 agents using the model parameter values randomly selected from normal distributions with the means and standard deviations of the parameter values derived from the individual fitting process of Study 1. Given that the models’ parameter values are not fitted by the data of Study 2, the models’ predictions provide a stronger empirical generalization test of the models, which has been often asked for but seldom done (Busemeyer & Wang, 2000). Method Participants. Twenty persons (13 women and 7 men) with an average age of 21 years participated in the experiment. The duration of the computerized task was approximately 1 hr. Most participants

Learning to Allocate Resources 27

(90%) were students in various departments of Indiana University. For their participation they received a show-up payment of $2. Additional payment was contingent on the participants’ performance; the average payment was $20. Procedure. The allocation problem was identical to the one used in Study 1, with the only difference being the modified payoff functions. The payoff functions differed by an increase in the payoff difference between the local and global payoff maximum (see Figure 5). Again, high investments in asset C lead to low payoffs, in the worst case to a payoff of -34.6, whereas small investments in asset C result in higher payoffs. The local maximum with a payoff of 32.5 is obtained when investing 29% in asset B and 21% in asset C (compared to 28% and 19% with a payoff of 32.8 in Study 1), whereas the global maximum with a payoff of 41.2 is reached when investing 12% in asset B and 88% in asset C (the same allocation led to the global payoff maximum of 34.5 in Study 1). From random choice an average payoff of 17.44 can be expected. The payoff functions yield a difference of 8.7 and a Euclidian distance of 79 between the allocations corresponding to the local and global payoff maximum. The instructions for the task in Study 2 were identical to those used in Study 1.

[ Figure 5 ]

Results As for Study 1, first a potential learning effect is analyzed before the two learning models are compared and more specific characteristics of the learning process are considered.

Learning to Allocate Resources 28

Learning Effects. In the first trial, the average allocation consisted of an investment of 26% in asset B (SD=12%), which increased to an average investment of 48% in asset B (SD=27%) in the 100th trial. The investment in asset C decreased from 27% (SD=13%) in the first trial to 22% (SD=12%) in the 100th trial. As in Study 1, participants in Study 2 had the tendency in the first trial to invest slightly more in asset A, which guaranteed a fixed return. The allocation in the first trial substantially differed from that in the 100th trial with a Euclidean distance of 40 (t(19)=6.9, p=.001, d=1.55). To investigate any learning effect, the 100 trials of both phases were aggregated into blocks of 10 trials (trial blocks). A repeated measurement ANOVA was conducted, with the obtained payoff as the dependent variable, the trial blocks and the two phases of 100 trials as two within-subject factors, and the order in which the payoff functions were assigned to the assets as a between-subjects factor. A strong learning effect was documented, as the average obtained payoff of $25 in the first block (SD=3.6) increased substantially across the 100 trials to an average payoff of $34 (SD=4.7) in the last block (F(9,10)=4.1, p=.019, η2=0.79). Additionally, there was a learning effect between the two phases as participants on average did better in the second phase (M=30, SD=3.7 vs. M=33, SD=4.5; F(1,18)=8.6, p=.009, η2=0.32). In contrast to Study 1, the interaction between trial blocks and the two phases was not significant (F(9,10)=2.2, p=.112, η2=0.67). The order in which the payoff functions were assigned to asset B and asset C had no effect on the average payoffs (therefore, for simplicity, in the following, the investments in asset B and C are interchanged for half of the participants). No other interactions were observed.

Learning to Allocate Resources 29

Model comparison. How well do the two learning models predict participants’ allocations across the first 100 trials? For Study 2 no parameter values were estimated. Instead, our testing approach consists of simulating a large number of agents with the models’ parameter values randomly selected from normal distributions with the means and standard deviations of the parameter values derived from the fitting process of Study 1. Finally, the models’ fits are assessed by calculating the mean squared error (MSE) of the average observed and average predicted allocations (the deviation between two allocations is defined by the Euclidian distance). Figure 6 shows the development of the average allocation of the participants across all 100 trials. Additionally, it shows the predicted average allocation of both learning models. The LOCAD learning model better describes the development of the allocations across the 100 trials and reaches a fit of an MSE of 39. In contrast, the GLOS learning model less appropriately describes the learning process with an MSE of the predicted and observed average allocation of 117. GLOS underestimates the magnitude of the learning effect for the allocation task.

[ Figure 6 ]

Characteristics of the learning process. Does the LOCAD learning model predict individualistic characteristics of the learning process more suitably than the GLOS model? Figure 7a shows again for Study 2 the proportion of allocations across all trials that correspond to the allocations that led to the local or global payoff maximum (with a tolerated deviation of ±5%). Similar to Study 1, the proportion of participants that made allocations according to the local or global maximum increased

Learning to Allocate Resources 30

substantially through learning across the 100 trials. However, again only a small number of participants (20%) finally “found” the allocation corresponding to the global payoff maximum, whereas a larger proportion (40%) “got stuck” at the allocation corresponding to the local payoff maximum. This result was again predicted by the LOCAD learning model. Although both models underestimate the proportion of allocations according to the local or global payoff maximum, the predicted proportions by LOCAD were closer to the observed data.

[ Figure 7 ]

Through learning, participants were able to increase their payoff over the 100 trials (see Figure 7b). Both models underestimated the payoff increase, but LOCAD’s prediction is closer to the observed payoff increase than GLOS’s prediction. The effect of training on the magnitude with which the participants changed their decisions was similar to Study 1 (see Figure 8a), starting with an average magnitude of a Euclidian distance of 29 for the first 10 trials and ending with an average magnitude of 5 for the last 10 trials. Although both models underestimated the decline in the magnitude with which decisions were adapted, the predictions of LOCAD come closer to the observed development.

[ Figure 8 ]

Similar to Study 1, an association between allocations’ directions and their success was observed for the participants’ decisions: For 74% of all allocations in the same direction as the preceding allocation (angles between 0 and 30o), the preceding allocation was successful,

Learning to Allocate Resources 31

compared to only 35% of all allocations made in an opposite direction (angles between 150 and 180o; see Figure 8b). An even stronger association was predicted by the LOCAD model: For 92% of all allocations made in the same direction as the preceding allocation, the preceding allocation was successful, compared to 20% of all allocations made in an opposite direction. In contrast, the GLOS model predicted a weak association: For 61% of all allocations made in the same direction as the preceding allocation, the preceding allocation was successful, compared to 46% of all allocations made in an opposite direction. The proportions of successful preceding allocations for the different angles were strongly correlated with both models’ predictions (r=.93 for LOCAD and r=.92 for GLOS). Summary of Study 2 Study 2 illustrates the robustness of the findings from Study 1. Although the payoff difference between the local and global payoff maximum was substantially increased, only a small proportion of participants were able to find the global maximum, whereas many participants got stuck at the local maximum. Such a result is consistent with the main learning mechanism of the LOCAD learning model, which better predicted the observed learning process for the allocation problem compared to the GLOS learning model. Of course one aspect of the payoff function that influences the difficulty with which the local or global payoff maxima can be detected is their localizations in the “search space” of possible allocations. The allocation corresponding to the local payoff maximum was located near the “center” of the search space, that is, near an allocation with an equal share invested in all three assets. In contrast, the allocation producing the global payoff maximum was located at the “border” of the search space, that is, an allocation with disproportional investments in the

Learning to Allocate Resources 32

different assets. If people tend to start with evenly distributed investments in all three assets and if they follow a learning process as predicted by the LOCAD model they should frequently get stuck at the local payoff maximum. In contrast, one could imagine a payoff function for which the positions of the allocations corresponding to the local and global payoff maxima were interchanged. For such a payoff function the majority of participants would presumable find the global payoff maximum. However, such a function would not allow discrimination between the predictions of the two learning models and was therefore not employed. In sum, the results that many participants got stuck at the local payoff maximum in both of our studies is a result of the payoff function used and can be predicted with the proposed LOCAD learning model. The generalization test of the learning models in Study 2 is more substantial than that in Study 1 as no parameter values were fitted to the data; instead the models predicted independent behavior of a different decision problem. Discussion Recently, several learning theories for decision-making problems have been proposed (e.g. Börgers & Sarin, 1997; Busemeyer & Myung, 1992; Camerer & Ho, 1999a, 1999b; Erev & Roth, 1998; Selten & Stöcker, 1986, Stahl, 1996). Most of them build on the basic idea that people do not solve a problem from scratch, but adapt their behavior based on experience. The theories differ according to the learning mechanism that they apply, that is, their assumptions about cognitive processes. The reinforcement-learning model proposed by Erev and Roth (1998) and the experienceweighted attraction learning model proposed by Camerer and Ho (1999a, 1999b) in general belong to the class of global search models. These models assume that all possible decision alternatives can be assigned an overall evaluation. Whereas the evaluation for the reinforcement-

Learning to Allocate Resources 33

learning model only depends on the experienced consequences of past decisions, the experienceweighted attraction model additionally can take hypothetical consequences and foregone payoffs into account. Both models make the assumption that people integrate their experience for an overall evaluation, and alternatives that are evaluated positively are more likely to be selected. The other approach–local adaptation models–does not assume that people necessarily acquire a global representation of the consequences of the available decision alternatives through learning. Instead, the hill-climbing model by Busemeyer and Myung (1987) and the learning direction theory of Selten and Stöcker (1986) assume that decisions are adapted locally, so that a preceding decision might be slightly modified according to its success or failure. Busemeyer and Myung (1992) suggested that models in the global search class may be applicable to situations in which the decision alternatives form a small set of qualitatively different strategies, whereas models in the local adaptation class may be applicable in situations in which the decision alternatives form a continuous metric space of strategies. Global search models have been successfully applied to constant-sum games, in which there are only a small number of options. The purpose of this research was to examine learning processes in a resource allocation task, which provides a continuous metric space of strategies. A new version of the global search model, called the GLOS model, and a new version of the local adaptation model, called the LOCAD model, were developed for this task. These two models were the best representations of the two classes that we constructed for the resource allocation task. The models were compared in two different studies. In the first study, the model parameters were estimated separately for each participant, and the model fits were compared to the individual data. In the second study, we used the estimated parameters from the first study to

Learning to Allocate Resources 34

generate a priori predictions for a new payoff condition, and the predictions of the models were compared to the mean learning curves. In both studies the resource allocation task consisted of repeatedly allocating a capital resource to different financial assets. The task was difficult because the rates of return were unknown for two assets, the rates of return depended in a nonlinear manner on the amount invested in the assets, and the number of allocation alternatives was quite large. However, because any investment led to a deterministic return, it was always obvious which of two allocations performed better after the payoffs for these allocation alternatives were presented. Therefore, the essence of the task that the participants faced in both studies consisted of a search problem for a good allocation alternative. Given that the participants were provided with a large number of trials, finding the best possible allocation alternative was possible. However, it turned out that the majority of participants did not find the best possible allocation corresponding to the global payoff maximum, but got distracted by the local payoff maximum. Nevertheless a substantial learning process was observed: At the beginning of the task there was a tendency to allocate an equal proportion of the resource to all three assets with a slightly larger proportion invested in the asset that guaranteed a fixed return. These allocations led to relatively low average payoffs, which then increased substantially over the 100 trials through learning. This learning process can be characterized by substantial changes of allocations at the beginning of the task, which then decline substantially over time. The direction in which the allocations were changed depended strongly on the success of previous changes, characterizing a directional learning process. These central findings correspond to the learning principles of the local adaptation model. It is therefore not surprising that the local adaptation model reached a better fit compared to the

Learning to Allocate Resources 35

global search model in predicting individuals’ allocations in both studies. In Study 1 when fitting both models to each individual separately, LOCAD reached a slightly better fit in describing the learning process. In Study 2 the a priori predicted average allocations by the LOCAD model (see Figure 6) properly described the observed average allocation across 100 trials, corresponding to a smaller MSE for LOCAD compared to GLOS. Given that in Study 2 the payoff function differed substantially from the payoff function of Study 1, these results provide strong empirical support for LOCAD. The appropriateness of LOCAD to describe the learning process is also supported by individual characteristics of the process. In Study 1 the LOCAD model, compared to GLOS, more accurately predicted the magnitude with which successive allocations were changed. In contrast, the other three individual characteristics of the learning process are equally well described by the two models in Study 1. This result changes substantially when turning to Study 2; here the LOCAD model also more suitably described the development of payoffs and the development of the number of allocations corresponding to the local and global payoff maximum. Unexpectedly, in both studies, the association between the direction of allocations and the success of previous allocations was appropriately described by the LOCAD model as well as the GLOS model. Why is the LOCAD model, compared to the GLOS model, better describes the learning process in the resource allocation task? Although the predictions of the two models can be similar with respect to specific aspects, the learning principles of the models are quite different. The learning principles of LOCAD seem to correspond more accurately to individuals’ behavior for this task. According to LOCAD, starting with a specific allocation, new allocations are made in the same direction as the direction of the preceding successful allocation. Although this

Learning to Allocate Resources 36

learning principle is very effective at improving allocations, it can lead to the result of missing the global maximum, as decisions have to be changed substantially to find the global maximum. Yet this result is exactly what was found in both studies. In contrast, the GLOS model eventually found a global payoff maximum, especially when experience made at the beginning of a learning process was not given too strong a weight. In this case the GLOS model selected all different kinds of allocations and eventually at some point also selected allocations corresponding to the global payoff maximum, for which it then developed a preference. However, given that most participants did not find the global payoff maximum, when fitting the GLOS model to the data, parameter values were selected so that the model would not lead to a convergence to the global payoff maximum. But with these parameter values the model also does not converge frequently to any allocation, so that it still does not predict the convergence to the local payoff maximum, which was found for most participants. To what extent can the results of the present studies be generalized to different learning models? The two models that we implemented are the best examples of the two approaches of learning models we found. Both were supported by past research and both were directly compared in previous theoretical analyses (see, e.g., Erev, 1998). More importantly, we also compared many variations of each model, although due to space limitations, we only present the results for the best version of each model. Nevertheless, our conclusions are supported in the sense that no variation of the GLOS model outperformed the LOCAD model that we present here, and instead, all the variations did worse than the GLOS model that we present here. Furthermore, due to the flexibility of the implemented models, that is their free parameters, we doubt that slight modifications of the presented models would lead to substantially different

Learning to Allocate Resources 37

results that would challenge our claim that the LOCAD learning model is better to predict the learning process for the resource allocation problem. To what extent did Study 2 provide a fair test of the two models? The answer, we argue, is more than fair. Firstly, both types of learning models have been applied in previous theoretical analyses to resource allocation tasks similar to the one used in Study 1 (Erev & Gopher, 1999). Thus there is no reason to claim that Study 1 does not provide a suitable test ground. In the second study, we simply increased the difference between the local and global maxima, which encouraged more participants to find the global maximum. This manipulation actually favors the GLOS model because the a priori tendency for the LOCAD model is to be attracted to the local maximum. Thus the second study provided the best possible a priori chance for the GLOS model to outperform the LOCAD model in the generalization test. To what extent can the results of the present studies be generalized to other decision problems? It should be emphasized that the current conclusions are restricted to the decision problem we considered. We expect that in similar decision problems that provide a large number of strategies that form a natural order, the LOCAD model will better describe the learning process. In such situations people can form a hypothesis about the underlying causal structure of the decision process that enables a directed learning process. For instance, when deciding how much to invest in a repeated public-good game, a local adaptation learning process might occur. However, there are many situations for which global search learning models describe learning processes better. For instance, there is a large amount of empirical evidence that global search models appropriately describe the learning process for constant-sum games with a small number of actions (Erev & Roth, 1998). In a constant-sum game, no possibility exists for the players to increase the mutual payoff by “cooperation.” Instead, the game’s theoretical prediction

Learning to Allocate Resources 38

asserts that the different decision strategies (options) should be selected with a particular probability. In such a situation, there are only a small number of categorically different alternatives, making it difficult to apply a local adaptation model, because the set of alternatives provides no natural order to define directions for changes in strategies. The present article demonstrates a rigorous test of two learning models representing two approaches in the recent learning literature. It also provides an illustration that learning often does not lead to optimal outcomes as claimed, for instance, by Herbert Simon (1990) or Reinhard Selten (1991). Yet, people improve their decisions substantially through learning: For example, even when individuals start with a suboptimal decision of allocating an equal share to the different assets, they quickly change their decision by making allocations that produce higher payoffs. This learning process can be described by the local adaptation learning model, which is commonly characterized by high efficiency but can lead to suboptimal outcomes. For other domains, other learning mechanism(s) might govern behavior, and each learning model might have its own domain in which it works well. Identifying these domains is a promising enterprise.

Learning to Allocate Resources 39

References Ball, C. T., Langholtz, H. J., Auble, J., & Sopchak, B. (1998). Resource-allocation strategies: A verbal protocol analysis. Organizational Behavior & Human Decision Processes, 76, 70-88. Benartzi, S., & Thaler, R. H. (2001). Naive diversification strategies in defined contribution saving plans. American Economic Review, 91, 79-98. Börgers, T., & Sarin, R. (1997). Learning through reinforcement and replicator dynamics. Journal of Economic Theory, 77, 1-14. Brennan, M. J., Schwartz, E. S., & Lagnado, R. (1997). Strategic asset allocation. Journal of Economic Dynamics & Control, 21, 1377-1403. Brown, G. W. (1951). Iterative solutions of games by fictitious play, Activity analysis of production and allocation (pp. 374-376). New York: Wiley. Busemeyer, J. R., & Myung, I. J. (1987). Resource allocation decision-making in an uncertain environment. Acta Psychologica, 66, 1-19. Busemeyer, J. R., & Myung, I. J. (1992). An adaptive approach to human decision-making: Learning theory, decision theory, and human performance. Journal of Experimental Psychology: General, 121, 177-194. Busemeyer, J. R., Swenson, K., & Lazarte, A. (1986). An adaptive approach to resource allocation. Organizational Behavior & Human Decision Processes, 38, 318-341. Busemeyer, J. R., & Wang, Y.-M. (2000). Model comparisons and model selections based on generalization criterion methodology. Journal of Mathematical Psychology, 44, 171-189. Bush, R. R., & Mosteller, F. (1955). Stochastic models for learning. New York: Wiley.

Learning to Allocate Resources 40

Camerer, C., & Ho, T.-H. (1999a). Experience-weighted attraction learning in games: Estimates from weak-link games. In D. V. Budescu & I. Erev (Eds.), Games and human behavior: Essays in honor of Amnon Rapoport (pp. 31-51). Mahwah, NJ: Erlbaum. Camerer, C., & Ho, T.-H. (1999b). Experience-weighted attraction learning in normal form games. Econometrica, 67, 827-874. Cheung, Y.-W., & Friedman, D. (1997). Individual learning in normal form games: Some laboratory results. Games & Economic Behavior, 19, 46-76. Dorfman, D. D., Saslow, C. F., & Simpson, J. C. (1975). Learning models for a continuum of sensory states reexamined. Journal of Mathematical Psychology, 12, 178-211. Erev, I. (1998). Signal detection by human observers: A cutoff reinforcement-learning model of categorization decisions under uncertainty. Psychological Review, 105, 280-298. Erev, I., & Gopher, D. (1999). A cognitive game-theoretic analysis of attention strategies, ability, and incentives. In D. Gopher & A. Koriat (Eds.), Attention and performance XVII: Cognitive regulation of performance: Interaction of theory and application. Attention and performance (pp. 343-371). Cambridge: MIT Press. Erev, I., & Roth, A. E. (1998). Predicting how people play games: Reinforcement learning in experimental games with unique, mixed strategy equilibria. American Economic Review, 88, 848-881. Estes, W. K. (1950). Toward a statistical theory of learning. Psychological Review, 57, 94107. Fudenberg, D., & Levine, D. K. (1995). Consistency and cautious fictitious play. Journal of Economic Dynamics & Control, 19, 1065-1089.

Learning to Allocate Resources 41

Gingrich, G., & Soli, S. D. (1984). Subjective evaluation and allocation of resources in routine decision-making. Organizational Behavior & Human Decision Processes, 33, 187-203. Harley, C. B. (1981). Learning the evolutionary stable strategy. Journal of Theoretical Biology, 89, 611-633. Langholtz, H. J., Ball, C., Sopchak, B., & Auble, J. (1997). Resource-allocation behavior in complex but commonplace tasks. Organizational Behavior & Human Decision Processes, 70, 249-266. Langholtz, H., Gettys, C., & Foote, B. (1993). Resource-allocation behavior under certainty, risk, and uncertainty. Organizational Behavior & Human Decision Processes, 54, 203224. Langholtz, H., Gettys, C., & Foote, B. (1994). Allocating resources over time in benign and harsh environments. Organizational Behavior & Human Decision Processes, 58, 28-50. Langholtz, H., Gettys, C., & Foote, B. (1995). Are resource fluctuations anticipated in resource allocation tasks? Organizational Behavior & Human Decision Processes, 64, 274-282. Luce, R. D. (1959). Individual choice behavior. New York: Wiley. Nelder, J. A., & Mead, R. (1965). A simplex method for function minimization. Computer Journal, 7, 308-313. Northcraft, G. B., & Neale, M. A. (1986). Opportunity costs and the framing of resource allocation decisions. Organizational Behavior & Human Decision Processes, 37, 348-356. Roth, A. E., & Erev, I. (1995). Learning in extensive-form games: Experimental data and simple dynamic models in the intermediate term. Games & Economic Behavior, 8, 164-212. Russel, S. J., & Norvig, P. (1995). Artificial intelligence. Englewood Cliffs, NJ: PrenticeHall.

Learning to Allocate Resources 42

Selten, R. (1991). Evolution, learning, and economic behavior. Games & Economic Behavior, 3, 3-24. Selten, R. (1998). Axiomatic characterization of the quadratic scoring rules. Experimental Economics, 1, 43-62. Selten, R., & Stöcker, R. (1986). End behavior in sequences of finite prisoner’s dilemma supergames: A learning theory approach. Journal of Economic Behavior & Organization, 7, 4770. Simon, H. A. (1990). Invariants of human behavior. Annual Review of Psychology, 41, 119. Smith, V. L. (1982). Microeconomic systems as an experimental science. American Economic Review, 72, 923-955. Stahl, D. O. (1996). Boundedly rational rule learning in a guessing game. Games & Economic Behavior, 16, 303-330. Sutton, R.S., & Barto, A.G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Thomas, E. A. C. (1973). On a class of additive learning models: Error correcting and probability matching. Journal of Mathematical Psychology, 10, 241-264.

Learning to Allocate Resources 43

Appendix

In Study 1 the payoff functions were defined as follows: The first allocation asset produced a fixed rate or return of 10% (with the payoff function uA(pA)= 0.1pA×0.1R with pA ∈[0,1] as the percent of the resource R invested in asset A.) For the other two allocation assets, the rate of return varied with the amount invested in the asset. For asset B, the payoff function was defined as uB(pB, pA) = 10-0.1pA × R+ 40×[sin(3.2π × (pB-0.781)-9)/(3.2π × (pB-0.781)-9)] with pB, pA ∈[0,1]. For asset C, the payoff function was defined as uC(pC)= 5+[4R × sin(1.1π × (pC-0.781)24.6)/(1.1π×(pC-0.781)-24.6)] with pC ∈[0,1]. In Study 2 the payoff functions were defined as follows: The payoff function for asset A was identical to the one used in Study 1. For asset B the payoff function was defined as uB(pB, pA) = 6 – 0.2pA × R +80×[sin(3.2π × (pB – 0.781) –9)/(3.2π × (pB – 0.781)–9)] with pB, pA ∈[0,1] and for the third asset C the payoff function was defined as uC(pC)= –4+[8R×sin(1.1π×(pC0.781)-24.6)/(1.1π×(pC-0.781)-24.6)] with pC ∈[0,1].

Learning to Allocate Resources 44

Authors’ Note

We gratefully acknowledge helpful comments by Jim Walker and Hugh Kelley with whom the authors worked on a similar research project on which the present study is based. Additionally, we would like to thank Ido Erev, Scott Fisher, Wieland Müller, Elinor Ostrom, Reinhard Selten, the members of the bio-complexity project at Indiana University and two anonymous reviewers for helpful comments. Special thanks are due to the financial support from the Center for the Study of Institutions, Population, and Environmental Change through National Science Foundation grants SBR9521918 and SES0083511. Correspondence concerning this article should be addressed to Jörg Rieskamp.

Address all correspondence to: Jörg Rieskamp Max Planck Institute for Human Development Lentzeallee 94, 14195 Berlin, Germany Phone: (+49 30) 82406363 Fax: (+49 30) 824 99 39 [email protected]

Learning to Allocate Resources 45

Footnotes

1. The Euclidian distance between two allocations Xki and Xkj with three possible assets k is defined as Dij =



3 k =1

( X ki − X kj ) 2 .

2. This is one aspect where the GLOS model varies from Erev’s (1998) reinforcement model, where the density of the generalization function is set equal to the received payoff. This constraint set in Erev (1998) has the disadvantage that the standard deviation of the generalization function, which is supposed to be a free parameter, interacts with the received payoff used as a reinforcement, such that a large reinforcement for a chosen allocation is, for instance, only possible if a small standard deviation of the generalization function is chosen. We are confident that this difference and all other differences of the GLOS model from the reinforcement model by Erev (1998) represent improvements, in particular for the allocation task, which makes it a strong competitor to the LOCAD model. 3. In fact when we constructed the learning models for Study 1 various modifications of both learning models were tested. For example, for the GLOS model, among other things, we used different generalization functions to determine reinforcements (see also Footnote 2) or different methods to determine the reinforcement in case of negative payoffs. For the LOCAD model, among other things, we used different reference outcomes with which the outcome of a present decision is compared to determine the “success” of a decision or different methods for how the step size of the current trial is determined. In sum, the specified LOCAD and GLOS learning models were the best models (according to the goodness-of-fit criterion in Study 1) representing the two approaches of learning models. Therefore the conclusions we draw from the

Learning to Allocate Resources 46

results of our model comparisons are robust to variations of the present definition of the two learning models. 4. As an alternative method for parameter estimation, as compared to the employed leastsquares estimation, maximum likelihood estimation has the drawback that it is sensitive to very small predicted probabilities, which frequently occurred for the present task with the large number of possible allocations; for advantages of the least-squares estimation see also Selten, 1998. Furthermore, the optimal properties of maximum likelihood only hold when the model is the true model, which is almost never correct. Also these properties only hold if the parameters fall inside the convex boundary of the parameters, which is not guaranteed in our models. In short, under conditions of possible model misspecification, least-squares estimation is more robust that maximum likelihood estimation, so the statistical justifications for maximum likelihood do not hold up under these conditions. 5. Note that the models’ parameters were not fitted by optimizing the predicted average allocations compared to the observed average allocations, but by optimizing the predicted probabilities of which allocation was selected; otherwise a closer fit would result.

Learning to Allocate Resources 47

Figure Captions Figure 1. The payoff function for the total payoff of the allocation problem in Study 1. The figure shows the investment in asset B and asset C (which determines the investment in asset A) and the corresponding payoff. Figure 2. Average participants’ allocations and average predictions of the two learning models fitted to each individual. The figure shows a moving average of nine trials, such that for each trial the average of the present allocation and the preceding and succeeding four allocations are presented. (Note for the first four trials the moving average is determined by five to eight trials.) Figure 3. Individual characteristics of the decision process in Study 1. a) Percentage of allocations corresponding to the local or global payoff maximum across all trials (with a tolerated deviation of ±5% from the allocations that lead to the global or local maximum), presented with a moving average of 9 trials. b) Average payoff across all trials, presented with a moving average of 9 trials. Figure 4. Individual characteristics of the decision process in Study 1. a) Average magnitude of changes (step size) measured with the Euclidian distance between the allocations of successive trials (with possible values ranging from 0 to 141), presented with a moving average of 9 trials. b) The angles between allocations’ directions compared to the direction of preceding allocations were determined and categorized in six intervals. For each category the percentage of successful preceding allocations (i.e. those leading to a higher payoff than the allocations before) are presented. Figure 5. The payoff function for the total payoff of the allocation problem in Study 2.

Learning to Allocate Resources 48

Figure 6. Average participants’ allocations and average predictions of the two learning models when simulating 50,000 agents. The figure shows a moving average of nine trials, such that for each trial the average of the present allocation and the preceding and succeeding four allocations are presented. (Note for the first four trials the moving average is determined by five to eight trials.) Figure 7. Individual characteristics of the decision process in Study 2. a) Percentage of allocations corresponding to the local or global payoff maximum across all trials (with a tolerated deviation of ±5% from the allocations that lead to the global or local maximum), presented with a moving average of 9 trials. b) Average payoff across all trials, presented with a moving average of 9 trials. Figure 8. Characteristics of the decision process in Study 2. a) Average magnitude of changes (step size) measured with the Euclidian distance between the allocations of successive trials (with possible values ranging from 0 to 141), presented with a moving average of 9 trials. b) The angles between allocations’ directions compared to the direction of preceding allocations were determined and categorized in six intervals. For each category the percentage of successful preceding allocations (i.e. those leading to a higher payoff than the allocations before) are presented.

Learning to Allocate Resources 49

Figures

Figure 1.

35

30

25

Payoff

20

15

10

5

0

-5 100 75 50 % invested in Asset C

25 0

0

25

50 % invested in Asset B

75

100

Learning to Allocate Resources 50

Figure 2.

50%

Investment in assets .

40% Real Asset B Real Asset C GLOS Asset B GLOS Asset C LOCAD Asset B LOCAD Asset C

30%

20%

10%

0% 1

11

21

31

41

51

61

Trials (moving average)

71

81

91

Learning to Allocate Resources 51

Figure 3.

Learning to Allocate Resources 52

Figure 4.

Learning to Allocate Resources 53

Figure 5.

50

40

30

20

Payoff

10

0

-10

-20

-30

-40 100

75

% invested in Asset C

50

25

0

25

50 % invested in Asset B

75

100

Learning to Allocate Resources 54

Figure 6.

50%

Investment in assets .

40% Real Asset B Real Asset C GLOS Asset B GLOS Asset C LOCAD Asset B LOCAD Asset C

30%

20%

10%

0% 1

11

21

31

41

51

61

Trials (moving average)

71

81

91

Learning to Allocate Resources 55

Figure 7.

Learning to Allocate Resources 56

Figure 8.