Emotional Multiagent Reinforcement Learning in Social Dilemmas

Report 2 Downloads 228 Views
Emotional Multiagent Reinforcement Learning in Social Dilemmas Chao Yu, Minjie Zhang, and Fenghui Ren School of Computer Science and Software Engineering University of Wollongong, Wollongong, 2522, NSW, Australia [email protected], {minjie,fren}@uow.edu.au

Abstract. Social dilemmas have attracted extensive interest in multiagent system research in order to study the emergence of cooperative behaviors among selfish agents. Without extra mechanisms or assumptions, directly applying multiagent reinforcement learning in social dilemmas will end up with convergence to the Nash equilibrium of mutual defection among the agents. This paper investigates the importance of emotions in modifying agent learning behaviors in order to achieve cooperation in social dilemmas. Two fundamental variables, individual wellbeing and social fairness, are considered in the appraisal of emotions that are used as intrinsic rewards for learning. Experimental results reveal that different structural relationships between the two appraisal variables can lead to distinct agent behaviors, and under certain circumstances, cooperation can be obtained among the agents.

1

Introduction

Cooperation is ubiquitous in the real world and can be observed at different scales of organizations ranging from microorganisms and animal groups to human societies [1]. Solving the puzzle of how cooperation emerges among self-interested entities is a challenging issue that has motivated scientists from various disciplines including economics, psychology, sociology and computer science for decades. The emergence of cooperation is often studied in the context of social dilemmas, in which selfish individuals must decide between a socially reciprocal behavior of cooperation to benefit the whole group over time and a self-interested behavior of defection to pursue their own short-term benefits. Social dilemmas often arise in many situations in Multi-Agent Systems (MASs), e.g., file sharing in peer-to-peer (p2p) systems and load balancing/packet routing in wireless sensor networks [2]. For this reason, mechanisms that promote the emergence of cooperation in social dilemmas are of great interest to researchers in MASs. Although various mechanisms, such as kin selection, reciprocal altruism and spatial selection [3], have been proposed to explain the emergence of cooperation in recent years, most of these mechanisms are based on Evolutionary Game Theory (EGT) [4], with a focus either on the macro-level population dynamics using replicator functions or on the agent-level strategy dynamics based on predefined imitation rules. Real animals and humans, however, do not simply replicate or G. Boella et al. (Eds.): PRIMA 2013, LNAI 8291, pp. 372–387, 2013. c Springer-Verlag Berlin Heidelberg 2013 

Emotional Multiagent Reinforcement Learning in Social Dilemmas

373

mimic others, but can learn efficient strategies from past interaction experience. In fact, this experience-based learning capability is important in building intelligent agents that can align human behavior, particularly when designers cannot anticipate all situations that the agents might encounter. A major family of experience-based learning is Reinforcement Learning(RL) [5], which enables an agent to learn an optimal strategy through trial-and-error interactions with an environment. When multiple agents conduct their learning at the same time, which is called Multi-Agent Reinforcement Learning (MARL) [6], each agent faces a non-stationary learning environment and therefore each agent’s individually optimal strategy does not necessarily guarantee a globally optimal performance for the whole system. In the setting of social dilemmas, directly applying distributed MARL approaches will end up with convergence to the Nash equilibrium of mutual defection among the agents if no additional mechanisms are implemented. The convergence to the Nash equilibrium occurs because both agents adopt the best-response actions during learning. As a result, neither agent can achieve a dominant position by choosing defection to exploit its opponent because the opponent will eliminate such dominance by also choosing defection, resulting in an outcome of mutual defection between the agents. The rational behavior that ends up with mutual defection contradicts our daily observations of altruistic cooperative behaviors among humans. In fact, psychology and behavioral economics have provided convincing evidence that humans are not purely rational and self-interested, but generally express considerations for others and make decisions with bounded rationality [7,8]. Recent research in artificial intelligence and cognitive science has shown that emotions can be an important heuristic to assist humans’ bounded rationality for effective decision-making [9]. Emotions play a fundamental role in learning by eliciting physiological signals that bias our behaviors toward maximizing reward and minimizing punishment. Therefore, MARL mechanisms, in essence, should rely on some emotional cues to indicate the advantage or disadvantage of an event. In this paper, we investigate the importance of emotions in modifying agent rationality in MARL in order to achieve cooperation in social dilemmas. We focus on spatial versions of social dilemmas by considering topological structures among agents to study the impact of local interactions on the emergence of cooperation. An emotional MARL framework is proposed to endow agents with internal cognitive and emotional capabilities that can drive these agents to learn reciprocal behaviors in social dilemmas. Two fundamental variables, individual wellbeing and social fairness, are considered in the appraisal of emotions. Different structural relationships between these appraisal variables can derive various intrinsic emotional rewards for agent learning. Experimental results reveal that different ways of appraising emotions and various structural relationships between emotional appraisal variables can lead to distinct agent behaviors, and under certain circumstances, cooperation can be obtained among the agents. Although emotion-based concepts and mechanisms have been widely applied in agent learning [10,11,12], most of these studies focus primarily on exploiting emotions to facilitate learning efficiency or to adapt a single agent to dynamic

374

C. Yu, M. Zhang, and F. Ren

and complex environments. The other line of research [13,14] has examined the evolution of cooperation in social dilemmas by implementing an emotion mechanism based on rule-based emotional frameworks (i.e., not from learning perspective). Our work, therefore, bridges these two directions of research by incorporating emotions into agent learning to study the evolution of cooperation in social dilemmas. We claim this effort to be the main contribution of this paper. The remainder of this paper is organized as follows. Section 2 describes social dilemmas. Section 3 introduces the emotional MARL framework. Section 4 gives the experimental studies. Section 5 lays out some related work. Finally, Section 6 concludes this paper with some directions for future research.

2

Social Dilemmas

This paper uses the well-known Prisoners’ Dilemma (PD) game as a metaphor to study social dilemmas among self-interested agents. In PD, each player has two actions: ‘cooperate’ (C), which is a socially reciprocal behavior and ‘defect’ (D), which is a self-interested behavior. Consider the typical pay-off matrix of the PD game in Table 1. The rational action for both agents is to select D because choosing D ensures a higher pay-off for either agent no matter what the opponent does. In other words, mutual defection (DD) is the unique Nash equilibrium and both agents have no incentive to deviate from this equilibrium. As can be seen from Table 1, however, DD is not the Pareto optimal because each agent would be better off (i.e., R > P ) and both agents together would receive a higher social reward (i.e., 2R > T + S) if both agents select C. Therefore, when played repeatedly, which is called the Iterated Prisoner’s Dilemma (IPD), it may be beneficial for an agent to cooperate in some rounds, even if this agent is selfish, in the hope of a reciprocal cooperative behavior to bring benefits in the long run. IPD has been widely used for studying the emergence of cooperation among selfish individuals in a variety of disciplines including artificial intelligence, economics and social sciences, etc. [15].

Table 1. Payoff matrix of the PD game (payoff of the row player is shown first) Cooperate (C) Defect (D) Cooperate (C) R(= 3),R(= 3) S(= 0),T(=5 ) Defect (D) T(=5 ),S(= 0) P(= 1),P(= 1)

This paper uses spatial IPD by considering the topological structure among the agents to study the impact of local interactions on the emergence of cooperation. Three types of networks are used to represent a spatial IPD. They are (1) Grid networks GRN , where N is the number of nodes, (2) Small-world networks SWNk,ρ , where k is the average size of the neighborhood of a node and ρ is the re-wiring probability to indicate the different orders of the randomness of the network, and (3) Scale-free networks SFNk,γ , where γ is an exponent indicating

Emotional Multiagent Reinforcement Learning in Social Dilemmas

375

that the probability of an node having k neighbors is roughly proportional to k −γ . A more detailed explanation of these networks is given by [16].

3

Emotional MARL in Social Dilemmas

This section introduces the emotional MARL framework and its implementation in social dilemmas based on previous appraisal theories of emotions. 3.1

Emotional MARL Framework

In recent years, researchers in artificial intelligence, cognition and psychology have shown increasing interest in defining the source and nature of rewards in RL [17]. It has been pointed out that the standard view of defining rewards in RL simply as an output of a “critic” from the external environment (Fig. 1(a)) is seriously misleading and thus not suitable as a reflection of real-life human and animal reward systems [18]. The environment of an RL agent should not simply be identified with the external environment where the agent is physically located, but should also include the agent’s internal environment constituted by multiple intrinsic emotion circuits and drives. A novel framework called Intrinsically Motivated Reinforcement Learning (IMRL) [18] has been proposed to implement intrinsic motivation systems in learning agents by clarifying the rewards into “extrinsic rewards” that define a specific task related to an agent’s external goal or cost, and “intrinsic rewards” related to an agent’s internal emotion circuits and drives. The IMRL framework provides a computationally sound approach to better reflect the reward systems in real human and animals, and thus enables a learning agent to achieve more adaptive behaviors by overcoming its perceptual limitations. Most work on the IMRL framework, however, focused mainly on single-agent scenarios, in which the motivational system is used to drive a single agent to learn adaptive behaviors in complex environments. In addition, most previous work lacks an explicit description of how to derive the intrinsic rewards for learning, especially by using an emotion system. To this end, we propose an emotional MARL framework in Fig. 1 (b), which extends the IMRL framework by implementing an emotion-driven intrinsic reward system based on the computational component appraisal models of emotions [19]. The framework consists of two parts: an agent and its external environment. The agent takes an action on the external environment and receives sensations and extrinsic rewards from the environment. An internal environment, including an emotion appraisal derivation model and an emotion derivation model, exists inside an agent to generate the emotions, which are then used in an emotion consequent model for adapting learning behaviors. The emotion appraisal derivation model transforms an agent’s belief about its relationship with the environment into a set of quantified appraisal variables. The appraisal variables correspond to a set of judgments that the agent can use to produce different emotional responses. Two variables are used to appraise

376

C. Yu, M. Zhang, and F. Ren Internal Environment Emotion Appraisal Derivation model

States Sensations

Environment Critic

Rewards

External Environment

Learning Algorithm

Extrinsic Reward

Social fairness Individual wellbeing

Agent Actions

Agent

(a) The standard view [5]

Emotion Derivation Model

States

Emotion consequent model

Actions

Learning Algorithm

Intrinsic Rewards

Emotions (Happiness, fear, sadness)

(b) The proposed framework (adapted from [18,19]).

Fig. 1. The standard RL framework and the proposed emotional MARL framework

emotions by considering not only the agent’s individual wellbeing, which is derived through extrinsic rewards from the environment, but also its sense of social fairness, which is derived from the agent’s sensations of its neighbors’ decisions. The emotion derivation model maps the appraisal variables to an emotional state, and specifies how an agent reacts emotionally once a pattern of appraisals has been determined. By combining the variables using different emotion derivation functions, different agent personalities can be defined. For example, if the function is defined in such a way that social fairness is the core factor to determine an agent’s final emotional state, the agent can then be considered as a socially aware agent. On the contrary, the agent is more like an egoist who cares about its own wellbeing more than the social fairness. The emotion consequent model maps emotions into some behavioral or cognitive changes. In the framework, the consequent emotions generated in the internal environment are simply used as intrinsic rewards for agent learning. The learning model consists of general RL approaches [5] that can be directly used to update learning information using the emotional intrinsic rewards. In this research, the widely used Q-learning algorithm is used as the basic RL approach. Many studies [10,11,12] claimed that RL mechanisms in nature should rely on emotional cues rather than on direct exogenous stimuli from the environment to indicate the advantage or disadvantage of an event. The proposed framework takes this position by differentiating an agent’s external and internal environment, and defining a new reward system that endows agents with internal emotional capabilities to drive them towards more complex and adaptive behaviors. The following subsections show in detail how to implement the proposed framework in spatial IPD so that agents can learn to achieve cooperative behaviors. 3.2

Appraisal of Emotions

One way of explaining how emotions are generated according to one’s relationship with the environment is by developing appraisal theories of emotions [20]. The appraisal on the elicitation and differentiation of emotions is based on an agent’s individual evaluation of the significance of events and situations for

Emotional Multiagent Reinforcement Learning in Social Dilemmas

377

itself along a set of judgments. These judgments, formally referred to appraisal variables, are used to produce different emotional responses by the agents. As Ellsworth and Scherer [20] pointed out, emotions can generally be considered to be composed of interacting elementary variables from two main dimensions: the basic/motivational dimension dealing with individual goals, needs and pleasantness, and the social dimension dealing with social norms, justice and fairness. Two fundamental appraisal variables, social fairness and individual wellbeing, therefore, are adopted in the appraisal of emotions in this research. Appraisal of Social Fairness. Research in the field of behavioral economics has shown that humans are not purely self-interested, but care strongly about fairness [21]. Humans generally show a remarkable ability to solve social dilemmas due to their tendency to consider fairness to other people. In order to evaluate the social fairness, an agent needs to assess its own situation in its neighborhood context. We define a focal agent’s neighborhood context C as: C=

i=N 1  nic − nid ( ), N i=1 M

(1)

where nic and nid are the counts of C actions and D actions adopted by neighbor i in M steps1 , respectively, and N is the number of neighbors of the focal agent. An agent’s neighborhood context C ∈ [−1, 1] indicates the extent of cooperativeness of the environment, with C = 1 indicating a fully cooperative environment, C = −1 indicating a fully defective environment, and C = 0 indicating a neutral environment. An agent’s sense of fairness is then to evaluate its own situation in such a neighborhood context, which can be given by: nc − nd , (2) M where F ∈ [−1, 1] represents the agent’s sense of social fairness, nc is the count of C actions and nd is the count of D actions adopted by the agent in M steps. From Equ. 2, we can see that when the environment is cooperative (C > 0), the agent senses fair if it cooperates more often (nc > nd ) and unfair if it defects more often (nc < nd ). Similar analysis applies when the environment is uncooperative (C < 0). F =C×

Appraisal of Individual Wellbeing. Agents also need to care about their own individual wellbeing in terms of maximizing utilities and achieving goals. Three different approaches are used to appraise an agent’s individual wellbeing. – Absolute value-based approach: A straightforward way to appraise an agent’s individual wellbeing is to use the agent’s absolute wealth as an evaluation criteria, which can be given by Equ. 3. 1

In this research, a learning episode consists of several interaction steps, which means that the learning information will be updated only at the end of M interaction steps.

378

C. Yu, M. Zhang, and F. Ren

W =

Rt , M × (T − S)

(3)

where Rt is the accumulated reward collected during M interaction steps at learning episode t, and T − S is the difference between temptation and sucker’s reward (refer to Table 1) to confine W to the range of [−1, 1]. – Variance-based approach: It is argued, however, that low wealth does not necessarily imply a negative reaction and high wealth does not imply a positive reaction to the agents [11]. In fact, this phenomenon can be observed in the real world where rich people are not necessarily happier than poor people, but an increase in wealth can often cause a positive reaction. Therefore, we define the individual wellbeing as the positive and negative variations of an agent’s absolute wealth, which can be given by Equ. 4. W =

Rt+1 − Rt , M × (T − S)

(4)

where Rt and Rt+1 are the accumulated reward collected at learning episode t and the following episode t + 1, respectively. – Aspiration-based approach: In reality, people will be satisfied if their intrinsic aspirations can be realized [8]. The aspiration-based approach thus appraises the agent’s individual wellbeing by comparing the reward achieved with an adaptive aspiration level, which is given by Equ. 5. W = tanh[h( x

Rt − At )], M

(5)

−x

where tanh(x) = eex −e +e−x is the hyperbolic tangent function, h > 0 is a scalable parameter, and At is the aspiration level, which can be updated by: Rt , M where β ∈ [0, 1] is a learning rate and A0 = (R + T + S + P )/4. At+1 = (1 − β)At + β

3.3

(6)

Derivation of Emotions

This research aims to study the effects of various structural relationships between social fairness and individual wellbeing on the derivation of emotions, and how such emotional reactions affect agent learning behaviors in social dilemmas. To this end, we first differentiate the roles of the two appraisal variables in determining the final emotional states. Simith et al. [22] proposed a structural model of emotion appraisal to explain the relation between appraisals and the emotions they elicit. The appraisal process is broken up into two different categories, primary appraisal and secondary appraisal, with primary appraisal concerning whether and how the encounter is relevant to the person’s motivational goals and secondary appraisal concerning the person’s resources and options for coping with the encounter [22]. The structural model of appraisal

Emotional Multiagent Reinforcement Learning in Social Dilemmas

379

allows researchers to formulate which emotions will be elicited from a certain set of circumstances by examining an individual’s appraisal of a situation, and therefore allows researchers to define different appraisal processes that lead to different emotions. Based on the differentiation of the emotion appraisal process, we define an appraisal variable to be a core appraisal variable (denoted as c) or a secondary appraisal variable (denoted as s). The core appraisal variable determines the desirability of an emotion through the agent’s evaluation of its situation, while the secondary appraisal variable indicates the intensity of such an emotion based on the agent’s evaluation of its coping ability. An emotion derivation function of emotion x, x (c, s), therefore, can be formally defined as follows: Definition 1. Let c, s be the core and secondary appraisal variable, respectively, 0 ≤ Dx ≤ 1 be the desirability of emotion x, and −1 ≤ Ix ≤ 1 be the intensity of emotion x, an emotion derivation function can be defined as x (c, s) := Ex (c, s) = f (Dx )·g(Ix ), where 0 ≤ Ex (c, s) ≤ 1 is the overall state of emotion x, 0 ≤ f (Dx ) ≤ 1 is the core derivation function that increases monotonically with the desirability of emotion x(Dx ), and 0 ≤ g(Ix ) ≤ 1 is the secondary derivation function that increases monotonically with the intensity of emotion x (Ix ). In reality, people react to things differently. Even when presented with the same or a similar situation, people will react in slightly different ways based on their appraisals of the situation. These appraisals elicit various emotions that are specific to each person. Based on Definition 1, two different kinds of emotion derivation functions can be defined as follows: – Fairness-Wellbeing (FW) Emotion Derivation Function: An agent that derives its emotions using the FW function puts social fairness F as the core appraisal variable (c ← F ) and then derives its emotions based on its sense of its own wellbeing W (s ← W ). More formally: ((F ≥ 0 ⇒ Djoy = F ) ∧ Ijoy = W ) ⇒ Ejoy (F, W ) = f (F ) · g(W );

(7)

((F < 0 ⇒ Df ear = −F )∧If ear = W ) ⇒ Ef ear (F, W ) = f (−F )·g(W ); (8) ((F < 0 ⇒ Danger = −F ) ∧ Ianger = −W ) ⇒ Eanger (F, W ) = f (−F ) · g(−W ); (9)

An agent using the FW function to derive its emotions is a socially aware agent that pays more attention to social fairness than to individual wellbeing in appraising its emotions. As the first priority of an FW agent is to pursue social fairness, when the agent senses that the environment is fair (F > 0, Equ. 7), the agent will be in a positive emotional state of joy because the situation is consistent with the agent’s motivational goal. The fairer the environment, the higher the desirability of emotion joy. Desirability Djoy is then equivalent to the core derivation variable of F . The intensity of emotion joy is then based on the increase or decrease in the agent’s sense of its

380

C. Yu, M. Zhang, and F. Ren

own wellbeing. An increase in the individual wellbeing in a fair environment indicates an easy coping situation. The socially aware agent, therefore, is in a positive emotional state because the agent can achieve selfish interest while at the same time pursue its core motivational goal of social fairness. The final state of the emotion of joy, then, can be calculated as f (F ) · g(W ). On the contrary, when an FW agent senses that the environment is unfair (F < 0), the agent will feel negative because the situation is inconsistent with the agent’s goal of pursuing social fairness. The unfair environment can be caused by two reasons: the agent defects more often in a cooperative environment (Equ. 8) and the agent cooperates more often in a defective environment (Equ. 9). The socially aware agent will feel f ear in the former case because it is exploiting its neighbors by choosing defection, while in the latter case, the agent will be in anger because it is being exploited by its neighbors. In both cases, desirability of the negative emotion is equivalent to the core derivation variable of −F . The secondary appraisal of emotion f ear considers one’s expectations of change in the motivational congruence of a situation [22]. If the agent is in f ear, the intensity of emotion f ear then increases monotonously with the wellbeing state of the agent (If ear = W ), because the socially aware agent realizes that it is exploiting its neighbors for an increase in its own wellbeing. The higher the wellbeing, the higher intensity of emotion f ear of the agent. The final state of emotion f ear can be then calculated as f (−F ) · g(W ). In contrast, if the agent is in anger, the intensity of emotion anger then increases inversely with the wellbeing state of the agent (Ianger = −W ), because a lower wellbeing will result in a higher intensity of emotion anger. The final state of the emotion of anger can then be calculated as f (−F ) · g(−W ). – Wellbeing-Fairness (WF) Emotion Derivation Function: In contrast to the FW function, an agent that derives its emotions using the WF function puts its own wellbeing W as the core derivation variable (c ← W ) and then derives its emotions based on the social fairness F (s ← F ). More formally: ((W ≥ 0 ⇒ Djoy = W ) ∧ Ijoy = F ) ⇒ Ejoy (W, F ) = f (W ) · g(F );

(10)

((W < 0 ⇒ Dsadness = −W ) ∧ Isadness = F ) ⇒ Esadness (W, F ) = f (−W ) · g(F ); (11)

An agent using the WF function is more like an egoist that pays more attention to its own wellbeing than to social fairness to determine its emotions. Equ. 10 formulates how to generate the emotion of joy. As the first priority of a selfish WF agent is to pursue its own benefits, when the agent senses that its wellbeing is high (W > 0), the agent will be in a positive state of joy because the situation is consistent with the agent’s motivational goal. The desirability of emotion joy, Djoy , is then equivalent to the value of core derivation variable W . The intensity of emotion joy is then based on the agent’s sense of social fairness. A high social fairness (associated with a high individual wellbeing) indicates that the selfish agent is in a positive emotional state because the agent can achieve fairness while at the same time

Emotional Multiagent Reinforcement Learning in Social Dilemmas

381

pursue its core motivational goal of staying in high wellbeing. The final state of the emotion of joy, therefore, can be calculated as f (W ) · g(F ). On the contrary, when a WF agent senses that its wellbeing is low (W < 0, Equ. 11), the agent will be in sadness because the situation is inconsistent with the agent’s goal of pursuing individual benefits. The desirability of the emotion of sadness, Dsadness , is then equivalent to the value of −W . The intensity of emotion sadness is then based on the increase or decrease in the agent’s sense of social fairness. A low wellbeing in a fair environment indicates a difficult coping situation for a selfish WF agent. The final state of the emotion of sadness can be calculated as f (−W ) · g(F ). Agents interact concurrently and simultaneously with all of their neighbors. Each agent keeps a Q-value table Qi (si , ai ), in which si and ai are the local state and individual action of agent i, respectively. We use the stateless version of Q-learning that does not consider the transitions between states of agents. At each learning episode, agent i chooses action ai (cooperation or defection) based on its Q-values and plays its action with each of its neighbors. Each pairwise interaction results in a reward for the agent. Agent i can observe its neighbors’ actions to appraise the social fairness and rely on its average reward to appraise its sense of wellbeing. The agent then updates its Q-values using the intrinsic reward Rint , which is derived from the valenced emotions as follows:  Ex (e, t) if emotion x is positve, (12) Rint = −Ex (e, t) if emotion x is negative. where emotion is positive if it is joy and negative if it is f ear, anger or sadness.

4

Experiments

We use the typical values of payoffs in Table 1 to represent a social dilemma game. The Watts-Strogatz model [23] is used to generate a small-world network, and the Barabasi-Albert model [16] is used to generate a scale-free network. To use the Barabasi-Albert model, we start with 5 agents and add a new agent with 1 edge to the network at every time step. This network evolves into a scale-free network following a power law with an exponent γ = 3. Learning rate α is 0.5 and discount factor γ is 0. Exploration rate ε is 0.1 in the ε-greedy exploration strategy. Scalable parameter h in the aspiration-based approach is 10 and learning rate β to update aspiration level is 0.5. We choose linear functions f (Dx ) = Dx and g(Ix ) = (1 + Ix )/2 to map the value of Dx and Ix to [0, 1]. All results are averaged over 100 independent runs. Fig. 2 shows the learning dynamics in network GR100 using the variance-based approach to appraise the individual wellbeing. Fig. 2(a) shows the learning dynamics of average population reward using different emotion derivation functions. In the figure, R Agents (Rational Agents) represent those agents using the learning approach based on the traditional MARL framework, F Agents and W Agents represent agents using the social fairness and individual wellbeing to be

382

C. Yu, M. Zhang, and F. Ren Learning dynamics with different kinds of agents

Learning dynamics with different frequencies (M) to update emotions

3

3

2.8 2.9

FW Agents

M=3

2.6 2.8

Average reward

Average reward

2.4

WF Agents 2.2 2

F Agents

1.8

M=2 2.7

M=1 2.6

M=4 2.5

1.6

W Agents

2.4

M=5

R Agents

1.4

2.3 1.2 1 0

20

40

60

80

100

120

140

160

180

200

Learning episode

(a) Impact of different kinds of agents.

2.2 0

20

40

60

80

100

120

140

160

180

200

Interaction step

(b) Impact of different memories.

Fig. 2. Learning dynamics in the proposed emotional MARL framework

the emotional intrinsic rewards, respectively. From the results, we can see that different kinds of agents can learn distinct behaviors. As expected, R Agents end up with mutual defection among them, causing a final average reward around P = 1. W Agents learn a similar behavior as R agents. Both the F Agents and WF agents can learn to achieve a certain level of cooperation among them, which is still much lower than that of FW Agents. These results confirm that both social fairness and individual wellbeing are fundamental factors in the appraisal of emotions, but to guarantee a high level of cooperation, social fairness must be considered to be the core appraisal variable. The results provide an explanation of real-life phenomenons when people are social beings and often care about social fairness more than their own interests in order to achieve mutually beneficial outcomes [21]. For example, in the Ultimatum Game, people usually refuse an unfair offer, even this decision will cause them to receive nothing, and in the Public Goods Game, people are usually willing to punish free riders, even though this punishment imposes a cost on themselves. Fig. 2(b) gives the learning dynamics when FW agents use different frequencies to update emotions. We can see that the frequency of updating intrinsic emotional states has a significant impact on the emergence of cooperation among the agents. A higher updating frequency (smaller interaction step M ) causes a more dynamic extrinsic environment, which accordingly slows the emergence of cooperation, while a too low updating frequency (large interaction step M ) will bring about a delay for the agents to catch up with the changing environment, causing a slow convergence of cooperation among the agents. We are also interested in the impacts of network topological structures and different ways of appraising the individual wellbeing on the emergence of cooperation. We carried out extensive experiments by applying the proposed emotional learning approach in the three kinds of networks with different sizes of population and found that the patterns of results do not differ greatly. As an illustration, we plot the learning dynamics in three different networks with 100 agents in Fig. 3. From the results, we can see that in all three kinds of networks, the emotional FW Agents can learn to achieve cooperation using the different approaches of

Emotional Multiagent Reinforcement Learning in Social Dilemmas Learning dynamics in small−world network SW 4,0.4

Learning dynamics in Grid network GR 100

3

2.6

2

Aspiration−based FW Agents

1.8 1.6

R Agents

1.4 1.2 1 0

50

100

150

2.4

(a) Network GR100

2.4

Absolut Vaule−based FW Agents

Absolut Vaule−based FW Agents

2.2

2.2

Variance−based FW Agents

2

Aspiration−based FW Agents

1.8 1.6

1.4

1.2

1.2 100

Learning episode

(b) Network

Aspiration−based FW Agents

1.6

1.4

50

Variance−based FW Agents

2

1.8

R Agents

1 0

200

Learning episode

Average reward

Absolut Vaule−based FW Agents Variance−based FW Agents

2.2

Average reward

2.6

2.4

100

3 2.8

2.8

2.6

Average reward

Learning dynamics in scale−free network SFk,3

100

3

2.8

383

150

R Agents

1 0

200

50

100

150

200

Learning episode

4,0.4 SW100

(c) Network

k,3 SF100

Fig. 3. Learning dynamics of using different approaches to appraise the wellbeing

appraising the wellbeing. The variance-based and aspiration-based approaches, however, outperform the absolute value-based approach in terms of a quicker convergence speed and a higher level of cooperation among the agents. 0.9 1 0.8 0.8 0.7

Value

Value

0.4

E

0.5

F

0.6

F

0.6

0.4

E

0.2 0

−0.2

0.3

W

−0.4

0.2

W

0.1

F W E

−0.6 −0.8

0

−1 −0.1 0

20

40

60

80

100

120

140

Learning episode

(a) FW Agents

160

180

200

0

20

40

60

80

100

120

140

160

180

200

Learning episode

(b) An FW Agent

Fig. 4. Dynamics of appraisal variables and resultant overall emotional state

Fig. 4 shows the learning dynamics of appraisal values and the resultant overall emotional state in grid network GR100 . Fig. 4 (a) gives the values of all FW Agents averaged in 100 runs. As learning proceeds, the value of F increases because more and more agents have chosen to cooperate in a cooperative environment. The value of W increases dramatically at the beginning and then stabilizes around 0 (This is because the difference between average reward 2.25 and final cooperation reward 3 is very minor. This difference is then normalized by T − S = 5, causing a final value of W close to 0). The overall state of emotion joy, Ejoy , also increases during the learning process and finally stabilizes around 0.38, which means the agents cannot reach a fully joyful state (Ejoy = 1). To reach the most joyful state (Ejoy = 1), the agents must sense the highest fairness (F = 1) and remain in the highest wellbeing state (W = 1) at the same time. However, being very wealthy will result in others’ defection so that the agent will feel unfair (F < 0) again in this case and will be fearful of being revenged by their neighbors (refer to Equ. 8). The reciprocal altruistic behaviors of cooperation require agents to forsake their own short-term benefits for long-term group

384

C. Yu, M. Zhang, and F. Ren

benefits. The stabilized final level of emotional state of joy, therefore, becomes an equilibrium (i.e., being moderately joyful in order to achieve mutually satisfactory outcomes) among the agents, where no one has the incentive to deviate. Note that each curve in Fig. 4 (a) only indicates the overall variation of social fairness, wellbeing and overall emotional state of all the agents. To have a better understanding of the dynamics during learning, Fig. 4 (b) shows the learning dynamics in a single FW agent in one particular run. It can be seen that the agent is dynamically updating the emotion appraisal variables and consequent emotional states, and during this dynamical updating, the agent can bias its learning behavior to achieve cooperation with other co-learning FW agents.

1

0.9

Emotion−derivation function selection strategy π(e)

3

2.8

2.6

Variance−based Average reward

2.4

Aspiration−based

2.2

2

Absolute value−based

1.8

1.6

Non−emotional learning 1.4

π(FW)−Aspiration−based

0.7

π(WF)−Absolute value−based

0.6

0.5

0.4

π(FW)−−Absolute value−based

0.3

π(WF)−Aspiration−based π(WF)−Variance−based

0.2

0.1

1.2

1

π(FW)−Variance−based

0.8

0

20

40

60

80

100

120

140

160

180

Learning episode

(a) Dynamics of average reward

200

0

0

20

40

60

80

100

120

140

160

180

200

Learning episode

(b) Dynamics of function selection strategy

Fig. 5. Learning dynamics using competitive emotion derivation functions

In reality, people usually react to an environment based on a number of different emotional cues. After a certain period of interaction, however, certain type of cues will emerge as the primary cause of emotional reactions in an individual. To model this phenomenon, we let each agent derive its emotions based on both FW and WF function at the same time. These two functions compete with each other in order to dominate the derivation of an agent’s emotions through a meta-level inner-layer learning process. This means that the emotional intrinsic rewards are not only used to update the Q values associated with each function, but also used to update the strategy of selecting each function. The Weighted Policy Learner (WPL) algorithm [24] is used to indicate different learning rates of selecting the emotion derivation functions in the inner-layer learning process. Fig. 5(a) shows the learning dynamics in network GR100 . In the figure, nonemotional learning represents the traditional MARL approach used by rational agents, and other three approaches represent the learning approaches in the emotional MARL framework. As can be seen, different learning approaches produce different patterns of learning behaviors. The variance-based and aspiration-based approaches can greatly boost the emergence of cooperation among the agents, while the absolute value-based approach, however, cannot bring about cooperation among the agents, causing a similar learning curve to the non-emotional

Emotional Multiagent Reinforcement Learning in Social Dilemmas

385

learning approach. This result confirms that the absolute wealth of an agent cannot reflect the agent’s real emotional state regarding its own wellbeing. Fig. 5(b) plots the dynamics of strategy to select emotion derivation functions in network GR100 , in which π(e) indicates the probability of selecting emotion derivation function e (FW or WF) to derive emotions of the agents. As can be seen, when the learning process moves on, the strategy of selecting the functions differs significantly. When agents adopt the satisfaction-based or the variancebased approach, the FW function gradually emerges as a dominant function with a probability close to 100% to derive emotions of the agents. When agents adopt the absolute value-based approach, however, none of the functions can dominate each other to derive emotions of agents. The probability of selecting the WF function is only a bit higher than that of selecting the FW function. The results in Fig. 5(b) indicate that, through the competition of agents’ emotion derivation functions in the inner-layer learning, the socially reciprocal behavior using the FW function can override the selfish egoistic behavior using the WF function, which correspondingly facilitates the cooperation among the agents.

5

Related Work

MARL in social dilemmas has been of great interest to researchers in MASs for decades. Sandholm and Crites [25] pioneered this area by investigating a Q-learning agent playing the IPD game against an unknown opponent. They reported that mutual cooperation did not occur when both agents did not take their past actions into account and that the exploration rates had a major impact on the level of cooperation. Vrancx et al. [26] used a combination of replicator dynamics and switching dynamics to model multi-agent learning automata in multi-state PD games. In addition, many other researchers investigated MARL in social dilemmas based on the aspiration-based approaches [8,27]. All these studies, however, focus on analyzing learning dynamics between two agents and understanding in which conditions naive RL agents could learn to achieve cooperation in social dilemmas. In contrast, our work incorporates an emotion mechanism into MARL in spatial social dilemmas to study the impact of local interactions on the emergence of cooperation. Numerous studies [10,11,12] have incorporated an emotion-based concept and mechanism into RL. Most of these studies, however, focused primarily on exploiting emotions to facilitate learning efficiency or to adapt a single agent to dynamic and complex environments. Several studies examined the evolution of cooperation in spatial IPD by implementing an emotion mechanism. For example, a rule-based system, which enabled the synthesis and generation of cognitionrelated emotions, was proposed in order to improve the level of cooperation in spatial IPD [13]. Szolnoki et al. [14] proposed an imitation mechanism that could copy the neighbors’ emotional profiles and found that this imitation could be capable of guiding the population towards cooperation in social dilemmas. All these studies, however, are based on rule-based emotional frameworks, in which the way of eliciting emotions must be predefined so that agents can adapt their

386

C. Yu, M. Zhang, and F. Ren

behaviors of whether to cooperate directly based on their emotional states. This is in contrast to our work, in which emotions are used as intrinsic rewards to bias agent learning during local repeated interactions in spatial social dilemmas. Bazzan et al. [15] used social attachments (i.e., belonging to a hierarchy or to a coalition) to lead learning agents in a grid to a certain level of cooperation. Although our work solves the same problem, we focus on exploiting emotions in modifying agent behaviors during learning, and do not impose assumptions of hierarchical supervision or coalition affiliation on the agents.

6

Conclusion and Future Work

This paper studied the emergence of cooperation in social dilemmas by using emotional intrinsic rewards in MARL. The goal of this work is to investigate whether such emotional intrinsic rewards, derived through the appraisals of social fairness and individual wellbeing, can bias agents’ rational learning so that cooperation can be achieved. Experimental results revealed that different structural relationships between the emotional appraisal variables could lead to distinct agent behaviors in the whole system, and under certain circumstances, cooperation could be obtained among the agents. This paper provides some directions for future research. For example, the impact of varying topological structures on the emergence of cooperation under the proposed learning framework is not the focus of this paper, but still needs to be further investigated. Also, it is interesting to study social dilemmas in heterogeneous societies, in which each agent is endowed with a different emotion derivation function, to model real-life situations where people usually have different emotional reactions to the same environmental changes.

References 1. Hofmann, L., Chakraborty, N., Sycara, K.: The evolution of cooperation in selfinterested agent societies: a critical study. In: The 10th International Conference on Autonomous Agents and Multiagent Systems, pp. 685–692 (2011) 2. Salazar, N., Rodriguez-Aguilar, J., Arcos, J., Peleteiro, A., Burguillo-Rial, J.: Emerging cooperation on complex networks. In: The 10th International Conference on Autonomous Agents and Multiagent Systems, pp. 669–676 (2011) 3. Nowak, M.: Five rules for the evolution of cooperation. Science 314(5805), 1560–1563 (2006) 4. Perc, M., Szolnoki, A.: Coevolutionary games–a mini review. BioSystems 99(2), 109–125 (2010) 5. Sutton, R., Barto, A.: Reinforcement learning: An introduction. MIT Press, Cambridge (1998) 6. Busoniu, L., Babuska, R., De Schutter, B.: A comprehensive survey of multiagent reinforcement learning. IEEE Trans. Syst. Man Cybern. C. Appl. Re. 38(2), 156–172 (2008) 7. Conlisk, J.: Why bounded rationality? J. Econ. Lit. 34(2), 669–700 (1996)

Emotional Multiagent Reinforcement Learning in Social Dilemmas

387

8. Stimpson, J., Goodrich, M., Walters, L.: Satisficing and learning cooperation in the prisoner’s dilemma. In: International Joint Conference on Artificial Intelligence, pp. 535–544. AAAI Press, California (2001) 9. Rumbell, T., Barnden, J., Denham, S., Wennekers, T.: Emotions in autonomous agents: comparative analysis of mechanisms and functions. J. Auton. Agents MultiAG 25(1), 1–45 (2012) 10. Ahn, H., Picard, R.: Affective cognitive learning and decision making: The role of emotions. In: Proceedings of the 18th European Meeting on Cybernetics and Systems Research, pp. 1–6. North-Holland, Amsterdam (2006) 11. Salichs, M., Malfaz, M.: A new approach to modeling emotions and their use on a decision-making system for artificial agents. IEEE Trans. Affec. Comput. 3(1), 56–68 (2012) 12. Sequeira, P., Melo, F., Paiva, A.: Emotion-based intrinsic motivation for reinforcement learning agents. In: D’Mello, S., Graesser, A., Schuller, B., Martin, J.-C. (eds.) ACII 2011, Part I. LNCS, vol. 6974, pp. 326–336. Springer, Heidelberg (2011) 13. Bazzan, A., Bordini, R.: A framework for the simulation of agents with emotions. In: Proceedings of the 5th International Conference on Autonomous Agents, pp. 292–299. ACM, New York (2001) 14. Szolnoki, A., Xie, N., Wang, C., Perc, M.: Imitating emotions instead of strategies in spatial games elevates social welfare. Europhys. Lett. 96(3), 38002 (2011) 15. Bazzan, A., Peleteiro, A., Burguillo, J.: Learning to cooperate in the iterated prisoner’s dilemma by means of social attachments. J. Braz. Comp. Soc. 17(3), 163–174 (2011) 16. Albert, R., Barab´ asi, A.: Statistical mechanics of complex networks. Rev. Mod. Phys. 74, 47–97 (2002) 17. Singh, S., Lewis, R., Barto, A.: Where do rewards come from. In: Proceedings of the Annual Conference of the Cognitive Science Society, pp. 2601–2606. Cognitive Science Society, Inc., Austin (2009) 18. Singh, S., Lewis, R., Barto, A., Sorg, J.: Intrinsically motivated reinforcement learning: An evolutionary perspective. IEEE Trans. Auton. Mental Develop. 2(2), 70–82 (2010) 19. Marsella, S., Gratch, J., Petta, P.: Computational models of emotion. Blueprint for Affective Computing: A Source Book. Oxford University Press, Oxford (2010) 20. Ellsworth, P., Scherer, K.: Appraisal processes in emotion. Oxford University Press, New York (2003) 21. de Jong, S., Tuyls, K.: Human-inspired computational fairness. J. Auton. Agents Multi-AG 22(1), 103–126 (2011) 22. Smith, C.A., Lazarus, R.S.: Appraisal components, core relational themes, and the emotions. Cognition and Emotion 7(3-4), 233–269 (1993) 23. Watts, D., Strogatz, S.: Collective dynamics of ‘small-world’ networks. Nature 393(6684), 440–442 (1998) 24. Abdallah, S., Lesser, V.: Learning the task allocation game. In: Proceedings of the Fifth International Joint Conference on Autonomous Agents and Multiagent Systems, pp. 850–857 (2006) 25. Sandholm, T., Crites, R.: Multiagent reinforcement learning in the iterated prisoner’s dilemma. Biosystems 37(1-2), 147–166 (1996) 26. Vrancx, P., Tuyls, K., Westra, R.: Switching dynamics of multi-agent learning. In: the 7th International Joint Conference on Autonomous Agents and Multiagent Systems, pp. 307–313. ACM Press, New York (2008) 27. Tanabe, S., Masuda, N.: Evolution of cooperation facilitated by reinforcement learning with adaptive aspiration levels. J. Theor. Biol. 293, 151–160 (2011)