Bayesian Update of Recursive Agent Models

Report 6 Downloads 375 Views
Bayesian Update of Recursive Agent Models PIOTR J. GMYTRASIEWICZ, SANGUK NOH AND TAD KELLOGG Department of Computer Science and Engineering University of Texas at Arlington, Arlington, TX 76019-0015 fpiotr, noh, [email protected]

(Received ..... ; Accepted in nal form .....)

Abstract.

We present a framework for Bayesian updating of beliefs about models of agent(s) based on their observed behavior. We work within the formalism of the Recursive Modeling Method (RMM) that maintains and processes models an agent may use to interact with other agent(s), the models the agent may think the other agent has of the original agent, the models the other agent may think the agent has, and so on. The beliefs about which model is the correct one are incrementally updated based on the observed behavior of the modeled agent and, as the result, the probability of the model that best predicted the observed behavior is increased. Analogously, the models on deeper levels of modeling can be updated; the models that the agent thinks another agent uses to model the original agent are revised based on how the other agent is expected to observe the original agent's behavior, and so on. We have implemented and tested our method in two domains, and the results show a marked improvement in the quality of interactions with the belief update in both domains.

1. Introduction

A rational agent that interacts with other agents should have a way to represent the state of its knowledge about other agents, as well as a method for updating this knowledge based on new incoming information. The general goal we have in mind is to design a rational agent that is able to maintain its state of knowledge about other agents it is interacting with, and to keep its knowledge consistent with the observations in a dynamic interactive environment. For this purpose, we use Bayesian updating, that is, a process of incorporating the evidence obtained by observing other agents' behavior one piece at the time, modifying the previously held belief about their beliefs, preferences, abilities and intentions (see, for example, formulation proposed in Russell and Norvig, 1994). Our agents are intended to be able to coordinate their actions with the actions of the agents they interact with. The coordination involves the ability to choose an action appropriate to the expected action of the other agent.

This research was supported by the Oce of Naval Research grant N00014-95-10775, by the National Science Foundation CAREER award IRI-9702132, and by a research initiation grant from the Computer Science and Engineering Department at the University of Texas at Arlington.

2

PIOTR J. GMYTRASIEWICZ, SANGUK NOH AND TAD KELLOGG

In our work on the Recursive Modeling Method (RMM) Gmytrasiewicz and Durfee (1995) Gmytrasiewicz (1996) Noh and Gmytrasiewicz (1997), this expectation is based on models which include the other agents' beliefs about the world, their preferences and goals, and their abilities for physical action. The other agents' beliefs include what they may know about the environment, as well as what they know about the agents they interact with. Thus, the models of agents include the models they may have of the other agents, which, in turn, may include the models they may think others have of others, and so on. Clearly, since the beliefs, preferences and abilities of other agents are not directly observable, an agent may be uncertain about these characteristics. Within our framework, the agent maintains a number of alternative models of other agents, and assigns a probability to each of them being the actual one. For example, an agent modeling a human user may be uncertain about the user knowing about imminent system shut-down, and could assign a probability of, say, 0.75, to the user's model in which she does know of the shut-down (see situation described in Gmytrasiewicz (1996)), and the probability of 0.25 to the model according to which she does not. Or, during coordination between anti-air defense batteries (see Figure 1), Battery1 may be uncertain whether the other battery still has sucient ammunition or whether it has been incapacitated. Thus, Battery1 will simultaneously maintain three alternative models of Battery2; the fully functional, without sucient ammunition, and incapacitated, during its own coordination and decision-making. This paper describes the process of incrementally updating the probabilities associated with the alternative models based on the other agents' observed behavior. The intuition is to compare the behavioral prediction obtained from the model with the actually observed behavior, and to increase the probability associated with model(s) that best predicted the observed behavior. As the result, an agent that initially had no information as to the proper way to model another agent and started with a non-informative uniform probability distribution over all possible models, will learn the right way to model the agents based on repeated observations. For example, Battery1's observing that Battery2 is not attempting to intercept the incoming threats should lead Battery1 to increase the likelihood it attaches to Battery2 being without ammunition or incapacitated. An interesting complication arises when an agent admits the possibility that it is itself modeled by other agents. We describe how the probabilities associated with these models can be updated based on how the original agent thinks its behavior is observed by the other agents. Further, even deeper models may exist, and their probabilities can be updated analogously resulting in a recursive Bayesian update. The result of and agent's updating the deeper level models usually leads to an increased con dence

UM-Learning97.tex; 20/12/1997; 17:14; no v.; p.2

3

BAYESIAN UPDATE OF RECURSIVE AGENT MODELS

C A

E D F

B

Missile Interceptor Defense Unit

1 2

1

2

Figure 1. An Air Defense Scenario.

as to how other agents perceive the original agent. For example, while initially Battery1 may have been aware that Battery2 does not know whether Battery1 has ammunition, as Battery1 starts shooting it becomes more con dent that its having ammunition is being recognized by the other agent. Thus, while the methods for Bayesian update are relatively well known, the unique contribution of our work is to apply these methods to a well-de ned and principled multiagent coordination and recursive modeling problems. Our work is related to a large body of work devoted to Bayesian learning and update, for example in Cooper and Herskovits (1992), Pearl (1988), Russell and Norvig (1994), and Sprites et al. (1993). Recent work in the user modeling community, for example Jameson (1989), Jameson et al. (1995), Albrecht et al. (1997), Albrecht et al. (), Chiu and Webb (), as well as in the uncertainty in AI community Huber et al. (1994), is also closely related. The distinguishing feature of our work on update and modeling is that we use BDI (belief, desire, intention) models of the other agents. That is, we model what the other agents believe about the world, what is it that they desire, and, as a rational consequence, what behavior can be expected of them. This can be contrasted with Albrecht et al. (1997), Albrecht et al. () and Huber et al. (1994), which contain no representation

UM-Learning97.tex; 20/12/1997; 17:14; no v.; p.3

4

PIOTR J. GMYTRASIEWICZ, SANGUK NOH AND TAD KELLOGG

of the mental state on the modeled agent to predict behavior. The work by Jameson (1989) and Jameson et al. (1995) does contain the model of the mental state, but does not derive intentions by modeling the other agents' rationality. Other related work in learning in multiagent domains includes Friedman and Halpern (1994), Sen and Knight (1995), and, in game theory Binmore (1982).

2. Models of Agents and Their Update This section summarizes our approach to agent modeling, and introduces a general Bayesian update of modeling probabilities. 2.1. Intentional Modeling of Agents The models of other agents we proposed in Gmytrasiewicz and Durfee (1995) Gmytrasiewicz (1996) Noh and Gmytrasiewicz (1997) are intended to represent the decision-making situation the agent is in, and use the assumption of rationality to predict the behavior of modeled agent, given its beliefs and preferences? . This approach has been called intentional in the philosophical work of Dennett (1986), and it is closely related to the framework of belief, desire and intention (BDI) permeating much of the AI research (see Allen (1990) and other works in that volume). The principal contribution of intentional, or purposeful, rendition of BDI framework is that the agent's intentions, and therefore its actions, are the rational consequences of its beliefs and its desires. The representation we use for intentional modeling is closely related to in uence diagrams Howard and Matheson (1984), also pursued in Poh and Horvitz (1996), and to Bayesian networks Pearl (1988), also used in Jameson et al. (1995). A simpli ed in uence diagram in Figure 2 depicts how the possible actions of an agent (or user), may, possibly probabilistically, in uence the state of the environment, and how the environment and the action impact on the agent's utility. Figure 2 contains the decision node, labeled Action, that lists the possible actions the agent can undertake. These actions can in uence the state of the world, usually represented as a Bayesian network of interconnected chance nodes, but in Figure 2 simpli ed to a single node. The agent's action and the resulting state of the world both impact the agent's utility, that is the degree to which the agent's desires, or goals, have been ful lled. The wide use of in uence diagrams arises from their representational power; they represent the agent's abilities by listing the possible actions, agent's beliefs about the world and the way each of the available actions impacts Models of agents as irrational entities are permitted in our framework; they are called sub-intentional models. ?

UM-Learning97.tex; 20/12/1997; 17:14; no v.; p.4

BAYESIAN UPDATE OF RECURSIVE AGENT MODELS

5

State s1 s2 s3

Action a1 a2 a3

Utility

Figure 2. A Simpli ed In uence Diagram.

the world state, and the agent's preferences. Further, in uence diagrams represent uncertainty present in the agent's knowledge, and they manage this uncertainty in a robust computational manner. In our work, we chose the representation in the form of payo matrices, also used widely in game theory Myerson (1991). They are closely related to in uence diagrams in that the information contained in an in uence diagram can be summarized by a unique payo matrix. For example, the information contained in Figure 2 can be summarized in the payo matrix below: State

s1 u11 u12 u13

s2 u21 u22 u23

s3 u31 a1 u32 Actions a2 u33 a3 where uji is the value of the utility node for the ith action and the j th state

of the world. Thus, the payo matrix, which summarizes the information in the in uence diagram by hiding the details of relationships among state variables, retains the dependence among the agent's actions, the state, and the utility. As we mentioned, agents that interact with other agents can use models of others to predict their behavior for better coordination.? This leads to the payo matrices being labeled with the agents' alternative actions along all of their dimensions, which is their standard form in game theory Myerson (1991). Hence, the decision-making situation of the original agent, together with the agent's view as to the possible models of decision-making situation These models can also be used for communication, as we outline in Gmytrasiewicz (1996). ?

UM-Learning97.tex; 20/12/1997; 17:14; no v.; p.5

6

PIOTR J. GMYTRASIEWICZ, SANGUK NOH AND TAD KELLOGG Level 1: A’s model of its own situation

Agent’s A Int. Model

p(M BA,1 )

A’s 1st Int. Model of B

Level 2: A’s models of B

p( M BA,3 )

p( M BA,2 )

M BA,1

A’s 2nd Int. Model of B

A’s 3rd Int. Model of B

M BA,2

M BA,3

A,2:B,3

p( M A,2:B,1 ) A

p(M A

)

p( M AA,2:B,2)

Level 3: how A thinks B may model A

No-Info Model

No-Info Model

. . .

. . .

No-Info Model

No-Info Model

A,2:B,3

M AA,2:B,2

A,2:B,1

MA

Level 4: how A thinks B may think that A models B

A’s Model of B’s 3rd Model of A

A’s Model of B’s 2nd Model of A

A’s Model of B’s 1st Model of A

MA

. . .

No-Info Model

. . .

. . .

Figure 3. Recursively Nested Models.

of the other agents, as well as its view of the possible ways the others model other agents, can all be compactly represented as payo matrices. The in uence diagrams, as well as payo matrices, can be solved for the optimal, or rational, action(s) of the agent in question. Intuitively, having represented an agent's decision-making situation as a matrix or a diagram, one can put oneself in the shoes of the agent, and, assuming that it will attempt to make the best choice out of the ones available, derive the agent's expected behavior. The nested models of agents form a hierarchical structure, called a recursive model structure Gmytrasiewicz and Durfee (1995), such as one in Figure 3. Here, agent A has three alternative models of agent B (we will provide intuitive examples in the next section). On the next level deeper there are models that agent A thinks agent B may have of agent A. Two of these are called No-Information models and represent, in this particular case, the

UM-Learning97.tex; 20/12/1997; 17:14; no v.; p.6

BAYESIAN UPDATE OF RECURSIVE AGENT MODELS

7

fact that agent A has no knowledge as to what models agent B is using, if agent B is adequately described by the rst and the third model. If the second model of B is correct, however, then agent A predicts that agent B could model A using three alternative models. The modeling can terminate on the fourth level with No-Information models if agent A has no modeling information nested deeper, or can continue on if more knowledge is available (see Gmytrasiewicz and Durfee (1995) for further examples). The recursive model structures, such as one in Figure 3, can be solved for the rational coordinated action. Our implementation uses dynamic programming and recursively goes down the branches of the structure until the nesting bottoms out. Then, the models are solved bottom-up, as the solution to the model on the top of the structure is constructed. To use dynamic programming, one clearly has to assume that the nesting of models terminates and actually bottoms out. This is a realistic assumption since, if the modeling information available to agent A is nite then there will be a level of nesting on which the only proper model will be the No-Information model, and the bottom-up dynamic programming solution can propagate the information contained in the model structure, solving the models as they are encountered for rational actions. The result of solving the models is the expectation of the agents' likely action, expressed as probability distributions called conjectures. Thus, solving the models on the third level of the model structure in Figure 3 yields a probability distributions over agent A's actions that represent what A thinks B expects of A in each of the alternative models. The alternative conjectures have to be combined with weights representing the likelihoods associated with the models (p(MAA;2:B;i )'s in Figure 3) and used to solve models of B located on the second level of modeling structure. The result are the conjectures of B's actions in each of the alternative models A has of B (we will provide examples of these computations in the following section). Given the conjectures of B's behavior in each of its alternative models, they can be combined with probabilities assigned to each model (p(MBA;j )'s in Figure 3) to yield the overall expectation of B's behavior.? Finally, given the conjecture as to B's behavior, A's decision making situation can be solved for the optimal action that best coordinates with what is expected of B. 2.2. Updating Probabilities of Agents' Models As we described above, dynamic programming, in it's bottom-up solution process, generates the conjectures of agents' behavior on each level of nesting. Thus, agent A using a modeling structure in Figure 3 will have a conThis method of combining predictions of alternative models is identical to one recommended in Russell and Norvig (1994). ?

UM-Learning97.tex; 20/12/1997; 17:14; no v.; p.7

8

PIOTR J. GMYTRASIEWICZ, SANGUK NOH AND TAD KELLOGG

jecture as to B's behavior in each of three alternative models of B's decisionmaking situations, and further will have alternative probability distributions that it thinks B may use to predict A's behavior, and so on. We will refer to the results of bottom-up dynamic programming solution as prior conjectures.? Thus, the prior conjecture, that is the result of solving a model MBA;i , will assign a probability pA;i(aB;k jMBA;i ) to each of agent B's actions aB;k . We should note that, in realistic cases, the predicted actions may be fairly high-level. For example, the alternative actions, if the modeled agent is a computer user, may include the action of saving a given le (as in Gmytrasiewicz (1996)), and action that does not include saving it, or, if the modeled agent is an air defense battery (as in Noh and Gmytrasiewicz (1997)), the alternatives may be agent's attempts to intercept a particular threat. Naturally, the results of solving the alternative models in the modeling structure can be compared to the observations of actual behavior the agents are exhibiting to update the likelihoods that each of the models is correct. The intuition is clear: Agent A, for example, should update the probabilities assigned to the alternative models of agent B, based on how closely the behavioral predictions of each alternative model match A's observations of B's behavior. Let us call A's observation of B ObsAB , which may be, for example a sequence of Control ? X key strokes of a computer user, or an action of loading a weapon and pointing it at a target in the air defense domain. The process of determining the likely high-level actions based on current observations expressed, say, as a conditional probability of action given observation p(aB;k jObsAB ), is, of course, the process of plan recognition, as discussed, for example, in Huber et al. (1994). Inversely, one can compute the probabilities of actual observable actions, given that the modeled agent is executing a high-level behavior, p(ObsAB jaB;k ). We are now ready to describe the process of updating the modeling probabilities using Bayesian updating as follows: A A;i )p(M A;i ) B : B p(MBA;ijObsAB ) = p(ObsBpj(MObs A) B

(1)

The above updates the probabilities assigned by agent A to alternative models of B given A's observation of B, ObsAB . The p(MBA;i ) is the prior probability assigned to model ith alternative model of B MBA;i, and the probability p(ObsAB ) can be computed as a normalizing constant by demanding that the posterior probabilities associated with alternative models add up to unity. ? We use the term \prior" to describe the situation before the observation of behavior takes place. Our use of the term \conjecture" for a probability distribution over actions is consistent with the literature on game theory.

UM-Learning97.tex; 20/12/1997; 17:14; no v.; p.8

BAYESIAN UPDATE OF RECURSIVE AGENT MODELS

9

The conditional probability of observation given an alternative model,

p(ObsAB jMBA;i) can, in turn, be computed from the high-level behavioral conjectures obtained by solving the model as follows:

p(ObsAB jMBA;i) =

X p(ObsABjaB;k)p(aB;kjMBA;i) k

(2)

The above expresses the likelihood of each particular observation, say, a sequence of key strokes or loading a weapon, as a sum of probabilities that each high-level activity will result in the observation, weighted with the probability, p(aB;k jMBA;i ), of this activity according to the model, obtained from the dynamic programming solution. The above formulation (Eq. 1), which allows the update of likelihoods that agent A assigns to alternative models of agent B, can be extended to models nested deeper in the recursive model structure. Thus, while the probabilities of A's models of B are updated based on A's observation of B's moves, the models that A thinks B may use to model A can be updated based on how A thinks its own actions are being observed by B. Further, the models that A thinks B may think A has of B can be updated based on how A thinks B may think its behavior is being observed by A, and so on. Formally, the update on the second level of nesting, i.e., the update of the probabilities of models MBA;i:B;j that A thinks B may have of A, proceeds analogously to the update on the rst level, as follows: A;i:B;j A;i:B A;i:B;j p(MBA;i:B;j jObsAA;i:B ) = p(ObsA jMA A;i:B)p(MA ) p(ObsA )

where

p(ObsAA;i:B jMAA;i:B;j ) =

X p(ObsAA;i:BjaA;l)p(aA;ljMAA;i:B;j): l

(3)

(4)

In the above, the conditional probabilities, p(aA;l jMAA;i:B;j ), of A's actions, aA;l, given a model B may have of A, MAA;i:B;j , are the results of solving the models on the second levels of nesting of the recursive model structure. The update of probabilities associated with models nested deeper in proceeds analogously. We should note that the nested observations, like ObsAA;i:B that A attributes to B in its ith model of B, are well de ned. It is so due to our earlier assumption that the model, in this case MBA;i , fully speci es an agent. This full speci cation of B includes attributes like B's current position, its sensing abilities, and so on, and it results in a unique speci cation of B's observation of A's actions.

UM-Learning97.tex; 20/12/1997; 17:14; no v.; p.9

10

PIOTR J. GMYTRASIEWICZ, SANGUK NOH AND TAD KELLOGG

3. Examples

In this section we present applications of the formalism described above in two domains. In the air defense domain the agents are assumed to be responsible for defending a particular territory from an incoming missile attack and unable to communicate with each other. Therefore, they have to rely on modeling each other to coordinate their defense activities. In the pursuit domain we have ve agents, four of which, called predators, are trying to capture the fth agent, called the prey. In each domain we rst describe how the, possibly uncertain, modeling leads to behavioral predictions, and how the likelihoods associated with di erent models can be revised based on Bayesian updating, leading to improved coordination. 3.1. The Air Defense Domain As we mentioned, the task of automated defense units is to defend a speci ed territory and to coordinate their attempts to intercept the attacking missiles, given the characteristics of the threat, and given what they can expect of the other defense units. To achieve the primary task successfully, an automated defense unit needs to model the other agents, either human or automated, that control the other defense batteries. Our approach to this coordinated decision making problem is based on the assumption that the task of each of the defense units is to minimize the overall damages to the attacked territory. Further, under a realistic threat situation, friendly defense units cannot expect to have an advanced knowledge of the character of the incoming attack. In such cases, coordination requires an agent to recognize the current status, and to model the actions of the other agents to decide on its own next behavior. Our anti-air defense scenario (see Figure 1) has six incoming missiles and two defending units in a 20 by 20 grid world. Each of the defense units independently decides to launch interceptors against the incoming missiles in the absence of communication. The incoming missiles y straight down and attack the overall ground site on which the units are located. We will analyze the situation from Battery1's point of view and assume that is is uncertain about how to model Battery2. Namely, it is possible that Battery2 has run out of short range interceptors, and that it has been somehow incapacitated by the enemy re altogether. Thus Battery2's decision making situation could be modeled as one of three cases: Battery2 is fully operational and has both short and long range interceptors; Battery2 is operational but it has only long range interceptors; or Battery2 has been damaged or incapacitated, in which case there is no information as to what action it would undertake. If Battery2 has only long range missiles it can only intercept the distant targets (A, B and C), even though it is faced with closer targets with larger threat value.

UM-Learning97.tex; 20/12/1997; 17:14; no v.; p.10

BAYESIAN UPDATE OF RECURSIVE AGENT MODELS

11

As we mentioned, each of the models has a form of a payo matrix. Each of the payo s, corresponding to each defense batteries ring at incoming threats, is determined based on the threat of each missile, which is a combination of its size and distance to the target, and on the probability of threat interception. We calculate the missile threat, MT, using the following formula: n MTn = W A n

(5)

where  Wn: the warhead size of missile n  An : the altitude of missile n The probability of interception is assumed to depend on the angle between a missile's direction of motion and the battery's line-of-sight. This probability is maximized when this angle is 0, as follows: P (HITij ) = e? ij (6) where ij is the angle between battery i's line-of-sight and missile j 's direction of motion, and  is an interceptor-speci c constant (assumed here as 0.01). Now, say that the Battery1 is faced with n missiles and that it targets a missile j . The resulting threat will be reduced by the missile threat MTj multiplied by the probability of successful interception, denoted as P (HIT1j ). If both batteries target missiles at the same time, the reduction of threat, and therefore the total payo , i.e., the utility, is equal to the sum of the threats that each of them removes. For the example scenario as depicted in Figure 1, when the sizes of threats that are 470, 410, 350, 370, 420, and 450 for threats A through F, respectively, we arrive at the recursive model structure depicted in Figure 4. As we mentioned, the top level in Figure 4 represents the way that Battery1 observes the situation to make his own decision, shown as Battery1's payo matrix. The second level depicts the models Battery1 has of Battery2's situation, and level 3 contains the models that Battery1 anticipates Battery2 may be using to model Battery1. The recursive modeling could continue into deeper levels, but in this case we assumed that the batteries have no further information. In other words, we are examining the reasoning of Battery1 in the particular case, when equipped with a nite amount of information about Battery2, and nested to the third level of modeling. The three models on the second level of the structure in Figure 4 correspond to Battery1's belief that Battery2 has both short and long range interceptors with probability 40%, in which case it can target any of the six incoming threats, that Battery2 has only long range interceptors with

UM-Learning97.tex; 20/12/1997; 17:14; no v.; p.11

12

PIOTR J. GMYTRASIEWICZ, SANGUK NOH AND TAD KELLOGG Battery2 A B C D E F A 28.0 42.9 41.3 50.8 52.4 55.9 B 42.6 23.6 38.5 47.9 49.5 53.0 C 39.6 37.1 19.4 44.9 46.5 50.0 Battery1 D 44.8 42.2 40.6 28.1 51.6 55.2 E 43.1 40.5 38.9 48.4 29.4 53.5 F 46.0 43.4 41.8 51.3 52.8 35.4

Level 1:

Belief: 0.4 Battery1 A B C D E F A 28.0 42.6 39.6 44.8 43.1 46.0 B 42.9 23.6 37.1 42.2 40.5 43.4 C 41.3 38.5 19.4 40.6 38.9 41.8 Level 2: Battery2 D 50.8 47.9 44.9 28.1 48.4 51.3 E 52.4 49.5 46.5 51.6 29.4 52.8 F 55.9 53.0 50.0 55.2 53.5 35.4

0.4 Battery1 A B C D E F A 28.0 42.6 39.6 44.8 43.1 46.0 B 42.9 23.6 37.1 42.2 40.5 43.4 C 41.3 38.5 19.4 40.6 38.9 41.8

0.2 No-info 1 [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]

No-info 1 [1,0,0,0,0,0] . . . . . [0,0,0,0,0,1]

No-info 1 [1,0,0,0,0,0] . . . . . [0,0,0,0,0,1]

Figure 4. Recursively Nested Models in Air Defense Scenario.

probability 40%, in which case it can only target the distant threats labeled A, B, and C, and that Battery2 has malfunctioned with probability 20%, in which case there is no information as to what it will do. The dynamic programming bottom-up solution for this structure starts at Level 2 and reveals that, if Battery2 has both short and long range interceptors, the probability distribution over Battery2's actions becomes [0.00, 0.00, 0.00, 0.08, 0.21, 0.71] for interception of missiles A through F, respectively (this computation is not important from the point of view of belief revision, and we refer the reader to Noh and Gmytrasiewicz (1997) for details of the logic sampling algorithm used). Thus, if Battery2 has all of the needed ammunition, it will most likely attempt to intercept one of the closer threatening missiles. For the case of Battery2 lacking short range interceptors results in a [0.70, 0.22, 0.08, 0.00, 0.00, 0.00] distribution over Battery2's limited actions of intercepting the missiles A through F, respectively. Thus, if Battery2 has no short range ammunition, it would be irrational for it to attempt to intercept any of the closer threats, and it will most likely try to intercept missile A. The results obtained above can be used to update a probability of Battery2's alternative models. Figure 5 illustrates such process, under a sim-

UM-Learning97.tex; 20/12/1997; 17:14; no v.; p.12

13

BAYESIAN UPDATE OF RECURSIVE AGENT MODELS C A

C

B A

C

B

D

B

A

E E

F

E

D

D

F

F 1 1 1

1

2

4

2

1

M1

0.4 M 1,1 2

0.4 M 1,2 2 Battery1: MissileD

(a)

2

3

3

4 5

6

2

1

M 1,3 2

0.89 M 1,1 2

0.0 M 1,2 2

Battery1: MissileA

(b)

2

M1

M1

0.2

2

0.11 M 1,3 2

0.98 M 1,1 2

0.0 M 1,2 2

0.02 M 1,3 2

Battery1: MissileB

(c)

Figure 5. Battery1's belief revision about Battery2. In this scenario, Battery2 has both short and long range interceptors.

plifying assumption that the rings of Battery2 are directly observable by Battery1 (this amounts to an assumption that A's observations ObsAB , and B's actions aB;k , in Eq. 2, are identical). For example, the rst update of models on the bottom is in response of Battery2 launching an interceptor, labeled 2, at missile F, proceeds as follows: p(M21;1 jF) = p(FjM21;1 )p(M21;1 )/p(F) = (0:71  0:40)=0:316 = 0:894 p(M21;2 jF) = p(FjM21;2 )p(M21;2 )/p(F) = (0:00  0:40)=0:316 = 0:000 p(M21;3 jF) = p(FjM21;3 )p(M21;3 )/p(F) = (0:16  0:20)=0:316 = 0:106 where  M21;1 is the Battery1's model of Battery2 equipped with short range and long range interceptors;  M21;2 is the Battery1's model of Battery2 with only long range interceptors;  M21;3 is the Battery1's model according to which Battery2 has been damaged or incapacitated; and with the probability distributions representing Battery2's expected actions [0.0, 0.0, 0.0, 0.08, 0.21, 0.71], [0.70, 0.22, 0.08, 0.0, 0.0, 0.0] and [0.16, 0.16, 0.16, 0.16, 0.16, 0.16] computed before from the models M21;1 through M21;3, respectively. Figures 5 and 6 illustrate the process of Battery1's revising his beliefs in two example scenarios. In Figure 5, Battery2 is equipped with both short and long range interceptors, and has launched interceptors 2, 4 and 6. In Figure 6, Battery2 has only long range interceptors, also numbered 2, 4, and 6. M1 is Battery1's own payo matrix in each gure. M21;1 , M21;2 , and M21;3 are the alternative models of Battery2, and Battery1's initial belief about Battery2 is shown in (a). Battery1's beliefs after the rst interceptor

UM-Learning97.tex; 20/12/1997; 17:14; no v.; p.13

14

PIOTR J. GMYTRASIEWICZ, SANGUK NOH AND TAD KELLOGG C A

C

B

C

B

A D

B

A

E E

F

E

D

D

F

F 1 2 1 2

4

3

1 2

4

1

3

2

1

M1

0.4 M 1,1 2

0.4 M 1,2 2 Battery1: MissileD

(a)

5

6

2

1

M1

M1

0.2 M 1,3 2

0.0 M 1,1 2

0.50 M 1,2 2 Battery1: MissileF

(b)

2

0.50 M 1,3 2

0.0 M 1,1 2

0.61 M 1,2 2

0.39 M 1,3 2

Battery1: MissileE

(c)

Figure 6. Battery1's belief revision about Battery2. In this scenario, Battery2 has only long range interceptors.

launch, and after the second launch, are shown in (b) and (c), respectively. If Battery2 has both missiles, Battery1's belief about M21;1 converged to almost 100% after two interceptors launched. But, if Battery2 was equipped with 1;2 only long range missiles, Battery1's belief about M2 changed less radically. In this case Battery1 remained unsure whether Battery2 had only long range missiles or whether it malfunctioned. As we will show, the Bayesian belief update enhances the quality of coordination during subsequent interactions, by correctly recognizing the models of the other agents from their behavior. 3.1.1. Experiments The anti-air defense simulator is written in Common LISP and built on top of the MICE simulator, running on a LINUX platform. In the experiments we ran, each of two defense units could launch three interceptors, and were faced with an attack by six incoming missiles. We put all of the trials under the following conditions. First, the initial positions of missiles were randomly generated. Second, the performance assessments of agents with di erent policies were compared using the same threat situation. Third, the size of each missile was constant during all of the experiments and the warhead sizes were comparable. Finally, although there was no communication between agents, each agent could see which threat was shot at by the other agent and use this information to make the next decision. Our experiments were aimed at determining the quality of modeling and coordination achieved by the RMM agents in a team, when paired with human agents, and when compared to other strategies. To evaluate the quality of the agents' performance, the results were expressed in terms of (1) the number of selected targets, i.e., targets the defense units attempted to intercept, and (2) the total expected damage to friendly forces after all six inter-

UM-Learning97.tex; 20/12/1997; 17:14; no v.; p.14

15

BAYESIAN UPDATE OF RECURSIVE AGENT MODELS Average number of selected targets 6.0 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 RMM RMM with learning

RMMHUMAN

RMMIND.

IND.

HUMAN

RMMRANDOM

RANDOM

Strategies

Figure 7. Average number of selected targets (over 100 runs).

ceptors were launched. These two measures reveal the quality of achieved coordination, and therefore agent modeling: If the units miscoordinated and redundantly targeted the same threat the threats not intercepted were free to penetrate the defenses and in ict high damages. The total expected damage was de ned as a sum of the residual warhead sizes of the attacking missiles. The target selection strategies are as follows:  Random: selection randomly generated.  Independent, no modeling: selection of minimizing his own damage.  Human: selection by human.  RMM: selection by RMM without belief update.  RMM with learning: selection by RMM with belief update. As shown in Figures 7 and 8, we found that the all-RMM team outperformed the human and independent teams.? The average number of selected targets by RMM after 100 trials is 5:49 (X ?? = 0:05), compared to 4:89 (X = 0:08) for the independent team and 4:77 (X = 0:06) for the allhuman team. Further, the RMM-controlled coordinated defense resulted in the total expected damage of 488:0 (X = 23:4), which is much less than that of the independent team (732:0; X = 37:1) and that of the all-human team (772:0; X = 36:3). The e ects of improvement due to Bayesian updating is shown as well. The all-RMM with learning team showed better performance than the allRMM team. The average number of selected targets when the agents were Our human subjects were CSE and EE graduate students; our e orts to compare the performance to experts in anti-air defense were obstructed by the fact that the anti-air defense doctrine remains classi ed. ??  denotes the standard error of the mean.  X ?

UM-Learning97.tex; 20/12/1997; 17:14; no v.; p.15

16

PIOTR J. GMYTRASIEWICZ, SANGUK NOH AND TAD KELLOGG Average total expected damage

1200

1000

800

600

400

200

0 RMM RMM with learning

RMMHUMAN

RMMIND.

IND.

HUMAN

RANDOM RMMRANDOM

Strategies

Figure 8. Average total expected damage (over 100 runs).

updating the probabilities associated with each other models was 5.72 (X = 0:05) and the total expected damage was 407.04 (X = 19:1). In order to test whether the observed di erences among the target selection strategies were not due to chance, we used an analysis of variance with a 0:01 signi cance level. Here, the all-human team and the RMM-Human team were left out, because of the relatively small number of participating human subjects.? In the experiment in which the number of selected targets was measured (Figure 8), F (5; 1) = 10:31, p < 0:01. Therefore, we can conclude that the di erences among the ve target selection strategies are not due to chance with the probability 99%. This result holds also for the experiment in which the total expected damage was measured. To test the signi cance of the observed superiority of coordination achieved by the RMM with learning team vs. the other non-human teams, t tests were performed. The results show that the RMM with learning team was better than any other team with the probability of 99% (0:01 level of signi cance). 3.2. The Pursuit Game Domain In the pursuit domain four predator agents are coordinating their actions to capture one prey agent by surrounding it on all four sides. In our simulations the prey agent was programmed to avoid capture by increasing its distance from the center of gravity of the four predator agents. Each of the predator agent was in turn attempting to minimize the distance between itself and In our preliminary experiment, there are 4 pairs of all-human team and RMM-Human team. The results for humans are only suggestive and couldn't be evaluated statistically due to the small number of human participants. We will conduct more exhaustive experiments in future work. ?

UM-Learning97.tex; 20/12/1997; 17:14; no v.; p.16

BAYESIAN UPDATE OF RECURSIVE AGENT MODELS

17

Figure 9. Simulation of the Pursuit Game Domain

the prey as well as increase the number of sides from which the prey was surrounded. These factors were then combined to form payo s for payo matrices as outlined in Levy and Rosenschein (1992). The sizes of the payo matrices involved precludes us from displaying them here. Due to the ve agents taking part in the interaction, the payo matrices were ve-dimensional. To experiment with Bayesian updating, we allowed for various measures of uncertainty as to the identity of the agents involved. Thus, the predator agents (labeled 1 through 4 in Figure 9) did not know with certainty which of the four other agents they were interacting with was the prey. The uncertainty as to identity of other agents lead to four alternative models for each of the predator agents. In each of the models, a di erent of the four other agents was hypothesized to be the prey, with the models consisting of four ve-dimensional payo matrices each. The probabilities associated with each model were updated based on the observed behavior of the other agents. Since the prey was programmed to run away from the center of gravity of the predators (and was not blung), the probability of

UM-Learning97.tex; 20/12/1997; 17:14; no v.; p.17

18

PIOTR J. GMYTRASIEWICZ, SANGUK NOH AND TAD KELLOGG

Time to Capture in Simulation Steps

60

40

20 non-learning learning Second RMM level Entropy fixed at 0.2 0 0.0

0.5

1.0

Model Information Entropy for First RMM Level

Figure 10. Capture Times for the Pursuit Game

the model that correctly speci ed the prey agent was increasing during the course of each interaction. To show the in uence of increased quality of coordination with learning we ran a series of experiments, varying the degree to which the predators were uncertain as to who was prey, and with Bayesian learning turned on and o . The results are in Figure 10. During the experiments we varied the initial uncertainty of the predator agents; at one extreme we equipped them with perfect knowledge, at another extreme they had no information at all as to who was who and used uniform probability distribution over all four alternative models. For each initial uncertainty we computed the entropy of the distribution over the alternative models, with the entropy of the distribution representing perfect knowledge being zero, and entropy of the uniform distribution being scaled to unity. The upper line in Figure 10 shows the variation of the time it took the predators to capture the prey on the predators' uncertainty level in ten randomly initialized runs for each level of uncertainty. Thus, if the predators had perfect knowledge, it took them on the average of a little over 21

UM-Learning97.tex; 20/12/1997; 17:14; no v.; p.18

BAYESIAN UPDATE OF RECURSIVE AGENT MODELS

19

time steps to capture the prey, but if they had no knowledge at all, they were unable to capture the prey at all (within the time limit of 50 on each simulation run). The lower curve in Figure 10 shows how prey agents that used Bayesian updating were able to improve their time to capture. As expected, the bene ts of learning increases with the increase of the initial uncertainty among the predators. Predators starting with no information, which without learning were unable to capture the prey, were able to do it with learning and achieved capture times close to those of predators with perfect knowledge. The relatively low variation of the capture time with increasing initial uncertainty in the learning case was likely due to the agents learning quite quickly based on their initial observations during which all of the predators were grouping together and the prey was escaping. This pattern was visible no matter how uncertain the predators were to start with. As a matter of comparison, we should remark that four human players equipped with perfect knowledge can capture a prey agent in about 18 time steps. Our experiments in the pursuit domain also included Bayesian update on the second level of nesting in the recursive model structure (during runs summarized in Figure 10 the uncertainty of the second level remained unchanged with the entropy of 0.2). The results of the deeper level updates, however, showed that the e ect of learning on the second level has a signi cantly smaller e ect on the capture times. In fact, in the experiments so far the di erence is not statistically signi cant. On one hand, this result is intuitive in that humans rarely bother to reason on deeper levels of nesting, presumably due to lower usefulness. On the other hand, this phenomenon can well be a feature of the domain. In other words, while updating second level models did not improve the quality of interactions in the pursuit domain on the average, in some special cases it may be of crucial importance, as anecdotal evidence in human coordination illustrates Clark and Marshall, Halpern and Moses (1981, 1990).

4. Conclusions We presented a framework for Bayesian updating of beliefs about models of agent(s) based on their observed behavior. We used the formalism of the Recursive Modeling Method which was particularly suitable since it yields probabilistic predictions of agents' behavior given their alternative models. The update we use is relatively straightforward and well known, but our contribution is formulating models amenable to this update, and in applying it to recursively nested levels. Our work shows how the probabilities associated with the models would evolve during repeated interactions. Thus, while initially agents may start interacting in the state of complete uncertainty, as they have the chance

UM-Learning97.tex; 20/12/1997; 17:14; no v.; p.19

20

PIOTR J. GMYTRASIEWICZ, SANGUK NOH AND TAD KELLOGG

to observe each others' behaviors they can increase their con dence in each others' capabilities and intentions, and use this information to improve the quality of coordination during subsequent interactions. The experimental results in two domains we presented show the gains in interaction quality due to Bayesian updating and suggest that this approach will be useful in other domains as well. Our framework can be extended in several ways. Currently, its main limitation is that it describes how to update the probabilities of models that already exist, but it does not include the possibility of building, or hypothesizing, new models. Thus, if none of the possible intentional models predict the observed behavior, the probability associated with all of them decreases, while the probability associated with the No-Information model on the same level increases. This, of course, is the the way for the system to tell us that none of the other existing models is adequate. This e ect should trigger search for new hypothetical models, which is at the center of our current research e orts. Another obvious extension, also on our research agenda, is to apply the Bayesian update in other domains, for example to an adaptive human-computer interactions, such as the setting described in Gmytrasiewicz (1996) and in Jameson et al. (1995).

References Albrecht, D., Zukerman, I., and Nicholson, A. Bayesian models for keyhole plan recognition in an adventure game. In This volume. Albrecht, D., Zukerman, I., Nicholson, A., and Bud, A. (1997). Towards a bayesian model for keyhole plan recognition in large domains. In Proceedings of the Sixth International Conference on User Modeling, 363{365. Allen, J. F. (1990). Two views of intention. In Cohen, P. R., Morgan, J., and Pollack, M. E., eds., Intentions in Communication. MIT Press. Binmore, K. (1982). Essays on Foundations of Game Theory. Pitman. Chiu, B. C., and Webb, G. I. Using decision trees for agent modelling: A study on improving prediction performance. In This Volume. Clark, H. H., and Marshall, C. R. (1981). De nite reference and mutual knowledge. In Joshi, A. K., Webber, B. L., and Sag, I. A., eds., Elements of Discourse Understanding. Cambridge, UK: Cambridge University Press. 10{63. Cooper, G., and Herskovits, E. (1992). A bayesian method for the induction of probabilistic networks form data. Machine Learning (9):309{347. Dennett, D. (1986). Intentional systems. In Dennett, D., ed., Brainstorms. MIT Press. Friedman, N., and Halpern, J. Y. (1994). A knowledge-based framework for belief change part i: Fiundations. In Proceedings of Sixth Conference on Theoretical Aspects of Reasoning about Knowledge. Gmytrasiewicz, P. J., and Durfee, E. H. (1995). Formalization of recursivshee modeling. In In Proceedings of the First International Conference on Multiagent Systems, ICMAS'95. Gmytrasiewicz, P. J. (1996). An approach to user modeling in decision support systems. In Proceedings of the Fifth International Conference on User Modeling. Halpern, J. Y., and Moses, Y. (1990). Knowledge and common knowledge in a distributed environment. Journal of the ACM 37(3):549{587.

UM-Learning97.tex; 20/12/1997; 17:14; no v.; p.20

BAYESIAN UPDATE OF RECURSIVE AGENT MODELS

21

Howard, R. A., and Matheson, J. E. (1984). In uence diagrams. In Howard, R. A., and Matheson, J. E., eds., Readings on Principles and Applications of Decision Analysis. Strategic Decisions Group, Menlo Park, CA. 721{762. Huber, M. J., Durfee, E. H., and Wellman, M. P. (1994). The automated mapping of plans for plan recognition. In Proceedings of 1994 Conference on Uncertainty in Arti cial Intelligence. Jameson, A., Shaefer, R., Simons, J., and Weis, T. (1995). Adaptive provision of evaluation-oriented information: Tasks and techniques. In Proceedings of the Fourteenth International Joint Conference on Arti cial Intelligence, 1886{1893. Jameson, A. (1989). But what will the listener think? belief ascription and image maintenance in dialog. In Kobsa, A., and Wahlster, W., eds., User Models in Dialog Systems. Springer-Verlag. Levy, R., and Rosenschein, J. S. (1992). A game theoretic approach to the pursuit problem. In Working Papaers of the Eleventh International Workshop on Distributed Arti cial Intelligence, 195{213. Myerson, R. B. (1991). Game Theory: Analysis of Con ict. Harvard University Press. Noh, S., and Gmytrasiewicz, P. J. (1997). Agent modeling in antiair defense. In Proceedings of the Sixth International Conference on User Modeling, 389{400. Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufman. Poh, K. L., and Horvitz, E. J. (1996). A graph-theoretic analysis of information value. In Proceedings of the Twelfth Conference on Uncertainty in Arti cial Intelligence (UAI96). Russell, S., and Norvig, P. (1994). Arti cial Intelligence: A Modern Approach. Prentice Hall. Sen, S., and Knight, L. (1995). A genetic prototype learner. In Proceedings of the International Joint Conference on Arti cial Intelligence, 84;89. Sprites, P., Glymour, C., and Scheines, R. (1993). Causation, Prediction and Search. Springer Verlag.

UM-Learning97.tex; 20/12/1997; 17:14; no v.; p.21