A Competitive Approach to Game Learning - CiteSeerX

Report 0 Downloads 127 Views
A Competitive

Approach

to Game

Learning

D. Rosin and Richard K. Belew Cognitive Computer Science Research Group CSE Department, University of California, San Diego La Jolla, CA 92093-0114 {crosin,rik}@cs. ucsd.edu Christopher

the type

Abstract Machine learning of game strategies has often depended on competitive methods that continually develop new strategies capable of defeating previous ones. We use a very inclusive definition of game and consider a framework algorithm makes rewithin which a competitive peated that

can

set in

use

of

a strategy

learn

strategies

of opponents. terms

ond

of

player

with

more

ing.

We

k in

M

new

and

both

connect of

the

The and

randomized

solves

in

games

a total

in Ig(/fil

is demonstrated, concept ample

oracle.

analysis new

number

), lg(/X

with We

of game

questions

an

a new

conclude

learning,

arising

result that

k.

Its

of counterex-

with

a complexity

list

this

Introduction

Empirical

work

learn to play games by using data from Typically, a series of strategies for the

are produced gressively new

of

work.

tems that own play.

tems

in

a number

has been

during

learning

stronger.

use a competitive strategies

done

capable

Many

on machine-learning

with

strategies

of these

approach of defeating

game

that

pro-

learning

repeatedly older

systheir game

getting

ones.

the

classic

netic

algorithm paper

is very

domains

other

as evolving

sorting

design

[24],

and

games

[5, 25]. this,

to expert work

learning [20, 21].

on checkers definition and [11],

for

[22],

systems

us to also

board

games,

minimax

game

ini-

Examples

of “game”

allows

intu-

from

[27, 23, 28], and using

approximations

framework

main

of play, players.

traditional

networks

discrete

our

The

inclusive than

The

level

to

a geused consuch

controller differential

syslearns This

existence

learning

relies

of a strategy

In Section 2 we give details of our model of game learning, describe its connection to familiar models of concept learning, and mention some related work. Section 3 motivates the consideration of both worst-case and randomized strategy learning algorithms, and gives necessary parameters for measuring competitive algorithm performance in each case. We then examine several competitive algorithms motivated by those used in practice. Section 4 presents two simple competitive algorithms, and shows examples on which they can fail to learn perfect strategies in polynomial time. A competitive algorithm that meets our performance goals with both worst-case and randomized strategy learning algorithms is given in Section 5. Examples of its use are described, including an application to concept learning with a new kind of counterexample oracle. Section 6 explores the computational complexity of game learning, and Section 7 discusses several open problems.

use

application

kind

and

from

of

of strategies

l), and

including

learning

of

strat-

algorithms. Our central is a competitive algorithm

we consider.

learning algortthm, that is able to learn strategies which defeat a given set of oppoalgorithm then repeatedly uses a nents. The competitive strategy learning algorithm to discover strong strategies for the game. We seek a competitive algorithm capable of learning perfect strategies for any game in polynomial time.

is investigated,

egy learning (Theorem 4)

Samuel’s

study

that bootstrap

strategies

reinforcement

on the

[2]

performance

it can

using

To

learnideas

number

algorithms

worst-case

sec-

model

concept of the

specification

context.

and

uninformed

sider

learning

first

importance

competitive

polynomial

1

and

of

models

the [9]

X

tial

in this

a given

game

is that

include

component

defeat

describe and

familiar show

this

using

sets

set

several

which

We

strategies,

teaching

learnang

of method

ition

is

2

Permission to make digital/bard copies of all or part of tlds material for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copyright is by permission of the ACM, Inc. To copy otherwise, to republish, to peat on servers or to redistribute to lists, ~quires specific permission andlor fee.

2.1

Preliminaries Definition

of Games

A game is a function G which maps two inputs h and z (first and second player strategies) to an outcome

COLT’96, Deaenzano del Garda, Italy @ 1996 ACM o_89’791.81 1.8/96/06 ,.$3.50

292

G(h, of

*).

The

possible

first-player

the

second-player

ble

second-player

presented strategy, ture

to

game

assumed.

consists

play

This

unit

X.

For

of both

(sequential a simple,

that

The

IB]

presenting

a

gies,

struc-

of play,

etc.)

inclusive

strategy

a strategy

by only

was

one

the

notation

strategy

This

a + B means

is view

Vb E B,3a

2.2.1 For

most

there or

for

Exact

In

this

of our

ing

second

egy

learning

it

For

part,

best

way

to

assume

(either

for

the

defeats

all

possible

first

that

has

to make

this

exact

learning

using

approximate

a possible

erates

over

have as

mixed

and

are

systems player

two

main

strategy

first

We

learning

over

strategies,

at the set,

end

on

may

the

be described algorithm

op-

and Si be the sets respectively, that

of step

step

to

i.

F.

i + 1 of the

Fi,

and

is called

and

So are

competitive

S’i+l

is initialized

to

on

Some

z For

subset

and

the

returned

strategy

is

A~

c F~+I

on A~,

is chosen.

and

the

returned

strategies

are

to Si+l

Termination

matters, and is not a clear

AS,

is chosen,

to F~+l,

added

perfect

learning,

AS G Si+l

subset

!5. L2 is called

strat-

of this

ing

algorithms

L2

for

are

that

occurs

This

failure

when

(5),

can only

strategy,

so termination

has been

found.

in

occur occurs

the

above

procedure,

if Ar

contains

a perfect

when

a perfect

strategy

Below, 2.2.3

learning

Notes

are

on

the

Learning

them, The

denoted

L1 for

player.

For

second

L2,

the no

first

by

and

strategy

Model about

place

not

in which,

to for

of self-play,

As

a concrete

be

applied,

ing

evaluation

Games learning

framework in which

agent

strategy

egy for one player

egy

example consider

and

in-

to their

to slow

viewed

a competitive

of

how

Samuel’s

functions

op-

as an

protocol

to learn

is

to play

for

this

framework

original

work

the

checkers

from

might on

self-play

were played between a fixed Beta player Alpha player. Alpha would learn from

When Alpha was finally replaced by Alpha. The

293

players

possible

is better

in an effort

done

an extreme

example,

strategy

games via Samuel’s reinforcement this corresponds to the strategy

not the other,

learning

model

to be uninformative The

framework. competition,

inform

trying

the weakest

this

through

by one player

are

in order

a single

made

well.

1> is only meaningful in the context of a particular game G and should be subscripted >G. Whenever we use this notation, the game is clear from context, so the subscript is dropped. This is also true for several other definitions. ‘Many board games are largely symmetric. The main reason for making a distinction between first-player and second-player strategies is the existence of a perfect stratbut

be

takes

situation

learning.

game

we learn-

player,

single

used

We

present

extension

search

strategy

other.

ponent

reinforcemethod;

chosen

the

opponents

may

heuristic

appropriate

unspecified.

that

algorithms

the strategies

tentionally

set A of

should learning

competitive

second

available,

strategies,

against other

and

points though

by

learning

a given

learning

opposing

First,

game first

of defeating

play

or some of this

the

algordhms

of the

during

details

the

for

the strat-

set.

competitive

Let Fi strategies,

player

is initialized

L1

4.

it is always

Model

to

assume

Strategy

by analysis

the

empty

added

player

distinction.

Learning

capable

strategies.

ment leave

empty

algorithms The

steps.

observed

Some

3.

oppos-

to approximate

components

learntng

a strategy

opposing do this

of the

we consider.

can find

points

by calling

strategies.

Structure

There

are

the com-

to use domain-specific Starting

on the

terms.

second

been

Several

2.2.2

to the

strategies

For example,

allowed

uses

strate-

available new

are obtained

multiple

set to the

that

a perfect

game

extension

and new

Sj .

of exact learning simplifies way to start because there

to define

loop,

methods

producing

competitive

following

fails.

we suggest

outer

strategies.

algorithms

formally,

of first

strategy. Consideration seems a good

is not

algorithm

1. F,+l

convenience,

we consider

only

for

to modify

in the

Learning

player

that

k’.

algorithm:

notational

it is necessary

the most

a defeats

of strategies:

B is defined

is necessary

that

so we

such

to produce

algorithms.

algorithm

More

to

2.

strategy

player)

to be the

when

Game

constant

is the

the

the strategy

egy learning

a $ b.l

results,

strategies,

assumed

b. A +

of A,

which

are allowed.

sets

model,

competitive

framework

strategy to

member

of strategies,

algorithms

algorithm

Learning

is a perfect

the

that +

usable

is assumed

no ties

extended

that

our

competition

Vb E B,a

Framework

2,2

but

b indicates

E A such

values,

information:

simplicity,

is also

that

many

of outcome The

For a >

b.

on

algorithm,

bit

winner.

be deterministic. The

take

learning

considers

player

may

B

some

algorithm

learning

knowledge outcome

every

a set

< k’ for

competitive petitive

game

return

competitive

the

of games. The

of defeating

it

B > A and

entire

further

unified,

be capable

results

is the

No

may

require

of possi-

most

players

rounds

a set

Similarly,

a set

of learning

an outcome.

allows

from

E ?-l.

from

z c

basic

obtaining

h

comes

strategies, the

A game and

comes

strategies,

strategy

here,

strategy

strategy

first-player

learning

capable

of

algorithm defeating

learning learning

to find

a new

current

[22], and a these

algorithm; algorithm.

able to defeat Beta, competitive algorithm the

learn-

Beta was uses the

(Alpha)

(Beta)

stratstrategy,

then

moves

to this

new

equal

2.3

to

Correspondence

with

Concept

strategy

(makes

Beta

on game

learning,

the

defined.

For example,

egy space

% corresponds

to concept

uses

sis space;

the assumption

that

Learning

Alpha). Our In some tive

empirical

algorithm

work

is not

as explicitly

Tesauro’s

backgammon

forcement

learning

new

strategies

rithm

would

defeat

(similar

of several So, our

games

of this

the

most

part,

computable

cost

empirical

well

The

This

is the

as a counterexample pothesis equivalence

The

framework

paper even

of the easily,

theoretical

of techniques

this

even

question allows

we us to

through

and

Strategy

Set

and

for

number

of strategies

for

a competitive

the

2.4

Related

A

learning

a competitive

clock-time

as long

requires

on learning in are

time

lg(17f/)

and

meaningful X

large

in the

restricted

will

be sought

lg([Xl).

Such

[23,

27,

used to consider ular

classes

context

to

While

them. very

ering

learning

that

are

scheme. as neural

The

in the

usually

rely

of learning

opponents,

rather

will

to defeat than

for

X

will

if ‘h!

also

a game,

We need fraction

typically

they

will

to consider to generalize of the

include

usually

partic-

be large

enough

from

strategy

an examination

of a

sets.

should

convergence time

on simple

lookup

table not

number

papers

have

results

early

for

for

complex

is vast, using

discussed [13,

to learn games

results do-

but

there

certain

kinds

[10].

games

is typically

are poly-

proven

representations

useful

approximators

But,

consid-

that

Also,

of states

recent

repeated in

are

learning [15].

without

bounds

[7].

function

reinforce-

learning

adaptive

8].

The

enough

strategies goal

about against

largely

orthogonal

goal of learning

strategies

that

are robust

of

such

a particular

do well

it

in later

against

a large

to

our

of opponents.

Experimental

A motivating

factor

for

experimental

work

that

Heuristic

game

in a variety domain

the

unknown

systematic

method as targets

model

has

been

learning

presented done

methods

of domains

knowledge.

ing in new, suitable

294

show

These

on

to

tensive

in

of game

or prove

the

done

are

learning

possible

strategies

learning

concerns

2.4.2

arbitrary

all

all or most

over

be seen be-

These

space

be

been

of states

recent

games.

ing. not

In

model.

in reinforcement

in terms

promising

simple

has

number

in which been

opponent

be

compactly

can

whereas

carry

will

concept

results

results

functions.

strategies

For example, net evaluation

framework

from

of work Some

time,

nomial

Several

are polygames

should

of this

dis-

strategy.

restrictive

learning

example

is

with

examples,

“target”

dif-

learning

learning.

described

of these

of value

opponents.

it is infeasible small

20].

that

of complex

strategies

the problem

‘H and

strategies that

28,

of simple

all-powerful

time

amount

been

have

algorithm

bounds

new

on an

Work

learning.

mains

total

be polynomial

learning

game

is a more

results

target

important

concerned

actual

game (an

to game

most

for

representable in some particular st rat egies might be represented functions

Polynomial

actually

strategy

to the

clock-time.

Bounds

and

by it.

will

as the

polynomial

nomial most

considered

algorithm

refers

providing

Theoretical

ment

Sizes

algorithm

use a hyalternating

the An

and

classifies the

from

negative

over

z.

is primarily

learning

learning

carry

value “Time”

game

with

h }

learning

requires

results

2,4.1

Finally,

computational

example,

if

y found.

in isolation.

to

how the target

Positive

strategies,

the

algorithm that

to an algorithm

h consistent

learning

sense,

low)

and

can be viewed

competitive

(for

calls

concept

learning

to concept

continual model

between

is on the a

and

learning

learner,

protocols

times

z corresponds

though described

progress

from

fixed

The

learning

a hypothesis

covering game

have

Time

queries

concept

have

it is not

procedure

learning-specific

more

allows

against

arise

are

that

strategy

algorithm

is

contains

hypotheses).

ference

of optimization researchers

on good

definition

questions

Having example

learn-

For example,

Bet a strategies

learning assump-

oracle.

multiple

hypothestrategy

space

first-player

the

strat-

learning’s

hypothesis

learning

to various

with

first-player

to a hypothesis

strategy

learner

similarities The

it has a perfect

the

The

corresponds

corresponds

efficiently-

of this

exists.

of game

a fixed,

learning

that

algorithm

sort

application theory.

con-

perfor-

Strategy

type

focus

learning

rapidly

here.

the

strong

Several

converge

successful

2.2.4

This

bootstrapping

such

second-player

outcome

a strategy

with

Samuel’s

new

this

game

iearnmg

address

to explain

in strategy

in

address

met

shares

learning.

that

concept.

algorithm

strategies[20]).

reasons.

studied.

success [6, 23, 26].

will

old

that

several

function.

difficulties

above

this

the

is a fairly

problem

opponents

that

and

be used

for

has been

strategy

new

This

is an optimization

clear

when

sampling

target

seen

is being

obtained by

we assume

we make

had

condition

the

learning

of concept

to assuming

consistent

tion,

problem

were

algo-

for game

model

similar

system.

is available.

ing

whether

improvement

model

typical

rein-

as a competitive

this

between

algorithm that

that

might

checking

cent inual

explicitly

framework

mance

ones,

the

results

was checked

system[27]

explicitly

old

Given

it is likely

implicitly

For

without

do.

in strategies, dition

learning

competi-

This

for

generating learning

is the learn-

promise

being

is important

environments. for

show

without

here

on game

given for

learn-

Pen has described novel [19].

games

for ex-

that

a are

Epstein

has

different play,

a system

games, and

variety

has

made

to explore against

important A setup

game

games. eral

games

used

with

guiding

28, 23].

search

successfully

applied

that

fail

learn

improvements model

this

empirical

described

this

at ion

powerful

succeed

game

or

may

towards

learning

As

can

then

idealization

that

are crucial

Algorithm

pend

of competitive

is an ini-

n – 1 second-player

one

imperfect

on lg(llfl)

is familiar hibitive

and

from for

this

section,

ters

for

lg(ll]),

concept

reasonable

type and

indicates need

algorithm

lg(lll\).

For

be fairly

small;

described

below.

that In

3.2

parame-

solve

define for

a teaching

any

x +

h.

set T for

imperfect Define

G with

strategy the

the size of the smallest follow the corresponding Since

time

lg( I?t 1) and the

lg(/X

dependence

Lemma perfect exists

1

that

For

h ~ ‘H,

exact

3X

number

The

‘?-i and

X,

of X such

that

E T

that

such

k for

G to

the

learning,

games

strategy Q(k)

competitive

learning =

game;

Q(n)

this

for

algotime

is not

competitive

as well

to

be

to

poly-

learning

k should

specific

examples

algorithm succeed. and

algo-

as lg( \X 1) and

practical,

for

always

Strategy

lemma

and

as in

shows

worst-case

requires We assume

lg(lXl).

Learning

a transitive

that

Xi

of pairs ~ X,

\Xil

1. Vi>j,

h,>Xj

2. Vi~j,

X,+{hj}

the

power

Algorithms

of competitive

a competitive polynomial

choices

lemma

a definition

sequence

like

in time

following

Define

be

we seek bounds in k, as well

following

specifica-

case

it can

to demonstrate we would

First,

set. These definitions concept learning [2, 9].

are polynomial

1).

spaces

G as a subset

teaching ones for

we are considering

on learning

strategy

specification

G~ard 1s per-

performance.

Number game

G&,d.,

of which

defeating

in k, lg(llfl),

Worst-case

k with For a particular

expected

The

k,

strategy

k’ is polynomial

a total ❑

one

bounds on

is the

that

with

of finding

each

this

learning

this

each

be pro-

In order

Specification

time

game

a

t&,

lg(lXl).

depend

second-player

niques, 3.1

and

that to

k) z k to ensure

de-

of dependence

additional

in lg(l’111)

finds

L1 (A),

is k = n – 1. Any

for

of

~.3

strategy.

requires

strategy

most

of games

first-player

above

k – IAI

strategies,

first-player the

a perfect

rithms

of strategies.

necessary

competitive

always

shouldn’t

representations

we establish

studying

This

learning,

will

using

to

only

game

be So,

probability

probability

a class

and

would

algorithm

is Q(k).

strategies,

only

This

Performance

consider

for this

The

~, giving

which

of ?l that

t < k) is at most

the

Let Each

be minimal).

is at

made

is at most A,

k.

be at least

A.

strategy

t calls were t = &,

strategy

not

must

t (for

=

there

a competitive time

ITI

this

3.3.

one member

by

time

described

G;

(otherwise

there

fect,

nomial

of practical

IT[,

that

number

for

at least

defeated

by

n first-player

learn

describe

to success.

algorithms

By

set

a perfect

an example,

The Performance

most

t.

rithm

algorithms

the features

at

algorithm

of using

IAI
A}; in Section

it and T would

of ‘1-i not

a perfect time

of some

we are assuming

defeat

under

also

We have

successful

using

strategy.

choose

{h

important

defeat

to include

strategy

~

set

of T defeats

probability

IAI

can sometimes

that

a perfect

L1 (A)

the

teaching

returning

since

be-

has

is an idealization

assumption,

conditions

step

to

as the

competition

minimal

A with

L1 (A)

has

assuming

a learning

any

let

become

member

members

been

A,

of T must

perfect

suggested

algorithm

to iearn

from

no other

The

[20].

competitive

arbitrary

member

for

sevbeen

has

Convolution

have

be the

no reason

[21].

exist,

and

represent

must

By

algorithms

representation,

and

T

fitness

such

any O(k)

at random of L1 will

of competition

of convolution

here

work.

egy learning

tial

[25].

the

that time

For

choice

This

to backgammon

method

for have

games

robots

games,

to the

The

general

differential

forms

small

outcome

population.

Proof:

successfully

in which

such

uniformly

specific

self-play

a complex

3-D

simple

been

[5], and

successfully

shown

With

to

game

applied to

the

[6]. fail

effective. to

L1

expected

the

training

more

algorithms

on the

of

simulated

been

Genetic

is based

pursuer-evader

has through

rithm takes

in poor

with

targeted

convolution,

members

tween

learning

of self-

could

resulting

self-play

are

networks

competitive

between

of self-play

to be much

systems

neural [27,

learning

mixed

was found

Reinforcement to train

about

successful

of the game,

that

learning

including

observations for forms

parts

an expert

Most used

needed simple

a number

methods

empirical

that

performance.

of learning

several

of opposition

It was observed

this

capable

using

techthat

can

in lg( IX 1), lg( Ifi 1), and

of L1

shows

algorithm and

that

L2.

this

Unfortunately, is not

possible.

is needed: chain (hi,

Xi)

of length (i

< k’, such

=

1,2,

t in a game . . ./),

with

to be a hi

~ ‘H

that:

on k is necessary. games

strategies a randomized

and

G with

with

at

most

specijicatton

first-player

strategy

(a

constant)

number learning

c

3This shows the necessity of bounding the number of perfect strategies by a constant. If c were allowed to grow with k, this probability might always be large.

k, there algo-

295

Lemma !,

2 For games

there

exist

rzthm Q(l)

L1

using time

G with

and L2

these

strategy

to learn

a transitive

such

that

learning

a perfect

chain

any

3.3

of length

competitwe

Randomized Algorithms

algo-

algorithms

Strategy

Learning

requires In practice,

strategy.

it seems

unlikely

that

~gy learning algorithms will be ing uninformative near-worst-case tain

positive

using

Proof: (h,, x,)... If LI hl

Lz

Xl,

has

a

transitive

chain

only

return

(maxj

the

empty

members

set,

from

h,, where

A =

let

the

{Xi,,

1. Otherwise,

ij ) +

them

assume

return

let

transitive Xi2,

L1 return

the

and perfect

if

transitive

B

cent ains

chain,

let

hil~l}

hi,,...

let

L2 return

allow

the

to the

game,

only

Lz (1?)

and

i

=

a teaching

members

return

from

Xi,

(maxj

i])

the

where

+

1.

B

=

learning

The

first

call

the

competitive

yet),

so that

itive

algorithm

turn

only

to L1 or L2 must algorithm the first

has

strategies

from

call

tive

chain

by at most

the

transitive

chain

least

typical

We

consider

that,

when

to the

the

1. Since may

none

in less than

/ steps.

the

of the

be perfect

algorithm

with

into

cannot

each

for

learn

If

an

example,

consists

of the

numbers first

numbers

strategy

of length

So, with

and

n:

distribution is that

worst-case

must

lg(lA’1),

on

board

powerful

games

A sufficient condition identify themselves sequence examine

of

may

“throwaway”

the

labels

1 and

outcome

appropriate

the

example,

label

win

lines

is given

the

paper. for

in

will

this

condi-

importance

strategy

a few

algorithm time

of

learning

bad

strategies

fail.

since

strategy

bounds

k when

are used.

necessary

that

proof

learning

will

randomized

Note

the

4

Two

Simple

Two

simple

competitive

section

that

depend

strategy

on learn-

dependence

of Lemma

on k is

1 used

a ran-

algorithm.

Competitive

Algorithms

algorithm

chain

pler

few

demands

We

motivate

by showing

competitive

polynomial

as lg( I?i 1),

algorithms

make

algorithms.

have

the

A

for

game

This

strategy dependence

exponentially

the

suggests learning

and

The

then

strategy of

that

/.

ing

algorithms.

4.1

Each

even

are described

a more

complex

examples may

with

in this

on the strategy

natural

algorithms

time,

A simple

fail

learning

competitive where

to solve

randomized

in

with the

we should

algorithms,

strategies

cooperate

s’ in

to

the long

the

the

sim-

games

strategy

in

learn-

Last

of learning

> t, t’ Section

is to have

> s’, and so on. 2.2.2, A5 is always

most

recently

added

A~. This is essentially the in Samuel’s checkers learning

larger these

to that

version

used

in

a recent

the

competitive

strategy s, then t + s. Then, find

In the chosen

framework to be the

to

and

S’i,

similarly

competitive algorithm system, and is very

backgammon

given single

learning

for

used similar system

[20].

go beyond

to eliminate

method

strategy

transitive

along

Defeats

algorithm obtain an initial first-player find a second-player strategy t with

interpreted

long

example Go

large

be

to the

detailed

for

of strategies that a label via an

might

IV),

game).

of strategies

moves.

(which

between

need

there games

transitive

classes

have

as numbers

worst-case

The allow

This

1}).

k.

1 as well

is the existence by communicating

produce

of this

these

of the longest

depend

natural

traditional

(for

is a transitive

and

algorithm”

of the

algorithms,

solves

helpful

learning

4 relaxes

a

sets)

k.

Unfortunately,

chain

there

to be most be uniform.

from

from

strategy

is a perfect

,.. (n – l,{n–

learning that

consists

choose

algorithm

algorithms

domized

Due to the above result, upper bounds for algorithms using worst-case strategy learn-

]),and

initial

1. But, (1, {1}),

in lg(]?fl),

algorithms

then

X

choose

to

Theorem

[16].

algorithms

(or

tends

competitive

worst-case

algorithms

learning

somewhat.

lg( /?f /), lg( [X /), and

of games:

There

beyond

learning

it does not

always

the

Competitive

last

the perfect

class

h > r.

strategy

by.4 the length

in a game. competitive lg(l,l?

k =

algorithm

polynomial

We denote

ing

following

to

make

ing

1 . . . n and

(0, {0}),

is no competitive time

the

although

produces

in

the



O. . .n – 1. h > z iff

player

chain

consider

going

strategy

on the

still As

for

access

immediately

is required

randomization that

unlimited

Randomization

to below,

we

has a first-player

set of strategies

“randomized

to

Since

a solution

strategy

distribution

algorithms

transi-

strategies

except

tion

re-

A.

this

the

that

uninformative

a set A of opponents,

over

defeat

refer

compet/ calls

chain,

distance

passed

some

algorithms

algorithms.

that

randomized

without in

algorithm

with

approach

stratproducTo ob-

restrict,

be

learning

is to use randomized

is what

available

the next

the transitive

increasing

set (since

strategies

available At

competitive

strategy

no

strategies

are J51 or Xl.

successive

the

pass the empty

would

algorithm

Another

to

learning

it

game

natural

algorithms

need

strategy

learning

any

strategy.

that

set.

we

that

strategy

distribution

Otherwise,

competitive

strategy

a perfect

when

one,

Note

algorithms

Similarly,

the

best-case

strategy

chain,

. . .xil~l}

strategy.

{hi,,

way,

we consider.

are passed

for

1 as a parameter,

meaningful

respectively.

contains

let LI(A) i =

G

(h~, x~).

and

and

If A

Assume

results

simple,

‘(adversarial”, strategies.

the

on /.

296

The

main

that

intransitivity

problem

with may

this exist,

competitive and

algorithm

a particular

learner

is

may

simply

sitivity

keep

choosing

strategies

has been

observed

when

algorithm

for

learner The

backgammon

may

get

following

stuck

[20].

in such

example

in a cycle.

using

this

Even

made from this list to 1 with probability

for

a long

The

time.

competitive

typically

this.

using

Example

1 (Small

Game

games

represented

quired

by the

strategy

in

the

number

could

of

Trees)

by learning

nodes

be reasonably

This example

game

in

trees.

The

algorithm the

extended

tree,

this

con-

time

always

so this

example

simple

must

games

consider the

games

be correctly

set to have

gorithm.

Correctly

the

in n.

belled

with

outcome come

with

binary

outcomes

of 1 indicates

of O indicates

of all possible choices

at

possible player

win

let

C Gd be the

playe; Gj,

nodes). game

or a second

win.

Since

it must

Let

possible X

player

win

with

contains

cent ain a winning

possible

strategy

this

player

must all

2 [$1 – 1 bits

first-player

strat-

be specified (not by the learning

remaining

bits

in d, which

set al-

will

re-

is exponential

Adding always

play,

so

finite

algorithm

in time using

memory

period

gorithm

element

competitive

1, even

fails

polynomial

randomized

to this

a strategy

is of limited

with

for

the

strategy

choosing

nents,

a first

strategies

example,

to

in lg(1741), strategy

learn-

algorithms.

and

for each

the

randomly.

Considering

at least

the

exponential

a perfect

be a first are

the first

a perfect

guessing

doubly

lg( l% 1), k, and ing

of all

perfect

in Gd that

all

For learn

sets of

lg([?ll)

either

time

to see:

bits

to win.

actually produced

will time

An

?-l consist

consist

that

in Gd must

la-

an out-

(all Note

set of games

?i

while

win. Let

strategies.

leaves 2[ ways.

win.

strategies

Each

with

in all possible

first-player

are O(n),

Gj

Td,

a second-~la~er

first-player

the

tree

a first-Dlaver

second-player

lg(lXl)

game

quire

section

is easy

where

responses,

Let Td be the complete binary tree of depth d; Td has n = 2d+l – 1 nodes and i = 2d leaves. Let Gd be

tic-tac-toe.

of games

then set ~.

polynomial

strategy

branch

second-player

in

This

of the

right

in this

games

algorithm.

[:1 Of which will m any strategy

class

bits are probability

described

such

sets most

choose

possible

solve

learning

always

For example,

re-

is polynomial

to familiar

to

egy) OnlY randomly)

like

irrelevant to O with

algorithm

fail

learner siders

and the ~ and

a randomized

a cycle

demonstrates

Intran-

competitive

than

it

h.

a memory

ail

the

can

The

of

algorithm,

defeats

usefulness;

greater

uses

competitive

that

still

next

last

by

h oppo-

fail

on cycles

competitive

previous

al-

second-player

strategies.

of G;. These

games

have

each

imperfect

label

O such

k