A Competitive
Approach
to Game
Learning
D. Rosin and Richard K. Belew Cognitive Computer Science Research Group CSE Department, University of California, San Diego La Jolla, CA 92093-0114 {crosin,rik}@cs. ucsd.edu Christopher
the type
Abstract Machine learning of game strategies has often depended on competitive methods that continually develop new strategies capable of defeating previous ones. We use a very inclusive definition of game and consider a framework algorithm makes rewithin which a competitive peated that
can
set in
use
of
a strategy
learn
strategies
of opponents. terms
ond
of
player
with
more
ing.
We
k in
M
new
and
both
connect of
the
The and
randomized
solves
in
games
a total
in Ig(/fil
is demonstrated, concept ample
oracle.
analysis new
number
), lg(/X
with We
of game
questions
an
a new
conclude
learning,
arising
result that
k.
Its
of counterex-
with
a complexity
list
this
Introduction
Empirical
work
learn to play games by using data from Typically, a series of strategies for the
are produced gressively new
of
work.
tems that own play.
tems
in
a number
has been
during
learning
stronger.
use a competitive strategies
done
capable
Many
on machine-learning
with
strategies
of these
approach of defeating
game
that
pro-
learning
repeatedly older
systheir game
getting
ones.
the
classic
netic
algorithm paper
is very
domains
other
as evolving
sorting
design
[24],
and
games
[5, 25]. this,
to expert work
learning [20, 21].
on checkers definition and [11],
for
[22],
systems
us to also
board
games,
minimax
game
ini-
Examples
of “game”
allows
intu-
from
[27, 23, 28], and using
approximations
framework
main
of play, players.
traditional
networks
discrete
our
The
inclusive than
The
level
to
a geused consuch
controller differential
syslearns This
existence
learning
relies
of a strategy
In Section 2 we give details of our model of game learning, describe its connection to familiar models of concept learning, and mention some related work. Section 3 motivates the consideration of both worst-case and randomized strategy learning algorithms, and gives necessary parameters for measuring competitive algorithm performance in each case. We then examine several competitive algorithms motivated by those used in practice. Section 4 presents two simple competitive algorithms, and shows examples on which they can fail to learn perfect strategies in polynomial time. A competitive algorithm that meets our performance goals with both worst-case and randomized strategy learning algorithms is given in Section 5. Examples of its use are described, including an application to concept learning with a new kind of counterexample oracle. Section 6 explores the computational complexity of game learning, and Section 7 discusses several open problems.
use
application
kind
and
from
of
of strategies
l), and
including
learning
of
strat-
algorithms. Our central is a competitive algorithm
we consider.
learning algortthm, that is able to learn strategies which defeat a given set of oppoalgorithm then repeatedly uses a nents. The competitive strategy learning algorithm to discover strong strategies for the game. We seek a competitive algorithm capable of learning perfect strategies for any game in polynomial time.
is investigated,
egy learning (Theorem 4)
Samuel’s
study
that bootstrap
strategies
reinforcement
on the
[2]
performance
it can
using
To
learnideas
number
algorithms
worst-case
sec-
model
concept of the
specification
context.
and
uninformed
sider
learning
first
importance
competitive
polynomial
1
and
of
models
the [9]
X
tial
in this
a given
game
is that
include
component
defeat
describe and
familiar show
this
using
sets
set
several
which
We
strategies,
teaching
learnang
of method
ition
is
2
Permission to make digital/bard copies of all or part of tlds material for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copyright is by permission of the ACM, Inc. To copy otherwise, to republish, to peat on servers or to redistribute to lists, ~quires specific permission andlor fee.
2.1
Preliminaries Definition
of Games
A game is a function G which maps two inputs h and z (first and second player strategies) to an outcome
COLT’96, Deaenzano del Garda, Italy @ 1996 ACM o_89’791.81 1.8/96/06 ,.$3.50
292
G(h, of
*).
The
possible
first-player
the
second-player
ble
second-player
presented strategy, ture
to
game
assumed.
consists
play
This
unit
X.
For
of both
(sequential a simple,
that
The
IB]
presenting
a
gies,
struc-
of play,
etc.)
inclusive
strategy
a strategy
by only
was
one
the
notation
strategy
This
a + B means
is view
Vb E B,3a
2.2.1 For
most
there or
for
Exact
In
this
of our
ing
second
egy
learning
it
For
part,
best
way
to
assume
(either
for
the
defeats
all
possible
first
that
has
to make
this
exact
learning
using
approximate
a possible
erates
over
have as
mixed
and
are
systems player
two
main
strategy
first
We
learning
over
strategies,
at the set,
end
on
may
the
be described algorithm
op-
and Si be the sets respectively, that
of step
step
to
i.
F.
i + 1 of the
Fi,
and
is called
and
So are
competitive
S’i+l
is initialized
to
on
Some
z For
subset
and
the
returned
strategy
is
A~
c F~+I
on A~,
is chosen.
and
the
returned
strategies
are
to Si+l
Termination
matters, and is not a clear
AS,
is chosen,
to F~+l,
added
perfect
learning,
AS G Si+l
subset
!5. L2 is called
strat-
of this
ing
algorithms
L2
for
are
that
occurs
This
failure
when
(5),
can only
strategy,
so termination
has been
found.
in
occur occurs
the
above
procedure,
if Ar
contains
a perfect
when
a perfect
strategy
Below, 2.2.3
learning
Notes
are
on
the
Learning
them, The
denoted
L1 for
player.
For
second
L2,
the no
first
by
and
strategy
Model about
place
not
in which,
to for
of self-play,
As
a concrete
be
applied,
ing
evaluation
Games learning
framework in which
agent
strategy
egy for one player
egy
example consider
and
in-
to their
to slow
viewed
a competitive
of
how
Samuel’s
functions
op-
as an
protocol
to learn
is
to play
for
this
framework
original
work
the
checkers
from
might on
self-play
were played between a fixed Beta player Alpha player. Alpha would learn from
When Alpha was finally replaced by Alpha. The
293
players
possible
is better
in an effort
done
an extreme
example,
strategy
games via Samuel’s reinforcement this corresponds to the strategy
not the other,
learning
model
to be uninformative The
framework. competition,
inform
trying
the weakest
this
through
by one player
are
in order
a single
made
well.
1> is only meaningful in the context of a particular game G and should be subscripted >G. Whenever we use this notation, the game is clear from context, so the subscript is dropped. This is also true for several other definitions. ‘Many board games are largely symmetric. The main reason for making a distinction between first-player and second-player strategies is the existence of a perfect stratbut
be
takes
situation
learning.
game
we learn-
player,
single
used
We
present
extension
search
strategy
other.
ponent
reinforcemethod;
chosen
the
opponents
may
heuristic
appropriate
unspecified.
that
algorithms
the strategies
tentionally
set A of
should learning
competitive
second
available,
strategies,
against other
and
points though
by
learning
a given
learning
opposing
First,
game first
of defeating
play
or some of this
the
algordhms
of the
during
details
the
for
the strat-
set.
competitive
Let Fi strategies,
player
is initialized
L1
4.
it is always
Model
to
assume
Strategy
by analysis
the
empty
added
player
distinction.
Learning
capable
strategies.
ment leave
empty
algorithms The
steps.
observed
Some
3.
oppos-
to approximate
components
learntng
a strategy
opposing do this
of the
we consider.
can find
points
by calling
strategies.
Structure
There
are
the com-
to use domain-specific Starting
on the
terms.
second
been
Several
2.2.2
to the
strategies
For example,
allowed
uses
strate-
available new
are obtained
multiple
set to the
that
a perfect
game
extension
and new
Sj .
of exact learning simplifies way to start because there
to define
loop,
methods
producing
competitive
following
fails.
we suggest
outer
strategies.
algorithms
formally,
of first
strategy. Consideration seems a good
is not
algorithm
1. F,+l
convenience,
we consider
only
for
to modify
in the
Learning
player
that
k’.
algorithm:
notational
it is necessary
the most
a defeats
of strategies:
B is defined
is necessary
that
so we
such
to produce
algorithms.
algorithm
More
to
2.
strategy
player)
to be the
when
Game
constant
is the
the
the strategy
egy learning
a $ b.l
results,
strategies,
assumed
b. A +
of A,
which
are allowed.
sets
model,
competitive
framework
strategy to
member
of strategies,
algorithms
algorithm
Learning
is a perfect
the
that +
usable
is assumed
no ties
extended
that
our
competition
Vb E B,a
Framework
2,2
but
b indicates
E A such
values,
information:
simplicity,
is also
that
many
of outcome The
For a >
b.
on
algorithm,
bit
winner.
be deterministic. The
take
learning
considers
player
may
B
some
algorithm
learning
knowledge outcome
every
a set
< k’ for
competitive petitive
game
return
competitive
the
of games. The
of defeating
it
B > A and
entire
further
unified,
be capable
results
is the
No
may
require
of possi-
most
players
rounds
a set
Similarly,
a set
of learning
an outcome.
allows
from
E ?-l.
from
z c
basic
obtaining
h
comes
strategies, the
A game and
comes
strategies,
strategy
here,
strategy
strategy
first-player
learning
capable
of
algorithm defeating
learning learning
to find
a new
current
[22], and a these
algorithm; algorithm.
able to defeat Beta, competitive algorithm the
learn-
Beta was uses the
(Alpha)
(Beta)
stratstrategy,
then
moves
to this
new
equal
2.3
to
Correspondence
with
Concept
strategy
(makes
Beta
on game
learning,
the
defined.
For example,
egy space
% corresponds
to concept
uses
sis space;
the assumption
that
Learning
Alpha). Our In some tive
empirical
algorithm
work
is not
as explicitly
Tesauro’s
backgammon
forcement
learning
new
strategies
rithm
would
defeat
(similar
of several So, our
games
of this
the
most
part,
computable
cost
empirical
well
The
This
is the
as a counterexample pothesis equivalence
The
framework
paper even
of the easily,
theoretical
of techniques
this
even
question allows
we us to
through
and
Strategy
Set
and
for
number
of strategies
for
a competitive
the
2.4
Related
A
learning
a competitive
clock-time
as long
requires
on learning in are
time
lg(17f/)
and
meaningful X
large
in the
restricted
will
be sought
lg([Xl).
Such
[23,
27,
used to consider ular
classes
context
to
While
them. very
ering
learning
that
are
scheme. as neural
The
in the
usually
rely
of learning
opponents,
rather
will
to defeat than
for
X
will
if ‘h!
also
a game,
We need fraction
typically
they
will
to consider to generalize of the
include
usually
partic-
be large
enough
from
strategy
an examination
of a
sets.
should
convergence time
on simple
lookup
table not
number
papers
have
results
early
for
for
complex
is vast, using
discussed [13,
to learn games
results do-
but
there
certain
kinds
[10].
games
is typically
are poly-
proven
representations
useful
approximators
But,
consid-
that
Also,
of states
recent
repeated in
are
learning [15].
without
bounds
[7].
function
reinforce-
learning
adaptive
8].
The
enough
strategies goal
about against
largely
orthogonal
goal of learning
strategies
that
are robust
of
such
a particular
do well
it
in later
against
a large
to
our
of opponents.
Experimental
A motivating
factor
for
experimental
work
that
Heuristic
game
in a variety domain
the
unknown
systematic
method as targets
model
has
been
learning
presented done
methods
of domains
knowledge.
ing in new, suitable
294
show
These
on
to
tensive
in
of game
or prove
the
done
are
learning
possible
strategies
learning
concerns
2.4.2
arbitrary
all
all or most
over
be seen be-
These
space
be
been
of states
recent
games.
ing. not
In
model.
in reinforcement
in terms
promising
simple
has
number
in which been
opponent
be
compactly
can
whereas
carry
will
concept
results
results
functions.
strategies
For example, net evaluation
framework
from
of work Some
time,
nomial
Several
are polygames
should
of this
dis-
strategy.
restrictive
learning
example
is
with
examples,
“target”
dif-
learning
learning.
described
of these
of value
opponents.
it is infeasible small
20].
that
of complex
strategies
the problem
‘H and
strategies that
28,
of simple
all-powerful
time
amount
been
have
algorithm
bounds
new
on an
Work
learning.
mains
total
be polynomial
learning
game
is a more
results
target
important
concerned
actual
game (an
to game
most
for
representable in some particular st rat egies might be represented functions
Polynomial
actually
strategy
to the
clock-time.
Bounds
and
by it.
will
as the
polynomial
nomial most
considered
algorithm
refers
providing
Theoretical
ment
Sizes
algorithm
use a hyalternating
the An
and
classifies the
from
negative
over
z.
is primarily
learning
learning
carry
value “Time”
game
with
h }
learning
requires
results
2,4.1
Finally,
computational
example,
if
y found.
in isolation.
to
how the target
Positive
strategies,
the
algorithm that
to an algorithm
h consistent
learning
sense,
low)
and
can be viewed
competitive
(for
calls
concept
learning
to concept
continual model
between
is on the a
and
learning
learner,
protocols
times
z corresponds
though described
progress
from
fixed
The
learning
a hypothesis
covering game
have
Time
queries
concept
have
it is not
procedure
learning-specific
more
allows
against
arise
are
that
strategy
algorithm
is
contains
hypotheses).
ference
of optimization researchers
on good
definition
questions
Having example
learn-
For example,
Bet a strategies
learning assump-
oracle.
multiple
hypothestrategy
space
first-player
the
strat-
learning’s
hypothesis
learning
to various
with
first-player
to a hypothesis
strategy
learner
similarities The
it has a perfect
the
The
corresponds
corresponds
efficiently-
of this
exists.
of game
a fixed,
learning
that
algorithm
sort
application theory.
con-
perfor-
Strategy
type
focus
learning
rapidly
here.
the
strong
Several
converge
successful
2.2.4
This
bootstrapping
such
second-player
outcome
a strategy
with
Samuel’s
new
this
game
iearnmg
address
to explain
in strategy
in
address
met
shares
learning.
that
concept.
algorithm
strategies[20]).
reasons.
studied.
success [6, 23, 26].
will
old
that
several
function.
difficulties
above
this
the
is a fairly
problem
opponents
that
and
be used
for
has been
strategy
new
This
is an optimization
clear
when
sampling
target
seen
is being
obtained by
we assume
we make
had
condition
the
learning
of concept
to assuming
consistent
tion,
problem
were
algo-
for game
model
similar
system.
is available.
ing
whether
improvement
model
typical
rein-
as a competitive
this
between
algorithm that
that
might
checking
cent inual
explicitly
framework
mance
ones,
the
results
was checked
system[27]
explicitly
old
Given
it is likely
implicitly
For
without
do.
in strategies, dition
learning
competi-
This
for
generating learning
is the learn-
promise
being
is important
environments. for
show
without
here
on game
given for
learn-
Pen has described novel [19].
games
for ex-
that
a are
Epstein
has
different play,
a system
games, and
variety
has
made
to explore against
important A setup
game
games. eral
games
used
with
guiding
28, 23].
search
successfully
applied
that
fail
learn
improvements model
this
empirical
described
this
at ion
powerful
succeed
game
or
may
towards
learning
As
can
then
idealization
that
are crucial
Algorithm
pend
of competitive
is an ini-
n – 1 second-player
one
imperfect
on lg(llfl)
is familiar hibitive
and
from for
this
section,
ters
for
lg(ll]),
concept
reasonable
type and
indicates need
algorithm
lg(lll\).
For
be fairly
small;
described
below.
that In
3.2
parame-
solve
define for
a teaching
any
x +
h.
set T for
imperfect Define
G with
strategy the
the size of the smallest follow the corresponding Since
time
lg( I?t 1) and the
lg(/X
dependence
Lemma perfect exists
1
that
For
h ~ ‘H,
exact
3X
number
The
‘?-i and
X,
of X such
that
E T
that
such
k for
G to
the
learning,
games
strategy Q(k)
competitive
learning =
game;
Q(n)
this
for
algotime
is not
competitive
as well
to
be
to
poly-
learning
k should
specific
examples
algorithm succeed. and
algo-
as lg( \X 1) and
practical,
for
always
Strategy
lemma
and
as in
shows
worst-case
requires We assume
lg(lXl).
Learning
a transitive
that
Xi
of pairs ~ X,
\Xil
1. Vi>j,
h,>Xj
2. Vi~j,
X,+{hj}
the
power
Algorithms
of competitive
a competitive polynomial
choices
lemma
a definition
sequence
like
in time
following
Define
be
we seek bounds in k, as well
following
specifica-
case
it can
to demonstrate we would
First,
set. These definitions concept learning [2, 9].
are polynomial
1).
spaces
G as a subset
teaching ones for
we are considering
on learning
strategy
specification
G~ard 1s per-
performance.
Number game
G&,d.,
of which
defeating
in k, lg(llfl),
Worst-case
k with For a particular
expected
The
k,
strategy
k’ is polynomial
a total ❑
one
bounds on
is the
that
with
of finding
each
this
learning
this
each
be pro-
In order
Specification
time
game
a
t&,
lg(lXl).
depend
second-player
niques, 3.1
and
that to
k) z k to ensure
de-
of dependence
additional
in lg(l’111)
finds
L1 (A),
is k = n – 1. Any
for
of
~.3
strategy.
requires
strategy
most
of games
first-player
above
k – IAI
strategies,
first-player the
a perfect
rithms
of strategies.
necessary
competitive
always
shouldn’t
representations
we establish
studying
This
learning,
will
using
to
only
game
be So,
probability
probability
a class
and
would
algorithm
is Q(k).
strategies,
only
This
Performance
consider
for this
The
~, giving
which
of ?l that
t < k) is at most
the
Let Each
be minimal).
is at
made
is at most A,
k.
be at least
A.
strategy
t calls were t = &,
strategy
not
must
t (for
=
there
a competitive time
ITI
this
3.3.
one member
by
time
described
G;
(otherwise
there
fect,
nomial
of practical
IT[,
that
number
for
at least
defeated
by
n first-player
learn
describe
to success.
algorithms
By
set
a perfect
an example,
The Performance
most
t.
rithm
algorithms
the features
at
algorithm
of using
IAI
A}; in Section
it and T would
of ‘1-i not
a perfect time
of some
we are assuming
defeat
under
also
We have
successful
using
strategy.
choose
{h
important
defeat
to include
strategy
~
set
of T defeats
probability
IAI
can sometimes
that
a perfect
L1 (A)
the
teaching
returning
since
be-
has
is an idealization
assumption,
conditions
step
to
as the
competition
minimal
A with
L1 (A)
has
assuming
a learning
any
let
become
member
members
been
A,
of T must
perfect
suggested
algorithm
to iearn
from
no other
The
[20].
competitive
arbitrary
member
for
sevbeen
has
Convolution
have
be the
no reason
[21].
exist,
and
represent
must
By
algorithms
representation,
and
T
fitness
such
any O(k)
at random of L1 will
of competition
of convolution
here
work.
egy learning
tial
[25].
the
that time
For
choice
This
to backgammon
method
for have
games
robots
games,
to the
The
general
differential
forms
small
outcome
population.
Proof:
successfully
in which
such
uniformly
specific
self-play
a complex
3-D
simple
been
[5], and
successfully
shown
With
to
game
applied to
the
[6]. fail
effective. to
L1
expected
the
training
more
algorithms
on the
of
simulated
been
Genetic
is based
pursuer-evader
has through
rithm takes
in poor
with
targeted
convolution,
members
tween
learning
of self-
could
resulting
self-play
are
networks
competitive
between
of self-play
to be much
systems
neural [27,
learning
mixed
was found
Reinforcement to train
about
successful
of the game,
that
learning
including
observations for forms
parts
an expert
Most used
needed simple
a number
methods
empirical
that
performance.
of learning
several
of opposition
It was observed
this
capable
using
techthat
can
in lg( IX 1), lg( Ifi 1), and
of L1
shows
algorithm and
that
L2.
this
Unfortunately, is not
possible.
is needed: chain (hi,
Xi)
of length (i
< k’, such
=
1,2,
t in a game . . ./),
with
to be a hi
~ ‘H
that:
on k is necessary. games
strategies a randomized
and
G with
with
at
most
specijicatton
first-player
strategy
(a
constant)
number learning
c
3This shows the necessity of bounding the number of perfect strategies by a constant. If c were allowed to grow with k, this probability might always be large.
k, there algo-
295
Lemma !,
2 For games
there
exist
rzthm Q(l)
L1
using time
G with
and L2
these
strategy
to learn
a transitive
such
that
learning
a perfect
chain
any
3.3
of length
competitwe
Randomized Algorithms
algo-
algorithms
Strategy
Learning
requires In practice,
strategy.
it seems
unlikely
that
~gy learning algorithms will be ing uninformative near-worst-case tain
positive
using
Proof: (h,, x,)... If LI hl
Lz
Xl,
has
a
transitive
chain
only
return
(maxj
the
empty
members
set,
from
h,, where
A =
let
the
{Xi,,
1. Otherwise,
ij ) +
them
assume
return
let
transitive Xi2,
L1 return
the
and perfect
if
transitive
B
cent ains
chain,
let
hil~l}
hi,,...
let
L2 return
allow
the
to the
game,
only
Lz (1?)
and
i
=
a teaching
members
return
from
Xi,
(maxj
i])
the
where
+
1.
B
=
learning
The
first
call
the
competitive
yet),
so that
itive
algorithm
turn
only
to L1 or L2 must algorithm the first
has
strategies
from
call
tive
chain
by at most
the
transitive
chain
least
typical
We
consider
that,
when
to the
the
1. Since may
none
in less than
/ steps.
the
of the
be perfect
algorithm
with
into
cannot
each
for
learn
If
an
example,
consists
of the
numbers first
numbers
strategy
of length
So, with
and
n:
distribution is that
worst-case
must
lg(lA’1),
on
board
powerful
games
A sufficient condition identify themselves sequence examine
of
may
“throwaway”
the
labels
1 and
outcome
appropriate
the
example,
label
win
lines
is given
the
paper. for
in
will
this
condi-
importance
strategy
a few
algorithm time
of
learning
bad
strategies
fail.
since
strategy
bounds
k when
are used.
necessary
that
proof
learning
will
randomized
Note
the
4
Two
Simple
Two
simple
competitive
section
that
depend
strategy
on learn-
dependence
of Lemma
on k is
1 used
a ran-
algorithm.
Competitive
Algorithms
algorithm
chain
pler
few
demands
We
motivate
by showing
competitive
polynomial
as lg( I?i 1),
algorithms
make
algorithms.
have
the
A
for
game
This
strategy dependence
exponentially
the
suggests learning
and
The
then
strategy of
that
/.
ing
algorithms.
4.1
Each
even
are described
a more
complex
examples may
with
in this
on the strategy
natural
algorithms
time,
A simple
fail
learning
competitive where
to solve
randomized
in
with the
we should
algorithms,
strategies
cooperate
s’ in
to
the long
the
the
sim-
games
strategy
in
learn-
Last
of learning
> t, t’ Section
is to have
> s’, and so on. 2.2.2, A5 is always
most
recently
added
A~. This is essentially the in Samuel’s checkers learning
larger these
to that
version
used
in
a recent
the
competitive
strategy s, then t + s. Then, find
In the chosen
framework to be the
to
and
S’i,
similarly
competitive algorithm system, and is very
backgammon
given single
learning
for
used similar system
[20].
go beyond
to eliminate
method
strategy
transitive
along
Defeats
algorithm obtain an initial first-player find a second-player strategy t with
interpreted
long
example Go
large
be
to the
detailed
for
of strategies that a label via an
might
IV),
game).
of strategies
moves.
(which
between
need
there games
transitive
classes
have
as numbers
worst-case
The allow
This
1}).
k.
1 as well
is the existence by communicating
produce
of this
these
of the longest
depend
natural
traditional
(for
is a transitive
and
algorithm”
of the
algorithms,
solves
helpful
learning
4 relaxes
a
sets)
k.
Unfortunately,
chain
there
to be most be uniform.
from
from
strategy
is a perfect
,.. (n – l,{n–
learning that
consists
choose
algorithm
algorithms
domized
Due to the above result, upper bounds for algorithms using worst-case strategy learn-
]),and
initial
1. But, (1, {1}),
in lg(]?fl),
algorithms
then
X
choose
to
Theorem
[16].
algorithms
(or
tends
competitive
worst-case
algorithms
learning
somewhat.
lg( /?f /), lg( [X /), and
of games:
There
beyond
learning
it does not
always
the
Competitive
last
the perfect
class
h > r.
strategy
by.4 the length
in a game. competitive lg(l,l?
k =
algorithm
polynomial
We denote
ing
following
to
make
ing
1 . . . n and
(0, {0}),
is no competitive time
the
although
produces
in
the
❑
O. . .n – 1. h > z iff
player
chain
consider
going
strategy
on the
still As
for
access
immediately
is required
randomization that
unlimited
Randomization
to below,
we
has a first-player
set of strategies
“randomized
to
Since
a solution
strategy
distribution
algorithms
transi-
strategies
except
tion
re-
A.
this
the
that
uninformative
a set A of opponents,
over
defeat
refer
compet/ calls
chain,
distance
passed
some
algorithms
algorithms.
that
randomized
without in
algorithm
with
approach
stratproducTo ob-
restrict,
be
learning
is to use randomized
is what
available
the next
the transitive
increasing
set (since
strategies
available At
competitive
strategy
no
strategies
are J51 or Xl.
successive
the
pass the empty
would
algorithm
Another
to
learning
it
game
natural
algorithms
need
strategy
learning
any
strategy.
that
set.
we
that
strategy
distribution
Otherwise,
competitive
strategy
a perfect
when
one,
Note
algorithms
Similarly,
the
best-case
strategy
chain,
. . .xil~l}
strategy.
{hi,,
way,
we consider.
are passed
for
1 as a parameter,
meaningful
respectively.
contains
let LI(A) i =
G
(h~, x~).
and
and
If A
Assume
results
simple,
‘(adversarial”, strategies.
the
on /.
296
The
main
that
intransitivity
problem
with may
this exist,
competitive and
algorithm
a particular
learner
is
may
simply
sitivity
keep
choosing
strategies
has been
observed
when
algorithm
for
learner The
backgammon
may
get
following
stuck
[20].
in such
example
in a cycle.
using
this
Even
made from this list to 1 with probability
for
a long
The
time.
competitive
typically
this.
using
Example
1 (Small
Game
games
represented
quired
by the
strategy
in
the
number
could
of
Trees)
by learning
nodes
be reasonably
This example
game
in
trees.
The
algorithm the
extended
tree,
this
con-
time
always
so this
example
simple
must
games
consider the
games
be correctly
set to have
gorithm.
Correctly
the
in n.
belled
with
outcome come
with
binary
outcomes
of 1 indicates
of O indicates
of all possible choices
at
possible player
win
let
C Gd be the
playe; Gj,
nodes). game
or a second
win.
Since
it must
Let
possible X
player
win
with
contains
cent ain a winning
possible
strategy
this
player
must all
2 [$1 – 1 bits
first-player
strat-
be specified (not by the learning
remaining
bits
in d, which
set al-
will
re-
is exponential
Adding always
play,
so
finite
algorithm
in time using
memory
period
gorithm
element
competitive
1, even
fails
polynomial
randomized
to this
a strategy
is of limited
with
for
the
strategy
choosing
nents,
a first
strategies
example,
to
in lg(1741), strategy
learn-
algorithms.
and
for each
the
randomly.
Considering
at least
the
exponential
a perfect
be a first are
the first
a perfect
guessing
doubly
lg( l% 1), k, and ing
of all
perfect
in Gd that
all
For learn
sets of
lg([?ll)
either
time
to see:
bits
to win.
actually produced
will time
An
?-l consist
consist
that
in Gd must
la-
an out-
(all Note
set of games
?i
while
win. Let
strategies.
leaves 2[ ways.
win.
strategies
Each
with
in all possible
first-player
are O(n),
Gj
Td,
a second-~la~er
first-player
the
tree
a first-Dlaver
second-player
lg(lXl)
game
quire
section
is easy
where
responses,
Let Td be the complete binary tree of depth d; Td has n = 2d+l – 1 nodes and i = 2d leaves. Let Gd be
tic-tac-toe.
of games
then set ~.
polynomial
strategy
branch
second-player
in
This
of the
right
in this
games
algorithm.
[:1 Of which will m any strategy
class
bits are probability
described
such
sets most
choose
possible
solve
learning
always
For example,
re-
is polynomial
to familiar
to
egy) OnlY randomly)
like
irrelevant to O with
algorithm
fail
learner siders
and the ~ and
a randomized
a cycle
demonstrates
Intran-
competitive
than
it
h.
a memory
ail
the
can
The
of
algorithm,
defeats
usefulness;
greater
uses
competitive
that
still
next
last
by
h oppo-
fail
on cycles
competitive
previous
al-
second-player
strategies.
of G;. These
games
have
each
imperfect
label
O such
k