Data Biased Robust Counter Strategies - Semantic Scholar

Report 3 Downloads 62 Views
Data Biased Robust Counter Strategies Michael Johanson, Michael Bowling

April 14, 2009

Q ♣

K ♥

10 ♠

P U C RG J ♦

K ♥

A ♠

V

University of Alberta Computer Poker Research Group

10 ♠

Q ♣

A ♠

J ♦

Introduction

Data Biased Robust Counter Strategies In a competitive two player game, we have a tradeoff: Playing to win: taking advantage of our opponent’s mistakes Playing well: being robust against any opponent

Our technique: Partially satisfies both goals Requires only observations of the opponent’s actions

Demonstrated in the challenging game of 2-player Limit Texas Hold’em Poker

Outline

Motivation: Playing strong Poker Background: Games as POMDP’s Strategies for playing games Existing techniques and their problems Building Counter-Strategies from Data Results against human opponents Conclusion

Motivation

University of Alberta Computer Poker Research Group Polaris - the world’s strongest program for playing Two-Player Limit Texas Hold’em Poker In 2008, won competitions against top human and computer players

Our approach revolves around being able to find Nash equilibrium strategies for large abstractions of Poker Nash equilibrium strategies are guaranteed to not lose, but don’t exploit opponent errors

Games as POMDP’s

Games like Poker can be thought of as repeated POMDPs When cards are dealt, state transitions are drawn from a known distribution When the opponent acts, state transitions depend on: Opponent’s strategy Opponent’s cards (that we can’t see)

Nash Equilibrium: A strategy that does well, no matter what state transitions at opponent node are set to But what if we have observations of opponent state transitions? We can do better.

Types of Strategies Performance against a static Poker opponent, in millibets per game Game Theory: Nash equilibrium. Low exploitiveness, low exploitability

Exploitation of Opponent (mb/g)

60 Best Response 50 40 30

Decision Theory: Best response. High exploitiveness, high exploitability

20 Nash Equilibrium

10 0 0

1000 2000 3000 4000 5000 Worst Case Performance (mb/g)

6000

Types of Strategies Performance against a static Poker opponent, in millibets per game Exploitation of Opponent (mb/g)

60 Best Response 50 40

Other possible strategies

30 20 Nash Equilibrium

10 0 0

1000 2000 3000 4000 5000 Worst Case Performance (mb/g)

6000

Types of Strategies Performance against a static Poker opponent, in millibets per game Exploitation of Opponent (mb/g)

60 Best Response 50

Restricted Nash Response: Set of best possible tradeoffs

40 30 20 Nash Equilibrium

10 0 0

1000 2000 3000 4000 5000 Worst Case Performance (mb/g)

6000

Restricted Nash Response

(Johanson, Zinkevich, Bowling – NIPS 2007)

Choose a value p and solve an unusual game Opponent is forced to follow a model with probability p

Our Strategy

Opponent's Strategy 1-p

Game Tree

Game Tree

p

Model

Restricted Nash Response Performance against a static Poker opponent, in millibets per game Exploitation of Opponent (mb/g)

60 (1) 50 40

(0.99) (0.97)

Restricted Nash Response: Each datapoint shows a different value p

(0.93) (0.9)

30

(0.8) (0.7) 20 (0.5) 10 (0) 0 0

1000 2000 3000 4000 5000 Worst Case Performance (mb/g)

6000

Restricted Nash Response with Models Restricted Nash requires the opponent’s strategy 0/0

0/0

Can we use a model instead? Observe 1 million games played by the opponent

4/10

2♦2♥

6/10

1/4

3/4

K♦K♥

Do frequency counts on opponent actions taken at information sets Model assumes opponent takes actions with observed frequencies Need a default policy when there are no observations

Problems with Models

Performance against opponent (mb/g)

Problem 1: Overfitting to the model 50 45 Opponent Model

40 35 30 25 20 15 10

Actual Opponent

5 0 0

200 400 600 800 Worst case performance (mb/g)

1000

Problems with Models Problem 2: Requires a lot of training data

Exploitation (mb/g)

20

1m

0

100k

-20

10k

-40 -60

1k

-80

100

-100 -120 0

200 400 600 800 Exploitability (mb/g)

1000

Problems with Models Problem 3: Source of observations matters

Exploitation (mb/g)

20 0

Random Samples

-20 -40 -60

Self-Play

-80 -100 -120 0

200 400 600 800 Exploitability (mb/g)

1000

Data Biased Response

Restricted Nash had one parameter p that represented model accuracy, but... Model wasn’t accurate in states we never observed Model was more accurate in some states than in others

New approach: Use model’s accuracy as part of the restricted game

Data Biased Response

Lets set up another restricted game. Instead of one p value for the whole tree, we’ll set one p value for each choice node, p(i) More observations → more confidence in the model → higher p(i) Set a maximum p(i) value, Pmax , that we vary to produce a range of strategies

Data Biased Response

Three examples: 1-Step: p(i) = 0 if 0 observations, p(i) = Pmax otherwise 10-Step: p(i) = 0 if less than 10 observations, p(i) = Pmax otherwise 0-10 Linear: p(i) = 0 if 0 observations, p(i) = Pmax if 10 or more, and p(i) grows linearly in between

By setting p(i) = 0 in unobserved states, our prior is that the opponent will play as strongly as possible

DBR doesn’t overfit to the model RNR and several DBR curves:

Exploitation (mb/h)

25 20

0-10 Linear

15

10-Step

10 5

RN

0

1-Step

-5 0

200 400 600 800 Exploitability (mb/h)

1000

DBR works with fewer observations 0-10 Linear DBR curve:

Exploitation (mb/g)

20 1m

15

100k

10

10k 5

1k 100

0 0

50

100 150 200 Exploitability (mb/g)

250

DBR accepts any type of observation data

Exploitation (mb/g)

0-10 Linear DBR curve:

20 18 16 14 12 10 8 6 4 2

Random Samples

Self-Play

0

50

100 150 200 Exploitability (mb/g)

250

Experiments against humans DBR trained on 1.8 million hands of human observations, playing on Poker-Academy.com: 12000

Total winnings

10000 8000 Data Biased Response 6000 4000 2000

Equilibrium

0 0

5000

10000

15000

20000

Games played

25000

30000

35000

Conclusion

Data Biased Response technique: Generate a range of strategies, trading off exploitation and worst-case performance Take advantage of observed information Avoid overfitting to parts of the model that may be inaccurate

Questions?