Data Biased Robust Counter Strategies Michael Johanson, Michael Bowling
April 14, 2009
Q ♣
K ♥
10 ♠
P U C RG J ♦
K ♥
A ♠
V
University of Alberta Computer Poker Research Group
10 ♠
Q ♣
A ♠
J ♦
Introduction
Data Biased Robust Counter Strategies In a competitive two player game, we have a tradeoff: Playing to win: taking advantage of our opponent’s mistakes Playing well: being robust against any opponent
Our technique: Partially satisfies both goals Requires only observations of the opponent’s actions
Demonstrated in the challenging game of 2-player Limit Texas Hold’em Poker
Outline
Motivation: Playing strong Poker Background: Games as POMDP’s Strategies for playing games Existing techniques and their problems Building Counter-Strategies from Data Results against human opponents Conclusion
Motivation
University of Alberta Computer Poker Research Group Polaris - the world’s strongest program for playing Two-Player Limit Texas Hold’em Poker In 2008, won competitions against top human and computer players
Our approach revolves around being able to find Nash equilibrium strategies for large abstractions of Poker Nash equilibrium strategies are guaranteed to not lose, but don’t exploit opponent errors
Games as POMDP’s
Games like Poker can be thought of as repeated POMDPs When cards are dealt, state transitions are drawn from a known distribution When the opponent acts, state transitions depend on: Opponent’s strategy Opponent’s cards (that we can’t see)
Nash Equilibrium: A strategy that does well, no matter what state transitions at opponent node are set to But what if we have observations of opponent state transitions? We can do better.
Types of Strategies Performance against a static Poker opponent, in millibets per game Game Theory: Nash equilibrium. Low exploitiveness, low exploitability
Exploitation of Opponent (mb/g)
60 Best Response 50 40 30
Decision Theory: Best response. High exploitiveness, high exploitability
20 Nash Equilibrium
10 0 0
1000 2000 3000 4000 5000 Worst Case Performance (mb/g)
6000
Types of Strategies Performance against a static Poker opponent, in millibets per game Exploitation of Opponent (mb/g)
60 Best Response 50 40
Other possible strategies
30 20 Nash Equilibrium
10 0 0
1000 2000 3000 4000 5000 Worst Case Performance (mb/g)
6000
Types of Strategies Performance against a static Poker opponent, in millibets per game Exploitation of Opponent (mb/g)
60 Best Response 50
Restricted Nash Response: Set of best possible tradeoffs
40 30 20 Nash Equilibrium
10 0 0
1000 2000 3000 4000 5000 Worst Case Performance (mb/g)
6000
Restricted Nash Response
(Johanson, Zinkevich, Bowling – NIPS 2007)
Choose a value p and solve an unusual game Opponent is forced to follow a model with probability p
Our Strategy
Opponent's Strategy 1-p
Game Tree
Game Tree
p
Model
Restricted Nash Response Performance against a static Poker opponent, in millibets per game Exploitation of Opponent (mb/g)
60 (1) 50 40
(0.99) (0.97)
Restricted Nash Response: Each datapoint shows a different value p
(0.93) (0.9)
30
(0.8) (0.7) 20 (0.5) 10 (0) 0 0
1000 2000 3000 4000 5000 Worst Case Performance (mb/g)
6000
Restricted Nash Response with Models Restricted Nash requires the opponent’s strategy 0/0
0/0
Can we use a model instead? Observe 1 million games played by the opponent
4/10
2♦2♥
6/10
1/4
3/4
K♦K♥
Do frequency counts on opponent actions taken at information sets Model assumes opponent takes actions with observed frequencies Need a default policy when there are no observations
Problems with Models
Performance against opponent (mb/g)
Problem 1: Overfitting to the model 50 45 Opponent Model
40 35 30 25 20 15 10
Actual Opponent
5 0 0
200 400 600 800 Worst case performance (mb/g)
1000
Problems with Models Problem 2: Requires a lot of training data
Exploitation (mb/g)
20
1m
0
100k
-20
10k
-40 -60
1k
-80
100
-100 -120 0
200 400 600 800 Exploitability (mb/g)
1000
Problems with Models Problem 3: Source of observations matters
Exploitation (mb/g)
20 0
Random Samples
-20 -40 -60
Self-Play
-80 -100 -120 0
200 400 600 800 Exploitability (mb/g)
1000
Data Biased Response
Restricted Nash had one parameter p that represented model accuracy, but... Model wasn’t accurate in states we never observed Model was more accurate in some states than in others
New approach: Use model’s accuracy as part of the restricted game
Data Biased Response
Lets set up another restricted game. Instead of one p value for the whole tree, we’ll set one p value for each choice node, p(i) More observations → more confidence in the model → higher p(i) Set a maximum p(i) value, Pmax , that we vary to produce a range of strategies
Data Biased Response
Three examples: 1-Step: p(i) = 0 if 0 observations, p(i) = Pmax otherwise 10-Step: p(i) = 0 if less than 10 observations, p(i) = Pmax otherwise 0-10 Linear: p(i) = 0 if 0 observations, p(i) = Pmax if 10 or more, and p(i) grows linearly in between
By setting p(i) = 0 in unobserved states, our prior is that the opponent will play as strongly as possible
DBR doesn’t overfit to the model RNR and several DBR curves:
Exploitation (mb/h)
25 20
0-10 Linear
15
10-Step
10 5
RN
0
1-Step
-5 0
200 400 600 800 Exploitability (mb/h)
1000
DBR works with fewer observations 0-10 Linear DBR curve:
Exploitation (mb/g)
20 1m
15
100k
10
10k 5
1k 100
0 0
50
100 150 200 Exploitability (mb/g)
250
DBR accepts any type of observation data
Exploitation (mb/g)
0-10 Linear DBR curve:
20 18 16 14 12 10 8 6 4 2
Random Samples
Self-Play
0
50
100 150 200 Exploitability (mb/g)
250
Experiments against humans DBR trained on 1.8 million hands of human observations, playing on Poker-Academy.com: 12000
Total winnings
10000 8000 Data Biased Response 6000 4000 2000
Equilibrium
0 0
5000
10000
15000
20000
Games played
25000
30000
35000
Conclusion
Data Biased Response technique: Generate a range of strategies, trading off exploitation and worst-case performance Take advantage of observed information Avoid overfitting to parts of the model that may be inaccurate
Questions?