Incentivizing Exploration Peter Frazier Senior Data Scientist, Uber Associate Professor, Cornell University joint work with:
David Kempe, Bobby Kleinberg & Jon Kleinberg 1
Yelp wants users to explore
10
35
14
2
4
9
25
18
1
1
1
0
1
1
0
Yelp wants users to explore
13
35
14
2
1
9
25
18
1
1
1
0
1
1
0
Yelp wants users to explore
13
35
15
2
1
9
25
18
1
1
1
0
1
1
0
Yelp wants users to explore
13
35
15
2
1
9
25
18
1
1
1
0
1
1
0
Yelp wants users to explore
13
36
15
2
1
9
25
18
1
1
1
0
1
1
0
Yelp wants users to explore
13
36
15
2
1
9
25
18
1
1
1
0
1
1
0
Yelp wants users to explore
13
36
15
2
1
9
25
18
1
1
1
0
1
1
0
Yelp wants users to explore
13
36
15
2
1
9
25
18
1
1
2
0
1
1
0
Yelp wants users to explore
• Yelp wants to maximize aggregate quality of many meals. • Wants users to explore to find good restaurants. • Each customer cares about only his/her own meal. • Wants to exploit current esImates, & go to the restaurant currently esImated to be best.
• IncenIves are misaligned.
Yelp could align them with payments.
Amazon wants
users to explore
• Each customer only wants to buy one good item. • Amazon wants to maximize the
total happiness of all customers.
YouTube wants
users to explore
20
10
92
25
11
2
• Each visitor only wants to look at a few stories/videos. • YouTube wants to maximize overall enjoyment.
We focus on these
common elements
• Different opIons of unknown quality. • Principal wants to maximize total reward of all agents.
• Each agent cares only about his own immediate reward.
• Principal can use money to incenIvize
agents toward acIons beUer for everyone.
Our model builds on the standard Bayesian multi-armed bandit (MAB)
• Each opIon (restaurant, camera, cat video) is called an “arm”. • Each arm i produces iid rewards from a distribuIon 𝚪 . • No one knows the 𝚪 . But everyone knows that each is drawn i
i
from a known Bayesian prior distribuIon.
• Arm i pulled at Ime t produces non-‐negaIve reward r ~ 𝚪 . • This reward is obtained by the agent i and the principal. • It also helps everyone (principal, agents) learn about 𝚪 by t,i
i
updaIng their Bayesian posterior distribuIons.
•
Principal wants the expected Ime-‐discounted reward,
R=E[𝝨i 𝛄i rt,i(t)], to be large.
i
If the principal pulled the arms, the Gittins index policy
would be optimal
• The Gicns index policy (Gicn & Jones 1989) computes an index for each arm i, which depends on the discount factor 𝛄 and our posterior belief on 𝚪i .
• The Gicns index policy then pulls the arm with the largest index. • This would maximize R=E[𝝨 𝛄 r ], in a standard Bayesian MAB. • The index is larger for arms about which we have more i
i
t,i(t)
uncertainty (exploraIon), and which produce large rewards according to the current posterior (exploitaIon).
The principal doesn’t pull the arms. Selfish myopic agents do.
• Agent arrives at Ime t, pulls one arm,
gets the reward, leaves forever.
• H : History of outcomes of arm i pulls up to Ime t. • Principal can offer payments c for pulling specific arms. • Agent pulls arm maximizing E[r | H ] + c • To get agent to pull arm j,
t,i
t,i
t,i
t,i
t,i
principal offers ct,j = maxi E[rt,i|Ht,i] -‐ E[rt,j |Ht,j] for arm j,
and ct,i=0 for all other i.
• Principal wants the expected total discounted incenIve payment
C=E[𝝨i 𝛄i ct,i(t)] to be small.
We answer three
business questions 1. “Right now we are not incenIvizing exploraIon. Does the extra reward we’ll get jusIfy worrying about this?” 2. If I decide to incenIvize exploraIon, how much will we have to pay out? 3. “If I decide to incenIve exploraIon, what is a simple strategy we can use that won’t be too much effort, and will hit your promised reward / payout?” Source: tinoshare.com:
We answer three business questions 1. Given an incenIve payment budget,
how much reward does an opImal
strategy generate? 2. Given a required level of reward,
how much incenIve will I need under an opImal strategy? 3. Find a simple (not necessarily opImal) strategy that hits these targets.
The answers depend on the problem instance 1. Given an incenIve payment budget,
how much reward does an opImal strategy generate? 2. Given a required level of reward,
how much incenIve will I need under an opImal strategy? 3. Find a simple (not necessarily opImal) strategy that hits these targets.
We develop bounds that apply over all problem instances 1. We provide a Ight lower bound on the reward gained by an opImal strategy, as a funcIon of the incenIve payment budget. 2. We provide a Ight upper bound on the incenIve payment required by an opImal strategy, as a funcIon of the reward obtained. 3. We give an implementable strategy that achieves these bounds in all problems.
Let’s define “achievable” before seeing the main theorem DefiniIon: Let OPT be the value of the opImal policy (Gicns index policy), if agents pulled arms. OPT depends on the problem instance.
DefiniIon: We say that the principal’s policy π achieves loss pair (a,b) for a problem instance if:
• •
Reward (R) ≥ (1-‐a)*OPT Payment (C) ≤ b*OPT.
1
b (IncenIve Cost = Payment/OPT) 0 0
DefiniIon: (a,b) is achievable if, for every instance, there is a policy that achieves loss pair (a,b).
1
a (Opportunity Cost
= 1 -‐ Reward/OPT)
Let’s define “achievable” before seeing the main theorem DefiniIon: Let OPT be the value of the opImal policy (Gicns index policy), if agents pulled arms. OPT depends on the problem instance.
DefiniIon: We say that the principal’s policy π achieves loss pair (a,b) for a problem instance if:
• •
Opportunity Cost = (OPT -‐ R)/OPT ≤ a IncenIve Cost = C / OPT ≤ b
1
b (IncenIve Cost = Payment/OPT) 0 0
DefiniIon: (a,b) is achievable if, for every instance, there is a policy that achieves loss pair (a,b).
1
a (Opportunity Cost
= 1 -‐ Reward/OPT)
Our main theorem characterizes the achievable loss pairs Main Theorem: Loss pair (a,b) is:
DefiniIon: Let OPT be the value of the opImal policy (Gicns index policy), if agents p ulled rms. aachievable if √a + √b > √𝛄;
OPT depends on the problem instance.
unachievable if √a + √b < √𝛄.
DefiniIon: We say that the principal’s policy π achieves loss pair (a,b) for a problem instance if:
• •
Opportunity Cost ≤ a IncenIve Cost ≤ b
1
b (IncenIve Cost = Payment/OPT)
Achievable region
Un-‐
achievable
0 0
DefiniIon: (a,b) is achievable if, for every instance, there is a policy that achieves pair (a,b).
1
a (Opportunity Cost
= 1 -‐ Reward/OPT)
Here’s how you use the theorem
to answer business question #1 Step 1: Calculate OPT (look at historical data Main Theorem: Loss pair (a,b) is:
over items, esImate a prior and likelihood for achievable if √a + √b > √𝛄;
arm reward distribuIons, run Gicns index in unachievable if √a + √b < √𝛄. simulaIon), or take a guess what it would be if you calculated it carefully. Step 2: b = IncenIve Budget / OPT Step 3: a = (√𝛄 -‐ √b)2 +ε
(in practice, set ε=0)
Step 4: The theorem tells us (a,b) is achievable, so there is a policy respecIng the budget that has reward (1-‐a)*OPT.
1
b (IncenIve Cost = Payment/OPT)
Achievable region
0 0
1
a (Opportunity Cost
= 1 -‐ Reward/OPT)
Here’s how you use the theorem
to answer business question #2 Step 1: Calculate OPT (look at historical data Main Theorem: Loss pair (a,b) is:
over items, esImate a prior and likelihood for achievable if √a + √b > √𝛄;
arm reward distribuIons, run Gicns index in unachievable if √a + √b < √𝛄. simulaIon), or take a guess what it would be if you calculated it carefully. Step 2: a = 1 -‐ Required Reward / OPT Step 3: b = (√𝛄 -‐ √a)2 +ε
(in practice, set ε=0)
Step 4: The theorem tells us (a,b) is achievable, so there is a policy respecIng the budget that has reward (1-‐a)*OPT.
1
b (IncenIve Cost = Payment/OPT)
Achievable region
0 0
1
a (Opportunity Cost
= 1 -‐ Reward/OPT)
If you don’t like formulas,
we also have sound bites. The worst-‐case 𝛄 is 𝛄=1. If you check the formula, you’ll see that: • (0.25, 0.25) is achievable for all 𝛄. (in practice)
Main Theorem: Loss pair (a,b) is:
achievable if √a + √b > √𝛄;
unachievable if √a + √b < √𝛄.
• (0.1, 0.5) is achievable for all 𝛄. (in theory and in practice) This means: • You can always get a reward of 75% of OPT if you pay 25% of OPT. • You can always get a reward of 90% of OPT if you pay 50% of OPT.
1
b (IncenIve Cost = Payment/OPT)
Achievable region
0 0
1
a (Opportunity Cost
= 1 -‐ Reward/OPT)
Let’s look at the proof. It has two parts • Part 1: If √a + √b < √𝛄, then (a,b) is not achievable. -‐
Proof: Look at worst-‐case instances, called “Diamonds in the Rough”.
• Part 2: If √a + √b > √𝛄, then (a,b) is achievable. -‐ -‐
Proof: Analyze “Ime expanded” policies. Part 2 answers business quesIon #3.
Part 1 uses a worst-case instance:
Diamonds in the Rough
• One arm (the “safe” arm) has
2
known constant reward c ≥(1-𝛄) .
c
• Infinitely many other arms are “collapsing”: reward is constant M(1-‐𝛄)2 -‐> ∞
with Iny probability 1/M (“diamonds”),
and 0 otherwise (“duds”).
• OpImum policy plays collapsing arms unIl we find a diamond. Total reward is 1.
• Myopic policy plays the safe arm.
Total reward is c/(1-‐𝛄).
0 0
2 M(1-‐𝛄)
Part 1 uses a worst-case instance:
Diamonds in the Rough •
Pick a value of c.
Want to show: Loss pair (a,b) is not achievable if
√a + √b < √𝛄. 1
b (IncenIve Cost = Payment/OPT) 0 0
1
a (Opportunity Cost
= 1 -‐ Reward/OPT)
Part 1 uses a worst-case instance:
Diamonds in the Rough • •
Pick a value of c. Calculate IncenIveCost(OpImal) and OpportunityCost(Myopic).
Want to show: Loss pair (a,b) is not achievable if
√a + √b < √𝛄. 1
b (IncenIve Cost = Payment/OPT) 0 0
1
a (Opportunity Cost
= 1 -‐ Reward/OPT)
Part 1 uses a worst-case instance:
Diamonds in the Rough • •
Pick a value of c. Calculate IncenIveCost(OpImal) and OpportunityCost(Myopic).
Want to show: Loss pair (a,b) is not achievable if
√a + √b < √𝛄. 1
b (IncenIve Cost = Payment/OPT) 0 0
1
a (Opportunity Cost
= 1 -‐ Reward/OPT)
Part 1 uses a worst-case instance:
Diamonds in the Rough • • •
Pick a value of c. Calculate IncenIveCost(OpImal) and OpportunityCost(Myopic). Let L(π,ƛ) = IncenIveCost(π) + ƛ * OpportunityCost(π),
Want to show: Loss pair where ƛ is chosen so that
L(OpImal,ƛ) = L(Myopic,ƛ). (a,b) is not achievable if
√a + √b < √𝛄. 1
b (IncenIve Cost = Payment/OPT) 0 0
1
a (Opportunity Cost
= 1 -‐ Reward/OPT)
Part 1 uses a worst-case instance:
Diamonds in the Rough • • •
Pick a value of c.
•
Consider minπ L(π,ƛ) = L*.
Calculate IncenIveCost(OpImal) and OpportunityCost(Myopic). Let L(π,ƛ) = IncenIveCost(π) + ƛ * OpportunityCost(π),
Want to show: Loss pair where ƛ is chosen so that
L(OpImal,ƛ) = L(Myopic,ƛ). (a,b) is not achievable if
√a + √b < √𝛄.
1
The soluIon is aUained by randomizaIon
between the only two non-‐dominated
staIonary policies, OpImal and Myopic.
These randomizaIons all have value L*.
b (IncenIve Cost = Payment/OPT) 0 0
1
a (Opportunity Cost
= 1 -‐ Reward/OPT)
Part 1 uses a worst-case instance:
Diamonds in the Rough • • •
Pick a value of c.
•
Consider minπ L(π,ƛ) = L*.
Calculate IncenIveCost(OpImal) and OpportunityCost(Myopic). Let L(π,ƛ) = IncenIveCost(π) + ƛ * OpportunityCost(π),
Want to show: Loss pair where ƛ is chosen so that
L(OpImal,ƛ) = L(Myopic,ƛ). (a,b) is not achievable if
1
The soluIon is aUained by randomizaIon
between the only two non-‐dominated
staIonary policies, OpImal and Myopic.
These randomizaIons all have value L*.
•
√a + √b < √𝛄.
No policy achieves a value of L(π,ƛ) below L*,
so the white region is not achievable.
b (IncenIve Cost = Payment/OPT) 0 0
1
a (Opportunity Cost
= 1 -‐ Reward/OPT)
Part 1 uses a worst-case instance:
Diamonds in the Rough
• Do this for many values of c. • Each value of c idenIfies a different
subregion of the unachievable region.
• Direct calculaIon shows that the union of these unachievable subregions is
{(a,b) : √a + √b < √𝛄}.
Want to show: Loss pair (a,b) is not achievable if
√a + √b < √𝛄. 1
b (IncenIve Cost = Payment/OPT) 0 0
1
a (Opportunity Cost
= 1 -‐ Reward/OPT)
Part 2 also uses a Lagrangian relaxation
•
Consider an arbitrary problem instance.
Want to show: Loss pair (a,b) is achievable if
√a + √b > √𝛄. 1
b (IncenIve Cost = Payment/OPT) 0 0
1
a (Opportunity Cost
= 1 -‐ Reward/OPT)
Part 2 also uses a Lagrangian relaxation
• •
Consider an arbitrary problem instance. Suppose that (a,b) is unachievable and
√a + √b > √𝛄.
Want to show: Loss pair (a,b) is achievable if
√a + √b > √𝛄. 1
b (IncenIve Cost = Payment/OPT) 0 0
1
a (Opportunity Cost
= 1 -‐ Reward/OPT)
Part 2 also uses a Lagrangian relaxation
• •
Consider an arbitrary problem instance.
•
The achievable region is convex, so there is a line through (a,b) such that the achievable region is on one side.
Suppose that (a,b) is unachievable and
√a + √b > √𝛄.
Want to show: Loss pair (a,b) is achievable if
√a + √b > √𝛄. 1
b (IncenIve Cost = Payment/OPT) 0 0
1
a (Opportunity Cost
= 1 -‐ Reward/OPT)
Part 2 also uses a Lagrangian relaxation
• •
Consider an arbitrary problem instance.
•
The achievable region is convex, so there is a line through (a,b) such that the achievable region is on one side.
•
The lines parallel to it are the level curves of L(π,ƛ), for some ƛ.
Suppose that (a,b) is unachievable and
√a + √b > √𝛄.
Want to show: Loss pair (a,b) is achievable if
√a + √b > √𝛄. 1
b (IncenIve Cost = Payment/OPT) 0 0
1
a (Opportunity Cost
= 1 -‐ Reward/OPT)
Part 2 also uses a Lagrangian relaxation
• •
Consider an arbitrary problem instance.
•
The achievable region is convex, so there is a line through (a,b) such that the achievable region is on one side.
•
The lines parallel to it are the level curves of L(π,ƛ), for some ƛ.
•
We saw that in Diamonds in the Rough with the corresponding ƛ,
we could achieve L*(ƛ)=b(a)+ƛa,
where sqrt(a) + sqrt(b(a)) = sqrt(𝛄).
•
Suppose that (a,b) is unachievable and
√a + √b > √𝛄.
Can any problem have a worse
(bigger) L* than this?
Want to show: Loss pair (a,b) is achievable if
√a + √b > √𝛄. 1
b (IncenIve Cost = Payment/OPT) 0 0
1
a (Opportunity Cost
= 1 -‐ Reward/OPT)
Part 2 studies the relaxed problem
using time-expanded policies
•
To complete the proof, we show, for an arbitrary problem instance, minπ L(π,ƛ) = minπ IncenIveCost(π) + ƛ OpportunityCost(π) is no worse than L*(ƛ), the value of the Diamonds in the Rough instance corresponding to ƛ.
• •
To do so, we define Ime-‐expanded policies,
and show that a parIcular Ime-‐expanded policy achieves
L(π,ƛ) no worse than L*(ƛ). This also answers business quesIon #3:
•
To find a simple policy achieving a
desired (a,b) on the efficient fronIer,
use the Ime-‐expanded policy with the
corresponding ƛ. [Need to also show
that it saIsfies the budget constraint.]
1
b (IncenIve Cost = Payment/OPT) 0 0
1
a (Opportunity Cost
= 1 -‐ Reward/OPT)
Part 2 studies the relaxed problem using time-expanded policies
• Want to show: min IncenIveCost(π) + ƛ OpportunityCost(π) is no π
worse than in Diamonds in the Rough.
• The Ime-‐expansion of policy π with parameter ƛ, TE(π,ƛ),
plays as follows on each iteraIon:
• With probability p=p(ƛ),
offer no incenIve payment so the agent plays myopically. Store the observaIon for later, but do not let π look at it.
• With probability 1-‐p,
incenIvize the agent to play the arm recommended by π.
Give π the next observaIon from that arm [it may be from a previous pull.]
Here’s how we analyze
time-expanded policies
• p(ƛ) is constructed so that the rewards from myopic pulls cancel the incenIve payments from π pulls.
• The Lagrangian value L(TE(π,ƛ)) equals -‐Reward (π) (up η
to a linear transformaIon), the reward computed under a different discount factor η ≤ 𝛄.
• To minimize L(TE(π,ƛ)), we maximize Reward (π).
η
We choose π to be the Gicns index policy for the discount factor η.
• The crux of this analysis is bounding OPTη / OPT . 𝛄
Here’s how we analyze
time-expanded policies
• Theorem: For any MAB instance and any η ≤ 𝛄,
2
OPTη / OPT𝛄 ≥(1-𝛄)
2
/(1-η)
with equality for the diamonds in the rough instance.
• This says: the Gicns index policy degrades slowly with the discount factor.
• A•er some pencil and paper calculaIon, we get… • Corollary: L(TE(OPTη,ƛ)) ≤ L*(ƛ)
This completes the proof of part 2
• We showed that the Ime expanded policy’s Lagrangian has value L(TE(OPTη,ƛ)) ≤ L*(ƛ), and so is no worse than the Diamonds in the Rough example.
• Thus, there is an achievable point below the supposedly unachievable line.
• The contradiction shows
all (a,b) with √a + √b > √𝛄
are achievable.
1
b (IncenIve Cost = Payment/OPT) 0 0
1
a (Opportunity Cost
= 1 -‐ Reward/OPT)
Let me summarize the theory,
and comment on practice
• We characterized the set of IncenIve Costs (a*OPT)
and Opportunity Costs (b*OPT) for which we can guarantee the existence of a policy that achieves them, regardless of problem instance.
• Loss pair (a,b) is achievable if √a + √b ≥ √𝛄. • We also provided an implementable policy that achieves this.
• What else do we need to do to successfully incentivize exploration in practice?
Here are some applications in which we could consider incentivizing exploration
These things happen in the real world, but are missing from our model
• • • •
Agents may have heterogeneous preferences, priors, and uIlity for money.
• • • • •
InformaIon sharing may be incomplete.
Preferences, priors, and uIlity for money may be unknown. Agents may have repeated interacIons. Agents may observe aspects of the interacIon unobserved by the principal.
We may not be able to incenIvize with money. Quality may vary over Ime. Random constraints on the agents’ behavior may lead them to explore. The items being explored (restaurants, products) may also have interests, and may promote themselves.
There are lots of
interesting questions out there!
• • • •
Agents may have heterogeneous preferences, priors, and uIlity for money.
• • • • •
InformaIon sharing may be incomplete.
Preferences, priors, and uIlity for money may be unknown. Agents may have repeated interacIons. Agents may observe aspects of the interacIon unobserved by the principal.
We may not be able to incenIvize with money. Quality may vary over Ime. Random constraints on the agents’ behavior may lead them to explore. The items being explored (restaurants, products) may also have interests, and may promote themselves.
Thanks!