0 - Semantic Scholar

Report 3 Downloads 42 Views
Incentivizing Exploration Peter Frazier Senior Data Scientist, Uber Associate Professor, Cornell University joint work with: 


David Kempe, Bobby Kleinberg & Jon Kleinberg 1

Yelp wants users to explore

10
 35
 14
 2
 4

9
 25
 18
 1
 1

1
 0
 1
 1
 0

Yelp wants users to explore

13
 35
 14
 2
 1

9
 25
 18
 1
 1

1
 0
 1
 1
 0

Yelp wants users to explore

13
 35
 15
 2
 1

9
 25
 18
 1
 1

1
 0
 1
 1
 0

Yelp wants users to explore

13
 35
 15
 2
 1

9
 25
 18
 1
 1

1
 0
 1
 1
 0

Yelp wants users to explore

13
 36
 15
 2
 1

9
 25
 18
 1
 1

1
 0
 1
 1
 0

Yelp wants users to explore

13
 36
 15
 2
 1

9
 25
 18
 1
 1

1
 0
 1
 1
 0

Yelp wants users to explore

13
 36
 15
 2
 1

9
 25
 18
 1
 1

1
 0
 1
 1
 0

Yelp wants users to explore

13
 36
 15
 2
 1

9
 25
 18
 1
 1

2
 0
 1
 1
 0

Yelp wants users to explore

• Yelp  wants  to  maximize  aggregate  quality  of  many  meals.   • Wants  users  to  explore  to  find  good  restaurants.   • Each  customer  cares  about  only  his/her  own  meal.   • Wants  to  exploit  current  esImates,  &  go  to  the   restaurant  currently  esImated  to  be  best.  

• IncenIves  are  misaligned.    


Yelp  could  align  them  with  payments.

Amazon wants 
 users to explore

• Each  customer  only  wants  to  buy  one  good  item.   • Amazon  wants  to  maximize  the
 total  happiness  of  all  customers.

YouTube wants 
 users to explore

20

10

92

25

11

2

• Each  visitor  only  wants  to  look  at  a  few  stories/videos.     • YouTube  wants  to  maximize  overall  enjoyment.

We focus on these
 common elements

• Different  opIons  of  unknown  quality.   • Principal  wants  to  maximize  total  reward  of   all  agents.  

• Each  agent  cares  only  about  his  own   immediate  reward.  

• Principal  can  use  money  to  incenIvize  

agents  toward  acIons  beUer  for  everyone.

Our model builds on the standard Bayesian multi-armed bandit (MAB)

• Each  opIon  (restaurant,  camera,  cat  video)  is  called  an  “arm”.   • Each  arm  i  produces  iid  rewards  from  a  distribuIon  𝚪 .   • No  one  knows  the  𝚪 .    But  everyone  knows  that  each  is  drawn   i

i

from  a  known  Bayesian  prior  distribuIon.  

• Arm  i  pulled  at  Ime  t  produces  non-­‐negaIve  reward  r  ~  𝚪 .   • This  reward  is  obtained  by  the  agent  i  and  the  principal.   • It  also  helps  everyone  (principal,  agents)  learn  about  𝚪 by   t,i

i  

updaIng  their  Bayesian  posterior  distribuIons.  



Principal  wants  the  expected  Ime-­‐discounted  reward,  
 R=E[𝝨i  𝛄i  rt,i(t)],  to  be  large.

i

If the principal pulled the arms, the Gittins index policy 
 would be optimal

• The  Gicns  index  policy  (Gicn  &  Jones  1989)  computes  an  index   for  each  arm  i,  which  depends  on  the  discount  factor  𝛄 and  our   posterior  belief  on  𝚪i  .  

• The  Gicns  index  policy  then  pulls  the  arm  with  the  largest  index.     • This  would  maximize  R=E[𝝨  𝛄  r ],  in  a  standard  Bayesian  MAB.   • The  index  is  larger  for  arms  about  which  we  have  more   i

i

t,i(t)

uncertainty  (exploraIon),  and  which  produce  large  rewards   according  to  the  current  posterior  (exploitaIon).

The principal doesn’t pull the arms. Selfish myopic agents do.

• Agent  arrives  at  Ime  t,  pulls  one  arm,  
 gets  the  reward,  leaves  forever.  

• H :  History  of  outcomes  of  arm  i  pulls  up  to  Ime  t.   • Principal  can  offer  payments  c  for  pulling  specific  arms.   • Agent  pulls  arm  maximizing  E[r  |  H ]  +  c   • To  get  agent  to  pull  arm  j,  
 t,i

t,i

t,i

t,i

t,i

principal  offers  ct,j  =  maxi  E[rt,i|Ht,i]  -­‐  E[rt,j  |Ht,j]  for  arm  j,
                                        and  ct,i=0  for  all  other  i.  

• Principal  wants  the  expected  total  discounted  incenIve  payment  
 C=E[𝝨i  𝛄i  ct,i(t)]  to  be  small.  

We answer three 
 business questions 1. “Right  now  we  are  not  incenIvizing   exploraIon.    Does  the  extra  reward   we’ll  get  jusIfy  worrying  about  this?”   2. If  I  decide  to  incenIvize  exploraIon,   how  much  will  we  have  to  pay  out?   3. “If  I  decide  to  incenIve  exploraIon,   what  is  a  simple  strategy  we  can  use   that  won’t  be  too  much  effort,  and  will   hit  your  promised  reward  /  payout?” Source: tinoshare.com:

We answer three business questions 1. Given  an  incenIve  payment  budget,  
 how  much  reward  does  an  opImal  
 strategy  generate?   2. Given  a  required  level  of  reward,  
 how  much  incenIve  will  I  need  under  an   opImal  strategy?   3. Find  a  simple  (not  necessarily  opImal)   strategy  that  hits  these  targets.

The answers depend on the problem instance 1. Given  an  incenIve  payment  budget,  
 how  much  reward  does  an  opImal  strategy   generate?   2. Given  a  required  level  of  reward,  
 how  much  incenIve  will  I  need  under  an   opImal  strategy?   3. Find  a  simple  (not  necessarily  opImal)   strategy  that  hits  these  targets.

We develop bounds that apply over all problem instances 1. We  provide  a  Ight  lower  bound  on  the   reward  gained  by  an  opImal  strategy,  as  a   funcIon  of  the  incenIve  payment  budget.   2. We  provide  a  Ight  upper  bound  on  the   incenIve  payment  required  by  an  opImal   strategy,  as  a  funcIon  of  the  reward   obtained.   3. We  give  an  implementable  strategy  that   achieves  these  bounds  in  all  problems.

Let’s define “achievable” before seeing the main theorem DefiniIon:  Let  OPT  be  the  value  of  the  opImal   policy  (Gicns  index  policy),  if  agents  pulled  arms.     OPT  depends  on  the  problem  instance.
 DefiniIon:  We  say  that  the  principal’s  policy  π   achieves  loss  pair  (a,b)  for  a  problem  instance  if:  

• •

Reward  (R)  ≥  (1-­‐a)*OPT   Payment  (C)  ≤  b*OPT.


1

b   (IncenIve  Cost   =  Payment/OPT) 0 0

DefiniIon:  (a,b)  is  achievable  if,  for  every  instance,   there  is  a  policy  that  achieves  loss  pair  (a,b).  

1

a   (Opportunity  Cost
 =  1  -­‐  Reward/OPT)

Let’s define “achievable” before seeing the main theorem DefiniIon:  Let  OPT  be  the  value  of  the  opImal   policy  (Gicns  index  policy),  if  agents  pulled  arms.     OPT  depends  on  the  problem  instance.
 DefiniIon:  We  say  that  the  principal’s  policy  π   achieves  loss  pair  (a,b)  for  a  problem  instance  if:  

• •

Opportunity  Cost  =  (OPT  -­‐  R)/OPT  ≤  a   IncenIve  Cost  =  C  /  OPT    ≤  b


1

b   (IncenIve  Cost   =  Payment/OPT) 0 0

DefiniIon:  (a,b)  is  achievable  if,  for  every  instance,   there  is  a  policy  that  achieves  loss  pair  (a,b).  

1

a   (Opportunity  Cost
 =  1  -­‐  Reward/OPT)

Our main theorem characterizes the achievable loss pairs Main   Theorem:  Loss  pair  (a,b)  is:
 DefiniIon:  Let  OPT  be  the  value  of  the   opImal   policy  (Gicns  index  policy),  if  agents    p    ulled   rms.                aachievable   if  √a  +  √b  >  √𝛄;
 OPT  depends  on  the  problem  instance.
 unachievable if √a  +  √b  <  √𝛄.

DefiniIon:  We  say  that  the  principal’s  policy  π   achieves  loss  pair  (a,b)  for  a  problem  instance  if:  

• •

Opportunity  Cost  ≤  a   IncenIve  Cost  ≤  b


1

b   (IncenIve  Cost   =  Payment/OPT)

Achievable  region

Un-­‐
 achievable

0 0

DefiniIon:  (a,b)  is  achievable  if,  for  every  instance,   there  is  a  policy  that  achieves  pair  (a,b).  

1

a   (Opportunity  Cost
 =  1  -­‐  Reward/OPT)

Here’s how you use the theorem 
 to answer business question #1 Step  1:  Calculate  OPT  (look  at  historical  data   Main  Theorem:  Loss  pair  (a,b)  is:
 over  items,  esImate  a  prior  and  likelihood  for                    achievable  if  √a  +  √b  >  √𝛄;
 arm  reward  distribuIons,  run  Gicns  index  in   unachievable if √a  +  √b  <  √𝛄. simulaIon),  or  take  a  guess  what  it  would  be  if   you  calculated  it  carefully.   Step  2:  b  =  IncenIve  Budget  /  OPT   Step  3:  a  =  (√𝛄  -­‐  √b)2  +ε

(in practice, set ε=0)  

Step  4:  The  theorem  tells  us  (a,b)  is  achievable,   so  there  is  a  policy  respecIng  the  budget  that   has  reward  (1-­‐a)*OPT.  

1

b   (IncenIve  Cost   =  Payment/OPT)

Achievable  region

0 0

1

a   (Opportunity  Cost
 =  1  -­‐  Reward/OPT)

Here’s how you use the theorem 
 to answer business question #2 Step  1:  Calculate  OPT  (look  at  historical  data   Main  Theorem:  Loss  pair  (a,b)  is:
 over  items,  esImate  a  prior  and  likelihood  for                    achievable  if  √a  +  √b  >  √𝛄;
 arm  reward  distribuIons,  run  Gicns  index  in   unachievable if √a  +  √b  <  √𝛄. simulaIon),  or  take  a  guess  what  it  would  be  if   you  calculated  it  carefully.   Step  2:  a  =  1  -­‐  Required  Reward  /  OPT   Step  3:  b  =  (√𝛄  -­‐  √a)2  +ε

(in practice, set ε=0)  

Step  4:  The  theorem  tells  us  (a,b)  is  achievable,   so  there  is  a  policy  respecIng  the  budget  that   has  reward  (1-­‐a)*OPT.  

1

b   (IncenIve  Cost   =  Payment/OPT)

Achievable  region

0 0

1

a   (Opportunity  Cost
 =  1  -­‐  Reward/OPT)

If you don’t like formulas, 
 we also have sound bites. The  worst-­‐case  𝛄 is 𝛄=1. If you check the formula, you’ll see that: • (0.25,  0.25)  is  achievable  for  all  𝛄. (in practice)

Main  Theorem:  Loss  pair  (a,b)  is:
                  achievable  if  √a  +  √b  >  √𝛄;
 unachievable if √a  +  √b  <  √𝛄.

• (0.1, 0.5) is achievable for all 𝛄. (in theory and in practice)   This  means:   • You  can  always  get  a  reward  of  75%  of  OPT  if   you  pay  25%  of  OPT.   • You  can  always  get  a  reward  of  90%  of  OPT  if   you  pay  50%  of  OPT.  

1

b   (IncenIve  Cost   =  Payment/OPT)

Achievable  region

0 0

1

a   (Opportunity  Cost
 =  1  -­‐  Reward/OPT)

Let’s look at the proof. It has two parts • Part  1:  If  √a  +  √b  <  √𝛄,  then  (a,b)  is  not  achievable.   -­‐

Proof:  Look  at  worst-­‐case  instances,  called  “Diamonds  in  the  Rough”.  

• Part  2:  If  √a  +  √b  >  √𝛄,  then  (a,b)  is  achievable.   -­‐ -­‐

Proof:  Analyze  “Ime  expanded”  policies.   Part  2  answers  business  quesIon  #3.

Part 1 uses a worst-case instance: 
 Diamonds in the Rough

• One  arm  (the  “safe”  arm)  has  


2

known  constant  reward  c  ≥(1-𝛄) .  

c

• Infinitely  many  other  arms  are  “collapsing”:   reward  is  constant  M(1-­‐𝛄)2  -­‐>  ∞  
 with  Iny  probability  1/M  (“diamonds”),  
 and  0  otherwise  (“duds”).  

• OpImum  policy  plays  collapsing  arms  unIl   we  find  a  diamond.    Total  reward  is  1.  

• Myopic  policy  plays  the  safe  arm.    
 Total  reward  is  c/(1-­‐𝛄).

0 0

2 M(1-­‐𝛄)

Part 1 uses a worst-case instance: 
 Diamonds in the Rough •

Pick  a  value  of  c.

Want  to  show:  Loss  pair   (a,b)  is  not  achievable  if  
 √a  +  √b  <  √𝛄. 1

b   (IncenIve  Cost   =  Payment/OPT) 0 0

1

a   (Opportunity  Cost
 =  1  -­‐  Reward/OPT)

Part 1 uses a worst-case instance: 
 Diamonds in the Rough • •

Pick  a  value  of  c.   Calculate  IncenIveCost(OpImal)  and  OpportunityCost(Myopic).

Want  to  show:  Loss  pair   (a,b)  is  not  achievable  if  
 √a  +  √b  <  √𝛄. 1

b   (IncenIve  Cost   =  Payment/OPT) 0 0

1

a   (Opportunity  Cost
 =  1  -­‐  Reward/OPT)

Part 1 uses a worst-case instance: 
 Diamonds in the Rough • •

Pick  a  value  of  c.   Calculate  IncenIveCost(OpImal)  and  OpportunityCost(Myopic).

Want  to  show:  Loss  pair   (a,b)  is  not  achievable  if  
 √a  +  √b  <  √𝛄. 1

b   (IncenIve  Cost   =  Payment/OPT) 0 0

1

a   (Opportunity  Cost
 =  1  -­‐  Reward/OPT)

Part 1 uses a worst-case instance: 
 Diamonds in the Rough • • •

Pick  a  value  of  c.   Calculate  IncenIveCost(OpImal)  and  OpportunityCost(Myopic).   Let  L(π,ƛ)  =  IncenIveCost(π)  +  ƛ  *  OpportunityCost(π),  
 Want  to  show:  Loss  pair   where  ƛ  is  chosen  so  that  
 L(OpImal,ƛ)  =  L(Myopic,ƛ). (a,b)  is  not  achievable  if  


√a  +  √b  <  √𝛄. 1

b   (IncenIve  Cost   =  Payment/OPT) 0 0

1

a   (Opportunity  Cost
 =  1  -­‐  Reward/OPT)

Part 1 uses a worst-case instance: 
 Diamonds in the Rough • • •

Pick  a  value  of  c.  



Consider  minπ  L(π,ƛ)  =  L*.    


Calculate  IncenIveCost(OpImal)  and  OpportunityCost(Myopic).   Let  L(π,ƛ)  =  IncenIveCost(π)  +  ƛ  *  OpportunityCost(π),  
 Want  to  show:  Loss  pair   where  ƛ  is  chosen  so  that  
 L(OpImal,ƛ)  =  L(Myopic,ƛ).   (a,b)  is  not  achievable  if  


√a  +  √b  <  √𝛄.



1

The  soluIon  is  aUained  by  randomizaIon  
 between  the  only  two  non-­‐dominated  
 staIonary  policies,  OpImal  and  Myopic.
 
 These  randomizaIons  all  have  value  L*.

b   (IncenIve  Cost   =  Payment/OPT) 0 0

1

a   (Opportunity  Cost
 =  1  -­‐  Reward/OPT)

Part 1 uses a worst-case instance: 
 Diamonds in the Rough • • •

Pick  a  value  of  c.  



Consider  minπ  L(π,ƛ)  =  L*.    


Calculate  IncenIveCost(OpImal)  and  OpportunityCost(Myopic).   Let  L(π,ƛ)  =  IncenIveCost(π)  +  ƛ  *  OpportunityCost(π),  
 Want  to  show:  Loss  pair   where  ƛ  is  chosen  so  that  
 L(OpImal,ƛ)  =  L(Myopic,ƛ).   (a,b)  is  not  achievable  if  
 


1

The  soluIon  is  aUained  by  randomizaIon  
 between  the  only  two  non-­‐dominated  
 staIonary  policies,  OpImal  and  Myopic.
 
 These  randomizaIons  all  have  value  L*.  



√a  +  √b  <  √𝛄.

No  policy  achieves  a  value  of  L(π,ƛ)  below  L*,
 so  the  white  region  is  not  achievable.

b   (IncenIve  Cost   =  Payment/OPT) 0 0

1

a   (Opportunity  Cost
 =  1  -­‐  Reward/OPT)

Part 1 uses a worst-case instance: 
 Diamonds in the Rough

• Do  this  for  many  values  of  c.   • Each  value  of  c  idenIfies  a  different  

subregion  of  the  unachievable  region.  

• Direct  calculaIon  shows  that  the  union   of  these  unachievable  subregions  is  
 {(a,b)  :  √a  +  √b  <  √𝛄}.

Want  to  show:  Loss  pair   (a,b)  is  not  achievable  if  
 √a  +  √b  <  √𝛄. 1

b   (IncenIve  Cost   =  Payment/OPT) 0 0

1

a   (Opportunity  Cost
 =  1  -­‐  Reward/OPT)

Part 2 also uses a Lagrangian relaxation



Consider  an  arbitrary  problem  instance.

Want  to  show:  Loss  pair   (a,b)  is  achievable  if  
 √a  +  √b  >  √𝛄. 1

b   (IncenIve  Cost   =  Payment/OPT) 0 0

1

a   (Opportunity  Cost
 =  1  -­‐  Reward/OPT)

Part 2 also uses a Lagrangian relaxation

• •

Consider  an  arbitrary  problem  instance.   Suppose  that  (a,b)  is  unachievable  and  
 √a  +  √b  >  √𝛄.

Want  to  show:  Loss  pair   (a,b)  is  achievable  if  
 √a  +  √b  >  √𝛄. 1

b   (IncenIve  Cost   =  Payment/OPT) 0 0

1

a   (Opportunity  Cost
 =  1  -­‐  Reward/OPT)

Part 2 also uses a Lagrangian relaxation

• •

Consider  an  arbitrary  problem  instance.  



The  achievable  region  is  convex,  so  there   is  a  line  through  (a,b)  such  that  the   achievable  region  is  on  one  side.

Suppose  that  (a,b)  is  unachievable  and  
 √a  +  √b  >  √𝛄.  

Want  to  show:  Loss  pair   (a,b)  is  achievable  if  
 √a  +  √b  >  √𝛄. 1

b   (IncenIve  Cost   =  Payment/OPT) 0 0

1

a   (Opportunity  Cost
 =  1  -­‐  Reward/OPT)

Part 2 also uses a Lagrangian relaxation

• •

Consider  an  arbitrary  problem  instance.  



The  achievable  region  is  convex,  so  there   is  a  line  through  (a,b)  such  that  the   achievable  region  is  on  one  side.  



The  lines  parallel  to  it  are  the  level  curves   of  L(π,ƛ),  for  some  ƛ.

Suppose  that  (a,b)  is  unachievable  and  
 √a  +  √b  >  √𝛄.  

Want  to  show:  Loss  pair   (a,b)  is  achievable  if  
 √a  +  √b  >  √𝛄. 1

b   (IncenIve  Cost   =  Payment/OPT) 0 0

1

a   (Opportunity  Cost
 =  1  -­‐  Reward/OPT)

Part 2 also uses a Lagrangian relaxation

• •

Consider  an  arbitrary  problem  instance.  



The  achievable  region  is  convex,  so  there   is  a  line  through  (a,b)  such  that  the   achievable  region  is  on  one  side.  



The  lines  parallel  to  it  are  the  level  curves   of  L(π,ƛ),  for  some  ƛ.  



We  saw  that  in  Diamonds  in  the  Rough   with  the  corresponding  ƛ,  
 we  could  achieve  L*(ƛ)=b(a)+ƛa,
 where  sqrt(a)  +  sqrt(b(a))  =  sqrt(𝛄).  



Suppose  that  (a,b)  is  unachievable  and  
 √a  +  √b  >  √𝛄.  

Can  any  problem  have  a  worse  
 (bigger)  L*  than  this?

Want  to  show:  Loss  pair   (a,b)  is  achievable  if  
 √a  +  √b  >  √𝛄. 1

b   (IncenIve  Cost   =  Payment/OPT) 0 0

1

a   (Opportunity  Cost
 =  1  -­‐  Reward/OPT)

Part 2 studies the relaxed problem 
 using time-expanded policies



To  complete  the  proof,  we  show,  for  an  arbitrary  problem  instance,     minπ  L(π,ƛ)  =  minπ  IncenIveCost(π)  +  ƛ  OpportunityCost(π)   is  no  worse  than  L*(ƛ),  the  value  of  the  Diamonds  in  the  Rough  instance   corresponding  to  ƛ.  

• •

To  do  so,  we  define  Ime-­‐expanded  policies,  
 and  show  that  a  parIcular  Ime-­‐expanded  policy  achieves
 L(π,ƛ)  no  worse  than    L*(ƛ).   This  also  answers  business  quesIon  #3:  



To  find  a  simple  policy  achieving  a  
 desired  (a,b)  on  the  efficient  fronIer,
 use  the  Ime-­‐expanded  policy  with  the
 corresponding  ƛ.    [Need  to  also  show  
 that  it  saIsfies  the  budget  constraint.]

1

b   (IncenIve  Cost   =  Payment/OPT) 0 0

1

a   (Opportunity  Cost
 =  1  -­‐  Reward/OPT)

Part 2 studies the relaxed problem using time-expanded policies

• Want  to  show:  min  IncenIveCost(π)  +  ƛ  OpportunityCost(π)  is  no   π

worse  than  in  Diamonds  in  the  Rough.  

• The  Ime-­‐expansion  of  policy  π  with  parameter  ƛ,  TE(π,ƛ),  
 plays  as  follows  on  each  iteraIon:  

• With  probability  p=p(ƛ),  


offer  no  incenIve  payment  so  the  agent  plays  myopically.     Store  the  observaIon  for  later,  but  do  not  let  π  look  at  it.  

• With  probability  1-­‐p,


incenIvize  the  agent  to  play  the  arm  recommended  by  π.
 Give  π  the  next  observaIon  from  that  arm  [it  may  be  from  a   previous  pull.]

Here’s how we analyze
 time-expanded policies

• p(ƛ)  is  constructed  so  that  the  rewards  from  myopic   pulls  cancel  the  incenIve  payments  from  π  pulls.  

• The  Lagrangian  value  L(TE(π,ƛ))  equals  -­‐Reward (π)  (up   η

to  a  linear  transformaIon),  the  reward  computed   under  a  different  discount  factor  η  ≤  𝛄.  

• To  minimize  L(TE(π,ƛ)),  we  maximize  Reward (π).
 η

We  choose  π  to  be  the  Gicns  index  policy  for  the   discount  factor  η.  

• The  crux  of  this  analysis  is  bounding  OPTη  /  OPT . 𝛄

Here’s how we analyze
 time-expanded policies

• Theorem:  For  any  MAB  instance  and  any  η  ≤  𝛄,
 2

OPTη  /  OPT𝛄  ≥(1-𝛄)

2
 /(1-η)

with equality for the diamonds in the rough instance.

• This  says:  the  Gicns  index  policy  degrades  slowly  with   the  discount  factor.  

• A•er  some  pencil  and  paper  calculaIon,  we  get…   • Corollary:  L(TE(OPTη,ƛ))  ≤ L*(ƛ)

This completes the proof of part 2

• We  showed  that  the  Ime  expanded  policy’s  Lagrangian   has  value  L(TE(OPTη,ƛ))  ≤ L*(ƛ), and so is no worse than the Diamonds in the Rough example.

• Thus, there is an achievable point below the supposedly unachievable line.

• The contradiction shows 
 all (a,b) with √a  +  √b  >  √𝛄
 are  achievable.

1

b   (IncenIve  Cost   =  Payment/OPT) 0 0

1

a   (Opportunity  Cost
 =  1  -­‐  Reward/OPT)

Let me summarize the theory, 
 and comment on practice

• We  characterized  the  set  of  IncenIve  Costs  (a*OPT)  

and  Opportunity  Costs  (b*OPT)  for  which  we  can   guarantee  the  existence  of  a  policy  that  achieves  them,   regardless  of  problem  instance.  

• Loss  pair  (a,b)  is  achievable  if  √a  +  √b  ≥  √𝛄. • We also provided an implementable policy that achieves this.

• What else do we need to do to successfully incentivize exploration in practice?

Here are some applications in which we could consider incentivizing exploration

These things happen in the real world, but are missing from our model

• • • •

Agents  may  have  heterogeneous  preferences,  priors,  and  uIlity  for  money.  

• • • • •

InformaIon  sharing  may  be  incomplete.  

Preferences,  priors,  and  uIlity  for  money  may  be  unknown.   Agents  may  have  repeated  interacIons.   Agents  may  observe  aspects  of  the  interacIon  unobserved  by  the   principal.  

We  may  not  be  able  to  incenIvize  with  money.   Quality  may  vary  over  Ime.   Random  constraints  on  the  agents’  behavior  may  lead  them  to  explore.   The  items  being  explored  (restaurants,  products)  may  also  have  interests,   and  may  promote  themselves.

There are lots of 
 interesting questions out there!

• • • •

Agents  may  have  heterogeneous  preferences,  priors,  and  uIlity  for  money.  

• • • • •

InformaIon  sharing  may  be  incomplete.  

Preferences,  priors,  and  uIlity  for  money  may  be  unknown.   Agents  may  have  repeated  interacIons.   Agents  may  observe  aspects  of  the  interacIon  unobserved  by  the   principal.  

We  may  not  be  able  to  incenIvize  with  money.   Quality  may  vary  over  Ime.   Random  constraints  on  the  agents’  behavior  may  lead  them  to  explore.   The  items  being  explored  (restaurants,  products)  may  also  have  interests,   and  may  promote  themselves.

Thanks!