slides - Cornell Computer Science

Report 3 Downloads 380 Views
Counterfactual Risk Minimization Learning from logged bandit feedback Adith Swaminathan, Thorsten Joachims

Software: http://www.cs.cornell.edu/~adith/poem/

Learning frameworks 𝒙 Online

Batch

Full Information

Perceptron, …

SVM, ...

Bandit Feedback

LinUCB, …

?

𝒚

2

Logged bandit feedback is everywhere! Alice: Want Tech news SpaceX launch

𝑥1 = (𝐴𝑙𝑖𝑐𝑒, 𝑡𝑒𝑐ℎ) 𝑦1 = (𝑆𝑝𝑎𝑐𝑒𝑋) δ1 = 1

𝑥𝑖 : Query

𝑦𝑖 : Prediction

δ𝑖 : Feedback Alice: Cool!

3

Goal • Risk of ℎ: 𝕏 ↦ 𝕐

𝑅 ℎ = 𝔼𝑥 δ(𝑥, ℎ 𝑥 ) • Find ℎ∗ ∈ 𝓗 with minimum risk • Can we find ℎ∗ using 𝒟 =

𝑥1 = (𝐴𝑙𝑖𝑐𝑒, 𝑡𝑒𝑐ℎ) 𝑦1 = (𝑆𝑝𝑎𝑐𝑒𝑋) δ1 = 1

collected from ℎ0 ?

𝑥2 = (𝐴𝑙𝑖𝑐𝑒, 𝑠𝑝𝑜𝑟𝑡𝑠) 𝑦2 = (𝐹1) δ2 = 5

4

Learning by replaying logs? 𝑥 ∼ Pr(𝕏), however 𝑦 = ℎ0 (𝑥), not ℎ(𝑥) 𝑥1 = (𝐴𝑙𝑖𝑐𝑒, 𝑡𝑒𝑐ℎ) 𝑦1 = (𝑆𝑝𝑎𝑐𝑒𝑋) δ1 = 1

𝑥2 = (𝐴𝑙𝑖𝑐𝑒, 𝑠𝑝𝑜𝑟𝑡𝑠) 𝑦2 = (𝐹1) δ2 = 5

𝑥1 : (Alice,tech)

𝑦1′ : Apple watch

What would Alice do ??

• Training/evaluation from logged data is counter-factual [Bottou et al] 5

Stochastic policies to the rescue! • Stochastic ℎ: 𝕏 ↦ ∆(𝕐), 𝑦 ∼ ℎ(𝑥) 𝑅 ℎ = 𝔼𝑥 𝔼𝑦∼ℎ(𝑥) δ(𝑥, 𝑦) 𝕐 Likely SpaceX launch



ℎ0

Unlikely 6

Counterfactual risk estimators Basic Importance Sampling [Owen] 𝔼𝑥 𝔼𝑦∼ℎ δ 𝑥, 𝑦 Perf of new system

= 𝔼𝑥

𝔼𝑦∼ℎ𝑜

Samples from old system

ℎ(𝑦|𝑥) δ(𝑥, 𝑦) ℎ0 (𝑦|𝑥) Importance weight

• 𝒟 = { 𝑥1 , 𝑦1 , δ1 , 𝑝1 , 𝑥2 , 𝑦2 , δ2 , 𝑝2 , … , 𝑥𝑛 , 𝑦𝑛 , δ𝑛 , 𝑝𝑛 } • 𝑝𝑖 = ℎ0 𝑦𝑖 𝑥𝑖 … propensity [Rosenbaum et al] 𝑹𝓓

𝟏 𝒉 = 𝒏

𝒏

𝒊=𝟏

𝒉(𝒚𝒊 |𝒙𝒊 ) 𝜹𝒊 𝒑𝒊

7

Story so far

Logged bandit data

Fix ℎ 𝑣𝑠. ℎ0 mismatch

Control variance

Tractable bound

8

Fix ℎ 𝑣𝑠. ℎ0 mismatch

Logged bandit data

Control variance

Tractable bound

Importance sampling causes non-uniform variance! 𝑥1 = (𝐴𝑙𝑖𝑐𝑒, 𝑠𝑝𝑜𝑟𝑡𝑠) 𝑦1 = (𝐹1) 𝛿1 = 5 𝑝1 = 0.9

𝑥2 = (𝐴𝑙𝑖𝑐𝑒, 𝑡𝑒𝑐ℎ) 𝑦2 = (𝑆𝑝𝑎𝑐𝑒𝑋) 𝛿2 = 1 𝑝2 = 0.9

Want: Error bound that captures variance of importance sampling 𝑥3 = (𝐴𝑙𝑖𝑐𝑒, 𝑚𝑜𝑣𝑖𝑒𝑠) 𝑦3 = (𝑆𝑡𝑎𝑟 𝑊𝑎𝑟𝑠) 𝛿3 = 2 𝑝3 = 0.9

ℎ1 𝑅𝓓 ℎ1 = 1

𝑥4 = (𝐴𝑙𝑖𝑐𝑒, 𝑡𝑒𝑐ℎ) 𝑦4 = (𝑇𝑒𝑠𝑙𝑎) 𝛿4 = 1 𝑝4 = 0.9

ℎ2 𝑅𝓓 ℎ2 = 1.33 9

Logged bandit data

Fix ℎ 𝑣𝑠. ℎ0 mismatch

Control variance

Tractable bound

Counterfactual Risk Minimization • W.h.p. in 𝒟 ∼ ℎ0 ∀ℎ ∈ 𝓗,

𝑅(ℎ) ≤ 𝑅𝒟 (ℎ) + 𝑂

𝑉𝑎𝑟𝒟 ℎ 𝑛 + 𝑂(𝓝∞ (𝓗) 𝑛) *conditions apply. Refer [Maurer et al]

Empirical risk

Variance regularization

Capacity control

Learning objective ℎ𝐶𝑅𝑀

𝑉𝑎𝑟𝒟 ℎ = argmin 𝑅𝒟 ℎ + λ 𝑛 ℎ∈𝓗 10

Logged bandit data

Fix ℎ 𝑣𝑠. ℎ0 mismatch

Control variance

Tractable bound

POEM: CRM algorithm for structured prediction • CRFs: ℎ𝑤 ∈ 𝓗𝑙𝑖𝑛 ; ℎ𝑤 𝑦 𝑥 =

exp(𝑤ϕ(𝑥,𝑦)) ℤ(𝑥;𝑤)

• Policy Optimizer for Exponential Models :

𝑤∗

1 = argmin 𝑛 𝑤

𝑛

𝑖=1

ℎ𝑤 (𝑦𝑖 |𝑥𝑖 ) 𝑉𝑎𝑟(𝒉𝑤 ) δ𝑖 +λ +μ 𝑤 𝑝𝑖 𝑛

Good: Gradient descent, search over infinitely many w Bad: Not convex in w Ugly: Resists stochastic 2 optimization

11

Logged bandit data

Fix ℎ 𝑣𝑠. ℎ0 mismatch

Control variance

Tractable bound

Stochastically optimize 𝑉𝑎𝑟(𝒉𝑤 ) ? • Taylor-approximate!

𝑛

𝑛

𝑖 ℎ𝑤

𝑉𝑎𝑟(𝒉𝑤 ) ≤ 𝐴𝑤𝑡 𝑖=1

𝑖 2 ℎ𝑤

+ 𝐵𝑤𝑡

+ 𝐶𝑤𝑡

𝑖=1

𝑤𝑡 𝑤𝑡+1

w

𝑖 𝑖 𝑖 𝑖 • During epoch: Adagrad with 𝛻ℎ𝑤 + λ 𝑛(𝐴𝑤𝑡 𝛻ℎ𝑤 + 2𝐵𝑤𝑡 ℎ𝑤 𝛻ℎ𝑤 )} • After epoch: 𝑤𝑡+1 ↤ 𝑤, compute 𝐴𝑤𝑡+1 , 𝐵𝑤𝑡+1 12

Experiment • Supervised  Bandit MultiLabel [Agarwal et al] • δ 𝑥, 𝑦 = Hamming(𝑦 ∗ 𝑥 , 𝑦) (smaller is better) • LibSVM Datasets • • • •

Scene (few features, labels and data) Yeast (many labels) LYRL (many features and data) TMC (many features, labels and data)

• Validate hyper-params (λ,μ) using 𝑅𝒟𝑣𝑎𝑙 (ℎ) • Supervised test set expected Hamming loss 13

Approaches • Baselines • ℎ0 : Supervised CRF trained on 5% of training data

• Proposed • IPS (No variance penalty) • POEM

(extends [Bottou et al])

• Skylines • Supervised CRF

(independent logit regression)

14

(1) Does variance regularization help? Test set expected Hamming Loss 6

5.547

5

4.614 4.517

δ

4

3.445 3.023

3 2

2.522

2.822

1.543 1.519

1.463

1.143 1

1.118 0.996

1.189

0.659 0.222

0

Scene

Yeast

LYRL h0

IPS

POEM

TMC

Supervised CRF

15

(2) Is it efficient? Avg Time (s)

Scene

Yeast

LYRL

TMC

POEM(B)

75.20

94.16

561.12

949.95

POEM(S)

4.71

5.02

120.09

276.13

CRF

4.86

3.28

62.93

99.18

• POEM recovers same performance at fraction of L-BFGS cost • Scales as supervised CRF, learns from bandit feedback

16

(3) Does generalization improve as 𝑛 → ∞? 4.25

δ

3.75 3.25 2.75

1

2

4

8

CRF

16 32 Replay Count

h0

64

128

256

POEM 17

(4) Does stochasticity of ℎ0 affect learning? 1.6

δ

1.4 1.2 1 0.8

0.5 Stochastic

1

2

4 Param Multiplier

h0

8

16

32 Deterministic

POEM 18

Conclusion • CRM principle to learn from logged bandit feedback • Variance regularization

• POEM for structured output prediction

• Scales as supervised CRF, learns from bandit feedback

• Contact: [email protected] • POEM available at http://www.cs.cornell.edu/~adith/poem/ • Long paper: Counterfactual risk minimization – Learning from logged bandit feedback, http://jmlr.org/proceedings/papers/v37/swaminathan15.html • Thanks! 19

References 1.

Art B. Owen. 2013. Monte Carlo theory, methods and examples.

2.

Paul R. Rosenbaum and Donald B. Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70. 41-55.

3.

Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual reasoning and learning systems: the example of computational advertising. J. Mach. Learn. Res. 14, 1, 3207-3260.

4.

Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li and Robert Schapire. 2014. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. Proceedings of the 31st International Conference on Machine Learning. 1638-1646.

5.

Andreas Maurer and Massimiliano Pontil. 2009. Empirical bernstein bounds and samplevariance penalization. Proceedings of the 22nd Conference on Learning Theory.

6.

Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual risk minimization: Learning from logged bandit feedback. Proceedings of the 32nd International Conference on Machine Learning. 20