Counterfactual Risk Minimization Learning from logged bandit feedback Adith Swaminathan, Thorsten Joachims
Software: http://www.cs.cornell.edu/~adith/poem/
Learning frameworks 𝒙 Online
Batch
Full Information
Perceptron, …
SVM, ...
Bandit Feedback
LinUCB, …
?
𝒚
2
Logged bandit feedback is everywhere! Alice: Want Tech news SpaceX launch
𝑥1 = (𝐴𝑙𝑖𝑐𝑒, 𝑡𝑒𝑐ℎ) 𝑦1 = (𝑆𝑝𝑎𝑐𝑒𝑋) δ1 = 1
𝑥𝑖 : Query
𝑦𝑖 : Prediction
δ𝑖 : Feedback Alice: Cool!
3
Goal • Risk of ℎ: 𝕏 ↦ 𝕐
𝑅 ℎ = 𝔼𝑥 δ(𝑥, ℎ 𝑥 ) • Find ℎ∗ ∈ 𝓗 with minimum risk • Can we find ℎ∗ using 𝒟 =
𝑥1 = (𝐴𝑙𝑖𝑐𝑒, 𝑡𝑒𝑐ℎ) 𝑦1 = (𝑆𝑝𝑎𝑐𝑒𝑋) δ1 = 1
collected from ℎ0 ?
𝑥2 = (𝐴𝑙𝑖𝑐𝑒, 𝑠𝑝𝑜𝑟𝑡𝑠) 𝑦2 = (𝐹1) δ2 = 5
4
Learning by replaying logs? 𝑥 ∼ Pr(𝕏), however 𝑦 = ℎ0 (𝑥), not ℎ(𝑥) 𝑥1 = (𝐴𝑙𝑖𝑐𝑒, 𝑡𝑒𝑐ℎ) 𝑦1 = (𝑆𝑝𝑎𝑐𝑒𝑋) δ1 = 1
𝑥2 = (𝐴𝑙𝑖𝑐𝑒, 𝑠𝑝𝑜𝑟𝑡𝑠) 𝑦2 = (𝐹1) δ2 = 5
𝑥1 : (Alice,tech)
𝑦1′ : Apple watch
What would Alice do ??
• Training/evaluation from logged data is counter-factual [Bottou et al] 5
Stochastic policies to the rescue! • Stochastic ℎ: 𝕏 ↦ ∆(𝕐), 𝑦 ∼ ℎ(𝑥) 𝑅 ℎ = 𝔼𝑥 𝔼𝑦∼ℎ(𝑥) δ(𝑥, 𝑦) 𝕐 Likely SpaceX launch
ℎ
ℎ0
Unlikely 6
Counterfactual risk estimators Basic Importance Sampling [Owen] 𝔼𝑥 𝔼𝑦∼ℎ δ 𝑥, 𝑦 Perf of new system
= 𝔼𝑥
𝔼𝑦∼ℎ𝑜
Samples from old system
ℎ(𝑦|𝑥) δ(𝑥, 𝑦) ℎ0 (𝑦|𝑥) Importance weight
• 𝒟 = { 𝑥1 , 𝑦1 , δ1 , 𝑝1 , 𝑥2 , 𝑦2 , δ2 , 𝑝2 , … , 𝑥𝑛 , 𝑦𝑛 , δ𝑛 , 𝑝𝑛 } • 𝑝𝑖 = ℎ0 𝑦𝑖 𝑥𝑖 … propensity [Rosenbaum et al] 𝑹𝓓
𝟏 𝒉 = 𝒏
𝒏
𝒊=𝟏
𝒉(𝒚𝒊 |𝒙𝒊 ) 𝜹𝒊 𝒑𝒊
7
Story so far
Logged bandit data
Fix ℎ 𝑣𝑠. ℎ0 mismatch
Control variance
Tractable bound
8
Fix ℎ 𝑣𝑠. ℎ0 mismatch
Logged bandit data
Control variance
Tractable bound
Importance sampling causes non-uniform variance! 𝑥1 = (𝐴𝑙𝑖𝑐𝑒, 𝑠𝑝𝑜𝑟𝑡𝑠) 𝑦1 = (𝐹1) 𝛿1 = 5 𝑝1 = 0.9
𝑥2 = (𝐴𝑙𝑖𝑐𝑒, 𝑡𝑒𝑐ℎ) 𝑦2 = (𝑆𝑝𝑎𝑐𝑒𝑋) 𝛿2 = 1 𝑝2 = 0.9
Want: Error bound that captures variance of importance sampling 𝑥3 = (𝐴𝑙𝑖𝑐𝑒, 𝑚𝑜𝑣𝑖𝑒𝑠) 𝑦3 = (𝑆𝑡𝑎𝑟 𝑊𝑎𝑟𝑠) 𝛿3 = 2 𝑝3 = 0.9
ℎ1 𝑅𝓓 ℎ1 = 1
𝑥4 = (𝐴𝑙𝑖𝑐𝑒, 𝑡𝑒𝑐ℎ) 𝑦4 = (𝑇𝑒𝑠𝑙𝑎) 𝛿4 = 1 𝑝4 = 0.9
ℎ2 𝑅𝓓 ℎ2 = 1.33 9
Logged bandit data
Fix ℎ 𝑣𝑠. ℎ0 mismatch
Control variance
Tractable bound
Counterfactual Risk Minimization • W.h.p. in 𝒟 ∼ ℎ0 ∀ℎ ∈ 𝓗,
𝑅(ℎ) ≤ 𝑅𝒟 (ℎ) + 𝑂
𝑉𝑎𝑟𝒟 ℎ 𝑛 + 𝑂(𝓝∞ (𝓗) 𝑛) *conditions apply. Refer [Maurer et al]
Empirical risk
Variance regularization
Capacity control
Learning objective ℎ𝐶𝑅𝑀
𝑉𝑎𝑟𝒟 ℎ = argmin 𝑅𝒟 ℎ + λ 𝑛 ℎ∈𝓗 10
Logged bandit data
Fix ℎ 𝑣𝑠. ℎ0 mismatch
Control variance
Tractable bound
POEM: CRM algorithm for structured prediction • CRFs: ℎ𝑤 ∈ 𝓗𝑙𝑖𝑛 ; ℎ𝑤 𝑦 𝑥 =
exp(𝑤ϕ(𝑥,𝑦)) ℤ(𝑥;𝑤)
• Policy Optimizer for Exponential Models :
𝑤∗
1 = argmin 𝑛 𝑤
𝑛
𝑖=1
ℎ𝑤 (𝑦𝑖 |𝑥𝑖 ) 𝑉𝑎𝑟(𝒉𝑤 ) δ𝑖 +λ +μ 𝑤 𝑝𝑖 𝑛
Good: Gradient descent, search over infinitely many w Bad: Not convex in w Ugly: Resists stochastic 2 optimization
11
Logged bandit data
Fix ℎ 𝑣𝑠. ℎ0 mismatch
Control variance
Tractable bound
Stochastically optimize 𝑉𝑎𝑟(𝒉𝑤 ) ? • Taylor-approximate!
𝑛
𝑛
𝑖 ℎ𝑤
𝑉𝑎𝑟(𝒉𝑤 ) ≤ 𝐴𝑤𝑡 𝑖=1
𝑖 2 ℎ𝑤
+ 𝐵𝑤𝑡
+ 𝐶𝑤𝑡
𝑖=1
𝑤𝑡 𝑤𝑡+1
w
𝑖 𝑖 𝑖 𝑖 • During epoch: Adagrad with 𝛻ℎ𝑤 + λ 𝑛(𝐴𝑤𝑡 𝛻ℎ𝑤 + 2𝐵𝑤𝑡 ℎ𝑤 𝛻ℎ𝑤 )} • After epoch: 𝑤𝑡+1 ↤ 𝑤, compute 𝐴𝑤𝑡+1 , 𝐵𝑤𝑡+1 12
Experiment • Supervised Bandit MultiLabel [Agarwal et al] • δ 𝑥, 𝑦 = Hamming(𝑦 ∗ 𝑥 , 𝑦) (smaller is better) • LibSVM Datasets • • • •
Scene (few features, labels and data) Yeast (many labels) LYRL (many features and data) TMC (many features, labels and data)
• Validate hyper-params (λ,μ) using 𝑅𝒟𝑣𝑎𝑙 (ℎ) • Supervised test set expected Hamming loss 13
Approaches • Baselines • ℎ0 : Supervised CRF trained on 5% of training data
• Proposed • IPS (No variance penalty) • POEM
(extends [Bottou et al])
• Skylines • Supervised CRF
(independent logit regression)
14
(1) Does variance regularization help? Test set expected Hamming Loss 6
5.547
5
4.614 4.517
δ
4
3.445 3.023
3 2
2.522
2.822
1.543 1.519
1.463
1.143 1
1.118 0.996
1.189
0.659 0.222
0
Scene
Yeast
LYRL h0
IPS
POEM
TMC
Supervised CRF
15
(2) Is it efficient? Avg Time (s)
Scene
Yeast
LYRL
TMC
POEM(B)
75.20
94.16
561.12
949.95
POEM(S)
4.71
5.02
120.09
276.13
CRF
4.86
3.28
62.93
99.18
• POEM recovers same performance at fraction of L-BFGS cost • Scales as supervised CRF, learns from bandit feedback
16
(3) Does generalization improve as 𝑛 → ∞? 4.25
δ
3.75 3.25 2.75
1
2
4
8
CRF
16 32 Replay Count
h0
64
128
256
POEM 17
(4) Does stochasticity of ℎ0 affect learning? 1.6
δ
1.4 1.2 1 0.8
0.5 Stochastic
1
2
4 Param Multiplier
h0
8
16
32 Deterministic
POEM 18
Conclusion • CRM principle to learn from logged bandit feedback • Variance regularization
• POEM for structured output prediction
• Scales as supervised CRF, learns from bandit feedback
• Contact:
[email protected] • POEM available at http://www.cs.cornell.edu/~adith/poem/ • Long paper: Counterfactual risk minimization – Learning from logged bandit feedback, http://jmlr.org/proceedings/papers/v37/swaminathan15.html • Thanks! 19
References 1.
Art B. Owen. 2013. Monte Carlo theory, methods and examples.
2.
Paul R. Rosenbaum and Donald B. Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70. 41-55.
3.
Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual reasoning and learning systems: the example of computational advertising. J. Mach. Learn. Res. 14, 1, 3207-3260.
4.
Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li and Robert Schapire. 2014. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. Proceedings of the 31st International Conference on Machine Learning. 1638-1646.
5.
Andreas Maurer and Massimiliano Pontil. 2009. Empirical bernstein bounds and samplevariance penalization. Proceedings of the 22nd Conference on Learning Theory.
6.
Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual risk minimization: Learning from logged bandit feedback. Proceedings of the 32nd International Conference on Machine Learning. 20