u - VideoLectures

Report 2 Downloads 87 Views
Combining Predictions for Accurate Recommender Systems [KDD 2010, July 25–28, 2010, Washington D.C., USA]

Michael Jahrer Andreas Töscher Robert Legenstein

1/27

Outline 

Motivation



Collaborative Filtering Algorithms



Blending Algorithms



Experimental Results on Netflix Data



Application example: KDD Cup 2010

2/27

Motivation 

Accurate recommendations may increase the sales



Guides users to the products, they want to purchase



Better cross-selling



Increasing user activity

3/27

Collaborative filtering 

All algorithms have been successfully applied on the Netflix Prize dataset 

SVD – Singular Value Decomposition



KNN – K-Nearest Neighbors (item - item)



AFM – Asymmetric Factor Model



RBM – Restricted Boltzmann Machines



GE – Global Effects

4/27

SVD u ...user i...item rui ...prediction p ...item feature q ...user feature



Very popular since the Netflix Prize



Accurate and good scaling properties q0 q1

rui = pT q

=

p0 p1 p3 item feature

*

q2 user feature

5/27

KNN 

Natural approach



Predict a rating

u ...user i...item rui ...prediction R u ,i ...item set c ij ...corr between item i and j r ui ...user rating

rui



Find k-best correlating items



Make a weighted sum

rui =

R u ,i 

c ij r uj ∑ j ∈R  u ,i 



∣c ij∣

j ∈R  u , i 

Quadratic runtime for 1x prediction O(N²) N = #items

6/27

AFM 

u ...user i...item rui ...prediction N  u...ratings of u p ...item feature q ...asym. item

Like SVD

feature



A user is represented via his rated items N  u



☺ New users can be integrated without re-training



1 rui = p qj ∑ ∣N u∣ j ∈N u  T

item feature



=

p0 p1 p3

(

1 * 3

q0 q1 q2

q0

+

q1 q2

q0

+

q1 q2

)

user feature (virtual)

7/27

RBM u ...user i...item rui ...prediction 

Two-layer undirected graphical model



Learning is performed with ”contrastive divergence”



RBM reconstructs the visible units



Predictions rui are calculated over rating probabilites

[R.Salakhutdinov, A.Mnih, G.Hinton : Restricted Boltzmann machines for collaborative filtering, ICML '07]

8/27

Global Effects 

Calculate ”hand-crafted” features for users and items



Equivalent to SVD with either fixed user or item features

[A.Töscher, M.Jahrer, R.Bell : The BigChaos Solution to the Netix Grand Prize, 2009]

9/27

Blending u ...user i...item rui ...prediction N  u ...ratings of u 

Apply a supervised learner for combining predictions



Error: RMSE



Additional information: ∣N u∣ (the ”support”) SVD KNN Dataset

ru

AFM RBM

Blender

rui

GE other info

10/27

Evaluation schema 

Dataset for CF algorithms: Netflix (108 ratings, except probe)



Dataset for Blending: probe (1.4M ratings)



50/50 random split of probe: pTrain, pTest

Blending 

pTrain: training set



pTest: test set



qualifying: another test set

11/27

Used CF algorithms 

4x SVD



4x AFM



4x KNN



2x RBM



4x GE



log(support) as additional input 19 predictors



Some are trained on residuals of others 12/27

Blending (supervised setup) F = 19 











X … train set (N x F matrix) x ij … feature value i...sample, j...feature N = 704197 y … targets (1...5 ratings)

x1 x2

X

p … predictions  x  … model (the ”blender”) Error function



1 RMSE= N

N

∑  x − y i 

2

i=1

13/27

What is inside X ?

(X=train set)

(first 20 rows)

................... … 700k rows 14/27

Linear Regression 

T

Model:  x =x w



Training: w= X T X I −1 X T y



☺ Fast



RMSEpTest: 0.87525 → Baseline

Determined by cross-validation

=0.000004

regression coefficients

wi

15/27

Binned Linear Regression Lin.Reg Baseline: 0.8752 T



Model:  x =x w b



☺ Fast, more accurate than LR



3 binning types

b … bin, each w b per bin



support: Number of ratings per user



date: Day of the rating



frequency: Number of votes from user u on day of the rating

→ support binning works best (5 bins)

16/27

Neural Network 

Stochastic gradient descent



Decrease initial learning rate



Bagging improves the accuracy



☺ Fast and accurate predictions



☹ Long training time

Lin.Reg Baseline: 0.8752

17/27

Bagged Gradient Boosted Decision Tree Lin.Reg Baseline: 0.8752 





Prediction is generated by 

Splits in a single tree are greedy (best RMSE)



Sum of trees (gradient boosting)



Averaging many chains (bagging)

Lower RMSE when 

Smaller learnrate



Larger bagging size

Dataset dependent 

Max. Number of leaves



Subspace size 18/27

Kernel Ridge Regression Lin.Reg Baseline: 0.8752 

Cannot be applied to all 700k training samples 



O(N³) runtime, O(N²) memory

Average over smaller trainsets (random x % subset)

1% subset: 7k samples 6% subset: 42k samples

RMSE: 0.874

19/27

K-Nearest Neighbors Blending Lin.Reg Baseline: 0.8752 

Cannot be applied to all 700k training samples 



O(N²) runtime, O(N²) memory

Does not work (worse RMSE)

RMSE: 0.883 20/27

Bagging with Neural Networks, Polynomial Regression and GBDT Blender

rui

AFM

Blender

rui

RBM

....

SVD KNN Dataset

ru

GE other info

Blender

Lin. comb.

rui

rui

Many Blenders are trained one after another → Error feedback for stop training: RMSE of the linear combination → Linear Combination is calculated on the out-of-bag estimate 21/27

Bagging with Neural Networks, Polynomial Regression and GBDT Lin.Reg Baseline: 0.8752 

Stagewise optimization of a lin. combination of different learners

22/27

Results on qualifying set (the ”real” test set) Lin.Reg Baseline: 0.8681 

19-30-1 neural network: RMSE=0.8664



Bagging with 7 models: RMSE=0.8660



0.0021 improvement

Netflix Prize competitors use linear regression with meta features …



...

0.0020 improvement

[J. Sill, G. Takacs, L. Mackey, and D. Lin. : Feature-weighted linear stacking, 2009]

23/27

Summary 



The blend of many CF algorithms improves the accuracy! Neural network (as blender) is the best tradeoff between training time and accuracy blended algorithms

individual collaborative filtering algorithms

24/27

Software is Open Source! 



The data and the implementation can be found on: http://elf-project.sourceforge.net/ Many examples are provided there Happy hacking ☺

25/27

Application example: KDD Cup 2010 This is one feature vector

f1...f36 predictors

Blender trainset - 141 features - 4M samples

f37...f67 f71...f91 f68...f70 f101...f141 f98...f100 knowledge problem hierarchy unit log(support) component view encoding step student encoding encoding f92...f97 opportunity statistics min/max/mean std/cnt/sum

26/27

Thank you for your attention!

Michael Jahrer commendo research & consulting GmbH www.commendo.at 27/27