A random walks perspective on maximizing ... - Semantic Scholar

Report 1 Downloads 21 Views
A random walks perspective on maximizing satisfaction and profit Matthew Brand SIAM International Conference on Data Mining, April 21-23, 2005

May 31, 2005

Presented by Daniel Hsu (djhsu@cs) for CSE 254

Outline

Motivation Collaborative filtering Basis for recommendation The random walk model Using the model Evaluation

Motivation: collaborative filtering “Hello, Daniel Hsu. We have recommendations for you.”

Motivation: collaborative filtering

Everyone wins! I

Daniel is more likely to find products he will like.

I

Amazon.com is more likely to sell products to Daniel.

How can Amazon.com achieve this glorious end? Should Amazon.com just recommend The Da Vinci Code to everyone?

Motivation: basis for recommendation

Motivation: basis for recommendation

How can Amazon.com decide which recommendations to make? I

(Satisfaction) “Many customers who bought Debussy: Piano Works also bought Satie: Piano Works. . . People who like Debussy also like Satie. . . Daniel will like Satie: Piano Works.”

I

(Profit) “Also, we (Amazon.com) make a huge profit margin on Satie: Piano Works, so let’s try to sell as many copies of this disc as possible.”

Outline Motivation The random walk model Association graph and Markov chain Expected hitting and commute times Connection to resistive networks Random walk correlation Using the model Evaluation

Model: association graph

Let W ∈ Rn×n be the weighted adjacency matrix of an association + graph. I

Example 1: vertices are events, weight of edge (i, j) is how many times event j followed event i.

I

Example 2: vertices are people and movies, weight of edge (i, j) is the rating person i gave movie j.

Let P = diag(W 1)−1 W be the row-normalized version of W . Then, we can think of P as the transition matrix of a Markov chain.

An example association graph

female

Star Wars Ep. 3 .. .

3

2 18-25 year old

individual 314

Napoleon Dynamite

1

5 student

individual 315

.. .

bus driver

.. .

Sleepless in Seattle

5

The Shawshank Redemption .. .

The main assumption

“a random walk on this Markov chain will mimic, over the short term, the behavior of individuals randomly drawn from this population.”

Further assumptions and consequences

Let {Xt : t ≥ 0} be an irreducible and aperiodic Markov chain with transition matrix P. Then, the chain has a unique stationary distribution π.

I

πj ≥ 0 for each j P j πj = 1

I

π0T = π0

I

Expected hitting and commute times

Suppose the chain is in state i. I

I

Expected hitting time Hij : How long does it take, on average, to reach state j? Expected commute time Cij : How long does it take, on average, to reach state j and then state i? I

Cij = Hij + Hji = Cji

Both Hij and Cij have been previously proposed as a basis for making recommendations. But how are they computed? Glimpse into the future: the newly proposed basis is also derived from the expected commute times.

A recurrence relation for expected hitting time Let the random variable Tj|i be the time to reach state j starting in state i. If i 6= j, then Tj|i = 1 + Tj|k for any k s.t. Pik > 0. Then, using conditional expecations, Hij = E Tj|i =1+

X

Pr(next state is k | in state i) E Tj|k

k:Pik >0

=1+

X k:Pik >0

Pik Hkj .

An identity for the frequency of a state Now, we’ll derive a direct expression for hitting time (adapted from Aldous and Fill). We’ll use the following lemma:

Lemma (“Occupation measure identity”) Consider the Markov chain {Xt : t ≥ 0} with stationary distribution π started at state i. Let 0 < S < ∞ be a random stopping time such that XS = i and E[S | X0 = i] < ∞. Then for any state j, E[# of visits to state j before time S | X0 = i] = πj E[S | X0 = i]. For succinctness, write this as Ei [#j before S] = πj Ei [S]. We count visits at time 0 and exclude visits at time S.

Using the identity Occupation measure identity: Ei [#j before S] = πj Ei [S]. Define Ti = min{t ≥ 0 : Xt = i} as the first hitting time of state i, and Ti+ = min{t ≥ 1 : Xt = i} as the first return time to state i. Note: Ti and Ti+ are the same unless X0 = i. Warm-up: Let S = Ti+ . Ei [#i before Ti+ ] = 1, so 1 = πi Ei [Ti+ ]. That is, Ei [Ti+ ] = 1/πi .

(1)

For S = Ti+ and j 6= i, use the lemma and (1) to get Ei [#j before Ti+ ] = πj Ei [Ti+ ] = πj /πi

(2)

Using the identity Occupation measure identity: Ei [#j before S] = πj Ei [S]. Let S = the first return to i after the first visit to j (j 6= i). Then Ei S = Ei Tj + Ej Ti and Ei [#j before S] = Ei [#j before Tj ] + Ej [#j before Ti ]. But Ei [#j before Tj ] = 0, so Ej [#j before Ti ] = πj (Ei Tj + Ej Ti ).

(3)

Using the identity Use the notation Eρ [·] to be the expectation given that the state at time 0 is distributed according to ρ. Let t0 ≥ 1 and S be the time of the following: 1. wait time t0 , then 2. wait until the chain hits i. Let Vt bePthe random variable that indicates if i is visited at time 0 −1 t. Then tt=0 Vt is the number of visits to i before S. Now, using the identity, tX 0 −1 t=0

Ei [Vt ] =

tX 0 −1

(P t )ii = πi (t0 + Eρ Ti )

t=0

with ρk = Pr(Xt0 = k | X0 = i).

Using the identity

Rearranging tX 0 −1

(P t )ii = πi (t0 + Eρ Ti )

t=0

to get tX 0 −1

[(P t )ii − πi ] = πi Eρ Ti

t=0

and letting t0 → ∞, we get Zii = πi Eπ Ti . where Zij =

P∞

t=0 [(P

t) ij

− πj ].

(4)

Using the identity To actually get an expression for Ei Tj , this time let S be the time of the following: 1. wait until the chain hits i, 2. then wait time t0 ≥ 1, and then 3. finally wait until the chain hits j. The occupation measure identity says Ej [#j before S] = πj Ej S. Note that Ej S = Ej Ti + t0 + Eρ Tj , where ρk = Pr(Xt0 = k | X0 = i), and Ej [#j before S] = Ej [#j before Ti ] +

tX 0 −1 t=1

(P t )ij .

Using the identity Then, using (3), rearranging, and letting t0 → ∞, Ej [#j before Ti ] +

πj (Ej Ti + Ei Tj ) +

tX 0 −1 t=0 tX 0 −1

(P t )ij = πj (Ej Ti + t0 + Eρ Tj ) (P t )ij = πj (Ej Ti + t0 + Eρ Tj )

t=0 tX 0 −1

[(P t )ij − πj ] = πj (Eρ Tj − Ei Tj )

t=0

Zij = πj (Eπ Tj − Ei Tj ). Finally, using (4), we have Zjj − Zij = πj Ei Tj = πj Hij

(5)

Computing the expected hitting time In order to use (5) to compute Hij = Ei Tj , we need to compute Z . Let Π = 1π 0 . Then Z=

∞ X

(P t − Π)

t=0

= (P 0 − Π) +

∞ X

(P t − Π)

t=1

= (I − Π) +

∞ X

(P − Π)t

(check by induction)

t=1

= −Π +

∞ X

(P − Π)t

t=0

= −Π + (I − (P − Π))−1

(since P t − Π → 0).

Note: Brand says Z = (I − P − Π)−1 , which is probably wrong.

A slight simplification

Recall that W is the weighted adjacency matrix of the association graph. From now on, assume W is symmetric (i.e. the graph is undirected). A random walk on such a graph has transition probabilities for i 6= j Wij Pij = Wi P where Wi = j Wij .

A slight simplification

Also assume the graph is connected and not bipartite. Then, the stationary distribution of the random walk is Wi . πi = P k Wk What about expected hitting and commute times? It turns out that the expected hitting and commute times are captured by the graph’s electrical resistance.

Connection to resistive networks Put a resistor on each edge {i, j} with resistance Rij = 1/Wij (i.e. with conductance Wij ). Now consider fixed nodes i and j. P

I

Inject Wi =

I

By Kirchhoff’s (Iin = Iout ) and Ωhm’s (V = IR) laws, X X X Wi = Iik = Vik /Rik = Vik Wik .

k

Wik current into node i.

(i,k)∈E I I

(i,k)∈E

(i,k)∈E

By Kirchhoff’s voltage law, Vij = Vik + Vkj . P Get Wi = k Wik (Vij − Vkj ). After rearranging, this is Vij = 1 +

X Wik k

=1+

X k

Wi

Vkj

Pik Vkj .

Connection to resistive networks

The recurrence relation for the voltage Vij when Wi units of current are injected into node i is the same as the recurrence relation for the expected hitting time Hij . So identify Vij ≡ Hij . To get an explicit formula for the expected commute time Cij = Hij + Hjk , we’ll use the superposition property of linear equations (resistive networks are characterized by linear equations).

Deriving the expected commute time: four cases Adapted from Karp (2003). Current into i Current into j Wi

Current into k k 6= i, j Wk

Vij

Vji

Hij

−Hij

Deriving the expected commute time: four cases Adapted from Karp (2003). Current into i Current into j Wi

P −( k Wk − Wj )

Current into k k 6= i, j Wk

Vij

Vji

Hij

−Hij

Deriving the expected commute time: four cases Adapted from Karp (2003). Current into i Current into j Wi

P −( k Wk − Wj ) Wj

Current into k k 6= i, j Wk Wk

Vij

Vji

Hij −Hji

−Hij Hji

Deriving the expected commute time: four cases Adapted from Karp (2003). Current into i Current into j Wi P −( k Wk − Wi )

P −( k Wk − Wj ) Wj

Current into k k 6= i, j Wk Wk

Vij

Vji

Hij −Hji

−Hij Hji

Deriving the expected commute time: four cases Adapted from Karp (2003). Current into i Current into j Wi P −( k Wk − Wi ) P k Wk − Wi

P −( k Wk − Wj ) Wj −Wj

Current into k k 6= i, j Wk Wk −Wk

Vij

Vji

Hij −Hji Hji

−Hij Hji −Hji

Deriving the expected commute time: four cases Adapted from Karp (2003). Current into i Current into j Wi P −( k Wk − Wi ) P Wk − Wi kP k Wk

P −( k Wk − Wj ) Wj −W P j − k Wk

Current into k k 6= i, j Wk Wk −Wk 0

Vij

Vji

Hij −Hji Hji Cij

−Hij Hji −Hji −Cij

Deriving the expected commute time: four cases Adapted from Karp (2003). Current into i Current into j Wi P −( k Wk − Wi ) P Wk − Wi kP k Wk

P −( k Wk − Wj ) Wj −W P j − k Wk

Current into k k 6= i, j Wk Wk −Wk 0

Vij

Vji

Hij −Hji Hji Cij

−Hij Hji −Hji −Cij

Using Ohm’s law, we have the expected commute time ! X ˜ ij = 2Wtotal R ˜ ij Cij = Wk R k

˜ ij is the where Wtotal is the total weight of the graph, and R effective resistance between nodes i and j.

Drawbacks of expected hitting and commute times

Using expected hitting and commute times as a basis for recommendation is natural. However, they can be dominated by the stationary distribution, so the same popular items are recommended to everyone. Idea: can still use expected hitting and commute times, but not directly.

Cosine correlation Here is a popular idea from information retrieval. Suppose x and y are count vectors (e.g. word counts of a document). Similarity is measured by the dot product x · y . Problem: longer similar documents get larger dot products than shorter similar documents. Solution: just look at the angle θ between x and y . x · y = kxkky k cos θ x ·y cos θ = kxkky k Can we use this cosine correlation with Hij or Cij ?

Effective resistance as a metric ˜ ij . So any metric properties of R ˜ will also Recall: Cij = 2Wtotal R hold for C . It is easy to check that I I I

˜ ij ≤ R ˜ ik + R ˜ kj R ˜ ii = 0, and R ˜ ij = R ˜ ji . R

˜ is a metric. So, the effective resistance R But a general metric is not enough to talk about angles—we need a Euclidean metric. We will show that the square-root of effective resistance is a Euclidean metric.

The Laplacian matrix

The Laplacian matrix L of the graph is L=D −W where D = diag(W1 , W2 , . . . , Wn ). Note that diag(W ) = 0.   W1 −W12 . . . −W1n  −W12 W2 . . . −W2n    L=  .. .. .. ..   . . . . −W1n −W2n . . .

Wn

Note that the sum of each row is 0 and so is the sum of each column (by symmetry of W ).

The Laplacian matrix and its pseudoinverse The Laplacian L is I

symmetric, because the original graph is symmetric, and

I

positive-semidefinite, because X x 0 Lx = Wij (xi − xj )2 ≥ 0. i<j

L has a pseudoinverse L+ given by 1 1 L+ = (L − 110 )−1 + 110 . n n L+ is also symmetric and positive-semidefinite (not obvious, but it’s true).

A Euclidean metric from the Laplacian’s pseudoinverse

Furthermore, ˜ ij = (ei − ej )0 L+ (ei − ej ), R where ei is the ith elementary vector (1 in the ith entry, zero everywhere else). This comes from yet another, less intuitive derivation of expected hitting time. This is a Mahalanobis distance since L+ is symmetric positive-semidefinite, so its square-root is a Euclidean metric.

Cosine correlation for random walk

q

˜ ij defines a Euclidean The square-root of effective resistance R metric, so the “angle” θij between i and j is well-defined: ˜ ij = (ei − ej )0 L+ (ei − ej ) R + + = L+ ii − 2Lij + Ljj

L+ ij

cos θij = q

+ L+ ii Ljj

Interpreting cosine correlation for random walk ˜ ij ≡ kxi − xj k2 , one can deduce Identifying R q kxi k = L+ ii . For large Markov chains, this is approximately the recurrence time 1 πi , a measure of “generic” popularity. If the embedded points xi are projected onto a unit hypersphere (thus removing all “generic” popularity), then cos θij = 1 −

dij2 2

,

where dij is the resulting Euclidean distance between i and j.

Outline

Motivation The random walk model Using the model Making recommendations Turning a profit Evaluation

Making recommendations

Recommendations are with respect to a query state (e.g. customer, currently viewed product, search query). Given a query state i, rank other states j according to cos θij and recommend the top hits. The problem is similar to semi-supervised classification (learning with both labeled and unlabeled data), and the cosine correlation is a superior similarity measure compared to other proposed methods. . .

Making recommendations

Recommendations are with respect to a query state (e.g. customer, currently viewed product, search query). Given a query state i, rank other states j according to cos θij and recommend the top hits. The problem is similar to semi-supervised classification (learning with both labeled and unlabeled data), and the cosine correlation is a superior similarity measure compared to other proposed methods. . . (on one toy example).

Semi-supervised classification

Semi-supervised classification

Turning a profit (at least in expectation) Goal (from decision theory): “recommend the product (state) with the greatest expected profit, discounted over time.” Let ~$ ∈ Rn be the profit (or loss) for each state, and e −β (β > 0) be the discount factor. Then, the expected discounted profit is v=

=

∞ X

e −tβ P t~$

t=0 ∞ X

! (e −β P)t ~$

t=0

= (I − e −β P)−1~$

since (e −β P)t → 0.

To maximize expected discounted profit at query state i, choose j = argmax{Pij vj } j:Pij >0

Outline Motivation The random walk model Using the model Evaluation Data set and model Maximizing satisfaction Maximizing profit Discussion

Experimental setup: data set and model Data comes from the MovieLens database: I

Ratings on 1–5 scale for 1682 movies by 943 individuals

I

Each individual viewed 20–737 movies (106 on average)

I

Each movie received 1–583 ratings (60 on average)

I

Ratings table is 93.7% empty (i.e. most viewers have not seen most movies)

I

Classify movies into 19 genres

I

Classify individuals into 2 genders, 21 vocations, 8 overlapping age groups

Constructed n = 2657 node graph with  1 if i belongs in class j Wij = rij if individual i rates movie j with rij .

Task 1: recommending to maximize satisfaction Randomly partition data into training and test sets. 1. Test set contains 10 ratings from each viewer. 2. Take 10 top-ranked movies not in the training set as the recommendations. 3. Score the recommendations with the sum of individual’s held-out ratings for the recommended movies. Compared different measures of similarities: I

cosine correlation

I

expected commute time, expected hitting time

I

stationary distribution

I

normalized hitting time, normalize commute time

Task 1 results

Compared average score across all 943 individuals and 500 trials. I

Score ranges from 0 to 50.

I

Omniscient oracle scores ≤ 35.3 on average (due to sparsity of data and low average rating).

I

Random recommendations score 2.2 on average.

Task 1 results

Task 2: recommending to maximize profit Similar setup as before, except the scoring is changed: 1. A priori randomly assign each movie j a profit pj ∈ N (0, 1). 2. For t = 1 to 10 a. Recommend a movie. b. If the movie is in the individual’s held-out set, receive profit e −tβ pj .

Compared different recommenders: I

maximum expected discounted profit

I

cosine correlation, but only movies with positive profit

I

cosine correlation, allowing all movies

I

expected commute time, expected hitting time, stationary distribution

Task 2 results

Discussion

Cosine correlation performs significantly better than stationary distribution. This suggests that it is sensitive to individual preferences. Issues with experimental study: I

Movies without recommendations by an individual in the test set were given a score of 0.

I

Only considered “random walk”-related similarity measures.

References D. Aldous and J. Fill, Reversible Markov Chains and Random Walks on Graphs. Monograph in preparation, www.stat.berkeley.edu/users/aldous/RWG/book.html. M. Brand, A random walks perspective on maximizing satisfaction and profit. In Proceedings of SIAM International Conference on Data Mining, 2005. F. Fouss, A. Pirotte, J. Renders, and M. Saerens, A Novel Way of Computing Dissimilarities Between Nodes of a Graph, with Application to Collaborative Filtering. In Proceedings of ECML workshop on Statistical Approaches for Web Mining, 2004. R. Karp, Lecture. U.C. Berkeley, November 12, 2003.