A Novel Distributed Collaborative Filtering Algorithm and Its Implementation on P2P Overlay Network* Peng Han, Bo Xie, Fan Yang, Jiajun Wang, and Ruimin Shen Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200030, China {phan,bxie,fyang,jjwang,rmshen}@sjtu.edu.cn
Abstract. Collaborative filtering (CF) has proved to be one of the most effective information filtering techniques. However, as their calculation complexity increased quickly both in time and space when the record in user database increases, traditional centralized CF algorithms has suffered from their shortage in scalability. In this paper, we first propose a novel distributed CF algorithm called PipeCF through which we can do both the user database management and prediction task in a decentralized way. We then propose two novel approaches: significance refinement (SR) and unanimous amplification (UA), to further improve the scalability and prediction accuracy of PipeCF. Finally we give the algorithm framework and system architecture of the implementation of PipeCF on Peer-to-Peer (P2P) overlay network through distributed hash table (DHT) method, which is one of the most popular and effective routing algorithm in P2P. The experimental data show that our distributed CF algorithm has much better scalability than traditional centralized ones with comparable prediction efficiency and accuracy.
1 Introduction Collaborative filtering (CF) has proved to be one of the most effective information filtering techniques since Goldberg et al [1] published the first account of using it for information filtering. Unlike content-based filtering, the key idea of CF is that users will prefer those items that people with similar interests prefer, or even that dissimilar people don’t prefer. The k-Nearest Neighbor (KNN) method is a popular realization of CF for its simplicity and reasonable performance. Up to now, many successful applications have been built on it such as GroupLens [4], Ringo [5]. However, as its computation complexity increased quickly both in time and space as the record in the database increases, KNN-based CF algorithm suffered a lot from its shortage in scalability. One way to avoid the recommendation-time computational complexity of a KNN method is to use a model-based method that uses the users’ preferences to learn a model, which is then used for predications. Breese et al utilizes clustering and Bayesian network for a model-based CF algorithm in [3]. Its results show that the clustering-based method is the more efficient but suffering from poor accuracy while ___________________ * Supported by the National Natural Science Foundation of China under Grant No. 60372078 H. Dai, R. Srikant, and C. Zhang (Eds.): PAKDD 2004, LNAI 3056, pp. 106–115, 2004 © Springer-Verlag Berlin Heidelberg 2004
A Novel Distributed Collaborative Filtering Algorithm and Its Implementation
107
the Bayesian networks prove only practical for environments in which knowledge of user preferences changes slowly. Further more, all model-based CF algorithms still require a central database to keep all the user data which is not easy to achieve sometime not only for techniques reasons but also for privacy reasons. An alternative way to address the computational complexity is to implement KNN algorithm in a distributed manner. As Peer-to-Peer (P2P) overlay network gains more and more popularity for its advantage in scalability, some researchers have already begun to consider it as an alternative architecture [7,8,9] of centralized CF recommender system. These methods increase the scalability of CF recommender system dramatically. However, as they used a totally different mechanism to find appropriate neighbors than KNN algorithms, their performance is hard to analyze and may be affected by many other factors such as network condition and selforganization scheme. In this paper we solve the scalability problem of KNN-based CF algorithm by proposing a novel distributed CF algorithm called PipeCF which has the following advantage: 1. In PipeCF, both the user database management and prediction computation task can be done in a decentralized way which increases the algorithm’s scalability dramatically. 2. PipeCF keeps all the other features of traditional KNN CF algorithm so that the system’s performance can be analyzed both empirically and theoretically and the improvement on traditional KNN algorithm can also be applied here. 3. Two novel approaches have been proposed in PipeCF to further improve the prediction and scalability of KNN CF algorithm and reduce the calculation complexity from to where M is the user number in the database and N is the items number. 4. By designing a heuristic user database division strategy, the implementation of PipeCF on a distributed-hash-table (DHT) based P2P overlay network is quite straightforward which can obtain efficient user database management and retrieval at the same time. The rest of this paper is organized as follows. In Section 2, several related works are presented and discussed. In Section 3, we give the architecture and key features of PipeCF. Two techniques: SR and UA are also proposed in this section. We then give the implementation of PipeCF on a DHT-based P2P overlay network in Section 4. In Section 5 the experimental results of our system are presented and analyzed. Finally we make a brief concluding remark and give the future work in Section 6.
2 Related Works 2.1 Basic KNN-Based CF Algorithm Generally, the task of CF is to predict the votes of active users from the user database which consists of a set of votes corresponding to the vote of user i on item j. The KNN-based CF algorithm calculates this prediction as a weighted average of other users’ votes on that item through the following formula:
108
P. Han et al.
(1)
n
Pa , j = va + κ ∑ϖ ( a, j )( vi , j − vi ) i =1
Where
Pa , j denotes the prediction of the vote for active user a on item j and n is
the number of users in user database.
vi = Where
vi is the mean vote for user i as: (2)
1 ∑ vi , j | I i | j∈I i
I i is the set of items on which user i has voted. The weights ϖ ( a, j )
reflect the similarity between active user and users in the user database. normalizing factor to make the absolute values of the weights sum to unity.
κ
is a
2.2 P2P System and DHT Routing Algorithm The term “Peer-to-Peer” refers to a class of systems and applications that employ distributed resources to perform a critical function in a decentralized manner. Some of the benefits of a P2P approach include: improving scalability by avoiding dependency on centralized points; eliminating the need for costly infrastructure by enabling direct communication among clients. As the main purpose of P2P systems are to share resources among a group of computers called peers in a distributed way, efficient and robust routing algorithms for locating wanted resource is critical to the performance of P2P systems. Among these algorithms, distributed hash table (DHT) algorithm is one of the most popular and effective and supported by many P2P systems such as CAN [10], Chord [11], Pastry [12], and Tapestry [13]. A DHT overlay network is composed of several DHT nodes and each node keeps a set of resources (e.g., files, rating of items). Each resource is associated with a key (produced, for instance, by hashing the file name) and each node in the system is responsible for storing a certain range of keys. Peers in the DHT overlay network locate their wanted resource by issue a lookup(key) request which returns the identity (e.g., the IP address) of the node that stores the resource with the certain key. The primary goals of DHT are to provide an efficient, scalable, and robust routing algorithm which aims at reducing the number of P2P hops, which are involved when we locate a certain resource, and to reduce the amount of routing state that should be preserved at each peer.
3 PipeCF: A Novel Distributed CF Algorithm 3.1 Basic PipeCF Algorithm The first step to implement CF algorithm in a distributed way is to divide the original centralized user database into fractions which can then be stored in distributed peers. For concision, we will use the term bucket to denote the distributed stored fraction of
A Novel Distributed Collaborative Filtering Algorithm and Its Implementation
109
user database in the following of this paper. Here, two critical problems should be considered. The first one is how to assign each bucket with a unique identifier through which they can be efficiently located. The second is which bucket should be retrieved when we need to make prediction for a particular user. Here, we solve the first problem by proposing a division strategy which makes each bucket hold a group of users’ record who has a particular tuple. It means that users in the same bucket at least voted one item with the same rating. This will then be used to a unique key as the identifier for the bucket in the network which we will describe in more detail in Section 4. To solve the second problem, we propose a heuristic bucket choosing strategy by only retrieving those buckets whose identifiers are the same with those generated by the active user’s ratings. Figure 1 gives the framework of PipeCF. Details of the function of lookup(key) and implemention of PipeCF on DHT-based P2P overlay network will be described in Section 4. The bucket choosing strategy of PipeCF is based on the assumption that people with similar interests will at least rate one item with similar votes. So when making prediction, PipeCF only uses those users’ records that are in the same bucket with the active user. As we can see in Figure 5 of section 5.3.1, this strategy have very high hitting ratio. Still, we can see that through this strategy we reduce about 50% calculation than traditional CF algorithm and obtain comparable prediction as shown in Figure 6 and 7 in section 5. Algorithm: PipeCF Input: rating record of the active user, target item Output: predictive rating for target item Method: For Each tuple in the rating record of active user: 1) Generate the key corresponding to the through the hash algorithm used by DHT 2) Find the host which holds the bucket with the identifier key through the function lookup(key) provided by DHT 3) Copy all ratings in bucket key to current host Use traditional KNN-based CF algorithm to calculate to predictive rating for target item. Fig. 1. Framework of PipeCF
3.2 Some Improvement 3.2.1 Significance Refinement (SR) In the basic PipeCF algorithm, we return all users which are in the at least one same bucket with the active user and find that the algorithm has an O(N) fetched user number where N is the total user number as Figure 7 shows. In fact, as Breese presented in [3] by the term inverse user frequency, universally liked items are not as useful as less common items in capturing similarity. So we introduce a new concept significance refinement (SR) which reduces the returned user number of the basic PipeCF algorithm by limiting the number of returned users for each bucket. We term
110
P. Han et al.
the algorithm improved by SR as Return K which means “for every item, the PipeCF algorithm returns no more than K users for each bucket”. The experimental result in Figure 7 and 8 of section 5.3.3 shows that this method reduces the returned user number dramatically and also improves the prediction accuracy. 3.2.2 Unanimous Amplification (UA) In our experiment in KNN-based CF algorithm, we have found that some highly correlated neighbors have little items on which they vote the same rating with the active users. These neighbors frequently prove to have worse prediction accuracy than those neighbors who have same rating with active users but relatively lower correlation. So we argue that we should give special award to the users who rated some items with the same vote by amplify their weights, which we term Unanimous Amplification. We transform the estimated weights as follows:
wa ,i wa′ ,i = wa ,i ⋅ α w ⋅ β a ,i Where
N a ,i = 0
(3)
0 < N a ,i ≤ γ N a ,i > γ
N a ,i denotes the number of items which user a and user i have the same votes.
A typical value for α for our experiments is 2.0, β is 4.0, and γ is 4. Experimental result in Figure 9 of section 4.3.4 shows that UA approach improves the prediction accuracy of the PipeCF algorithm.
4 Implemention of PipeCF on a DHT-Based P2P Overlay Network 4.1 System Architecture Figure 2 gives the system architecture of our implementation of PipeCF on the DHTbased P2P overlay network. Here, we view the users’ rating as resources and the system generate a unique key for each particular tuple through the hash algorithm, where the ITEM_ID denotes identity of the item user votes on and VOTE is the user’s rating on that item. As different users may vote particular item with same rating, each key will correspond to a set of users who have the same tuple. As we stated in section 3, we call such set of users’ record as bucket. As we can see in Figure 2, each peer in the distributed CF system is responsible for storing one or several buckets. Peers are connected through a DHTbased P2P overlay network. Peers can find their wanted buckets by their keys efficiently through the DHT-based routing algorithm. As we can see from Figure 1 and Figure 2, the implementation of our PipeCF on DHT-based P2P overlay network is quite straightforward except two key pieces: how to store the buckets and fetch them back effectively in this distributed environment. We solve these problems through two fundamental DHT function: put(key) and lookup(key) which are described in Figure 3 and Figure 4 respectively.
A Novel Distributed Collaborative Filtering Algorithm and Its Implementation
111
These two functions inherit from DHT the following merits: − Scalability: it must be designed to scale to several million nodes. − Efficiency: similar users should be located reasonably quick and with low overhead in terms of the message traffic generated. − Dynamicity: the system should be robust to frequent node arrivals and departures in order to cope with highly transient user populations’ characteristic to decentralized environments. − Balanced load: in keeping with the decentralized nature, the total resource load (traffic, storage, etc) should be roughly balanced across all the nodes in the system.
Fig. 2. System Architecture of Distributed CF Recommender System
5 Experimental Results 5.1 Data Set We use EachMovie data set [6] to evaluate the performance of improved algorithm. The EachMovie data set is provided by the Compaq System Research Center, which ran the EachMovie recommendation service for 18 months to experiment with a collaborative filtering algorithm. The information they gathered during that period consists of 72,916 users, 1,628 movies, and 2,811,983 numeric ratings ranging from 0 to 5. To speed up our experiments, we only use a subset of the EachMovie data set. 5.2 Metrics and Methodology The metrics for evaluating the accuracy of we used here is statistical accuracy metrics which evaluate the accuracy of a predictor by comparing predicted values with userprovided values. More specifically, we use Mean Absolute Error (MAE), a statistical
112
P. Han et al.
accuracy metrics, to report prediction experiments for it is most commonly used and easy to understand:
MAE = Where
∑
a∈T
| v a , j − pa , j |
(4)
|T |
va , j is the rating given to item j by user a, is the predicted value of user a
on item j, T is the test set, |T| is the size of the test set.
Algorithm: DHT-based CF puts a peer P’s vote vector to DHT overlay network Input: P’s vote vector Output: NULL Method: For each in P’s vote vector: 1) P generates a unique 128-bit DHT Key Klocal (i.e. hash the system unique username). 2) P hashes this tuple to key K, and routes it with P’s vote vector to the neighbor Pi whose local key Ki_local is the most similar with K. 3) When Pi receives the PUT message with K, it caches it. And if the most similar neighbor is not itself, it just routes the message to its neighbor whose local key is most similar with K. Fig. 3. DHT Put(key) Function
Algorithm: lookup(key) Input: identifier key of the targeted bucket Output: targeted bucket (retrieved from other peers) Method: 1) Routes the key with the targeted bucket to the neighbor Pi whose local key Ki_local is the most similar with K. 2) When Pi receives the LOOKUP message with K, if Pi has enough cached vote vectors with the same key K, it returns the vectors back to P, otherwise it just routes the message to its neighbor whose local key is most similar with K. Anyway, P will finally get all the records in the bucket whose identifier is key. Fig. 4. DHT Lookup(key) Function
We select 2000 users and choose one user as active user per time and the remainder users as his candidate neighbors, because every user only make self’s recommendation locally. We use the mean prediction accuracy of all the 2000 users as the system's prediction accuracy. For every user’s recommendation calculation, our tests are performed using 80% of the user’s ratings for training, with the remainder for testing.
A Novel Distributed Collaborative Filtering Algorithm and Its Implementation
113
5.3 Experimental Result We design several experiments for evaluating our algorithm and analyze the effect of various factors (e.g., SR and UA, etc) by comparison. All our experiments are run on a Windows 2000 based PC with Intel Pentium 4 processor having a speed of 1.8 GHz and 512 MB of RAM. 5.3.1 The Efficiency of Neighbor Choosing We used a data set of 2000 users and show among the users chosen by PipeCF algorithm, how many are in the top-100 users in Figure 5. We can see from the data that when the user number rises above 1000, more than 80 users who have the most similarities with the active users are chosen by PipeCF algorithm.
Fig. 5. How Many Users Chosen by PipeCF in Traditional CF’s Top 100
Fig. 6. PipeCF vs. Traditional CF
5.3.2 Performance Comparison We compare the prediction accuracy of traditional CF algorithm and PipeCF algorithm while we apply both top-all and top-100 user selection on them. The results are shown as Figure 6. We can see that the DHT-based algorithm has better prediction accuracy than the traditional CF algorithm. 5.3.3 The Effect of Significance Refinement We limit the number of returned user for each bucket by 2 and 5 and do the experiment in Section 5.3.2 again. The user for each bucket is chosen randomly. The result of the number of user chosen and the prediction accuracy is shown in Figure 7 and Figure 8 respectively. The result shows: − “Return All” has an O(N) returned user number and its prediction accuracy is also not satisfying; − “Return 2” has the least returned user number but the worst prediction accuracy; − “Return 5” has the best prediction accuracy and the scalability is still reasonably well (the returned user number is still limited to a constant as the total user number increases).
114
P. Han et al.
Fig. 7. The Effect on Scalability of SR on PipeCF
Fig. 8. The Effect on Prediction Accuracy of SR on PipeCF Algorithm
Fig. 9. The Effect on Prediction Accuracy of Unanimous Amplification
5.3.4 The Effect of Unanimous Amplification We adjust the weights for each user by using Equation (5) while setting value for α as 2.0, β as 4.0, γ as 4 and do the experiment in Section 5.3.2 again. We use the top-100 and “Return All” selection method. The result shows that the UA approach improves the prediction accuracy of both the traditional and the PipeCF algorithm. From Figure 9 we can see that when UA approach is applied, the two kinds of algorithms have almost the same performance.
6 Conclusion In this paper, we solve the scalability problem of KNN-based CF algorithm by proposing a novel distributed CF algorithm called PipeCF and give its implementation on a DHT-based P2P overlay network. Two novel approaches: significance refinement (SR) and unanimous amplification (UA) have been proposed to improve
A Novel Distributed Collaborative Filtering Algorithm and Its Implementation
115
the performance of distributed CF algorithm. The experimental data show that our algorithm has much better scalability than traditional KNN-based CF algorithm with comparable prediction efficiency.
References 1. 2.
3. 4.
5. 6. 7. 8. 9.
10. 11. 12. 13.
David Goldberg, David Nichols, Brian M. Oki, Douglas Terry.: Using collaborative filtering to weave an information tapestry, Communications of the ACM, v.35 n.12, p.6170, Dec. 1992. J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl.: An algorithmic framework for performing collaborative filtering. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 230-237, 1999. Breese, J., Heckerman, D., and Kadie, C.: Empirical Analysis of Predictive Algorithms for Collaborative Filtering. Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence, 1998 (43-52). Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, John Riedl.: GroupLens: an open architecture for collaborative filtering of netnews, Proceedings of the 1994 ACM conference on Computer supported cooperative work, p.175-186, October 2226, 1994, Chapel Hill, North Carolina, United States. Upendra Shardanand, Pattie Maes.: Social information filtering: algorithms for automating “word of mouth”, Proceedings of the SIGCHI conference on Human factors in computing systems, p.210-217, May 07-11, 1995, Denver, Colorado, United States. Eachmovie collaborative filtering data set.: http://research.compaq.com/SRC/eachmovie Amund Tveit.: Peer-to-peer based Recommendations for Mobile Commerce. Proceedings of the First International Mobile Commerce Workshop, ACM Press, Rome, Italy, July 2001, pp. 26-29. Tomas Olsson.: "Bootstrapping and Decentralizing Recommender Systems", Licentiate Thesis 2003-006, Department of Information Technology, Uppsala University and SICS, 2003 J. Canny.: Collaborative filtering with privacy. In Proceedings of the IEEE Symposium on Research in Security and Privacy, pages 45--57, Oakland, CA, May 2002. IEEE Computer Society, Technical Committee on Security and Privacy, IEEE Computer Society Press. S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker.: A scalable contentaddressable network. In SIGCOMM, Aug. 2001 Stocal I et al.: Chord: A scalable peer-to-peer lookup service for Internet applications (2001). In ACM SIGCOMM, San Diego, CA, USA, 2001, pp.149-160 Rowstron A. Druschel P.: Pastry: Scalable, distributed object location and routing for large scale peer-to-peer systems. In IFIP/ACM Middleware, Hedelberg, Germany, 2001 Zhao B Y et al.: Tapestry: An infrastructure for fault-tolerant wide-area location and routing. Tech.Rep.UCB/CSB-0-114,UC Berkeley, EECS,2001