Sparse Learning-to-Rank via an Efficient Primal-Dual Algorithm

Report 2 Downloads 40 Views
IEEE TRANSACTIONS ON COMPUTERS, VOL. , NO. , JULY 2011

1

Sparse Learning-to-Rank via an Efficient Primal-Dual Algorithm Hanjiang Lai, Yan Pan, Cong Liu, Liang Lin, Jie Wu, Fellow, IEEE Abstract—Learning-to-rank for information retrieval has gained increasing interest in recent years. Inspired by the success of sparse models, we consider the problem of sparse learning-to-rank, where the learned ranking models are constrained to be with only a few non-zero coefficients. We begin by formulating the sparse learning-to-rank problem as a convex optimization problem with a sparse-inducing `1 constraint. Since the `1 constraint is non-differentiable, the critical issue arising here is how to efficiently solve the optimization problem. To address this issue, we propose a learning algorithm from the primal dual perspective. Furthermore, we prove that, after at most O( 1² ) iterations, the proposed algorithm can guarantee the obtainment of an ²-accurate solution. This convergence rate is better than that of the popular sub-gradient descent algorithm. i.e., O( ²12 ). Empirical evaluation on several public benchmark datasets demonstrates the effectiveness of the proposed algorithm: (1) Compared to the methods that learn dense models, learning a ranking model with sparsity constraints significantly improves the ranking accuracies. (2) Compared to other methods for sparse learning-to-rank, the proposed algorithm tends to obtain sparser models and has superior performance gain on both ranking accuracies and training time. (3) Compared to several state-of-the-art algorithms, the ranking accuracies of the proposed algorithm are very competitive and stable. Index Terms—learning-to-rank, sparse models, ranking algorithm, Fenchel Duality.

F

1

I NTRODUCTION

R

ANKING is a crucial task for information retrieval systems, in particular for web search engines. Learning-to-rank is a task that applies machine learning techniques to learn good ranking predictors for sorting a set of entities/documents. It has been drawing increasing interest in information retrieval and machine learning research. Many learning-to-rank algorithms have been proposed in literature such as [3], [4], [5], [6], [7], [8]. In many machine learning applications, such as computer vision and bioinformatics, there is much desire to learn a sparse model. That is, a model with only a few non-zero coefficients with respect to the input features. Models with sparsity constraints are also desirable in ranking. Firstly, some new data sets for ranking, such as the data sets for Yahoo!’s Learning-to-Rank Challenge1 and Microsoft’s data sets for large-scale learning-to-rank2 , contain high dimensional features. High dimensional features lead to the problem that the dense models learned are complicated and hard to interpret. Secondly, high dimensional features may be redundant or noisy, which results in poor generalization performance. Lastly, a • Corresponding author: Yan Pan. E-mail: [email protected]. • Hanjiang Lai and Cong Liu are with the School of Information Science and Technology, Sun Yat-sen University, Guangzhou, China, 510006. • Yan Pan and Liang Lin are with the School of Software, Sun Yat-sen University, Guangzhou, China, 510006. • Jie Wu is with the Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA. 1. http://learningtorankchallenge.yahoo.com/ 2. http://research.microsoft.com/en-us/projects/mslr/

sparse model has less computational cost in prediction. The following is an intuitive example to illustrate why sparse learning works: a few strong features can dominate the whole ranking performance. We sort the documents in the TD2004 dataset (in LETOR 3.0 [19]), respectively, using each individual feature out of the 64 given features. The results of Normalized Discounted Cumulative Gain (NDCG)@10 evaluation metrics (interested readers please refer to Section 8.2 for more details about NDCG) are shown in Figure 1. For comparison, we generate a random predictor that uses all of the features as follows: we randomly initialize a feature vector w (w = Pweight 64 (w1 , w2 , · · · , w64 ), ∀i, wi ≥ 0, i=1 wi = 1) for 10 times and sort the documents in decreasing order of the results of the inner product hw, xi, where x denotes the feature vector of a document, and then we get the average over the ten NDCG@10 values. We can observe in Figure 1 that only several strong features, i.e., BM25, language models, PageRank, HITS [19], have obviously higher NDCG@10 values than the average of ten random sorting values (the red line), while many are lower. Moreover, there are some poor features whose NDCG@10 values are very far away from the average value of random sorting. To sum up, there can be cases where the whole ranking performance is dominated by only a small number of strong features, where learning a sparse model with only a few non-zero coefficients is desirable for ranking, and it has the potential to achieve better performance. To obtain sparse ranking models, a natural way is to construct a ranking model with the smallest number

IEEE TRANSACTIONS ON COMPUTERS, VOL. , NO. , JULY 2011

NDCG@10 Average Value

0.3

NDCG@10

0.25 0.2 0.15 0.1 0.05 0

10

20

30 40 Features

50

60

Fig. 1. The red line denotes the average NDCG@10 value over 10 times of random sorting. The dots are the NDCG@10 values of 64 features, respectively. We can see that only a few dots are above the red line.

of features. This problem is usually modeled using the `0 penalty. However, the resulting optimization problem is an NP-hard combinatorial problem. A simple way to tackle the `0 penalty is feature selection, which greedily chooses an additional feature at every step to reduce the training error. Although feature selection algorithms directly address the `0 penalty, they are non-convex and hard to analyze. A more popular way is to use the `1 penalty as a convex surrogate of the `0 penalty. The convex optimization problems with the `1 penalty not only lead to efficient learning algorithms but also allow comprehensive theoretical analysis. In this paper, we focus on the sparse ranking problem with the `1 penalty. However, no effort has been made to tackle the problem of learning a sparse model for ranking with the `1 constraint except for the work [9], in which the authors proposed a reduction framework to reduce ranking to the problem of importance-weighted pairwise classification. Then, they used an `1 regularized algorithm to learn a sparse ranking predictor. Despite the improvement they reported, the authors did not justify the individual contribution of the two parts of their solution: the reduction framework and the sparse learning algorithm. This paper aims to answer the question of how the sparse learning algorithm contributes to improve ranking accuracy. We propose a convergence-provable primal-dual algorithm that optimizes the `1 regularized pairwise ranking loss. Furthermore, we empirically show in our experiments that our algorithm, using the pairwise ranking loss with sparse-inducing `1 norm, can significantly outperform the algorithm using the same loss with `2 norm and can achieve a state-of-the-art performance on several benchmark datasets.

2

Our contributions in this paper can be summarized as follows: (1) We successfully formulate the sparse learning-to-rank problem as a convex optimization problem by combining a pairwise ranking loss and a sparse-inducing `1 norm. Since the `1 norm is nondifferentiable, this optimization problem is difficult to solve. We propose a learning algorithm for this optimization problem from the primal-dual perspective. (2) We prove that, after T iterations, our proposed algorithm can guarantee the obtainment of a solution with desired tolerant optimization error ² = O( T1 ). Our algorithm has a better convergence rate than the sub-gradient descent algorithm, which is a popular algorithm for convex and non-differentiable problems with a convergence rate of O( √1T ) [10]. (3) We empirically show that, compared to the methods that learn dense models, learning a ranking model with sparsity constraints can significantly improve the ranking accuracies. Experiment results show that our learning algorithm achieves a state-of-the-art performance on several public benchmark datasets.

2

R ELATED W ORK

Recently, there has been a lot of research focusing on learning-to-rank in the machine learning community. Two main classes of methods for learning-to-rank have been explored in the last few years: pairwise methods and listwise methods [3], [4], [5], [6], [8], [11]. Our proposed method belongs to the first class. The first class of methods is based on the so-called pairwise approach, in which a process of learningto-rank is viewed as a task to classify the preference order within document pairs. Ranking SVM [6], RankBoost [5] and RankNet [3], are notable pairwise algorithms. The Ranking SVM algorithm adopts a large margin optimization approach like the traditional SVM [14]. It minimizes the number of incorrectly ordered instance pairs. Several extensions of Ranking SVM have also been proposed to enhance the ranking performance, such as [12], [13]. RankBoost is a boosting algorithm for ranking by using pairwise preference data. RankNet is another well-known algorithm, which applies Neural Network to rank and uses cross entropy as its loss function. Recently, Chapelle and Keerthi [15] replaced the standard hinge loss in Ranking SVM with a differentiable squared hinge loss and thus proposed a Newton descent algorithm to efficiently learn the ranking predictor. The second class contains the listwise methods in which there are mainly two streams. (1) The first stream optimizes a loss function directly based on the IR evaluation metrics. SVMMAP [8] adopts the structural Support Vector Machines to minimize a loss function that is the upper bound of the Mean Average Precision (MAP) evaluation metrics (interested readers please refer to Subsection 8.2 for more details about MAP). AdaRank [11] is a

IEEE TRANSACTIONS ON COMPUTERS, VOL. , NO. , JULY 2011

boosting algorithm that optimizes an exponential loss, which upper bounds the metrics of MAP and NDCG. (2) The second stream defines several listwise loss functions, which take the list of retrieved documents for the same query as a sample. ListNet [4] defines a loss function based on the KL-divergence between two permutation probability distributions. ListMLE [17] defines another listwise likelihood loss function based on the Luce Model [16]. Another aspect related to the work in this paper is sparse learning, which has been widely applied to many applications in computer vision, signal processing and bioinformatics. Many learning algorithms have been proposed for sparse classification/regression, such as decomposition algorithms [27], [26], algorithms for `1 constrained optimization problem [30], [29], [28] (interested readers please refer to [26] for more discussions about sparse classification/regression algorithms). Since the pairwise approach reduces the ranking problem to a classification problem on document pairs, in principle, many algorithms for sparse classification can be applied to obtain sparse ranking models. However, few efforts have been made to tackle the problem of learning a sparse solution for ranking. Recently, Sun et al. [9] proposed a reduction framework to reduce ranking to importance-weighted pairwise classification and then used an `1 regularized algorithm to learn a sparse ranking predictor. Despite success, it does not justify the individual contribution of each of its two parts, the reduction framework and the sparse learning algorithm. Sparse learning for ranking is a relatively new topic that needs more exploration.

3

N OTATIONS

We introduce the notations used throughout this paper. In the learning-to-rank problem, there is a labeled training set S={(qk , Xk , Yk )}nk=1 and a test set T = {(qk , Xk )}n+u k=n+1 . Here qk denotes a query, Xk = n(qk ) {Xk,i }i=1 denotes the list of corresponding retrieved n(q ) objects (i.e., documents) for qk , and Yk = {yk,i }i=1k is the list of corresponding relevance labels provided by human, where yk,i ∈ {0, 1, 2, 3, 4}, n(qk ) represents the number of objects in the retrieved object list belongs to query qk , and Xk,i represents the ith object in the retrieved object list belongs to query qk . Each Xk,i ∈ Rm is an m-dimensional feature vector and each attribute of Xk,i is scaled to the range [0, 1]. We define a pairs set P of comparable object pairs as following: (k, i, j) ∈ P if and only if Xk,i , Xk,j belong to the same query qk and yk,i 6= yk,j . We use p to denote the number of pairs in P . In addition, we define an object pairwise comparison error matrix K ∈ Rp×m as follows: each pair in P corresponds to a row in K. Denote the lth pair in P as {kl , il , jl }, the lth row of K as Kl . We define Kl = ykl ,il ,jl (Xkl ,il −Xkl ,jl ),

3

TABLE 1 List of notations Notations S={(qi , Xi , Yi )}n i=1 m p r K IC (w)

Meaning training set dimension of data number of pairs in set P the radius of `1 -ball: ||w||1 ≤ r matrix in Rp×m that contains the pairwise information IC (w) = 0 if condition C is satisfied, otherwise IC (w) = ∞.

where ykl ,il ,jl = 1 if ykl ,il > ykl ,jl , and otherwise ykl ,il ,jl = −1. Since Xi,j ∈ [0, 1]m for all i, j, we have Kl ∈ [−1, 1]m for all l. We use hx, yi to represent the inner product of two vectors x and y. Let r denote the radius of an `1 ball: ||w||1 ≤ r . We introduce an indictor function IC (w): IC (w) = 0 if and only if for a given vector w, condition C is satisfied, otherwise IC (w) = +∞. The above notations are summarized in Table 1.

4

P ROBLEM S TATEMENT

The learning-to-rank problem has a wide range of applications in information retrieval systems. We are given a labeled training set S = {(qk , Xk , Yk )}nk=1 and a test set T = {(qk , Xk )}n+u k=n+1 . The task of learning to rank is to construct a ranking predictor from the training data, and then sort the examples in the test set using the ranking predictor. Following the common practice in learning-to-rank, in this paper, we only focus on learning a linear ranking predictor f (x) = hw, xi. Many existing learning-torank algorithms use this setting. The SVM methods, such as the recently proposed RankSVM-Struct [18] and RankSVM-Primal [15], are notable algorithms for learning linear ranking predictors, which achieve a state-of-the-art performance on several benchmark datasets. These methods learn ranking models by minimizing the following form of regularized pairwise loss functions: X 1 T min ||w0 ||22 + C `(yk,i,j w0 (Xk,i − Xk,j )) (1) 0 2 w (k,i,j)∈P

where `(x) can be the hinge loss `(x) = max(0, 1 − x) or the squared hinge loss `(x) = max(0, 1 − x)2 , and C is a parameter to control the trade-off between training error and the model complexity. Existing work [33] in learning-to-rank revealed that the classification based pairwise loss function (i.e., hinge loss) is both an upper bound of 1-NDCG and 1-MAP. When we take `(x) = max(0, 1 − x), the objective function given by (1) is the objective of Ranking SVM. There exist several algorithms, such as the quadratic programming [25] or the cutting plane algorithm [18], which minimize this objective function. If `(x) = max(0, 1 − x)2 , the function given by (1) becomes the objective of RankSVM-Primal [15],

IEEE TRANSACTIONS ON COMPUTERS, VOL. , NO. , JULY 2011

4

which is a convex and twice differentiable function that can be optimized directly via an efficient Newton descent algorithm. Despite achieving a state-of-the-art performance, a learning algorithm using these forms of objectives usually obtains dense solutions (most of the ranking predictor’s coefficients are non-zero) because of the `2 regularization term. Sparse models have been proved to be effective in many applications, including computer vision, signal processing and bioinformatics. In this paper, we are interested in the particular problem of how the sparse learning algorithm can contribute to the improvement of the ranking accuracy. By replacing the `2 norm with the spare-inducing `1 norm, we obtain the following optimization problem: X

min ||w0 ||1 +C 0 w

(k,i,j)∈P

the problem in Equation (4), with a faster convergence rate of O( 1² ).

5

O UR A LGORITHM

5.1 Overview

In this section, we present our algorithm to solve the sparse learning-to-rank problem in Eq. (4). Our algorithm is based on the theory of Fenchel Duality [23], which has been used in several machine learning algorithms such as the boosting variants in [1]. Our algorithm follows the genetic algorithmic framework proposed in [1]. Since Fenchel Duality is the key ingredient in our designing methodology, we call our algorithm “FenchelRank” for short. Let D(w) = −G(w). Then the problem in Eq. (4) is 1 T max(0, 1−yk,i,j w0 (Xk,i −Xk,j ))2equivalent to the following optimization problem: p

(2) For any C in the problem in Eq. (2), there exists a corresponding r such that the problem in Eq. (2) is equivalent to the following optimization problem (see the explanation in [31], Section 1.2):

max D(w) = max −I||w||1 ≤1 (w)− w

w

p r2 X 1 max(0, − (Kw)i )2 p i=1 r

(5)

In order to maximize the objective in Eq. (5), we propose an iterative algorithm, which iteratively conmin max(0, 1 − yk,i,j w (Xk,i − Xk,j )) ||w0 ||1 ≤r structs a sequence of weight vectors: w1 → · · · → (k,i,j)∈P T w → w t t+1 → · · · → wT , such that {D(wt )}t=1 is a p X 1 = min max(0, 1 − (Kw0 )i )2 monotonically increasing sequence of function values: I||w0 ||1 ≤r (w0 ) + p i=1 w0 D(w1 ) ≤ · · · ≤ D(wt ) ≤ D(wt+1 ) ≤ · · · ≤ D(wT ). (3) Suppose w∗ is the best solution for D(w) (i.e., the Bayes error is minimized), and D∗ = D(w∗ ). With 0 The predictor w is an m-dimensional vector con- the constructed sequence {D(wt )}Tt=1 , we will prove strained in the `1 -ball of radius r. It is well-known that that after T iterations, D(wT ) is guaranteed to be the `1 constrained optimization problems like Eq. (3) an ²-accurate solution (i.e., D∗ − D(wT ) ≤ ²) with usually lead to sparse solutions, but the `2 regularized ² = O( 1 ). T formulation like Eq.(1) does not (see [32], pages 14To improve efficiency in practice, we further define 15, for a detailed explanation based on a geometrical an early stopping criterion for the algorithm using intuition). the properties of Fenchel duality. Obviously, we can For ease of analysis, we scale the the radius of `1 - compare D∗ −D(wT ) with ² to determine whether the ball to 1 and define w = 1r w0 . The problem (3) can be algorithm can be stopped. However, D∗ is unknown. rewritten as the following: To derive an upper bound of D∗ −D(wt ), we construct another sequence of function values P (dt ) such that p 1 r2 X 2 P (dt ) ≥ D ∗ , where P (dt ) is the primal form whose max(0, −(Kw)i ) min G(w) = min I||w||1 ≤1 (w)+ w w p i=1 r Fenchel dual form is D(wt ). Therefore, we can use (4) P (dt ) − D(wt ) ≥ D∗ − D(wt ) to derive an early The objective function given by Eq. (4) is similar to stopping criterion (see Subsection 5.3 and 5.4 for more that of the RankSVM-Primal, except for a different details). regularization term. The skeleton of the proposed algorithm is shown Since `1 is not differentiable everywhere, it is chal- in Algorithm 1. The input of the algorithm includes lenging to optimize the objective in Eq. (4), and the a data matrix K, a desired optimization tolerance ², Newton descent algorithm used in RankSVM-Primal a maximum number of iterations T and the radius can not be applied. To minimize a convex but non- r of an `1 -ball. In this algorithm, the sign function differentiable function, a straightforward way is to use sign(α) = 1 if α ≥ 0, otherwise sign(α) = −1; 0m the popular sub-gradient descent algorithm. However, denotes the m-dimensional vector with all zeros, and the sub-gradient descent method has a slow conver- ei is the vector with all zeros except the ith element gence rate of O( ²12 ), where ² is the expected optimiza- being 1. The algorithm initializes w to be 0m . It stops tion precision. In this paper, we propose an efficient if the early stopping criterion is satisfied (Line 2), or and convergence-provable optimization algorithm for the maximal iteration, T , is reached. 1 p

X

0T

2

IEEE TRANSACTIONS ON COMPUTERS, VOL. , NO. , JULY 2011

5

Algorithm 1 FenchelRank algorithm Input: pairwise data matrix K, desired accuracy ², maximal iteration number T and the radius r of `1 ball. Output: linear ranking predictor w Initialize: w1 = 0m 1. For t = 1, 2, · · · , T do //check if the early stopping criterion is satisfied 2. IF ||gt ||∞ + hdt , −Kwt i ≤ ² return wt as ranking predictor w ∂f ∗ (−Kw) Here dt = ∇f ∗ (−Kwt ) = ∂(Kw) |w=wt and gt = dT t K //Greedily choose a feature to update 3. Choose jt = argmaxj |(gt )j | //Compute an appropriate step size 4. Let µt = arg max0≤µt ≤1 D((1 − µt )wt +µt sign((gt )jt )ejt )) //Update the model with the chosen feature and step size 5. Update wt+1 = (1 − µt )wt + µt sign((gt )jt )ejt 7. end For 8. return wT as ranking predictor w

In each iteration, the algorithm has three main steps: (1) checking the early stopping criterion (Line 2), (2) greedily choosing a feature to update (Line 3), and (3) finding an appropriate step size and updating the weights (Lines 4-5). In the following, we first review the properties of Fenchel duality. Then we present how to construct the sequences {D(wt )}Tt=1 , {dt }Tt=1 and {P (dt )}Tt=1 . After that, we specify the three main steps, respectively. Finally, we provide the theoretical analysis of the algorithm. 5.2

Properties of Fenchel Duality

The main properties of Fenchel Duality are the Fenchel conjugate (Definition 1) and the Fenchel Duality inequalities (Lemma 1 and 2). Definition 1. The Fenchel conjugate of function f is defined as f ∗ (θ) = maxx∈domf (hθ, xi − f (x)). Lemma 1. (Fenchel-Young inequality: [2], Proposition 3.3.4) Any points θ in the domain of function f ∗ and x in the domain of function f satisfy the inequality: f (x) + f ∗ (θ) ≥ hθ, xi

(6)

The equality holds if and only if θ ∈ ∂f (x). Lemma 2. (Fenchel Duality inequality: [2], Theorem 3.3.5) Let function f : Rp → (−∞, +∞] and g : Rm → (−∞, +∞] be two closed and convex functions, K be a Rp×m matrix, sup −f ∗ (−Kw) − g ∗ (w) ≤ inf f (d) + g(dT K) w

d

(7)

The equality holds when 0 ∈ (dom(g) − K T dom(f )) .

r2 p

Pp

max(0, 1r − (Kw)i )2 . The objective in Eq. (5) can be rewritten by combining f ∗ and g ∗ : i=1

max D(w) = max −f ∗ (−Kw) − g ∗ (w) w

w

This is exactly the same as the left-hand side of Eq. (7). Accordingly, we can define the upper bound of Eq. (5) by the right hand side of Eq. (7): min P (d) = min f (d) + g(dT K) d

d

Constructing the Sequences {D(wt )}Tt=1 ,

{P (dt )}Tt=1

To construct the sequences and {dt }Tt=1 , we define g ∗ (w) = I ||w||1 ≤1 (w) and 2 f ∗ (θ) = rp (max(0, 1 + θ))2 . We have f ∗ (−Kw) =

(9)

where g and f are the Fenchel conjugates of g ∗ and f ∗ , respectively, which is given by the following lemma. Lemma 3. The Fenchel conjugate of f ∗ (−Kw) = Pp r2 Pp p 2 1 2 i=1 p max(0, r − (Kw)i ) is f (d) = i=1 ( 4r 2 di − 1 ∗ r di + Idi ≥0 (di )). The Fenchel conjugate of g (w) = T T I||w||1 ≤1 (w) is g(d K) = ||d K||∞ . Proof: Let x = −Kw. According to Definition 1, we have f (d) = maxx hd, xi − f ∗ (x) = Pp r2 1 2 i=1 maxxi di xi − p max(0, r + xi ) . Let h(di ) = Pp r2 1 2 maxxi di xi − p max(0, r +xi ) , thus f (d) = i=1 h(di ). If di < 0, letting xi → −∞, we have: h(di ) = di xi −

r2 1 max(0, + xi )2 = di xi → ∞ (10) p r

If di > 0, we respectively discuss the following two cases: (1) If ( 1r + xi ) ≥ 0, then h(di ) = maxxi ≥− r1 di xi − r2 1 2 p ( r + xi ) p 2 1 4r 2 di − r di ,

2

= maxxi ≥− r1 − rp (xi + 1r − 2rp2 di )2 + which implies that h(di ) ≤ 4rp2 d2i − 1r di . The equality holds when xi + 1r − 2rp2 di = 0. (2) If ( 1r + xi ) ≤ 0, then h(di ) = maxxi