Loss Bounds for Online Category Ranking Koby Crammer1 and Yoram Singer2,3 1
Dept. of Computer and Information Science, Univ. of Pennsylvania, Philadelphia, PA 19104 2 School of Computer Sci. & Eng., Hebrew University, Jerusalem 91904, Israel 3 Google Inc., 1600 Amphitheatre Parkway, Mountain View CA 94043, USA
[email protected], singer@{cs.huji.ac.il,google.com}
Abstract. Category ranking is the task of ordering labels with respect to their relevance to an input instance. In this paper we describe and analyze several algorithms for online category ranking where the instances are revealed in a sequential manner. We describe additive and multiplicative updates which constitute the core of the learning algorithms. The updates are derived by casting a constrained optimization problem for each new instance. We derive loss bounds for the algorithms by using the properties of the dual solution while imposing additional constraints on the dual form. Finally, we outline and analyze the convergence of a general update that can be employed with any Bregman divergence.
1
Introduction and Problem Setting
The task of category ranking is concerned with ordering the labels associated with a given instance in accordance to their relevance to the input instance. Category ranking often arises in text processing applications (see for instance [8]) in which the instances are documents and the labels constitute a list of topics that overlap with the subject matter of the document. The set of labels, or topics using the text processing jargon, is predefined and does not change along the run of the text processing and learning algorithm. A closely related problem studied by the machine learning community is called the multilabel classification problem. Few learning algorithms have been devised for the category ranking problem. Some notable example are a multiclass version of AdaBoost called AdaBoost.MH [12], a generalization of Vapnik’s Support Vector Machines to the multilabel setting by Elisseeff and Weston [10], and a generalization of the Perceptron algorithm to category ranking [8]. This work employs hypotheses for category ranking that are closely related to the ones presented and used in [10, 8]. We generalize the algorithms presented in [10, 8] by providing both a more refined analysis as well as deriving and analyzing new algorithms for the same problem. First, we give online bounds for an additive algorithm which is a fusion of a generalization of the Perceptron for topic ranking [8] and the MIRA algorithm [9, 7]. We also derive a multiplicative algorithm and a general algorithm based on Bregman divergences that were not discussed in previous research papers. Last, but not least, previous work focused on feedback that takes a rather rigid structured form in which the set of labels is partitioned into relevant and non-relevant subsets. The framework presented here can be used with P. Auer and R. Meir (Eds.): COLT 2005, LNAI 3559, pp. 48–62, 2005. c Springer-Verlag Berlin Heidelberg 2005
Loss Bounds for Online Category Ranking
49
a rather general feedback which takes the form of a partial order. Experimental results [6] which unfortunately we do not have room to include in this paper indicate that the algorithms described in this paper outperform the previously published algorithms for topic ranking. Our algorithmic framework thus presents a viable practical and provably correct alternative to previous learning algorithms for the category ranking task. Let us now describe the formal ingredients of the category ranking problem. As in supervised learning problems the learning algoritm is introduced to a set of instancelabel pairs. For concreteness we assume that the instances are vectors in Rn and denote the instance received on round i by xi . The labels that we examine in this paper may take a rather general form. Specifically, labels are preference relations over a set of k categories and is denoted by C = {1, 2, . . . , k}. That is, a label y ⊂ C × C is a relation, where (r, s) ∈ y implies that category r is ranked above, or preferred over, category s. The only restriction we impose on a label y is that it does not contain any cycle. Put another way, we represent each label y as a graph. The set of vertices of the graph is defined as the set of categories in C. Each preference pair (r, s) ∈ y corresponds to a directed edge from the vertex r to the vertex s. Using this graph-based view, there is a one-to-one correspondence between relations which do not contain any cycle and directed acyclic graphs (DAGs). We refer to such relations as semi-orders. ˆ A prediction function h maps instances x ∈ X to total-orders over C denoted by Y. We restrict ourselves to mappings based on linear functions which are parameterized by W . Formally, such mappings are by a set of k weight vectors w1 , . . ., wk denoted defined as h (x) = w1 , x , . . . , wk , x ∈ Rk , where ·, · designates the innerproduct operation. A prediction yˆ ∈ Yˆ naturally induces a total order where category r is ranked above category s iff wr , x > ws , x and ties are broken arbitrarily. Throughout the paper, we overload the notation and denote by yˆ both a k-dimensional vector and the total-order it induces. Online algorithms work in rounds. On the ith round the algorithm receives an instance xi and predicts a total-order yˆi (∈ Rk ). It then receives as feedback the semiorder y i that is associated with xi . We then suffer an instantaneous loss based on the discrepancy between the semi-order y i and the total order yˆi . The goal of the online learning algorithm is to minimize a pre-defined cumulative loss. As in other online algorithms the collection of k weight vectors W is updated after receiving the feedback y i . Therefore, we denote by W i the set of parameters used for ranking the categories on round i. For brevity, we refer to W i itself as the ranker. As in other learning algorithms, proper loss functions should be defined in order to asses the quality of the prediction functions that are learned. In the problem of binary classification we are usually interested in the event of a misclassification, which induces the so called 0-1 loss. In the more complex category ranking problem there does not exist a unique and natural loss function. The lack of a natural loss function can be primarily attributed to the fact that the learning algorithm needs to take into consideration that some mistakes are less drastic than others. Nevertheless, the 0-1 loss can be also applied in category ranking problems, indicating whether the predicted total order is consistent with the preference represented by the semi-order received as feedback. This loss function is crude in the sense that it ignores completely how many preference pairs in y are in practice mis-ordered. Moving to the other extreme, we can define a loss
50
K. Crammer and Y. Singer χ1 2
1
1
3
1
2
4
5
3
3
2
4
1
5
χ2
5
5
3
5
2
4
1
5
3
1
4
3
2
2
4
1
5
6
2
4
1
5
3
2
4
5
χ4
1
1
3
4
1
χ3
1
2
3
4
2
3
1
2
4
1
5
3
2
2
4
1
5
3
3
2
4
1
5
3
2
4
5
Fig. 1. Illustrations of various covers. The target semi-order in the example consists of six pairs (relations). These pairs constitute a bipartite graph with the “good” categories {1, 2} on one side and the “bad” categories {3, 4, 5} on the other side. The predicted total order is yˆ = {1 > 3 > 4 > 2 > 5}. Four different covers are depicted. Each subset within a cover is designated by a box. Pairs of categories for which the predicted order agrees with the target are depicted with bold edges while inconsistencies are designated by dashed edges. Thus, the induced loss is the number of boxes in which there is at least one dashed edge. Top: The all-pairs cover in which each subset is a pair of categories. Bottom Left: The 0-1 cover in which the entire set of edges reside in a single graph. Bottom Center: This cover counts the number of “good” categories that do not dominate all the “bad’ categories. Bottom Right: The cover counts the number of “bad” categories that are not dominated by all of the “good” categories
function which is set to be equal to the number of preference relations in the semi-order which are not consistent with the predicted total-order. In this paper we chose a general approach which includes the above two choices of loss functions as special cases. Let (xi , y i ) be an instance-label pair. A loss function is parameterized by a partition or a cover χ of the semi-order y into finite disjoint sets, namely, ∪χ∈χ χ = y and ∀p = q : χp ∩ χq = ∅ . Let [[π]] denote the indicator function, that is, [[π]] is 1 if the predicate π is true and is 0 otherwise. The loss suffered by a ranker W for a cover χ is defined to be, [[{(r, s) ∈ χ : wr , x ≤ ws , x} = ∅]] , (1) I (W ; (x, y, χ)) = χ∈χ
An illustration of four different covers and their associated loss is given in Fig. 1. The effect of the specific cover χ that is being used may be crucial: a cover of a small number of (large) disjoint sets typically induces a loss function which is primarily sensitive to the existence of a mistake and is indifferent to the exact nature of the induced total-order. In contrast, a cover which includes many sets each of which has only a small number of elements induces a loss that may be too detailed. The natural question that arises is what cover to use. Unfortunately, there is no general answer as the specific choice is domain and task dependent. For example, in the problem of optical character recognition we merely interested in whether a prediction error occurred
Loss Bounds for Online Category Ranking
Parameters: γ ; C Initialize: w1r = 0 (1 ≤ r ≤ k) Loop: For i = 1, 2, . . . , m
51
Parameters: γ ; C 1 (1 ≤ r ≤ k, 1 ≤ l ≤ n) Initialize: w1r,l = nk Loop: For i = 1, 2, . . . , m
– Get a new instance: xi ∈ Rn – Predict: y ˆi = wi1 , xi , . . . , wik , xi – Get a target yi and its cover χi – Suffer loss: I W i ; (xi ; y i ; χi ) i to be the solution αr,s – Set αr,s of Eq. (9) and Eq. (10) k i – Set for r = 1...k : τri = αr,s
– Get a new instance: xi ∈ Rn – Predict: y ˆi = wi1 , xi , . . . , wik , xi – Get a target yi and its cover χi – Suffer loss: I W i ; (xi ; y i ; χi ) i – Set αr,s to be the solution αr,s of Eq. (13) and Eq. (14) k i – Set for r = 1...k: τri = αr,s
s=1
s=1
– Update for r = 1, . . . , k: = wir + τri xi wi+1 r
– Update for r = 1, . . . , k:
i xi τr
l i e wi+1 r,l = w r,l Z i i i where Z i = wis,l eτs xl
s,l
Output: m+1 Output: m+1 h(x) = wm+1 h(x) = w1 , x , . . . , wm+1 , x , . . . , wk , x ,x 1 k Fig. 2. The additive algorithm
Fig. 3. The multiplicative algorithm
or not. In contrast, in the problem of document categorization, where each document is associated with a subset of relevant categories, it seems more natural to ask how many categories were mis-placed by the total order. To underscore the dependency on the cover, we slightly extend our definitions and denote an example by a triplet (x, y, χ): an instance x, a target semi-order y, and a cover χ of y. Thus the choice of a loss function is made part of the problem description and is not a sub-task of the learning algorithm. Since the loss functions are parameterized by a cover χ of the target y we call them cover loss functions. To derive our algorithms, we use a generalization of the hinge loss which depends on a predefined insensitivity parameter γ and is defined as, max [γ − (wr , x − ws , x)]+ . (2) Hγ (W ; (x, y, χ)) = χ∈χ
(r,s)∈χ
It is straightforward to verify the bound γI (W ; (x, y, χ)) ≤ Hγ ((x, y, χ)). Note that if the loss is equal to zero then, independently of the form of the specific loss being used, wr , x − ws , x ≥ γ for all (r, s) ∈ y.
2
An Additive Algorithm
In this section we present the first algorithm for category ranking which is based on an additive update. The motivation for the algorithm as well as its analysis build on
52
K. Crammer and Y. Singer
previous research, in particular the MIRA algorithm and the Passive-Aggressive algorithm [9, 7]. As discussed above, we generalize these algorithms by considering a general class of loss functions and provide tighter loss bounds by modifying the dual problem described in the sequel. Throughout the paper we denote the norm of the a category-ranker W by W . This norm is defined as the norm of the vector obtained by concatenating the vectors wr , W = (w1 , . . . , wk ) . The core of the online algorithm is an update rule that receives the current ranker, denoted W i , along with the newly observed instance xi , a feedback y i , and a cover χi . The next ranker W i+1 is set to be the solution of the following optimization problem, W i+1 = argmin W
1 C W − W i 22 + Hγ W ; (xi ; y i ; χi ) , 2 k−1
(3)
where C > 0. The new ranker is thus the solution to a problem that is composed of two opposing terms. The role of the first term is to try to keep the new ranker W i+1 as close as possible to the current one, W i . The second term solely focuses on the hinge-loss achieved by the new ranker on the newest example. Thus, the constant C encapsulates the tradeoff between the two terms. Expanding the hinge-loss, we rewrite the optimization problem of Eq. (3) as, min W
1 W − W i 22 + C ξχ 2 i χ∈χ
subject to : ∀χ ∈ χi , ∀(r, s) ∈ χ : ∀χ ∈ χi :
(4) wr , xi − ws , xi ≥ γ − ξχ
ξχ ≥ 0 ,
where ξχ ≥ 0 are slack variables. To characterize the solution W i+1 we use the dual form of Eq. (3) and Eq. (4). We do so by introducing the Lagrangian of the problem, L(W ; α) =
k 1 wr − wir 2 + C ξχ 2 r=1 χ∈χi i γ − w r , xi + w s , x i − + αr,s βχ ξχ , (r,s)∈y i
(5)
χ∈χi
i where αr,s ≥ 0 (defined only for pairs (r, s) ∈ y i ) are the Lagrange multipliers. Mundane calculus yields that, i i wp = wip + αp,s xi − αr,p xi for 1 ≤ p ≤ k . (6) s : (p,s)∈y i
r : (r,p)∈y i
To simplify Eq. (6) and the form of the optimal solution we extend αr,s to be defined over all pairs r, s. For each (r, s) ∈ y i we define αs,r = −αr,s . We also set αr,s = 0 for all other values of r and s. Using this extension of the Lagrange multipliers, Eq. (6) can be rewritten as, i wi+1 = wir + αr,s xi for 1 ≤ r ≤ k . (7) r s
Loss Bounds for Online Category Ranking def
Finally, we write τri =
s
53
i αr,s yielding the following update
= wir + τri xi wi+1 r
for 1 ≤ r ≤ k .
Summing up, the resulting dual is, 2 k i i i i 1 i 2 i i x wr , x − ws , x − γ α + αr,s min r,s i 2 {αr,s } s r=1 (r,s)∈y i ⎧ i (r, s) ∈ y i ⎨ αr,s ≥ 0 C i i i αs,r = −αr,s (r, s) ∈ y i ∀χ ∈ χi : s. t.: (r,s)∈χ αr,s ≤ k−1 ⎩ i αs,r = 0 Otherwise
(8)
(9)
The transformation from the primal to the dual form is standard. However, the resulting dual form given by Eq. (9) imposes a rather major difficulty whenever the optimal solution does not satisfy any of the inequality constraints with equality. In this case there is no way to distinguish between different covers from the values of αr,s . Since the proofs of our loss bounds are based on bounding first the cumulative sum of αr,s , this problem precludes the derivation of mistake bounds that are sensitive to the specific cover that is used. We illustrate this difficulty with the following toy example. Assume that there are only three different categories and the instance space is the reals, X = R. Assume further that the ith weight vectors are wi1 = −0.5 , wi2 = 0 , wi3 = 2.5. Let the ith example be xi = 1 and y i = {(1, 2), (1, 3)}. If we now set C = 3, we get that the optimal solution of the dual is the same for two different covers, χi = {{(1, 2)}, {(1, 3)}} and χi = {(1, 2), (1, 3)}. Thus, it is impossible to unravel from i what cover was used and any analysis must unify the two covers into a single loss αr,s bound. To overcome the problem above, we impose a lower bound on the total sum of the Lagrange multipliers in each cover. By construction, this lower bound depends on the particular cover that is being used. Specifically, we replace the constraints on αr,s with the following constraints that bound the total sum from above and below, ∀χ ∈ χi :
ci C i [[{(r, s) ∈ χ : (s, r) ∈ y ˆ} = ∅]] ≤ , αr,s ≤ k−1 k−1
(10)
(r,s)∈χ
2 where ci = min{C, 1/ xi }. Put another way, if the predicted total-order is consistent with all the elements of a specific cover χ ∈ χi , then the lower bound is kept at zero. Alas, if the order of some pair in a cover is not predicted perfectly, then we aggressively set a lower bound on the sum of the Lagrange multipliers corresponding to the cover. We discuss briefly the implications of this construction at the end of next section. The pseudocode of the algorithm is given in Fig. 2. To derive ia loss-bound for the additive algorithm we first bound the cumulative | as given in Lemma 1 below. We then draw a connection between this sum i,r,s |αr,s bound and the bound on the cumulative loss suffered by the algorithm. Lemma 1. Let (x1 , y 1 ), . . . , (xm , y m ) be an input sequence for the algorithm described in Fig. 2 where xi ∈ Rn and y i ∈ Y × Y is a semi-order. Let W ∗ ∈ Rn×k be
54
K. Crammer and Y. Singer
a collection of k vectors and γ ∗ > 0. Assume that the algorithm of Fig. 2 is run with i a parameter C > 0. Fix γ > 0, and let αr,s be the optimal solution of Eq. (9) with a modified set of constraints given in Eq. (10). Then, the following bound holds, 1 C 2 i |αr,s | ≤ 4 ∗ 2 W ∗ + 4 Hγ ∗ W ∗ ; (xi ; y i ; χi ) . ∗ (k − 1)γ i γ i,r,s The proof is omitted due to the lack of space. The skeleton of the proof is similar to the proof of Lemma 2 which is given in the next section. Before stating the main theorem of this section, we would like to make a few comments in passing. First, whenever the category ranking is perfectly consistent with the feedback on all examples, then the right term of the bound above vanishes for a proper choice of W ∗ . Second, the bound still holds when solving the optimization problem given by Eq. (9) without the additional constraints provided in Eq. (10). However, as discussed above we incorporated the set of additional constraints since they enables us to cast the cumulative loss bound stated in the theorem below. Theorem 1. Assume that all the instances reside in a ball of radius R (∀i : xi 2 ≤ R) and that C ≥ γ/R2 . Then, under the same terms stated in Lemma 1 the cumulative cover loss the algorithm suffers is upper bounded by, R2 C R2 2 I W i ; (xi ; y i ; χi ) ≤ 2(k−1) ∗ 2 W ∗ +2 Hγ ∗ W ∗ ; (xi ; y i ; χi ) . ∗ γ γ i γ i Thm. 1 tells us that the cumulative loss of the algorithm with respect to a given cover, is bounded by the hinge-loss suffered by any category plus a term that depends on the norm of ranker. The dependency on the number of different labels is distilled to a single factor: the multiplier of the ranker’s norm, which is proportional to k. Furthermore, the dependency of the bound in the meta-parameters γ and C appears only through their ratio, and thus one of these parameters can be set to an arbitrary value, often we set γ = 1.
3
A Multiplicative Algorithm
In this section we describe a multiplicative algorithm for category ranking. As in the previous section, the algorithm maintains a collection of k weight vectors. In the case of the multiplicative update we add a constraint on the ranker by forcing the 1 norm of W i to be one for all i. We further assume that all the components of W i are nonnegative. The resulting update incorporates these constraints for each new vector it constructs. On round i the new ranker W i+1 is again the minimizer of a constrained optimization problem which is similar to the one given in Eq. (3). The main difference is that we replace the Euclidean norm appearing in Eq. (3) with the Kullback-Leibler (KL) divergence [5]. The KL-divergence, also known as the relative entropy, is used in information theory and statistics to measure the discrepancy between information sources. The resulting constrained optimization that yields the multiplicative update is, 2C Hγ W ; (xi ; y i ; χi ) s.t. W 1 = 1 . W i+1 = argmin DKL W W i + k(k − 1) W (11)
Loss Bounds for Online Category Ranking
55
We show in the sequel that the resulting update has a multiplicative form. As the additive update, the multiplicative update can be employed with any cover that satisfies the requirements listed above. The pseudocode of the algorithm is given in Fig. 3. Before proceeding to the derivation of the multiplicative update and its loss bound analysis we would like to underscore two important difference between the additive update of previous section and the multiplicative update. Setting C = ∞ puts all the emphasis on the empirical loss of the most recent example. In the additive case this results in a solution W i+1 such that Hγ W i+1 ; (xi ; y i ; χi ) = 0. However, due to the additional constraint that W i 1 = 1 the inner products wr , xi are upper bounded by xi ∞ . Hence, depending on γ, it may be impossible to achieve a zero hinge-loss with W i+1 even when C is arbitrarily large. Second, note that the loss term is weighed differently in both algorithms: we use a factor of 1/(k − 1) (Eq. (3)) for the additive algorithm and a factor of 2/(k(k − 1)) (Eq. (11)) for the multiplicative one. This difference is due to the conversion phase, described in the sequel, of the bounds on the weights into loss bounds. To derive an update rule we use the dual form of Eq. (11). Similar to Eq. (3) we write the constraints explicitly, compute the corresponding Lagrangian, and get that the lth component of the optimal solution satisfies, i i αp,s xil − αr,p xil − β . (12) log (wp,l ) = log wip,l + s : (p,s)∈y i
r : (r,p)∈y i
Taking the exponent of both sides of the above equation results in the multiplicative update described in Fig. 3 where eβ = Z i . Similar to the line of derivation following Eq. (6) we simplify Eq. (12) by extending the definition of the Lagrange multipliers i to be defined over all r and s. The end result is the following dual problem, αr,s i max − log Z i + γ αr,s i {αr,s }
(r,s)∈y i
⎧ i (r, s) ∈ y i ⎨ αr,s ≥ 0 i i α = −αr,s (r, s) ∈ y i subject to: ⎩ s,r i =0 Otherwise αs,r
∀χ ∈ χi :
(r,s)∈χ
i αr,s ≤
2C k(k−1)
. (13)
Finally, as in the additive update we impose an additional set of constrains that cast a i lower bound on each αr,s , ∀χ ∈ χi :
2ci i [[{(r, s) ∈ χ : (s, r) ∈ y ˆ} = ∅]] ≤ αr,s . k(k − 1)
(14)
(r,s)∈χ
where ci depends on the ∞ norm of xi and is equal to,
⎫ ⎧ ⎨ log 1 + xiγ ⎬ ∞ ci = min , C . ⎩ ⎭ xi ∞ The technique for deriving a loss bound for the multiplicative update is similar the one used for the additive update, yet it is more involved. We first find a bound on the
56
K. Crammer and Y. Singer
i cumulative sum of coefficients αr,s . Then, we tie the cover loss with the value of the i coefficients αr,s which enables us to derive a loss bound.
Lemma 2. Let (x1 , y 1 ), . . . , (xm , y m ) be an input sequence for the algorithm described in Fig. 3 where xi ∈ Rn and y i ∈ Y × Y is a semi-order. Assume that the algorithm is run with a parameter C ≥ 0 and a margin parameter γ > 0. Let i } be the minimizer of Eq. (13) with the additional constraints given in Eq. (14). {αr,s Let W ∗ ∈ Rn×k be any collection of k vectors such that W ∗ 1 = 1 and fix γ ∗ > γ. Then, the cumulative sum of coefficients is upper bounded by, C log(kn) i +4 |αr,s |≤2 ∗ Hγ ∗ W ∗ ; (xi ; y i ; χi ) . ∗ γ −γ k(k − 1)(γ − γ) i i (r,s)
∗ i ∗ i+1 Proof. Define . We prove the lemma by m ∆i = DKL W W − DKL W W m bounding i=1 ∆i from above and below. First note that i=1 ∆i is a telescopic sum and therefore, m
∆i = DKL W ∗ W 1 − DKL W ∗ W m+1 ≤ DKL W ∗ W 1 .
i=1
Using the definition of DKL and substituting the value of w1r,l with 1/(nk) we get, ∗ m wr,l ∆i ≤ w∗r,l log w∗r,l log w∗r,l ≤ log(nk) , (15) = log(nk) + 1/nk i=1 r,l r,l where the last inequality holds since w∗r,l ≤ 1. This provides an upper bound on i ∆i . In the following we prove a lower bound on ∆i . Expanding ∆i we get, ∗ i ∆i = DKL W W − DKL W ∗ W i+1 w∗ w∗ r,l − w∗r,l log wi+1 = r,l w∗r,l log wr,l i r,l r,l i i s αr,s xl (16) = r,l w∗r,l log e Z i i ∗ xil = − log(Z i ) r,l w∗r,l + r,s,l αr,s wr,l i ∗ i ∗ = − log(Z ) r,l wr,l + r,s αr,s wr , xi . We rewrite the right term of the last equality above as, ∗ i ∗ i ∗ i i i wr , x = wr , x − ws , x . αr,s αr,s r,s
(17)
(r,s)∈y i
Substituting Eq. (17) in Eq. (16) while using the constraint that W ∗ 1 = 1 we get, ∗ i ∗ i i ∆i = − log Z i + wr , x − ws , x αr,s (r,s)∈y i
i = − log Z i + γ αr,s −γ
(r,s)∈y i
(r,s)∈y i i αr,s +
(r,s)∈y i
i αr,s
(18)
w∗r , xi − w∗s , xi .
(19)
Loss Bounds for Online Category Ranking
57
We thus decomposed ∆i into two parts denoted by Eq. (18) and Eq. (19). Note that Eq. (18) is equal to the objective of dual optimization problem given in Eq. (13). We now show that there exists a feasible assignment (not necessary the optimal one) of the i for which variables αr,s i αr,s ≥0 . (20) − log Z i + γ (r,s)∈y i
Therefore, the optimal solution of Eq. (13) should also satisfy the inequality above. We hence get that ∆i is lower bounded solely by the term given in Eq. (19). i we assume that {(r, s) ∈ To describe a set of feasible values to the parameters αr,s χ : (s, r) ∈ y ˆ} is not empty. That is, there is a mis-ordered pair in the set χ. We set i = 2ci /(k(k − 1)) and set all other values αri ,s to zero. For brevity we denote by αr,s i i b = xi ∞ (r,s)∈yi αr,s . We thus get i i xil αr,s ≤ xi ∞ αr,s ≤ bi . s
(r,s)∈y i
We upper bound Z i as follows, i Zi = wir,l exl p αr,p r,l
≤
wir,l
bi + xil
p
αr,p
2bi
r,l
e +
bi − xil
p
αr,p
2bi
−bi
e
i i eb − e−b + = 2 r,l r,s,l i = cosh(bi ) + sinh(bi ) wir , xi − wis , xi αr,s
wir,l
i
eb + e−b 2
i
bi
i αr,s wir,l xil
(r,s)∈y i i
≤ cosh(b ) , where the first inequality follows from the convexityof the exponential function and i the last inequality holds since either αr,s > 0 and wir , xi − wis , xi < 0 , or i αr,s = 0. Therefore we get that the objective function is lower bounded by,
− log(cosh(bi )) + γ
i αr,s = − log(cosh(bi )) + γ
(r,s)∈y i i Our particular choice of αr,s implies that,
bi = xi ∞
αr,s = ci xi ∞ 2
(r,s)∈y i
i
for c ∈
0, i
γ xi ∞ i x ∞
log(1+
)
bi . xi ∞
I W i ; (xi ; y i ; χi ) k(k − 1)
(21)
. It can be shown that log(cosh(bi )) − γbi / xi ∞ ≤ 0
both for c = 0 and for ci = log(1 +
γ i xi ∞ )/ x ∞
(note that bi is proportional
58
K. Crammer and Y. Singer
to ci ). From the convexity of f (bi ) = log(cosh(bi )) − γbi / xi ∞ it follows that log(cosh(bi )) − γbi / xi ∞ ≤ 0 for all feasible values of bi . We thus proved that the value of Eq. (18) is lower bounded by 0. This yields the following lower bound on ∆i ,
∆i ≥
i αr,s
i w∗r , xi − w∗s , xi − αr,s γ.
(r,s)∈y i
(22)
(r,s)∈y i
We further develop the first term and get,
i αr,s
(r,s)∈y i
w∗r , xi − w∗s , xi
i γ ∗ − γ ∗ − w∗r , xi + w∗s , xi + αr,s
≥
(r,s)∈y i
≥−
χ∈χi
⎛ ⎝
⎞
i ∗ i ⎠ max γ ∗ − w∗r , xi + w∗s , xi ++ αr,s αr,s γ . (23)
(r,s)∈χ
(r,s)∈χ
χ∈χi (r,s)∈χ
Finally, using the upper bound of Eq. (14) we lower bound the left term in the last 2C equation with − k(k−1) Hγ ∗ W ∗ ; (xi ; y i ; χi ) . Thus, ∆i ≥ −2
1 C i Hγ ∗ W ∗ ; (xi ; y i ; χi ) + (γ ∗ − γ) |αr,s | k(k − 1) 2 r,s
(24)
Substituting Eq. (24) in Eq. (15) we get, 1 ∗ C i (γ − γ) Hγ ∗ W ∗ ; (xi ; y i ; χi ) , |αr,s | ≤ log(kn) + 2 2 k(k − 1) i (r,s)
which yields the desired bound. As in the analysis of the additive update, the bound holds true also when the the optimization problem given by Eq. (13) is not augmented with the additional constraints provided in Eq. (14). These additional constraints however are instrumental in the proof of the following theorem. Theorem 2. Assume that all the instances lie in a cube of width R ( xi ∞ ≤ R) and γ log(1+ R ) . Under the assumptions of Lemma 2, the cumulative loss is bounded that C ≥ R above by, I W i ; (xi ; y i ; χi ) ≤ i
RC 1 Rk(k − 1) log(kn) + Hγ ∗ W ∗ ; (xi ; y i ; χi ) . γ γ ∗ ∗ 2 log 1 + R (γ − γ) log 1 + R (γ − γ)
Loss Bounds for Online Category Ranking
59
The lemma and the theorem state that for each value of γ used in the algorithm there is a feasible range of values of the margin parameter γ ∗ . Furthermore, if the value of γ ∗ is known to the algorithm in advance, then the value of the margin parameter γ can be set to provide the tightest upper bound as follows. Using the concavity of the log function we get that log(1 + x) ≥ x and since we assume in Thm. 2 that C ≥ γ/R2 , then the bound on the loss that is stated in the theorem becomes, RC 1 Rk(k − 1) log(kn) + Hγ ∗ W ∗ ; (xi ; y i ; χi ) . 2 (γ/R)(γ ∗ − γ) (γ/R)(γ ∗ − γ) The above bound is minimized by setting γ = γ ∗ /2. Substituting this value in the bound we obtain, R2 k(k − 1) log(kn) R2 C I W i ; (xi ; y i ; χi ) ≤ 2 + 4 ∗ 2 Hγ ∗ W ∗ ; (xi ; y i ; χi ) . 2 ∗ γ γ i Substituting the optimal value for C which is γ ∗ /(2R2 ) we finally obtain, R2 2 I W i ; (xi ; y i ; χi ) ≤ 2k(k−1) log(kn) ∗ 2 + ∗ Hγ ∗ W ∗ ; (xi ; y i ; χi ) . (25) γ γ i Before proceeding to the next section let us summarize the results obtained thus far. Similar algorithms [9, 7] were designed for simple prediction problems. As a consequence the update schemes for these algorithms take closed forms. Their analyses in turn strive on the existence of an exact form solution. In this paper we address the more complex problem of category ranking for which there is no close form for the update. To analyze these algorithms the optimization problems that constitute the infrastructure for the update were augmented with additional constraints. Mathematically, these constraints are equivalent to additional negative slack variables in the primal optimization problem. The understanding of the semantics of these variables require further investigation. Nonetheless, this construction forces each Lagrange multipliers to attain a minimal value and distinguishes between the solutions obtained for different covers.
4
Category-Ranking Based on Bregman Divergences
The additive and multiplicative algorithms described in previous sections share a similar structure. On each iteration the online algorithms attempt to minimize the loss associated with the instantaneous category task while attempting to keep the new ranker, designated by W i+1 , as “close” as possible to W i . The additive algorithm uses the square of the Euclidean distance as the means for encapsulating quantitatively the notion of closeness while the multiplicative algorithm uses the KL-divergence for that purpose. In this section we overview a unified approach that is based on Bregman divergences [2]. We would like not that while the use of the Bregman divergences in the context of category ranking problems is new, Bregman divergences have been used extensively in other learning settings (see for instance [11, 1, 4]).
60
K. Crammer and Y. Singer
A Bregman divergence is defined via a strictly convex function F : X → R which is defined on a closed convex set X ⊆ Rn . A Bregman function F needs to satisfy a set of constraints. We omit the description of the specific constraints and refer the reader to [3]. We further impose that F is continuously differentiable at all points of Xint (the interior of X ) which is assumed to be nonempty. The Bregman divergence that is associated with F applied to x ∈ X and w ∈ Xint is defined to be def
BF (x w ) = F (x) − [F (w) + ∇F (w) · (x − w)] . Thus, BF measures the difference between two functions evaluated at x. The first is the function F itself and the second is the first-order Taylor expansion of F derived at w. The divergences n we employ are defined via a single scalar convex function f such that F (x) = l=1 f (xl ), where xl is the lth coordinate n of x. The resulting Bregman divergence between x and w is thus, BF (x w ) = l=1 Bf (xl wl ) . The two divergences described in the previous section can be obtained by choosing f (x) = (1/2) x2 (squared Euclidean) and f (x) = x log(x) − x (KL-divergence). For the latter we also restrict X to the probability simplex ∆n = {x | xl ≥ 0 ; l xl = 1}. We now describe an online category-ranking algorithm that can be applied with any Bregman divergence. However, the generality of the algorithm comes with a cost. Namely, the algorithm and its corresponding analysis are designed for the case where there exists a ranker W ∗ which is consistent with all the semi-orders that are given as feedbacks. Equipped with this assumption, the new category-ranker W i+1 is defined as the solution of the following problem, s.t. Hγ W ; (xi ; y i ) = 0 , (26) W i+1 = argmin BF W W i W
That is, W i+1 is chosen among all rankers which attain a zero hinge loss on the current instance-label pair. Due to our assumption this set is not empty. The ranker that is chosen is the one whose Bregman divergence w.r.t the current ranker W i is the smallest. Before providing the main result of this section let us elaborate on the form of the dual of Eq. (26) and the resulting solution. Writing explicitly the constraint given in Eq. (26) we get that, s.t. wr , xi − ws , xi ≥ γ, ∀(r, s) ∈ y i . (27) W i+1 = argmin BF W W i W
The corresponding Lagrangian of this optimization problem is, L(W ; α) =
k r=1
i γ − wr , xi + ws , xi , (28) BF wr wir + αr,s (r,s)∈y i
i where αr,s ≥ 0 (for (r, s) ∈ y i ) are Lagrange multipliers and we expanded W into its constituents wr . To find a saddle point of L we first set to zero the derivative of L with respect to wp for all p and get, i i αp,s xi − αr,p xi . (29) ∇F (wp ) = ∇F wip + s : (p,s)∈y i
r : (r,p)∈y i
Loss Bounds for Online Category Ranking
61
The last equation generalizes both Eq. (6) and Eq. (12) for the Euclidean distance and the KL-divergence, respectively. As before we expand αr,s to be defined over all r and s and get that, i i def αr,s x = ∇F wir + τri xi . (30) ∇F (wr ) = ∇F wir + s
We thus found an implicit form of W i+1 . Substituting Eq. (30) back in the Lagrangian of Eq. (28) we obtain the dual problem, k i i i i −1 wr min BF ∇F αr,s x ∇F wr + {αir,s }
r=1
+
i αr,s
(r,s)∈y i
γ−
∇F
−1
s
∇F
wir
+
+ ∇F −1
s
i αr,s xi
,x
i
i i i αs,r x , xi ∇F ws + ,
⎧ i (r, s) ∈ y i ⎨ αr,s ≥ 0 i i α = −αr,s (r, s) ∈ y i subject to: ⎩ s,r i αs,r =0 Otherwise
r
(31)
where ∇F −1 (·) is the component-wise inverse of ∇F (·). It is well defined since F is strictly convex and thus ∇F is strictly monotone. It remains to describe how we set the initial ranker W 1 . To be consistent with the choice of initial ranker made for the additive and multiplicative algorithms, we set W 1 = arg minW F (W ). i is The lemma below states that the cumulative sum of the dual parameters αr,s bounded. In return, this lemma can be used to derive specific loss bounds that for particular Bregman divergences. Lemma 3. Let (x1 , y 1 ), . . . , (xm , y m ) be an input sequence for the algorithm whose update rule is described in Eq. (26) where xi ∈ Rn and y i ∈ Y × Y is a semi-order. margin γ ∗ > 0 Let W ∗ ∈ Rn×k be a collection of k vectors attains ∗ a positive ∗which ∗ i i on the sequence, γ = mini min(r,s)∈yi { wr , x − ws , x } > 0 . Let BF be a Bregman divergence derived from a convex function F . Then, for any value of c > γ/γ ∗ the cumulative sum of coefficients is bounded by, i
(r,s)
i |αr,s |≤
F (cW ∗ ) − F (W 1 ) . cγ ∗ − γ
The proof is omitted due to lack of space however we would like to discuss specific choices of Bregman divergences. If the Bregman function F is p-homogeneous, F (ax) = ap F (x) for p > 1 and W 1 = 0 then the bound is minimized by setting, γ c = (p−1)γ ∗ . In this case the bound becomes, p p F (W ∗ )(p − 1)γ p−1 . (p − 1)γ ∗
62
K. Crammer and Y. Singer
If the Bregman function F is homogeneous, F (ax) = aF (x) . Then the bound is minimized by setting c → ∞, and we obtain that the bound on the cumulative weights ∗ ) is simply, F (W γ∗ . To conclude the paper we like to mention some open problems. First, comparing the bounds of the additive algorithm and the multiplicative algorithm we see that the bound of the additive update is k times smaller than that of the multiplicative update. This rather large gap between the two updates is not exhibited in other problems online prediction prboems. We are not sure yet whether the gap is an artifact of the analysis technique or a property of the category ranking problem. Second, currently we were not able to convert Lemma 3 into a mistake bound similar to Thm. 1and Thm. 2. We leave this problem to future research. Third, it is straightforward to employ the online updates based on Eq. (9) and Eq. (13) in a batch setting. However, it is not clear whether the additional constraints on the dual variables given in Eq. (10) and Eq. (14) can be translated into a sensible batch paradigm. The lower bounds on the weights depend on the instantaneous loss the algorithm suffers where in a batch setting this notion of temporal loss does not exist. Acknowledgments. We are in debt to the chairs and members of program committee of COLT’05 for their constructive and thoughtful comments. This research was funded by EU Project PASCAL and by the Israeli Science Foundation grant number 522/04. Most of this work was carried out at the Hebrew University of Jerusalem.
References 1. K.S. Azoury and M.W. Warmuth. Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning, 43(3):211–246, 2001. 2. L. M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7:200–217, 1967. 3. Y. Censor and S.A. Zenios. Parallel Optimization: Theory, Algorithms, and Applications. Oxford University Press, New York, NY, USA, 1997. 4. M. Collins, R.E. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances. Machine Learning, 47(2/3):253–285, 2002. 5. T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, 1991. 6. K. Crammer. Online Learning for Complex Categorial Problems. PhD thesis, Hebrew University of Jerusalem, 2005. to appear. 7. K. Crammer, O. Dekel, S. Shalev-Shwartz, and Y. Singer. Online passive aggressive algorithms. In Advances in Neural Information Processing Systems 16, 2003. 8. K. Crammer and Y. Singer. A new family of online algorithms for category ranking. Jornal of Machine Learning Research, 3:1025–1058, 2003. 9. K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. Jornal of Machine Learning Research, 3:951–991, 2003. 10. A. Elisseeff and J. Weston. A kernel method for multi-labeled classification. In Advances in Neural Information Processing Systems 14, 2001. 11. C. Gentile and M. Warmuth. Linear hinge loss and average margin. In Advances in Neural Information Processing Systems 10, 1998. 12. R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):1–40, 1999.