On the Learnability and Design of Output Codes for Multiclass Problems

Report 5 Downloads 15 Views
On the Learnability and Design of Output Codes for Multiclass Problems

Koby Crammer and Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel kobics,singer  @cs.huji.ac.il

Abstract Output coding is a general framework for solving multiclass categorization problems. Previous research on output codes has focused on building multiclass machines given predefined output codes. In this paper we discuss for the first time the problem of designing output codes for multiclass problems. For the design problem of discrete codes, which have been used extensively in previous works, we present mostly negative results. We then introduce the notion of continuous codes and cast the design problem of continuous codes as a constrained optimization problem. We describe three optimization problems corresponding to three different norms of the code matrix. Interestingly, for the  norm our formalism results in a quadratic program whose dual does not depend on the length of the code. A special case of our formalism provides a multiclass scheme for building support vector machines which can be solved efficiently. We give a time and space efficient algorithm for solving the quadratic program. Preliminary experiments we have performed with synthetic data show that our algorithm is often two orders of magnitude faster than standard quadratic programming packages.

1 Introduction Many applied machine learning problems require assigning labels to instances where the labels are drawn from a finite set of labels. This problem is often referred to as multiclass categorization or classification. Examples for machine learning applications that include a multiclass categorization component include optical character recognition, text classification, phoneme classification for speech synthesis, medical analysis, and more. Some of the well known binary classification learning algorithms can be extended to handle multiclass problem (see for instance [4, 17, 18]). A general approach is to reduce a multiclass problem to a multiple binary classification problem. Dietterich and Bakiri [8] described a general approach based on error-correcting codes which they termed errorcorrecting output coding (ECOC), or in short output coding. Output coding for multiclass problems is composed

of two stages. In the training stage we need to construct multiple (supposedly) independent binary classifiers each of which is based on a different partition of the set of the labels into two disjoint sets. In the second stage, the classification part, the predictions of the binary classifiers are combined to extend a prediction on the original label of a test instance. Experimental work has shown that output coding can often greatly improve over standard reductions to binary problems [8, 9, 14, 1, 19, 7, 3, 2]. The performance of output coding was also analyzed in statistics and learning theoretic contexts [11, 13, 20, 2]. Most of the previous work on output coding has concentrated on the problem of solving multiclass problems using predefined output codes, independently of the specific application and the class of hypotheses used to construct the binary classifiers. Therefore, by predefining the output code we ignore the complexity of the induced binary problems. The output codes used in experiments were typically confined to a specific family of codes. Several family of codes have been suggested and tested so far, such as, comparing each class against the rest, comparing all pairs of classes [11, 2], random codes [8, 19, 2], exhaustive codes [8, 2], and linear error correcting codes [8]. A few heuristics attempting to modify the code so as to improve the multiclass prediction accuracy were suggested (e.g., [1]). However, they did not yield significant improvements and, furthermore, they lack any formal justification. In this paper we concentrate on the problem of designing a good code for a given multiclass problem. In Sec. 3 we study the problem of finding the first column of a discrete code matrix. Given a binary classifier, we show that finding a good first column can be done in polynomial time. In contrast, when we restrict the hypotheses class from which we choose the binary classifiers, the problem of finding a good first column becomes difficult. This result underscores the difficulty of the code design problem. Furthermore, in Sec. 4 we discuss the general design problem and show that given a set of binary classifiers the problem of finding a good code matrix is NP-complete. Motivated by the intractability results we introduce in Sec. 5 the notion of continuous codes and cast the design problem of continuous codes as a constrained optimization problem. As in discrete codes, each column of the code matrix divides the set of labels into two subsets which are labeled positive (  ) and negative (  ). The sign of each entry

in the code matrix determines the subset association (  or  ) and the magnitude corresponds to the confidence in this association. Given this formalism, we seek an output code with small empirical loss whose matrix norm is small. We describe three optimization problems corresponding to three different norms of the code matrix:    and  . For  and we show that the code design problem can be solved by  linear programming (LP). Interestingly, for the   norm our formalism results in a quadratic program (QP) whose dual does not depend on the length of the code. Similar to support vector machines, the dual program can be expressed in terms of inner-products between input instances, hence we can employ kernel-based binary classifiers. Our framework yields, as a special case, a direct and efficient method for constructing multiclass support vector machine. The number of variables in the dual of the quadratic problem is the product of the number of samples by the number of classes. This value becomes very large even for small datasets.  For instance, an English letter recognition problem with  training examples would require   variables. In this case, the standard matrix representation of dual quadratic problem would require more than 5 Giga bytes of memory. We therefore describe in Sec. 6.1 a memory efficient algorithm for solving the quadratic program for code design. Our algorithm is reminiscent of Platt’s sequential minimal optimization (SMO) [15]. However, unlike SMO, our algorithm optimize on each round a reduced subset of the variables that corresponds to a single example. Informally, our algorithm reduces the optimization problem to a sequence of small problems, where the size of each reduced problem is equal to the number of classes of the original multiclass problem. Each reduced problem can again be solved using a standard QP technique. However, standard approaches would still require large amount of memory when the number of classes is large and a straightforward solution is also time consuming. We therefore further develop the algorithm and provide an analytic solution for the reduced problems and an efficient algorithm for calculating the solution. The run time of the algorithm is poly-logarithmic and the memory requirements are linear in the number of classes. We conclude with simulations results showing that our algorithm is at least two orders of magnitude faster than a standard QP technique, even for small number of classes.

2 Discrete codes 





    Let  training   be a set of "! examples where each instance belongs to a domain ! # . We assume without loss of generality that each label  is  an integer from the set $% &('  . A multiclass clas sifier is a function )+*,#.-/$ that maps an instance into an element  of $ . In this work we focus on a framework that uses output codes to build multiclass classifiers from binary classifiers.  A discrete output code 0 is a matrix  of size '21  over     where each row of 0 correspond to a class 432$ . Each column of 0 defines a partition of $ into two disjoint sets. Binary learning algorithms are used to construct classifiers, one for each column 5 of 0 . That is, the set of examples induced by column 5 of 0    is 6087:9 ;&<   (087:9 ;(=  . This set is fed as training data to a learning algorithm that finds a hypothesis >?7@*#A-





. This reduction yields  different binary classifiers . We denote the vector of predictions of these clas    sifiers on an instance as > D   >?  E6>FC  . We denote the G th row of 0  by 02 D H . Given an example we predict the label  for which the  row 02 D ; is the “closest” to > D  . We will use a general notion for closeness and define it through an inner-product function I I   C C *KJ 1LJ -MJ . The higher the value of >D   02 D H is the more confident we are that G is the correct label of according toI the classifiers > D . An example for a closeness  N O NLP O function isI D  D   D D . It is easy to verify that this choice of is equivalent to picking the row of 0 which  attains the minimal Hamming distance to . >D      , we say Given a classifier )  and an example   that )  misclassified the example if ) ,R Q  . Let S S T"U U be 1 if the predicate T holds and 0 otherwise. Our goal isthere ! RV !XW fore to find a classifier ) SS )  such that YQ !  U U is small. When  is small there might be more then one row of which attains the maximal value according to the func0 I we will relax our deftion . To accommodate such cases Z ) 0 inition and define a classifier  based on a code   \ to beI the mapping ) given by )  *[#/    I   ^] >D   02   02 D ;  `_baBcdH D H   . In this case we > D will pick one of the labels in )  uniformly at random, and use the expected error of )   , 

 



>?B(>C

e(f 



0g> D ih j:k



m

l





!XW







])

 l

SS  !XW

!

SS  ! ])

3n) ?!

3q)  !

 !

!

 UU

 ]po  UU

(1)

 ]

In the context of output codes, a multiclass mapping the coding ma is thus determined by two parameters:  trix 0 and the set of binary classifiers . Assume that > D    the binary classifiers >"  (>FC  are chosen from some hypothesis class r . The following natural learning problems arise: (a) Given a matrix 0 , find a set > D which suffers small empirical loss. (b) Given a set of binary classifiers > D , find a matrix 0 which has small empirical loss. (c) Find both a matrix 0 and a set > D which have small empirical loss. Previous work has focused mostly on the first problem. In this paper we mainly concentrate on the code design problem (problem b), that is, finding a good matrix 0 . )



3 Finding the first column of an output code 

Assume we are given a single binary classifier >s  and we want to find the first column of the matrix 0 which mini mizes the empirical loss e f 0g > D  . For brevity, let us denote N Nvu by N D  . We now det xw the first column of 0  scribe an efficient algorithm that finds N D given >s  . The algorithm’s running time is polynomial in the size of the label set 'by] $[] and the sample size . First, note that in this case ! ! ! N (2)  3n) >" ;(|{ {z   Second, note that the sample can be divided into ' equivalence classes according to their labels and the classifica   tion of >"  .‚ For ! G} ? !&(' and ~3     , de fine €dH  ] *  }GB6>"  ƒ~ d] to be the fraction of

the examples with label G and classification ~ u (according to    W V >"     , denote by €  €dH , and let  ). For ~,3 H    ] G * N H  ~ ] be the number of elements in N D which are equal to ~ . (For brevity, we will often use  and  to denote the value of ~  .) Let



f 



0g> D th j:k

l

!

SS  !XW

])

3^)  !

!

 UU

 

 ]

ef  

0g> D 

(3) 

We can assume without loss of generality   that not all the ele-is ments in N D are the same (otherwise, f 0g > D   u , which  equivalent to random guessing). Hence, the size of )  is : 

])

  

 ] 

 > >





 



  



(4) 

Using Eqs. (2) and (4), we rewrite Eq. (3),



f 





0g > D 

 ! 



E| l



!

u

 | 



!

])

;(|

  

 l

 ]

W



|

?!

])

 ]



%& NN H     W H   H  |  |"|!  $#  %(' u & l %' & N H      W H *),% +  ' N H    + >  f  0g>D  , > Using Eq. (5) we now can expand > > -/.102436795 8;:=l >@?B< ADCFE 0 E(GIH 8KJ$L G FE 0 E*NOH 8KJQP M > PSR > >M L > m > E l >@?B< A m J L G J P : FE l >@?B< A C H J L N J P G M >L M > P o R F M L M P o > m E m JL G JP : FE l >@?B< ATC H J L N J P G M L M P o R F M L M P oVU l

l



For a particular choice of W (and X }'  maximized (and e&f is minimized) by setting N H,  indices which attain the highest values for Y %

 ) 

& &







(5)



(6) f

is at the ' %  '[Z ,

erations needed to find the optimal choice for the first column   ' \_^ ` '  . We have proven the following theorem. is a  





conditions do not hold ( >  is constant), let  w.v Ifg the above be the number of examples which the hypothe sis >"  classifies correctly. Then, using Eq. (3) again we  yx O obtain f  z|{   . Thus, the minimum of e&f is achieved if and only if the formula b is satisfiable. There fore, a learning algorithm for >v  and N D can also be used as an oracle for the satisfiability of b . O

  of the indices to   . This can and set the rest X + be + done efficiently in ']\_^ ` ' time using sorting. Therefore, the best choice of N D  is found by enumerating all the possible values D   6'   and choosing the value of for D 3 which achieves the maximal value for Eq. (6). Since it takes operations to calculate € H and € H  , the total number of op

To conclude this section we use a reduction from SAT to demonstrate that if the learning algorithm (and its corresponding class of hypotheses from which >s can chosen from) is of a very restricted form then the resulting learn 1c b  ing problem can be hard. Let  be a boolean ?!   formula over the variables 3     where we in  terpret !  ! c! W as feFalse and as True. Define   c c #   ‚  D   to be the instance space. Let g ]d   ! ‚ "e !  !W    6 ig    be a samd D  hg   !XW d ple of size , where the labels are taken from  ig    $  &( fg   . Define the learning algorithm j]k c! W samas follows. The algorithm’s is a e binary  !   ! input c labeled ! ! c    D        ple of the cform . If   ‚ !ml ! c b B X    True and for all     True, then the algorithm returns an hypothesis which is consistent with the sample (the sample itself). Otherwise, the algorithm    returns the constant hypothesis, > or > , 4n 4n which agrees with the majority of the sample by choosing  ‚ ! > a pq` _baBc sr  9  6t ] *   ~ ] . Note that the on  learning algorithm is non-trivial in the sense that the hypothfu esis it returns has an empirical loss of less than on the binary labeled sample. We now show that a multiclass learning algorithm that minimizes the empirical loss e&f over both the first column N  which was returned by the algoD and the hypothesis > b is satrithm j k , can be used to check whether the formula c  isfiable. We need to ‚ consider two cases. When b   XX  !l ! c  True and for all     True, then using the def  Y g c  g    c  Z  inition from Eq. (3) we get f 

Theorem 1 Let         be a set of training examples, where each label is an integer from the  set 6'  . Let be a binary hypothesis class. Given  r an hypothesis >  3qr , the first column of an output code which minimizes the empirical loss defined by Eq. (1) can be found in polynomial time.

While the setting discussed in this section is somewhat superficial, these results underscore the difficulty of the problem. We next show that the problem of finding a good out put code given a relatively large set of classifiers > D  is intractable. We would like to note in passing that efficient algorithm for finding a single column might be useful in other settings. For instance in building trees or directed acyclic graphs for multiclass problems (cf. [16]). We leave this for future research.

4 Finding a general discrete output code In this section we prove that given a set of  binary classi fiers > D  , finding a code matrix which minimizes the em e f 0g  > D   is NP-complete. Given a sample } pirical loss         and a set of classifiers > D , let us de     note by Y }  >?D B   > D    the evaluation of >D P  ! ! on the sample  ,I where >D h j:k > D  . We now show that even  N O N{P O when ' ƒ and D  D   D D the problem is NP-complete. (Clearly, the problem remains NPC for 'h~ ). Following the notation of previous sections, the output code matrix is composed of?two rows and 0 D  and the predicted class ! ! 0 D W ! P for instance is ) 9  0 D H > D  . For the  ƒa pq` _bacH

simplicity of the presentation of the proof, we! assume that both the code 0 and the  hypotheses’ values > D are over the  set   (instead of     ). This assumption does not change the problem since there is a linear transform between the two sets. Theorem 2 The following decision problem is NP-complete. Input: A natural number , a labeled sample     8 }  >?  D B   > D   , ! !   C 3   . 6  , and > D where  3   C Question: Does there exist a matrix 0.3 F  such  that the classifier ) makes  based on an output code 0 at most mistakes on  } . Proof: Our proof is based on a reduction technique introduced by H¨offgen and Simon [12]. First, note that the problem is in NP as we can pick a random code matrix 0 and check the number of classification errors it induces in polynomial time. We show a reduction to Vertex Cover in order to prove that the problem is NP-hard. Given an undirected graph      , we will code the structure of the graph as follows. }

The sample  } will be composed of two subsets,  }  and   of size ]  ] and ] ] respectively. We set  } ] ]  . Each !  edge O  O  3 is encoded by two examples > D   in  } . !    We set for the first vector to >  ,> !  c ,>C  and  c elsewhere. We set the second vector to >   ,>   ,  and elsewhere. We set the label >FCt  of each example   >s }  encodes a node O ! 3 in D   in    }  to . ! Each example ! c  where >  ,>  , >FC  and elsewhere. We  set the label  of each example in } to (second class). with We now show that there exists a vertex cover  at most nodes if and only if there exists a coding matrix   C that induces at most classification errors on 0 3   the sample } .   be a vertex cover such that ] b] .  : Let  We show that there exists a code which has at most mis  takes on  } . Let !R3    ! be the characteristic function ! if O 3 and N   otherwise. of  , that is, N   Define the output code matrix  @@  and    N to be 0 D 0 D        . Here, denotes the component-wise logical not operator. Since  is a cover, for each >^ D 3  }  we get 0 D

P

>D

v



and

P

0 D 





> D 

0 D 

P

> D 

0 D

P

>D



Therefore, for all the examples in  }  the predicted label equals the true label and we suffer errors on these examples. For each example > D 3  } that corresponds to a node O 3 we have P

0 D

~

!

>D





P

0 D 

>D



Therefore, these examples are misclassified (Recall that the label of each example in  } is ). Analogously, for each example in  } which corresponds to O 3 Q  we get 0 D

P

>D





R 

0 D 

P

>D



and these examples are correctly classified. We thus have shown that the total number of mistakes according to 0 is ] b]"# .

%$

be a code which achieves at most mis : Let 0  & takes on  } . We construct a subset as follows. We ! scan  } and add to  all vertices O corresponding to misclassified examples from  } . Similarly, for each! misclassified example! from  }  corresponding to an edge O  O  , we pick either O or O at random and add it to  . Since we have at most misclassified examples in  } the size of  is at most . We claim that the set  is a vertex cover of the ! graph  . Assume by contradiction that there is an edge O  O  for O ! O which neither nor belong to the set  . Therefore, by! construction, the examples corresponding to the vertices O and O are classified correctly and we get, 0

(9

!

9

0

c

!

 c 9  0



9

0

9 C' 0

 9

0

9 C

 0



!

0

9

c  c  9 

0

!

 9



0

0 

 9C

0 

 9C

Summing the above equations yields that, 0

!

9

0

0

!

 9



(9



0 



9

c

!



0(9 0

!

c



 9

c



0(9



c

0 

9 C(

 0



9

0 

(7)

 9C 

In ! addition, the two examples corresponding to the edge O O   are classified correctly, implying that 0

9

!



0

c

(9

!

0





9

0

9

c



0

9 C

0

9 C

~ ~

0

 9

0

 9

! !



c 

0

 9



0 

9



0

c

 9C 

0

 9C

which again by summing the above equations yields, 0

9 0

!

0  9

!



(9 0





9

0

(9 0

!

c

  9

!





c 0

(9 0

c

 

9



c

 0 

 9C

0

~  9C 

(8)

Comparing Eqs. (7) and (8) we get a contradiction.

5 Continuous codes The intractability results of previous sections motivate a relaxation of output codes. In this section we describe a natural relaxation where both the classifiers’ output and the code matrix are over the reals.  As before, the classifier ) from a code  is constructed  matrix 0 and a set of binary classifiers > D  . The matrix 0 is of size ' 1  over J where each row of 0 corresponds to a  > 7 class n3 $ . Analogously, each binary classifier  3^r  is a mapping >F7  * # - J . A column 5 of 0 defines a partition of $ into two disjoint sets. The sign of each element of the 5 th column is interpreted as the set (+1 or -1) to which the class G belongs and the magnitude ] 0 H9 7] is interpreted as the confidence in the associated partition. Similarly, we interpret the sign of >7  as the prediction of the  set (+1 or -1) to which the label of the instance belongs and  the magnitude ] >F7   ] as the confidence of this prediction.  Given an instance , the classifier )  predicts the label I    which >D maximizes the confidence function   0 D H ,  I   ) >D  Ra pq`@_ ac H/

  0 D H  . In contrast to discrete \ codes, we can assume here without loss of generality that exactly oneI class attains the maximum value according to the function . We will concentrate on the problem of finding a good continuous code given a set of binary classifiers > D . The approach we will take is to cast the code design problem as constrained optimization problem. Borrowing

the idea of soft margin [6] we replace the discrete 0-1 multiclass loss with the linear bound  I

_baBc H

!

>D





  02 D H  

;(|9 H 



 I



!

>D

(9)

  02 D ;(|  

This formulation is also motivated by the generalization analysis of Schapire et al. [2]. The analysis they give is based on the margin of examples where the margin is closely related to the definition of the loss as given by Eq. (9). Put another way, the correct label should have a confidence value which is larger by at least one than any of the confidences for the rest of the labels. Otherwise, we suffer loss which is linearly proportional to the difference between the confidence of the correct label and the maximum among the confidences of the other labels. The bound on the empirical loss is then,  

ef 

0g > D 

l



SS )

!W



  l



 I



 

I

_baBc H

!W

!

!

UU



 

  0 D H 

  0 D ; | 



‚

 !

>D

 !

>D

 Q

!



where 9 equals if  and otherwise. We say that a sample  is classified correctly using a set of binary classifiers > D if there exists a matrix 0 such that the above loss is equal to zero,



‚ _



ac H

 I

 !

>D





  0 D H 

; | 9H  



 I

 !

>D





   _



  0 D ; |  }

!







‚

 I

G

>D

!



9H



 I

  02 D ;(|  

>D

; | 9H

v

  02 D H 

! ~

9H

(12) 

Motivated by [21, 2] we seek a matrix 0 with a small norm which satisfies Eq. (12). Thus, when the entire sample  can be labeled correctly, the problem of finding a good matrix 0 can be stated as the following optimization problem,





 

_

‚

subject to :



G

I



>D

?!

 I

  02 D ;(|  

!

>D

v

  0 D H 

~

!

‚

I

_baBc H



>D

!





  02 D H  



I

;(|9 H  



>D

!





  02 D ;(|  

! 

(13)



 

subject to : ‚

G

    I

0



>D

!

l 

!XW

!

  02 D ;(|  

I



>D

?!

  02 D H 

v ~

!

9H 

 !

!



v

~

  02 D H 

~

!

9H

! 







!

 " # $  !   subject to :  %! ' v & _



 

l

l

9 &9

H9 7 

H9 7

‚

!

G

GB5

!XW

?!

}> D

P

02 D ;(| 

KH9 7



!

>D

P

v

02 D H



! ~

9H

02H9 7

To obtain its dual program (see also App. A) we define one variable for each constraint of the primal problem. We use ! 9 H for the first set of constraints, and 7:9 H for the second set. The dual program is,

(

)+*

_

-, .0/

ac

9

subject to :

(

l



!

!

!

9 HE~

9H



9H

( 1)  2) (

‚

!

GB5 l ‚

!

9H  9H

7:9 H 



GB5

7:9 H

l

GB5 !

 6  !

The case of n Eq. (14) becomes single new variable problem,



>F7

GB5

 

 S ;(|:9 H 

( !

4)  5)

9 H&U" 





7:9 H

7:9 H





G



7 9H : ?!

is! similar. The objective function of ! !  ! V 9H ]0 9 H ]  . We introduce a ! ! `_baBc 9 H ] 0 9 H ] to obtain the primal

   # $  !   subject to :  %! v7& 9 &9



_baBc

l

_

 v

7:9 H





 3)  )

‚

(14)

>D



!

9H !

The corresponding optimization problem is,  _

, for some

H

Here is an integer. Note that of the constraints for G R are automatically satisfied. This is changed in the following derivation for the non-separable case. In the general case a matrix 0 which classifies all the examples correctly  ! might not exist. We therefore introduce slack variables v and modify Eq. (10) to be,



     

 I

  0 D ;6|  

5.1 Design of continuous codes using Linear Programming  We now further develop Eq. (14) for the cases Y . 6  We deal first with the cases  and  which result in linear programs. For the simplicity of presentation we will I  N O N P O assume that D  D   D D . For the case !  the  objective function of Eq. (14) beV ! V ! ! . We introduce a set of auxiliary comes 9H ]  9H ]0 ! ! variables 9 H `] 0 9 H ] to get a standard linear programming setting,



?!

!

>D

The relation between the “hard” and “soft” constraints and their formal properties is beyond the scope of this paper. For further discussion on the relation between the problems see [21].



(11) Thus, a matrix 0 that satisfies Eq. (10) would also satisfy the following constraints, ~

 I

G

0

(10) Denote by

!XW

subject to : ‚

!

l



; | 9H 



for some constant v . This is an optimization problem with “soft” constraints. Analogously, we can define an optimization problem with “hard” constraints,

!

!XW

!

?!

}> D



0

Following the technique for n gram is,

P 

02 D ;(| 

>D

! 

P

02 D H

v ~

!

9H

H9 7 

, we get that the dual pro-

‚

!

-, .0/

_baBc

(

l

9

!

 

!

9 H~

9H ‚

subject to :

!

( 2)  2) ( !

GB5 l ‚

9H 

!

H l



9H

7:9 H

9H

 7:9 H 

) 5) 

?!

>7 !

(







l

GB5



7:9 H 

 7:9 H



 v 7:9 H

 

 S ;(|:9 H

(



!

4)  5)

9 H&U" 





7:9 H

7:9 H





Both programs (  and  ) can be now solved using standard linear program packages. 5.2 Design of continuous codes using Quadric Programming We now discuss in detail Eq. (14) for the case  . For convenience we use the square of the norm of the matrix (instead the norm itself). Therefore, the primal program becomes, 





(



_

9



subject to :

  

‚



G



l 

0

 

!

!W !

}> D

!

(15)

P 

?!

02 D ;(| > D

P 

v

02 D H

~

!

9H

 







   #







l !

(

l 

!

!( 0g !

V 



 

(  !

9H



l

 !

H P





0 D H

 subject to :

> D

9H



0 D H



‚

 

!XW

 !

>D



l 

G

(

P  !

!

~

!



9H

(

l 





!



R

9H

(

!

 

9H

(17)

H





02 D H





l !



0 D H

( 

!

9 H > D

!

  

l   !

9 ;(|

l !

>D

W

 !

>D

!

 

(18) l

H





; | 9H

 W 

(    !

H

9H



(  !





; | 9H

( 

!

9 HE

 

; &9 H

( 

9 HE

9H

9H

(  ( 

,





v !

G

D

and

(

! P  D D

 

(20)



(

  







!

(

l 

l

!

! P D

 >D

9

 !

P 



>D



  D ; | 



(

! D





P   D ;

(  



D

! D~



(



(

It is easy to verify that  is strictly convex in . Since the constraints are linear the above problem has a single optimal solution and therefore QP methods can be used to solve it. In Sec. 6 we describe a memory efficient algorithm for solving this special QP problem. ! !  To simplify the equations we denote by  D  D ; |  D the difference between the correct point distribution and the distribution obtained by the optimization problem, Eq. (19) becomes,  l >D !   ! 9 H (21) 02 D H  !

(

Similarly, for 02 D H we require,



!

subject to :

(16)

l

H

l



H

_baBc

(

!

9 HE~



>D

 !



The saddle point  we are seeking is a minimum for the primal variables ( 0g ), and the maximum for the dual ones ( ). To find the minimum over the primal variables we require,



9

P 

(

(Details are omitted due to the lack of space.) Let be the ‚ D vector with all components zero, except for the th compo nent which is equal to one, and let D be the vector whose components are all one. Using this notation we can rewrite the dual program in vector form as



0 D ; |  9H

!

 !

>D

(

where !

v

(

(

We solve the optimization problem by finding a saddle point of the Lagrangian :  

(

is when the label  of an example is equal to G , then the th !  example is a support pattern if 9 H  . The second case is ! when the label  of the example is! different from G , then the ‚ th pattern is a support pattern if 9 HT ~ . ‚ ! v G we have Loosely speaking, since for all and 9H ! !  V and H 9 H  , the variable 9 H can be viewed as ‚ a distribution over the labels for each example. An example affects ! not a point the solution for 0 (Eq. (19)) if and only if D in ! distribution concentrating on the correct label  . Thus, only the questionable patterns contribute to the learning process. We develop the Lagrangian using only the dual variables. Substituting Eqs. (17) and (19) into Eq. (16) and using various algebraic manipulations, we obtain that the target function of the dual program is,



02 D H





(19)

9 H

Eq. (19) implies that when the optimum of the objective function is achieved, is a linear  ! each row of the matrix 0 ‚ combination of > D  . We say that an example ! is a support  ! pattern for class G if the coefficient ; | 9 H  9 HE of > D  in Eq. (19) is not zero. There are two settings for which an ‚ example can be a support pattern for class G . The first case

(



Since we look for the value  of the variables which maximize  the objective function (and not the optimum of itself), we can omit constants and write the dual problem given by Eq. (20) as,

 



_

 

ac

subject to : where

 

 

 







G

l !

9



S >D

 ! 

!

P

>D

and  D

 D ; |



D





 U



D ! P D

! P  D

l  

(22)

}

!

D ! P

D~

!

Finally, the classifier variable  as,  )

H

a

pq`@_ 

ac

pq`@_ 

a

)

a

H

ac

)

  P





02 D H



!

 ! 9 H >D !

l





P 

 !

>D !

 ! 9 H  >D

l

l

pq`@_ 

)



P 



>D

ac H



>D

ac

P 

>D

 !



>D

 !

 ! 9 H  

 

 



Primal





 

where :

 



 









l

x >D I

! 9





ip6` 



a

) !



and  D

 ; | D (





  >D



l _baBc H

!

D

 !

and the classification rule ) )



‚

subject to :



! P  D

{ D D 

! P

 ! 9 H I x >D



  >D

 !

(24)

R

D

l !



! P

{

D~

  #  subject to : 9

‚

G



  

!



0





 

?! P 0 D ;6|  D

V

!XW

  _



0

‚

G

(25)

v ~

!



!

l 

!W  !

>

 0

(27)  !

; | 2>

 0

H

v

! ~

9H

! 



‚ _

x >D

H

ac

!

>D

?!

P 



P



5

 !  0 H    ;(|:9 H  !  0 ;( | {   ! 

0 D H >

02 D ;(| >





The constraints appearing in Eq. (14) now become 9H

(26)



Here, v is a given constant and ~ 9 H   ;(|:9 H , as before. For the rest of the columns we assume inductively that   >?  E6>FC  have been provided and the first  columns of the matrix 0 have been found. In addition, we are pro    for the vided with a new binary classifier >C   h j:k > next  column. We need to find a new column of 0 (indexed   ). We substitute the new classifier and the matrix in Eq. (13) and get, !

!

! P 0 D H 2 D



   



Note that for '  Eq. (26) reduces to the primal pro . We gram of SVM, if we take 0 D   0 D  and  would also like to note that this special case is reminiscent of the multiclass approach for SVM’s suggested by Weston and Watkins [22]. Their approach compared the confidence I  I   02  02 D ;   to the confidences of all other labels D H   '  and had problem. In  slack variables in the primal I   0 D ; is comcontrast, in ourW framework the confidence I   0 D H and has only pared to _bac H ; slack variables in the primal program. In Table 1 we summarize the properties of the programs discussed above. As shown in the table, the advantage of using  in the objective function is that the number of variables in the dual problem in only a function of on ' and and does not depend on the number columns  in 0 . The





subject to :

The general framework for designing output codes using the QP program described above, also provides, as a special case, a new algorithm for building multiclass Support Vectors Machines. Assume that the instance space is the vector c    space J and define > D D  h j:k D (thus v g ), then the primal program in Eq. (15) becomes  _



 

number of columns in only affects the evaluation of the 0 I inner-product kernel . The formalism given by Eq. (14) can also be used to construct the code matrix incrementally (column by column). We now outline the incremental (inductive) approach. However, we would like to note that this method only applies I  O N O P N when alD  D   D D . In the first step of the incremental  gorithm, we are given a single binary classifier >  and we need to construct the first column of 0 . We rewrite Eq. (14) in a scalar form and obtain,

!



becomes,

 

  G E    G FF

  G GG  FG E  G

(23) 

 



G 

Table 1: Summary of the sizes of the optimization problems for different norms. (See Appendix A for the definitions of the constraints in linear programming.)



 



_baBc

Variables 0-Constraints Constraints Variables 0-Constraints Constraints

Dual

As in Support Vector Machines, the dual program and the classification algorithm depend only on inner products  !  P >D of the form > D   . Therefore, we can perform the calculations in some high dimensional inner-product space C using a transformation D * J . We thus replace the inner-product in Eq. (22) and in Eq. (23) with a general I inner-product kernel that satisfies Mercer conditions [21]. The general dual program is therefore,



A

F

G    G FF

  G GG FF 



can be written in terms of the 

 H

a

pq`@_ 



 )

>D



 !

>

P  

0 D ; | 



; | 9H

 !  0 ;  | 



S >D

?!



  !

> 

8>

P

!

 0

 !

>D



 !  0 H  v

02 D ;(| > D !



;(| 

?!



P

02 D H(U 

!

P

0 D H



2>

 

 

;(|:9 H !



!

 0 H

!

 v 

P P We now redefine ~ 9 H to be  S > D 0 D ;6| }> D 02   D HU    ;(|:9 H . It is straightforward to verify that this definition of ! ~ 9 H results in an equation of the same form of Eq. (27). We can thus apply the same algorithms designed for the “batch” case. In the case of  and  , this construction decomposes a single problem into  sub-problems with fewer variables and constraints. However, for   the size of the program remains the same while we lose the ability to use kernels. We therefore concentrate on the batch case for which we need to find the entire matrix at once.



6 An efficient algorithm for the QP problem The quadratic program presented in Eq. (24) can be solved using standard QP techniques. As shown in Table 1 the dual program depends on n' variables and has '  constraints all together. Converting the dual program in Eq. (24) to a standard QP form requires storing and manipulating a  matrix with n'  elements. Clearly, this would prohibit applications of non-trivial size. We now introduce a memory efficient algorithm for solving the quadratic optimization problem given by Eq. (24).  First, note that the constraints in! Eq. (24) can be divided !  !XW P    into disjoint subsets D  D ;(| , D D   . The algorithm we describe works in rounds. On each round it picks a ! !  !  single set  D  D ;(|t  D P D }  , and modifies  D so as to optimize the reduced optimization problem. The algorithm is reminiscent of Platt’s SMO algorithm [15]. Note, however, that our algorithm optimizes one example on each round, and not two as in SMO. Let us fix an example index p and write the objective function onlyin! terms of the variables  D . For brevity, let  I ! I x  in  . 9  >D  > D  { . We isolate D





  

xh j:k 

D

  











 



 







 





 I



l !

W

9

P

D



 







 





 

~

9

l ! W



l ! 9



W

!W

  P





P

! I

I

!

9

P



l 

! W

  

l ! W

D !



P

D





!

D~

   9

 D~

P

l

! 

D

D !

 

P

! I

9

! W P D~

!



D ! U

U

(28)

(29) (30)

D ! P D  

!

D~

D !

 9

! I

 

D



S D~

D ! P D  

9

!

D  

D 

 

D



I

 

D~



D



! I

l

D ! P D  

9

D !

l  





 

D

D  P D  

9



 

9

! I

   



where,

D



W

P

D

D ! P D



9

9



9

l



! I

!

    !





l

I

 [S 









l 

!W





D ! P

D~

!

(31)

For brevity, we will omit the index and drop constants (that do not affect the solution). The reduced optimization  has ' variables and '  constraints, q€





subject to :

  D

 



 D ;

D





and  D

   

D

P  D

P

D  

}

D

P

D (32)

Although this program can be solved using a standard QP technique, it still requires large amount of memory when ' is large, and a straightforward solution is also time consuming. Furthermore, this problem constitutes the core and inner-loop of the algorithm. We therefore further develop the algorithm and describe a more efficient method for solving

  D

Eq. (32). We write quadratic form,

  D

Since



 

 



~



in Eq. (32) using a completion to 

   

P

D



 

 S D

D   D

D

P 

   3     





D P

D 

D





D

 U 





the program from Eq. (32) becomes,

  



_

subject to : where,



D 

D

D





D

P and [ D

D

 D







D 

D

 

D



 D



P  D D

D

 

 D ; 

P

(33)

In Sec. 6.1 we discuss an analytic solution to Eq. (33) and in Sec. 6.2 we describe a time efficient algorithm for computing the analytic solution. 6.1 An analytic solution While the algorithmic solution we describe in this section is simple to implement and efficient, its derivation is quite complex. Before describing the analytic solution to Eq. (33), we would like to  give some intuition on our method. Let us P  fix some vector D and denote   4 D . First note that D   qP   D  P  D is not a feasible point since the constraint D D  is not satisfied. Hence for any feasible point some D D  of the constraints  D  D are not tight. Second, note that  the differences between the bounds H and the variables  H sum to one. Let us induce a uniform distribution over the components of  D . Then, the variance of  D is

 

 S 



U 

S

 



U





  D



'





'







Since the expectation  is constrained to a given value, the optimal solution is the vector achieving the smallest variance. That is, the components of of  D should attain similar values, as much as possible, under the inequality constraints    D . In Fig. 1 we illustrate this motivation. We picked D     F  F F  6   and show plots for two different D  feasible values for  D . The x-axis is the index G of the point and the y-axis designates the values of the components of  . The norm of  on the plot on the right hand side plot is D D smaller than the norm of the plot on the left hand side. The  right hand side plot is the optimal solution for  D P .  Thensum  of the lengths of the arrows in both plots is D D  D P D . Since both sets of  points are feasible, they satisfy the con P P straint [ . Thus, the sum of the lengths of the D D  D  D “arrows” in both plots is one. We exploit this observation in the algorithm we describe in the sequel. We therefore seek a feasible vector  D whose most of its components are equal to some threshold . Given we define a vector  D whose  its G th component equal to the minimum between and H , hence the inequality constraints are satisfied. We define



 H





H



 

~

We denote by

H

(34) H

u 







D

P  D

l  H

W

 H



ν = [0.0 0.2 0.6 0.8 0.6] , ||ν|| =1.4

3.5

1

d(1)

3

D v

d(4) d(3),d(5)

0.8 2.5

0.6 F()

2

0.4

1.5

0.2

1 d(2)

0

0.5

−0.2

0

0

1

2

3

4

5

0

0.5

ν = [0.5 0.2 0.5 0.5 0.5] , ||ν|| =1.04

1.5

Figure 2: An illustration of the QP problem  

 of the  solution 6  d F F F F    . The using the inverse of  for D 

  d optimal value is the solution for the equation which is F .

D v

1

1

Θ

6



0.8

0.6

0.4

This completes the proof.



0.2

Corollary 4 There exists a unique

0

−0.2



0

1

2

3

4

5

constraint from Eq. (33) becomes , the equality    D P D  . Let us assume without loss of generality that the compo v  nents  u  v of the  u vector D are given in a descending order,  (this can be done in ']\_^ ` ' time). Let      and  . To prove the main theorem of this section we need the following lemma.

 

Lemma   3 range H 

 

 

is piecewise linear with a slope for G } F&(' .

H

Proof: Let us develop





u

l  







W H l SS W H



u l

 H

SS W



~

~

H

~



 

u





.



HU U



u



H&U U

H







H 



l SS W H





H(U U







u







W

l H









G



 





















P Thus h j:k U as needed. Unique 32S  } D D 

ness of follows the fact that the function is a one-to-one   P  mapping onto  } D D U . We now can prove the main theorem of this section.











Theorem 5 Let be the unique solution of   D P D  . Then  D is the optimum value of the optimization problem stated in Eq. (33).



The theorem tells us that the optimum value of Eq. (33) is of the form defined by Eq. (34) and that there one

 is exactly  P D   value of for which the equality constraint  D  P 

  holds. A plot of  and the solution for from D D  Fig. 1 are shown in Fig. 2. Proof: Corollary 4 implies that a solution exists and is unique. Note also that from definition of we have that the vector  D is a feasible point of Eq. (33). We now prove that  D is the optimum of Eq. (33) by showing that   ~   for all feasible points  Q  . D D D D Assume, by contradiction, that there is a vector  D such    . Let e h j:k    Q D , and define  that  D D D D D

 



HU U

v

Note that ifu o~ for all H then o~ G . Also, the  W V S S  U K U ƒ  G equality holds for each H H   in the Prange    PP    H . Thus, for H    G ƒ '  , H  H the function  has the form, N



such that

in each G

H

H }S S

D



Proof: From Eq. (35) we conclude that   is strictly increasing and continuous in the range  . Therefore,

  has an inverse in that range, using the theorem that every strictly increasing and continuous function has an in  u

 verse. Since  A' for  then  -    as the of for the interval   }  6U -  . Hence,   range  is the interval  R D P D U which clearly contains D P D  .



 

P  D

6

Figure 1: An illustration of two  feasible points for the re  duced optimization problem with D  6  d F F F F   . The x-axis is the index of the point, and the y-axis denotes the values  D . The bottom plot  has a smaller variance hence it achieves a better value for . Using









(35)

          



 H  G * G^* ~  H  . Since both  D and H   h j:k    satisfy the equality constraint of Eq. (33), we have, D     P D    P D     P D   D

D

D



e P  D D

D

u

l  H

W

e H

}

(36)



Input : D .   Initialize  u  D . Define    .   Sort the components of D , such that    Define  G  } ;  .   While G   ~





G

G





G 



 

.

 

 



G  LG

!



 H6



!

 

!



!





!

& 


D

{ 



    7   

D





D







 S

D

 

 

S

 

z

  



z

z

e 

ez   





e 











D

z





and 



U 



z

 S





 



z

e  

 

z

2

z ez



z     e  e z









 e















 S

 





e











e





z



e



e

z

z 

 







z



 

for     e z we obtain,





U 





e 







zez 





z

e 









U

from the definition of



D

e 





Substituting the values for    we obtain,

    7  







U



e



 

z ez 

The first term of the bottom equation  is  negative since e  and e z ~ . Also N 3 , hence ~ and the second term is also negative. We thus get,

  D







Using the definition of  D and  D

e  and for  z   z  e z 

and the

~

) !

The vector  satisfies the constraints of Eq. (33) since      e z  D    e   e z    and   z   z  e z   z   ez  ez   . Since  D and  D are equal except for N O their and components we get,

D

 3

 ! 9 H I x >D

l ac

Eq. (25)

Figure 4: A skeleton of the algorithm for finding a classifier based on an output code by solving the quadratic program defined in Eq. (24).



 

  D







which is a contradiction.







 



 }a



and we get a contradiction since e D Q D . We now turn to prove the complementary case in which V V N e e . Since , then there exists 3 H  H/

H

H  e  such that use again Eq. (36) and conclude that  . We u such that e z ~ . Let us assume there exists also O 3   without loss of generality that e  e z y (The case e  N e z v follows analogously by switching the roles of and  O  ). Define D as follows,







pq`@_

 )



u

l e

D



 H

H



u









D







We use nowu the assumption that e H R for all G W V e equality H H } (Eq. (36)) to obtain, 

 

>D





H

l

e



H 

H







 &

Output the final hypothesis:

 Since  D is a feasible point we  have D





  . - a feasible point for Eq. (24).  



D 



Figure 3: The algorithm for finding the optimal solution of the reduced quadratic program (Eq. (33)).



 !





!

D

>D

Choose an example Compute and         D  D ; .  

Compute     D 

 

v
D   . The code matrices we test are of 'L rows (classes) and   columns. We varied  the size of the training set size from  to % . The examples were generated using the uniform distribution         over S   U 1}S   U . The domain S   U  1}S   U  was partitioned into four quarters of equal  size: S    BUd1 S  6 BU ,     S  U1 S F U , S  U1 S  6 BU , and S  U1 S  U . Each quarter was associated with a different label. For each sample size we tested, we ran the algorithm three times, each run used a different randomly generated training set. We compared the standard quadratic optimization routine available from Matlab with our algorithm which was also implemented in Matlab. The average running time results are shown in Fig. 5. Note that we used a log-scale for the  (run-time) axis. The results show that the efficient algorithm can be two orders of magnitude faster than the standard QP package.





3

2

1

0

0

50

100

150 No. of training examples

200

250

300

 U 

 & }G H  H        (39) G  G  YG H  H       P    , we first find G To solve the equation  D D      and G  , which implies that such that G     H E H  . Using Eq. (38) and the equation    32S  we get,       H   G      Using the linearity of  we obtain,        H   H  G H  G   RG   





QP SPOC 4

−1

Recall that   the function  is linear in each interval S H  B H  with a slope G (Lemma 3), hence,

5

Log10(run time)



interval S H   H which contains . We now use simple algebraic to derive the search scheme for .

  manipulations

   Since   , then 38S H   H  , iff        ~  &   HE and  &   H  & v   For convenience, we define the potential function  



  (38) G    S H  U"   and obtain,        3YS H   HMz G  ~ and G    Also note that,



Figure 5: Run time comparison of two algorithms for code design using quadratic programming: Matlab’s standard QP package and the proposed algorithm (denoted SPOC). Note that we used a logarithmic scale for the run-time ( ) axis.

7 Generalization properties of the algorithm In this section we will analyze the generalization properties of the multiclass SVM algorithm. We will use another scheme for reducing multiclass problems to multiple binary problems, proposed by Platt [16]. This method alsou contains x two stages. In the training stage the set of all  { binary classifiers is constructed, where each classifier is trained to distinguish between some pair of distinct labels. In the classification stage the algorithm maintains a list of all possible labels for a given test instance (initialized to the list of all labels). The algorithm runs in steps, in each step it picks two of the labels from that list, and applies the binary classifier which distinguish between the two labels. The label which was rejected by the binary classifier is deleted from the list.  After '  such steps there is only one possible label is the list, which is the prediction of the multiclass classifier. It is convenient to represent the classifier using a rooted binary directed acyclic graph (DAG). Each node of the graph represents some list of possible labels, which is elements is a subset of all the labels (we call such list a state). Each node also contains the identity of the binary classifier to be applied on the instance when the state represented by that node is achieved. The leafs of the DAG corresponds to the ' singleton states (where there is only one label in the list). From each internal node there are two outgoing edges defined  as following. Given a node containing the state G BG  G  and the binary classifier that distinguishes between the labels G and G  .  Then the two possible states for the next steps  are G G  (if G was rejected) and G G &G  (if G  was rejected). The root of the DAG is the state corresponding to the list of all the labels. u   states. But, In the general case there u can be up to u x x it is possible to use only  { }' states, where  { of them are internal nodes (one for each possible binary classifier) and the rest ' are the singleton leaves. The structure of such classifier is a DAG since a state can be visited using different paths from the root. The binary classifiers used by Platt were Support Vector Machine. Given a decoding matrix 0 of size 'n1  over J where each row of 0 corresponds to a class  3 $ , the set of all binary classifiers can be constructed as following. Assume



we want to build a binary classifier to distinguish between class G  and class . ERecall that the multi-classifier is given P by ) 02  Ra pq` _bac D H  . Similarly, for the binary case we have that The correct label is not r





0 D H z

Acknowledgement

0 D

Note that the classifier rules out a label and not points the correct label, because that there is a possibility that the correct label is neither G nor . Define  H9 to be the (normalized) vector 0 D H   0 D , then we result the binary SVM  >H9 `  H9 P  , where we interpret a positive out  put as a prediction of the label G (rejecting the label ). Define the margin of the binary classifier > H9  to be  W H9 `_  9  ;/ f 9 ; H9 ]  H9 P ]  , where  is the label of the example   . We now ready to use Theorem 1 from Platt [16].







) 













Theorem 6 Suppose we are able to classify a random sample  of labeled using a SVM DAG on ' classes I x u { examples decision nodes (and ' leaves) with marcontaining   gin H9 at node GB  , then we can bound the generalization  to be less than error with probability greater than 

)  







m 



 \^$`



 E

 







4\^$`

 





B

 

o

r  t .   , and  is the radius of a ball conwhere taining the support of the distribution. 





V

learning of binary classifiers. A viable direction in this domain is combining our algorithm for continuous codes with the support vector machine algorithm.

H9

8 Conclusions and future research In this paper we investigated the problem of designing output codes for solving multiclass problems. We first discussed discrete codes and showed that while the problem is intractable in general we can find the first column of a code matrix in polynomial time. The question whether the algorithm can be generalized to  v columns with running time of a C  or less remains open. Another closely related question is whether we can find efficiently the next column given previous columns. Also left open for future research is further usage of the algorithm for finding the first column as a subroutine in constructing codes based on trees or directed acyclic graphs [16], and as a tool for incremental (column by column) construction of output codes. Motivated by the intractability results for discrete codes we introduced the notion of continuous output codes. We described three optimization problems for finding good continuous codes for a given a set of binary classifiers. We have discussed in detail an efficient algorithm for one of the three problems which is based on quadratic programming. As a special case, our framework also provides a new efficient algorithm for multiclass Support Vector Machines. The importance of this efficient algorithm might prove to be crucial in large classification problems with many classes such as Kanji character recognition. We also devised efficient implementation of the algorithm. The implementation details of the algorithm, its convergence, generalization properties, and more experimental results were omitted due to the lack of space and will be presented elsewhere. Finally, an important question which we have tackled barely in this paper is the problem of interleaving the code design problem with the

We would like to thank Rob Schapire for numerous helpful discussions, to Vladimir Vapnik for his encouragement and support of this line of research, and to Nir Friedman for useful comments and suggestions.

References [1] D. W. Aha and R. L. Bankert. Cloud classification using error-correcting output codes. In Artificial Intelligence Applications: Natural Science, Agriculture, and Environmental Science, volume 11, pages 13–28, 1997. [2] E.L. Allwein, R.E. Schapire, and Y. Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. In Machine Learning: Proceedings of the Seventeenth International Conference, 2000. [3] A. Berger. Error-correcting output coding for text classification. In IJCAI’99: Workshop on machine learning for information filtering, 1999. [4] Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification and Regression Trees. Wadsworth & Brooks, 1984. [5] V. Chvatal. Linear Programming. Freeman, 1980. [6] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, September 1995. [7] Ghulum Bakiri Thomas G. Dietterich. Achieving highaccuracy text-to-speech with machine learning. In Data mining in speech synthesis, 1999. [8] Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2:263–286, January 1995. [9] Tom Dietterich and Eun Bae Kong. Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Oregon State University, 1995. Available via the WWW at http://www.cs.orst.edu:80/ tgd/cv/tr.html. [10] R. Fletcher. Practical Methods of Optimization. John Wiley, second edition, 1987. [11] Trevor Hastie and Robert Tibshirani. Classification by pairwise coupling. The Annals of Statistics, 26(1):451– 471, 1998. [12] Klaus-U. H¨offgen and Hans-U. Simon. Robust trainability of single neurons. In Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, pages 428–439, Pittsburgh, Pennsylvania, July 1992. [13] G. James and T. Hastie. The error coding method and PiCT. Journal of computational and graphical stastistics, 7(3):377–387, 1998. [14] Eun Bae Kong and Thomas G. Dietterich. Errorcorrecting output coding corrects bias and variance. In Proceedings of the Twelfth International Conference on Machine Learning, pages 313–321, 1995. [15] J.C. Platt. Fast training of Support Vector Machines us-



[16]

[17] [18]

[19]

[20] [21] [22]

A

B

ing sequential minimal optimization. In B. Sch¨olkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning. MIT Press, 1998. J.C. Platt, N. Cristianini, and J. Shawe-Taylor. Large margin dags for multiclass classification. In Advances in Neural Information Processing Systems 12. MIT Press, 2000. (To appear.). J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning internal representations by error propagation. In David E. Rumelhart and James L. McClelland, editors, Parallel Distributed Processing – Explorations in the Microstructure of Cognition, chapter 8, pages 318–362. MIT Press, 1986. Robert E. Schapire. Using output codes to boost multiclass learning problems. In Machine Learning: Proceedings of the Fourteenth International Conference, pages 313–321, 1997. Robert E. Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):1–40, 1999. Vladimir N. Vapnik. Statistical Learning Theory. Wiley, 1998. J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In Proceedings of the Seventh European Symposium On Artificial Neural Networks, April 1999.



 2* -

V

subject to : V



its dual program is:

%( '& subject to :

?A

0 8 0D8

0 8

0$ 8

(0-Constraints) (Unconstrained variables)

 ?BA )   *+ V,) ?BA J  *   0-. 8

V ) J  * :  0-.$ 8 0 -/ 8 (0-Constraints) *  0 -.8 (Unconstrained variables) * !# V



3

49;: 9 < : ,, < =>: ?5 D5

?BA

 ?BA       ?BA J     J   : 



"!#

2

1

Using Chvatal’s [5] notation, given the linear program: V

0

Var. name

Linear programming

 

Legend

3

Description Sample Matrix code Sample size No. of classes (No. of rows in ) No. of hypotheses (No. of columns in ) Index of an example Index of a class Correct label (class) Index of an hypothesis Slack variables in optimization problem Dual variables in quadric problem

2

> > 8 : 7 4 6 5 ( | 5 N 3 6 5

2

>

Coefficient in reduced optimization prob. Coefficient in reduced optimization prob. Coefficient in reduced optimization prob.

4 5 C GABC @ B @ G 5E (

Section 2 2 2 2 2 2 2 2 2 5 5.2 5.2 6 6 6 6 6