On Hadamard-Type Output Coding in Multiclass Learning

Report 2 Downloads 24 Views
On Hadamard-Type Output Coding in Multiclass Learning Aijun Zhanga , Zhi-Li Wub , Chun-Hung Lib and Kai-Tai Fanga? a

Department of Mathematics, b Department of Computer Science Hong Kong Baptist University, Kowloon Tong, Hong Kong

Abstract. The error-correcting output coding (ECOC) method reduces the multiclass learning problem into a series of binary classifiers. In this paper, we consider the dense ECOC methods, combining an economical number of base learners. Under the criteria of row separation and column diversity, we suggest the use of Hadamard matrices to design output codes and show them better than other codes of the same size. Comparative experiments based on the support vector machines are made for some real datasets from the UCI machine learning repository. Keywords. Multiclass learning, error-correcting output codes, Hadamard matrix, support vector machines.

1

INTRODUCTION

Many real-world classification problems are polychotomous and require multiclass supervised learning. Examples for multiclass learning applications include optical character recognition, text classification, fingerprint classification, speech categorization, and so on. Since most of machine learning algorithms are developed for binary classification, they could not be directly used for multiclass learning. An alternative approach is called the method of output coding, which reduces the multiclass problem to a series of binary classification problems, then combine the binary outputs to make the polychotomous classification. Every output coding method aims to decompose the polychotomy of multiclass labels into multiple dichotomies of binary labels, for each dichotomy a base classifier is trained to make a binary classification. The simplest way is to compare each class against all others, usually termed as one-per-call (OPC) output code. Another general output code is introduced by [7] to predefine a binary code with error-correcting properties such that the individual errors caused by binary classifications can be corrected. Such error-correcting output codes (ECOC) have been widely studied in the recent years and used in many multiclass learning applications as discussed above. Unlike the above dense output codes that use the full data to train each base learner, a different approach is suggested by [8] to compare all the pairs of classes, i.e., each base learner is only based on two classes ?

Corresponding author. E-mail: [email protected]

of training data. Such pairwise output codes can be viewed as sparse. A unifying framework of both dense and sparse codes is addressed in [1]. In this paper we restrict ourselves into the dense ECOC framework and consider a new type of output codes constructed from Hadamard matrices. In Section 2, a systematic study of ECOC methods is given, including Hamming decoding, goodness assessment and error analysis. In Section 3, we introduce Hadamard-type output codes and show them optimal in terms of row separation and column diversity. Finally, we show some experimental results for some public datasets from the UCI machine learning repository. The support vector machines are employed as our base learners for they are widely believed to be stronger binary classifiers with good generalization performance.

2 2.1

ERROR-CORRECTING OUTPUT CODING ECOC and Hamming Decoding

Suppose S = {(xi , yi )}N i=1 is the set of N training examples, where each xi ∈ X contains the attributes and yi ∈ Y is categorical. For simplicity, we usually take the integer set Y = {1, 2, . . . , K} to represent K different classes. In the ECOC (error-correcting output coding) framework [7] for handling such supervised learning problem, a multiclass classifier is built by combining multiple (say L) binary learners. Rather than resolving the whole K classes at once, each base learner make a binary classification only. Such ECOC method consists of two stages: decomposition and reconstruction. Denote an ECOC matrix as M = (Mkl )K×L with entries from {−1, 1}, which reflects the arrangement of base learners. In the first stage, the original training data (multiclass labeled) are partitioned into L sets of two superclasses, i.e., for each column M∗l , the polychotomy {1, 2, . . . , K} is partitioned into a unique two-superclass dichotomy: one superclass is formed by those classes with Mkl = 1 and the other with Mkl = −1. How designing such L decompositions is our main concern in this paper. At the stage of reconstruction, there are different schemes to recover the multiclass labels, typically Hamming decoding (i.e. majority vote) [7, 10] and loss-based decoding [1]. The latter one is shown advantageous for margin-based classifiers, employing different loss functions of the margin. In this paper we mainly discuss how different output codes work with Hamming decoding, but the results are readily extended to loss-based decoding. Each class k in the polychotomy can be uniquely represented by a row Mk∗ of the ECOC matrix, which is also called as a codeword. For any two codewords w, u, their Hamming distance is defined by dH (w, u) = |{j : wj 6= uj , 1 ≤ j ≤ L}|.

(1)

In total there are L base classifiers fˆ1 , . . . , fˆL yielded from different decompositions of super two-class samples. Given any observation x, the vector of predictions ˆf (x) = (fˆ1 (x), . . . , fˆL (x)) of components of 1 or −1 is called the target

codeword. We predict x as from class k if ˆf and Mk∗ are closest in Hamming distance Quite often there are more than one codewords in M attaining the smallest Hamming distance from ˆf , so we need to define the following candidate set n o C(x) = k : dH (ˆf (x), Mk∗ ) minimized , (2) then assign x to every element of C(x) with probability 1/|C(x)|. The above procedure is commonly referred to as Hamming decoding. Remark 1. We usually suppose both empirical and generalization error rates below 0.5 for all base learners. In case that a subset Q of base learners whose empirical (generalization) risks are as awful as beyond 0.5, we should modify the Hamming distance (1) into the following form dH (w, u) = |{j 0 : wj 0 = uj 0 , j 0 ∈ Q}| + |{j : wj 6= uj , j ∈ / Q}|, before the Hamming decoding for the training (testing) data. 2.2

Goodness Assessment

Given an output code, two criteria of row separation and column diversity are commonly suggested [7, 6] for the goodness assessment. Row separation. Each codeword should be well-separated in Hamming distance from each of the other codewords; Column diversity. The base learner for each decomposition should be uncorrelated with base learners for the other decompositions. Since each base learner can make some wrong decisions, there might be error bits in the target codeword. For Hamming decoding, small number of error bits may not result in a wrong multiclass label if the target codeword keeps being closest to the true label. Therefore, a great row separation is desired in order to correct as many error bits as possible. Numerically, focus is commonly placed on the search for codes with large minimum Hamming distance, dmin =

min dH (wi , wk ),

1≤i,k≤K

(3)

which is a common measure of quality for error-correcting codes. Furthermore, we introduce a lemma from coding theory [9], which plays a key role in the framework of ECOC methods. Lemma 1. ¤An ECOC matrix with minimum Hamming distance dmin can cor£ −1 rect dmin errors, where [x] denotes the greatest integer not exceeding x. 2 On the other hand, the criterion of column diversity is also essential for a good ECOC method. Based on the same type of binary learners, the training data partitioned according to identical (similar) would yield identical (similar) learned base classifiers, thus resulting in redundancy. Therefore, the base learners

associated with its training data should be independent to each other. Regardless of the behavior of the base learners that vary with different choices for specific datasets, our research is mainly carried on to find a good ECOC matrix and assess how the decompositions (i.e., columns in the ECOC matrix) are uncorrelated. In some subsequent works of [7] searching ECOC matrices, the algorithmic procedures were widely seen maximize the minimum Hamming distance between columns, in the same manner of row separation. However, that does not mean maximizing the column uncorrelation. Instead, we should minimize the columnwise absolute correlations. For example, consider the decompositions vA = (1, 1, −1, −1)0 , vB = (−1, 1, 1, 1)0 and vC = (1, −1, −1, 1)0 ; we see (vA , vB ) is more diverse than (vA , vC ) if by dH (vA , vB ) = 3 > dH (vA , vC ) = 2, while by |Corr(vA , vB )| = 1/2 > |Corr(vA , vC )| = 0 we have the correct order of (vA , vC ) more diverse than (vA , vB ). Therefore, we define smax (M) = max |Corr(M∗j , M∗l )|, 1≤j,l≤L

(4)

for gauging the worst column diversity in ECOC, and use the minimax criterion min(smax ) to search for the optimal ECOC with best column diversity. Given the parameters K (cardinality of polychotomy) and L (number of base learners), now we consider the optimality of the dense ECOC MK (2L ). Denote as Ω(K, L) the set of K × L ECOC matrices with binary entries from {−1, 1}, i.e. Ω(K, L) = {−1, 1}K×L . In terms of the criteria discussed above, we have the following definition of ECOC optimality. Definition 1. An ECOC M∗K (2L ) is said to be optimal if   dmin (M∗ ) ≥ dmin (M) for all M ∈ Ω(K, L).  smax (M∗ ) ≤ smax (M)

(5)

The requirement (5) can be divided into two separate conditions: maximum dmin and minimum smax . Then the definition for an optimal ECOC can be relaxed by satisfying either of them, or satisfying both of them in a sequential manner. In this paper we require both row separation and column diversity simultaneously in the strong sense. In the next section we present a series of optimal ECOC matrices through Hadamard designs. 2.3

Error Analysis

For each learned base classifier fˆl (x), denote as εl , ρl the corresponding empirical and generalization losses, respectively. According to the Hamming decoding reviewed above, it is natural to define the empirical and generalization risks by ¶ µ ¶ N µ 1y ∈C(xi ) 1y ∈C(xi ) 1 X Remp (M, ˆf ) = ; Rgen (M, ˆf ) = E 1 − i (6) 1− i N i=1 |C(xi )| |C(xi )| where the expectation in Rgen is taken with respect to the proper distribution of an unknown observation.

Basically, the error-correcting property in ECOC is supported by the prementioned lemma [9]. More theoretically, since every codeword differs£ at least¤ −1 dmin positions from every other codeword, the Hamming balls of radius dmin 2 around £ dmin −1 ¤each codeword are disjoint. Therefore, any target vector with at most error positions can be corrected to the unique class with smallest Ham2 ming distance. By these facts, we derive the following theorems concerning both empirical and generalization risks. Theorem 1. For the ECOC MK (2L ) with minimum Hamming distance dmin , the empirical risk for the multiclass training is bounded above by ( ) PL εj j=1 Remp (M, ˆf ) ≤ min 1, £ dmin −1 ¤ . (7) +1 2 Proof. Empirically for each base learner fˆj , there are N εj error bits maleclassified inPN jth-decomposed training data with virtual binary labels , so L in total N j=1 εj error bits are distributed into the N × L virtual binary labels. Symbolically, we can think of these error bits are the nontrivial entries of a sparse matrix E of size N × L. At the reconstruction stage by Hamming decoding, let us first consider an original training data (xi , yi ) with its true label yi ∈ {1, 2, . . . , K}. The ith row Ei∗ of matrix E corresponds to the error bits in the target codeword ˆf (xi ). By Lemma 1, a multiclass reconstruction error could £be corrected whenever the ¤ −1 number of error bits in Ei∗ is less than or equal to dmin , so such error bits 2 ˆ ˆ can be remedied with no increase in Remp (M, f ). Any £ dmin ¤ increase in Remp (M, f ) −1 might be made only if the row has at least + 1 error bits. The upper 2 bound of Remp (M, ˆf ) can be obtained by considering the worst-case of error distribution in E, such that the number of problematic rows is maximized, i.e., " # PL PL N j=1 εj N j=1 εj nmax = £ dmin −1 ¤ ≤ £ dmin −1 ¤ . +1 +1 2 2 Finally, by Remp = nerror /N ≤ nmax /N and the fact Remp ≤ 1, we prove the claim. u t Theorem 2. For the ECOC MK (2L ) with minimum Hamming distance dmin , the generalization risk of multiclass reconstruction satisfies XY Y Rgen (M, ˆf ) ≤ 1 − ρr (1 − ρt ) (8) u r∈u

t∈uc

£ ¤ −1 in which the subset u ⊂ {1, 2, . . . , L} with its cardinality |u| ≤ dmin and 2 uc = {1, 2, . . . , L} \ u, provided that the base learners associated with the column diverse ECOC are independent. Proof. We can justify this claim from the opposite direction. For a new observation x, the jth element of its target codeword ˆf (x) has the probability ρj to be

an error bit. ¤ 1 tells us that any subset u of error bits with the cardinality £ Lemma −1 can be corrected, so the generalization accuracy should be at less than dmin £ ¤ P Q 2 Q −1 least u r∈u ρr t∈uc (1 − ρt ) for all |u| ≤ dmin , which leads to (8). u t 2 For the generalization risk bound, it is straightforward to obtain the following corollary, which can be simply calculated and useful when the base learner behaves invariantly with respect to different decompositions. The corollary can be verified by combining the identical terms in (8). Corollary 1. Suppose all the base learners have the same generalization loss ρ1 = . . . = ρL = ρ, For the ECOC MK (2L ) with dmin , µ · ¸¶ dmin − 1 Rgen (M, ˆf ) ≤ 1 − Bρ L, (9) 2 Pt ¡L¢ j L−j where Bρ (L, t) denotes , provided that the base learners j=0 j ρ (1 − ρ) are independent.

3

HADAMARD OUTPUT CODES

A square matrix Hn of order n and entries ±1 is called a Hadamard matrix if H0n Hn = nIn where In is the nth order identity matrix. Since H0n Hn = nIn is held after multiplying any row or column of Hn by −1, a Hadamard matrix Hn is often written in the normalized form with both the first row and column consisting of all +1’s. Some examples of Hadamard matrices are shown below.   ++++++++ +−+−+−+−     ++−−++−− ++++   · ¸ + − + −   ++  , H8 =  + − − + + − − +  H2 = , H4 =      ++−− +− ++++−−−−   +−−+ +−+−−+−+ ++−−−−++ +−−+−++− There are two main methods to construct the Hadamard matrices, Sylvester construction and Paley construction. The former method is simple enough to construct Hadamard matrices of two-power order 2r by the iteration below: · ¸ H2r−1 H2r−1 r H2 = . (10) H2r−1 −H2r−1 Based on number theories, the Paley method can generate Hadamard matrices of any four-multiple order (p+1) where p is a prime number. Interested readers can refer to the book [9] (pages 45–48) by MacWilliams and Sloane. All Hadamard matrices discussed in this paper can be downloaded from the web site of Sloane’s collection [12].

The necessary condition for the existence of a Hadamard matrix of order n > 2 is known as n ≡ 0(mod 4), while the inverse statement on the sufficiency is still a famous open problem. Deleting the first column of normalized Hadamard matrices we obtain a series of two-level Hadamard designs Hn (2n−1 ) that are useful in the field of experimental designs [3]. In what follows we would show Hadamard designs are also useful output codes for solving multiclass machine learning problems. Note that the Hadamard output codes discussed here can be viewed identical to Hadamard designs in the design of industrial experiments. For brevity, we write the Hadamard design and output code HK (2K−1 ) as HK . The important properties of a Hadamard output code HK with K ≡ 0(mod 4) are two fold: a) every pair of codewords has the same Hamming distance of 0 dmin = K 2 , and b) every pair of columns are orthogonal, i.e., H∗j H∗l = δjl K where δjl = 1 for j 6= l and 0 otherwise. Then we have Theorem 3. The Hadamard designs HK for K = 4, 8, . . . are optimal errorcorrecting output codes, within the pool of K-class output codes that combine K − 1 base learners. Proof. According to the Definition 1 of the optimal ECOC, we need to show the Hadamard designs satisfy the extreme conditions of both row separation and column diversity. 1. Maximum dmin in Ω(K, K − 1) By the Plotkin’s bound for any M ∈ Ω(K, L) that is widely known in the coding theory (see page 41 of [9]) · ¸ dmin K≤2 , 2dmin − L where L = K − 1 for Ω(K, K − 1), we have the following bounds dmin ≤

L K K = . 2 K −1 2

Therefore, the minimum Hamming distance dmin = K 2 for HK achieves the above upper bound. In another word, dmin (HK ) ≥ dmin (M) for all M ∈ Ω(K, K − 1). 2. Minimum smax in Ω(K, K − 1) Since every two columns are orthogonal in Hadamard output codes, we have smax (HK ) = max |Corr(H∗j , H∗l )| = j6=l

1 0 max |H∗j H∗l | = 0, N j6=l

achieving the minimum value in the class of Ω(K, K − 1).

u t

The error bounds of multiclass reconstruction can be decreased by using the Hadamard output codes. Let ε denote the global average empirical losses for all

Risk bounds in ECOC Ω(20,19), assuming ρ=0.15, ε=0.1 1

0.9

0.8

0.7 Bound of Gen−Risk

Error bounds

0.6

H20

0.5

0.4

0.3

0.2 Bound of Emp−Risk

0.1

0

0

1

2

3

4 5 6 7 Minimum Hamming distance

8

9

10

11

Fig. 1. Illustration of empirical and generalization bounds in the ECOC class of Ω(20, 19), assuming ε = 0.1 and ρ = 0.15. The risk bounds for Hadamard output codes are circled on the right ends

PL the base learners, i.e. ε = j=1 εj /L. Let ρ be defined in Corollary 1. Then we have the empirical and generalization risk bounds for Hadamard output codes ½ ¾ µ ¶ 4ε K Remp (HK , ˆf ) ≤ min 1, ; Rgen (HK , ˆf ) ≤ 1 − Bρ K − 1, − 1 . (11) K 4 By Figure 1 that illustrates the risk bounds for both empirical and generalization bounds derived in Section 2.3, we see the Hadamard output codes can greatly decrease the risk bounds within the pool of Ω(K, K − 1).

4

EXPERIMENTAL RESULTS

The support vector machines (SVMs) are used as the base learners in our experiments for comparing different dense output codes, for the SVMs with flexible kernels are strong enough to classify various types of dichotomous data while keeping good generalization performance [5, 11]. Plugged into the ECOC framework, the base SVM can be written as (N ) X (j) fj (x) = sign Myi j αi Kj (xi , x) + bj (12) i=1 (j)

(j)

where Kj (x, w) is a selected kernel, bj is the offset term and α(j) = (α1 , . . . , αN ) can be obtained by ) ( N N N X 1 XX (j) αi αr Myi j Myr j Kj (xi , xr ) , (13) α = arg max Lj (α) = αi − 2 i=1 r=1 i=1

subject to the constraint of 0 ≤ αi ≤ Cj . The constant Cj controls the tradeoff between maximizing the margin and minimizing the training errors. In our experiments, we ©choose the Gaussian radian basis function (RBF) as the kernel ª Kj (x, w) = exp −γj kx − wk2 with γj being used to further tune the RBF kernel. For each base SVM learner, we employed the LIBSVM software [4] for handling two tasks, a) using the default svm-train program in C language for solving the above optimization problem; b) using the grid program in Python language for tuning parameters (Cj , γj ) in order to select SVM models by crossvalidation (10-fold in our setups). Table 1. Description of the datasets from the UCI repository Datasets #Classes #Train Data #Test Data #Attributes dermatology 6 366 34 glass 6 214 13 ecoli 8 366 8 vowel 11 528 10 yeast 10 1484 8 letter 26 16000 4000 16

Table 2. Experimental results using different output codes based on the support vector machines with Guassian RBF kernels Hadamard Datasets dermatology glass ecoli vowel yeast letter

Remp 0.0000 0.0498 0.0179 0.0000 0.0745 0.0000

Rgen 0.0121 0.1184 0.0159 0.0013 0.0685 0.0171

One-per-class Remp 0.0037 0.0265 0.0365 0.0000 0.2652 0.0089

Rgen 0.0633 0.1869 0.0450 0.0121 0.2873 0.1502

Random Remp 0.0000 0.2329 0.0302 0.0000 0.1102 0.0000

Rgen 0.0170 0.2457 0.0387 0.0028 0.1203 0.0133

Our comparative experiments are carried on the real multiclass datasets from from the UCI Repository and a simple statistics is given in Table 1. We compare three types of dense output codes with economical number (around the number of classes) of decompositions: 1) Hadamard-type, pruning, if necessary, the last row(s) (up to 3) of Hadamard designs such that the number of rows in the output codes coincides with the cardinality of the polychotomy; 2) one-per-class, i.e., a square output code with diagonal entries of all +1’s and off-diagonal entries of all −1’s; and 3) random ECOC. The random error-correcting output codes are generated in the following way: generate 10,000 output codes in Ω(K, K) with entries uniformly at random from {−1, 1}, then choose the code with the largest dmin without identical columns. For demonstrating the different usages

of output codes, both empirical and generalization risks for the multiclass reconstruction by Hamming decoding are evaluated for training and testing data, respectively. For the datasets without testing data offered, we use again the cross-validation as an estimate of Rgen . For the letter dataset, we choose only 2600 (100 per class) from the training data and 1300 (50 per class) from the testing data for the demonstration purpose. Results are given in Table 2, which shows that in general Hadamard-type output codes perform better than random error-correcting output codes, which in turn perform better than OPC codes. Then we may turn to ask: what happens to the traditional OPC methods? Our theories reveals that the OPC output codes £have no¤error correcting properties −1 because all their dmin ’s slump to low 2 and dmin = 0! 2

References [1] Allwein, E.L., Schapire, R.E. and Singer, Y.: Reducing multiclass to binary: a unifying approach for margin classifiers. Journal of Machine Learning Research 1 (2000) 113–141 [2] Blake, C.L. and Merz, C.J.: UCI Repository of machine learning databases. (1998) http://www.ics.uci.edu/~mlearn/MLRepository.html [3] Box, G.E.P. and Hunter, W.G. and Hunter, J.S.: Statistics for Experiments (1978) Wiley, New York [4] Chang, C.C. and Lin, C.J.: LIBSVM: a library for support vector machines. (2001) Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm [5] Cristianini, N. and Shawe-Taylor, J.: An Introduction to Support Vector Machines (and other kernel-based learning methods). (2000) Cambridge University Press [6] Dietterich, T.G.: Ensemble methods in machine learning. First International Workshop of Multiple Classifier Systems (Editors Kittler, J. and Roli, F.) Lecture Notes in Computer Science 1857, Springer-Verlag (2000) 1–15 [7] Dietterich, T.G. and Bakiri, G.: Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research 2 (1995) 263–286 [8] Hastie, T. and Tibshirani, R.: Classification by pariwise coupling. The Annals of Statistics 26(2) (1998) 451–471 [9] MacWilliams, F.J. and Sloane, N.J.A.: The Theory of Error-Correcting Codes (1977) Elsevier Science Publishers [10] Schapire, R.E.: The strength of weak learnability. Machine Learning 5 (1990) 197–227 [11] Sch¨ olkopf, B. and Smola, A.: Learning with Kernels, Support Vector Machines, Regularization, Optimization and Beyond, (2002) MIT Press [12] Sloane, N.J.A.: A Library of Hadamard Matrices (1999) AT&T http://www.research.att.com/~njas/hadamard/