ELM-based Multiple Classifier Systems - IEEE Xplore

Comment

Report 5 Downloads 102 Views

ELM-based Multiple Classifier Systems Dianhui Wang Department of Computer Science and Computer Engineering La Trobe University, Melbourne, VIC 3083, Australia E-mail: [email protected] Abstract— With random weights between the inputs and the hidden units of three-layer feed-forward neural networks (namely Extreme Learning Machine (ELM)), some favorable performance may be achieved for pattern classifications in terms of efficiency and effectiveness. This paper aims to investigate properties of ELM-based multiple classifier systems (MCS). A protein database with ten classes of super-families is employed in this study. Our results indicate that (1) integration of the base ELM classifiers with better learning performance may result in a MCS with better generalization power; (2) smaller size of weights in ELM classifiers does not imply a better generalization capability; and (3) under/over-fitting phenomena occurs for classification as inappropriate network architectures are used.

I. I NTRODUCTION Main concerns with the neural classifiers are the system performance in terms of generalization capability and learning efficiency [1]. In the past, many efforts have been made to speed up training process and to improve the generalization of neural classifiers [2]. Unfortunately, the performance of neural classifiers depends on many factors involved in the learning process, such as network architecture, initial weights, learning rates and termination conditions. Therefore, it is hard to ensure the neural classifier to have favorable performance under such uncertainties. Therefore, it is useful to develop techniques to resolve these problems appearing in neural classifiers. Multiple Classifier System (MCS) consists of a set of individual classifiers and a fusion method. The final decision making comes from combining the outputs of the base classifiers, which may reflect diverse perspectives on different aspects of base classifiers. The MCS has been shown to have better performance than a single classifier [3]. Thus the neural MCS approach is expected to give a solution which is robust with respect to the architecture and/or parameter variations. It is interesting to understand more about model selection so that a better MCS solution can be achieved [4]. There are many different fusion methods for MCS design such as voting and neural network approaches [5]. Among these fusion techniques, the simplest one is to take an average value over the outputs obtained from the base classifiers. In this study, we employ a simplest fusion technique in our simulations, but not to address the effect of fusion techniques on the MCS performance. One may randomly choose the input weights and the hidden neurons’ biases and determine the output weights of a Singlehidden Layer Feedforward neural Networks (SLFNs) by calculating a generalized inverse of the output matrix of the hidden layer. Some results reported in [6] showed that the ELM tends

c 2006 IEEE 1–4244–0342–1/06/$20.00

to have good generalization performance for both regression and classification tasks. Also, this fast learning scheme avoids the difficulties, such as stopping criteria, learning rate, learning epoches, and local minima, in gradient-based optimization techniques for training the SLFNs. These merits motivates us to employ ELMs as base classifiers in the design of the MCS. The goal of this paper is to investigate the following: 1) How to select the base ELM-classifiers to build a MCS with better generalization? 2) What is the correlation between the size of weights in the SLFNs and the generalization capability for classification? 3) Sensitivity of the ELM-classifiers to the architectures of the SLFN. The ELM with sigmoidal activation function is employed as our base classifiers. A protein data set with ten classes of superfamilies obtained from PIR database1 is used in this study. The remainder of this paper is organized as follows. Section 2 provides a brief description of the ELM. Section 3 presents a simple and practical approach for designing MCSs with base ELM-classifiers. Experimental results are reported in Section 4. Conclusions from this study are summarized in Section 5. II. B RIEF D ESCRIPTION OF E XTREME L EARNING M ACHINE FOR SLFN S Unlike popular implementations such as Back-Propagation (BP) for training neural networks, in ELM, one may randomly choose the input weights and the hidden neurons’ biases and analytically determine the output weights of SLFNs[6]. For N arbitrary distinct samples (xi , ti ), where xi = [xi1 , xi2 , · · · , xin ]T ∈ Rn and ti = [ti1 , ti2 , · · · , tim ]T ∈ ˜ hidden neurons and activation Rm , standard SLFNs with N function g(x) are mathematically modeled as ˜ N X

βi g(wi · xj + bi ) = oj , j = 1, · · · , N,

(1)

i=1

where wi = [wi1 , wi2 , · · · , win ]T is the weight vector connecting the ith hidden neuron and the input neurons, βi = [βi1 , βi2 , · · · , βim ]T is the weight vector connecting the ith hidden neuron and the output neurons, and bi is the threshold of the ith hidden neuron. wi · xj denotes the inner product of wi and xj . ˜ hidden neurons each The fact that standard SLFNs with N with activation function g(x) can approximate these N samples 1 http://pir.Georgetown.edu

ICARCV 2006

PN˜ with zero error, means that j=1 koj − tj k = 0. i.e., there exist βi , wi and bi such that ˜ N X

βi g(wi · xj + bi ) = tj , j = 1, · · · , N.

(2)

i=1

The above N equations can be written as: Hβ = T

(3)

where H(w1 , · · · , wN˜ , b1 , · · · , bN˜ , x1 , · · · , xN )  g(w1 · x1 + b1 ) · · · g(wN˜ · x1 + bN˜ ) (4)   .. .. =  . ··· . g(w1 · xN + b1 ) · · · g(wN˜ · xN + bN˜ ) N ×N˜ 

  β1T   .  and T =  β =  ..  βN˜ T N˜ ×m 

 tT1 ..  .  T tN N ×m

(5)

H is called the hidden layer output matrix of the neural network; the ith column of H is the ith hidden neuron output with respect to inputs x1 , x2 , · · · , xN . Mathematically, it is ˜ = N for any randomly given input easy to show that when N weights and hidden neurons’ biases the hidden layer output matrix H is invertible and there exists output weights β such that Hβ = T. In most cases the number of hidden neurons is much less ˜ N , H is than the number of distinct training samples, N a nonsquare matrix and there may not exist wi , bi , βi (i = ˜ ) such that Hβ = T. Thus, w ˜) ˆ i , ˆbi , βˆ (i = 1, · · · , N 1, · · · , N are to be found such that they satisfy : ˆ 1, · · · , w ˆ N˜ , ˆb1 , · · · , ˆbN˜ )βˆ − Tk kH(w = min kH(w1 , · · · , wN˜ , b1 , · · · , bN˜ )β − Tk

(6)

wi ,bi ,β

which is equivalent to minimizing the cost function 2  ˜ N N X X  E= βi g(wi · xj + bi ) − tj  j=1

(7)

i=1

It is interesting to note that unlike the most common learning approaches where all the parameters of SLFNs need to be adjusted, the input weights wi and the hidden layer biases bi are in fact not necessarily tuned and the hidden layer output matrix H can actually remain unchanged once arbitrary values have been assigned to these parameters in the beginning of learning. For fixed input weights wi and the hidden layer biases bi , from equation (6), to train an SLFN is simply equivalent to finding a least-squares solution βˆ of the linear system Hβ = T: kH(w1 , · · · , wN˜ , b1 , · · · , bN˜ )βˆ − Tk = min kH(w1 , · · · , wN˜ , b1 , · · · , bN˜ )β − Tk β

(8)

The unique smallest norm least-squares solution of the above linear system is: βˆ = H† T (9) where H† is the Moore-Penrose generalized inverse of hidden layer output matrix H [7]. The special solution βˆ = H† T is the smallest norm leastsquares solutions of a general linear system Hβ = T, meaning that the smallest training error can be reached by this special solution. The three main steps involved in the original ELM algorithm can be summarized as: ELM Algorithm[6]: Given a training set ℵ = {(xi , ti )|xi ∈ Rn , ti ∈ Rm , i = 1, · · · , N }, activation function g(x), and ˜, hidden neuron number N step 1 Assign arbitrary input weight wi and bias bi , i = ˜. 1, · · · , N step 2 Calculate the hidden layer output matrix H. step 3 Calculate the output weight β: β = H† T. III. D ESIGN OF ELM- BASED MCS For neural network classifiers with error back propagation algorithm, the output values of the neural classifiers can be interpreted as a posterior probability of the unseen example for some class if the activation function at output layer is taken as sigmoid one. However, this is not the case with ELM classifiers because its output values for unseen examples cannot be ensured to lie in the unit interval, that is, some of them may be greater than 1 or less than 0. By comparing these unconstrained values to make a decision for classification, it has no statistical interpretation although this way may result in better results. To make the output be in logical sense, we reorganize the training data by the following map: t ), (10) s = ln( 1−t where t takes a random real number from the range (0.001, 0.1), and its corresponding value of s will be the new target value of the original target output 0. Similarly, t takes a random real number in (0.9, 0.999), and its corresponding value of s will be the new target value of the original target output 1. In this way, the range of the ELM outputs will be in (0, 1). Let EC = {EC1 , EC2 , ..., ECm } be a collection of base ELM-classifiers produced from a training data set, namely DR . A validation data set, denoted by Dv , will be used to evaluate the performance of classifiers. In this paper, we suggest a simple and practical method to design the MCS, which is outlined as follows: step 1 Set p = 0. step 2 Using learning performance (e.g.,recognition rate) to rank members of the EC(p) in descending order, that is, ECk1 , ECk2 , ..., ECkmp , where mp is the number of ELM classifiers. step 3 Calculate the mean of these recognition rates, denoted by λ(p), and retain the classifiers which have a recognition rate higher than the mean value λ(p). Let BC(p) =

p p p {ECk1 , ECk2 , ..., ECkq , qp < mp } be the collection p of the remaining classifiers. step 4 Evaluate the generalization performance of the ELMbased MCS, denoted by P Sp , from BC(p) and the Dv , where we assign unseen examples to the Class r if the r − th average output takes the maximum value. step 5 Generate (mp − qp ) new set of candidate ELM classifiers, denoted by CM (p). Denoted by M EC = EC(p) ∪ CM (p). step 6 Set p = p + 1 and EC(p) = M EC. step 7 Repeat step 2 - step 5, generate the BC(p) and calculate the P Sp . step 8 If P Sp > P Sp−1 , set a final set of the base classifiers as F BC = BC(p); otherwise set F BC = BC(p − 1). step 9 If p > N (a predefined number), stop; otherwise, go to step 2. Notice that our proposed approach for design of the ELMbased MCS is using the concept of evolutionary computing. It is directly related to the generalization performance of the classifier. The basis of this design are from the fast generation of candidate classifiers using the ELM, and the supporting results on model selection given in the next section.

IV. E XPERIMENTAL R ESULTS A. Datasets In this paper, we employ the database used in our earlier studies for protein sequences classification using neural networks [5], [8], [9]. Below gives a brief description of feature extraction from raw protein sequence. A protein sequence is made from combinations P of variable length of 20 amino acids = {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y }. The n-grams or k-tuples features will be extracted as an input vector of the neural network classifier. The n-gram features are a pair of values (vi , ci ), where vi is the feature i and ci is the count of this feature in a protein sequence for i = 1, · · · , 20n . In general, a feature is the number of occurrences of an amino symbols presented in a protein sequence. These features are all the possible combinations of n letters from the set P . For example, the 2-gram (400 in total) features are (AA, AC, · · · , AY, CA, CC, · · · , CY, · · · , Y A, · · · , Y Y ). Consider a protein sequence V AAGT V AGT , the extracted 2-gram features are {(V A, 2), (AA, 1), (AG, 2), (GT, 2), (T V, 1)}. The 6-letter exchange group is another commonly used feature. The 6-letter group contains 6 P combinations of the letters from the set . These combinations are A = {H, R, K}, B = {D, E, N, Q}, C = {C}, D = {S, T, P, A, G}, E = {M, I, L, V } and F = {F, Y, W }. For example, the protein sequence V AAGT V AGT mentioned above will be transformed using 6-letter exchange group as EDDDDEDDD and their 2-gram features are {(DE, 1), (ED, 2), (DD, 5)}. We will use en and an to represent n-gram features from a 6-letter group and 20 letters set, respectively. Each sets of n-grams features, i.e., en and an , from a protein sequence will be

scaled separately to avoid skew in the count value using the following formula: x=

x L−n+1

(11)

where x represents the count of generic gram feature, x is the normalized x, which will be the inputs of the classifiers; L is the length of the protein sequence and n is the size of n-gram features. In this study, the protein sequences covering ten superfamilies (classes) were obtained from the PIR databases comprised of PIR1 and PIR22 . The ten super-familes to be trained/classified in this study are: Cytochrome c (113/17), Cytochrome c6 (45/14), Cytochrome b (73/100), Cytochrome b5 (11/14), Triose-phosphate isomerase (14/44), Plastocyanin (42/56), Photosystem II D2 protein (30/45), Ferredoxin (65/33), Globin (548/204), and Cytochrome b6 − f complex 4.2K (8/6). The 56 features comprised of e2 and a1 were used in this study. B. Correlation Analysis In our simulation study, 949 protein sequences selected from PIR1 and 533 protein sequences selected from PIR2 were combined, then we randomly select 949 protein sequences from the mixed dataset as the training dataset and the remaining 533 protein sequences as the testing dataset at each trial. 10 trials (i.e., non-standard 10-fold cross validation scheme) have been conducted for our study and the averaged results are reported. With different initial input weights, we generated 100 ELMs with 50, 80, 120 and 180 hidden units, respectively. We used 2-norm to measure the size of weights of the ELMs, where both the input weights and the output weights are included. Table I gives the correlation coefficients TABLE I P ERFORMANCE C ORRELATION C OEFFICIENTS Corr. Coeff. ORR AFM

NHU=80 ORR AFM -0.0724 -0.0691 -0.0792 -0.0590

NHU=120 ORR AFM 0.0328 0.1107 0.0396 0.0971

||W || 0.0231 0.1063

for ELM classifiers with architecture 56-80-10, and 56-120-10 respectively, where the column is for the learning performance and the row is for the generalization performance. From the results, we can see that the generalization performance is independent of the learning performance for classification, and the size of the weights does not correlate the generalization capability as well. This observation does not follow the results reported in literature [10], [11]. In Table 1, the ORR and AF M represent the overall recognition rate and the average f-measure of the classifier (Definition details can be found in [12]), ||W || is the 2-norm of the matrix W , and N HU represents the number of hidden units. 2 Protein

Information Resources (PIR), http://pir.Georgetown.edu

TABLE II P ERFORMANCE C OMPARISON OF T HREE G ROUPS OF ELM- BASED MCS C LASSIFIERS Candidates Best Random Worst

Fig. 1.

Hidden Units Performance Mean (%) Std. Dev. Mean (%) Std. Dev. Mean (%) Std. Dev.

NHU=50 ORR AFM 91.71 95.16 0.0068 0.0048 88.60 94.06 0.0022 0.0015 86.27 92.93 0.0079 0.0043

NHU=80 ORR AFM 92.49 94.22 0.0024 0.0019 91.77 94.22 0.0040 0.0032 91.14 93.68 0.0087 0.0065

ORR vs the number of candidate classifiers

NHU=120 ORR AFM 93.22 94.76 0.0022 0.0030 92.35 93.84 0.0035 0.0026 91.30 93.55 0.0048 0.0030

NHU=180 ORR AFM 92.25 93.90 0.0040 0.0020 92.66 93.74 0.0044 0.0036 91.22 92.69 0.0073 0.0046

unites N HU = 50, 80, 120 and 180, respectively. The results suggest that the generalization performance of the MCS trends to be stable if we use more than 10 base classifiers. Table II reports the performance, which take the average values for r=11, 13, 15, 17 and 19, respectively. The first two rows give the means and standard deviations of the ORR and the AF M for the best-candidate group. The second two rows give the means and standard deviations of the ORR and the AF M for the random-candidate group. The last two rows give the means and standard deviations of the ORR and the AF M for the worst-candidate group. It has been observed that the MCS with the best-candidate group slightly outperforms the random-candidate group, the MCS with the worst-candidate group had lower performance than the random group. This fact is useful for designing the MCS. This statistics shows that it is more reliable to employ the candidates ELM classifiers with better learning performance as base classifiers used in the MCS. Indeed, this understanding is the basis of our MCS design proposed in the last section. The results listed in Table 2 also show that the ELM classifiers suffer from the under/over fitting due to an estimation of the number of hidden unites. V. C ONCLUSIONS

Fig. 2.

AFM vs the number of candidate classifiers

C. Model Selection Model selection for building MCS with improved generalization is challenging but meaningful. In this paper, we are interested in knowing if learning performance can be used as a discriminant to select candidate classifiers. To do this, we made three sets of ELM classifiers according to the learning performance. The first set is formed by r best candidates ELM classifiers, the second set is formed by r random candidates ELM classifiers, and the last set is formed by r worst candidates ELM classifiers. Figure 1 and 2 show the effects of the number of base classifiers on the classification performance, where lines with ∗, +, o and × are for the performance corresponding to number of hidden

This paper investigates some properties of single ELM classifiers, and also presents new approach for designing neural MCS systems using the evolutionary computing concept. Experimental results, which are carried out by using a protein sequences database with ten super-families, demonstrate that ELM classifiers with better learning performance may be recommended to form the bases classifiers of a MCS. This study also reveals that there is no remarkable correlation between the size of weights in ELM classifiers and the generalization capability. The classification performance of the ELM classifiers depends on the number of the hidden units. Therefore, scientific and practical methods for determining this significant parameter are sought eagerly. R EFERENCES [1] Y. H. Pao, Adaptive Pattern Recognition and Neural Networks. AddisonWesley, 1989. [2] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press, 1996. [3] T. H. Ho, J. J. Hull, and S. Srihari, “Decision combination in multiple classifier system,” IEEE Trans. on PAMI, vol. 16, no. 1, pp. 66–75, 1994. [4] Z. H. Zhou, J. Wu, and W. Tang, “Ensembling neural networks: Many could be better than all,” Artificial Intelligence, vol. 137, no. 1-2, pp. 239–263, 2002.

[5] D. H. Wang, N. K. Lee, T. S. Dillon, and N. J. Hoogenraad, “Protein sequences classification using modular rbf neural networks,” Lecture Notes in Computer Science, vol. 2557, no. 477-486, 2002. [6] G. B. Huang, Q. Y. Zhu, and C. K. Siew, “Extreme learning machine: A new learning scheme of feedforward neural networks,” in Proceedings of International Joint Conference on Neural Networks (IJCNN2004), (Budapest, Hungary), 25-29 July, 2004. [7] D. Serre, Matrices: Theory and Applications. Springer-Verlag New York, Inc, 2002. [8] D. H. Wang, N. K. Lee, and T. S. Dillon, “Extraction and optimization of fuzzy protein sequence classification rules using grbf neural networks,” Neural Information Processing - Letters and Review, vol. 1, no. 1, pp. 53–59, 2002. [9] D. H. Wang and G. B. Huang, “Protein sequence classification using extreme learning machine,” in Proceedings of International Joint Conference on Neural Networks (IJCNN2005), (Hilton Montreal Bonaventure Hotel, Montreal, Quebec, Canada), July 31-August 4, 2005. [10] P. L. Bartlett, “The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network,” IEEE Transactions on Information Theory, vol. 44, pp. 525–536, 1998. [11] S. McLoone and G. Irwin, “Improving neural network training solutions using regularisation,” Neurocomputing, vol. 37, no. 1-4, pp. 71–90, 2001. [12] Z. Lu and et al., “Predicting subcellular localization of proteins using machine-learned classifiers,” Bioinformatics, vol. 20, pp. 547–556, 2004.