Bagging for Linear Classifiers - CiteSeerX

Report 2 Downloads 96 Views
Bagging for Linear Classifiers Marina Skurichina1,2 and Robert P.W.Duin1

1

Pattern Recognition Group, Department of Applied Physics, Faculty of Applied Sciences, Delft University of Technology, P.O. Box 5046, 2600GA Delft, The Netherlands. E-mail: [email protected]

2

Department of Data Analysis, Institute of Mathematics and Informatics, Akademijos 4, Vilnius 2600, Lithuania. E-mail: [email protected]

1

Abstract Classifiers built on small training sets are usually biased or unstable. Different techniques exist to construct more stable classifiers. It is not clear which ones are good, and whether they really stabilize the classifier or just improve the performance. In this paper bagging (bootstrapping and aggregating(1)) is studied for a number of linear classifiers. A measure for the instability of classifiers is introduced. The influence of regularization and bagging on this instability and the generalization error of linear classifiers is investigated. In a simulation study it is shown that in general bagging is not a stabilizing technique. It is also demonstrated that one can consider the instability of the classifier to predict how useful bagging will be. Finally, it is shown experimentally that bagging might improve the performance of the classifier only for very unstable situations. Keywords: linear discriminant, generalization error, small sample size, regularization, bagging, instability, bias and variance.

1. Introduction The main problem in building classifiers on small training sample sets is that it is impossible to estimate parameters of the data distribution properly. Moreover, a small training sample set may present the total data set incorrectly. Thus, classifiers built on small sample sets are biased, and may have a large variance in the probability of misclassification.(2,3,4) For this reason they are unstable. In order to make a classifier more stable a larger training sample set is needed or stabilizing techniques have to be used. In many applications the total amount of data is not large and therefore the training data set is limited. In this case one can try to use different techniques to get more stable solutions. It still is an open question which stabilizing techniques are good: do they really stabilize the solution, and do they improve the performance of the classifier? It is well-known that the bootstrap estimate of the data distribution parameters is robust(5) and more accurate than the plug-in estimate(6) usually used in classification rules. So it might be promising to use bootstrapping and aggregating techniques ('bagging'(1)) to get a

2

better classifier(7,8) with a more stable solution. Bagging was mostly investigated for regression problems, classification trees and the nearest neighbour classifier. In this paper we study bagging for linear classifiers in relation with the above questions. The datasets used in the experiments are presented in section 2. The classifiers studied are the Nearest Mean Classifier (NMC), the Fisher Linear Discriminant (FLD), the Pseudo Fisher Linear Discriminant (PFLD) and the Regularized Fisher Linear Discriminant (RFLD), which are discussed in section 3. Bagging, its use and a modification, which we call “nice” bagging, are discussed in section 4. Studying bagging for classification and regression trees, Breiman(1) has noticed that the efficiency of bagging depends on the stability of the prediction or classification rule. We investigate the influence of regularization and bagging on the stability and on the generalization error of linear classifiers: whether they really stabilize the classifier and whether they improve the generalization error of the classifier. For this study we need to measure the stability of classifiers, which also depends on the training sample size used to build the classifier. In section 5 a possible measure is introduced for the stability, which we call the instability, and investigate it for different linear classifiers. The relation between the performance and the instability of bagged linear classifiers is investigated in section 6. Our simulation study shows that in comparison with regularization, which really stabilizes the solution of the classifier, bagging is not a stabilizing technique: bagging improves the performance of the classifier only in very unstable situations. Conclusions are summarized in section 7.

2. Data Two artificial data sets and one real data set are used for our experimental investigations. These data sets have a high dimension because we are interested in critical situations where classifiers have unstable solutions. The first set is a 30-dimensional correlated Gaussian data set constituted by two classes with equal covariance matrices. Each class consists of 200 vectors. The mean of the first class is zero for all features. The mean of the second class is equal to 3 for the first two features and 3

equal to 0 for all other features. The common covariance matrix is a diagonal matrix with a variance of 40 for the second feature and a unit variance for all other features. The intrinsic class overlap (Bayes error) is 0.064. In order to spread the separability over all features, this data set is rotated using a 30 × 30 rotation matrix which is

1 –1 1 1

for the first two features and

the identity matrix for all other features. We call this data further “Gaussian correlated data”. Its first two features are presented in Fig. 1. In order to have a non-Gaussian distributed problem we choose for the second set a mixture of two 30-dimensional Gaussian correlated data sets. The first one consists of two strongly correlated Gaussian classes (200 vectors in each class) with equal covariance matrices. The common covariance matrix is a diagonal matrix with a variance of 0.01 for the first feature, a variance of 40 for the second feature and a unit variance for all other features. The mean of the first class is zero for all features. The mean of the second class is 0.3 for the first feature, 3 for the second feature and zero for all other features. As our aim is to construct non-Gaussian data, we mix the above described data set with another Gaussian data set constituted by two classes with equal covariance matrices. Each class also consists of 200 vectors. The common covariance matrix is a diagonal matrix with a variance of 0.01 for the first feature and a unit variance for all other features. The means of the

25 20 15 10 5 0 −5 −10 −15 −20 −25 −30

−20

−10

0

10

20

30

Fig. 1. Scatter plot of a two-dimensional projection of the 30-dimensional Gaussian correlated data. 4

20

−3

15

−4

10

−5

5 −6

0 −7

−5 −8

−10 −9

−15 −20 −20

−15

−10

−5

0

5

10

15

20

−10 −10

−9

−8

−7

−6

−5

−4

−3

Fig. 2. Scatter plot of a two-dimensional projection of the 30-dimensional Non-Gaussian data. Left: entire dataset, right: partially enlarged. first two features are [10,10] and [-9.7,-7] respectively for the first and second classes. The means of all other features are zero. As a result of the union of the first class of the first data set with the first class of the second data set and the second class of the first data set with the second class of the second data set, we have two strongly correlated classes (400 vectors in each class) with shifted weight centers of classes to the points [10,10] and [-9.7,-7] for the first two features. The intrinsic class overlap (Bayes error) is 0.06. This data set was also rotated like the first data set. Further these data are called “Non-Gaussian data”. Its first two features are presented in Fig. 2. The last data set consists of real data collected through spot counting in interphase cell nuclei (see, for instance, Netten et al(9) and Hoekstra et al(10)). Spot counting is a technique to detect numerical chromosome abnormalities. By counting the number of colored chromosomes (‘spots’), it is possible to detect whether the cell has an aberration that indicates a serious disease. A FISH (Fluorescence In Situ Hybridization) specimen of cell nuclei was scanned using a fluorescence microscope system, resulting in computer images of the single cell nuclei. From these single cell images 16 × 16 pixel regions of interest were selected. These regions contain either background spots (noise), single spots or touching spots. From these regions we constructed two classes of data: the noisy background and single spots, omitting the regions 5

with touching spots. The samples of size 16 × 16 were considered as a feature vector of size 256. The first class of data (the noisy background) consists of 575 256-dimensional vectors and the second class (single spots) - of 571 256-dimensional vectors. We call these data “cell data” in the experiments. Training data sets with 3 to 100 samples per class are chosen randomly from a total set. The remaining data is used for testing. These and all other experiments are repeated 50 times. In all figures the averaged results over 50 repetitions are presented and we do not mention that further.

3. Linear Classifiers The most popular and commonly used linear classifiers are the Fisher Linear Discriminant (FLD):(11,12) and the Nearest Mean Classifier (NMC) or Euclidean Distance Classifier. The FLD is defined as gF ( x ) =



x – 1--- ( X ( 1 ) + X ( 2 ) ) S –1 ( X ( 1 ) – X ( 2 ) ) , 2

(1)

where S is the standard maximum likelihood estimation of the p × p covariance matrix Σ, x is a p-variate vector to be classified, X(i) is the sample mean vector of the i-th class, i=1,2. The Nearest Mean Classifier (NMC) can be written is: g NM ( x ) =

′ 1 x – --- ( X ( 1 ) + X ( 2 ) ) ( X ( 1 ) – X ( 2 ) ) , 2

(2)

It minimizes the distance between the vector x and the class mean X(i), i=1,2. Notice that (1) is the mean squared error solution for the linear coefficients (w,w0) in gF ( x ) = w • x + w0 = L

(3)

with x ∈ X and with L being the corresponding desired outcomes, 1 for class-1 and -1 for class-2. Direct computation is impossible when the number of features p exceeds the number of training vectors N.(13) Increasing feature sizes, the expected probability of misclassification rises dramatically.(14) 6

The NMC generates the perpendicular bisector between the class means and thereby yields the optimal linear classifier for classes with identical normal distribution of features: classes have a common covariance matrix and only the means of the classes are shifted in the direction of one of the principal components of the distribution. The advantage of this classifier is that it is relatively insensitive to the number of training examples.(15) The NMC, however, does not take into account differences in the variances and the covariances. A well-known technique to overcome problems with the inverse of an ill-conditioned covariance matrix in building the standard Fisher linear discriminant function (1) is to add some constant values to the diagonal elements of the estimate of the covariance matrix SR = S + λI ,

(4)

where I is the p × p identity matrix and λ is a regularization parameter. The new estimate SR is called the ridge-estimate of the covariance matrix, which concept we have borrowed from regression analysis(16). This approach is called regularized discriminant analysis. The modification (4) gives the ridge or regularized Fisher linear discriminant function (RFLD) gR( x ) =

′ 1 x – --- ( X ( 1 ) + X ( 2 ) ) ( S + λ I ) –1 ( X ( 1 ) – X ( 2 ) ) 2

(5)

representing a whole family of linear classifiers: from the Fisher linear discriminant (1), for λ=0, to the Nearest Mean Classifier (2), for λ → ∞ in which case the covariance values are lost (see Fig. 3). The regularized linear discriminant function is analyzed by Friedman,(17) see also Peck et al.(18) In order to find the best value for the regularization parameter, he has proposed to use a leave-one-out cross-validation technique. For reducing the computational burden he has developed updating inverse formulas for the down-dated class sample covariance matrix. The optimal ridge-estimate of the covariance matrix and its properties for the regularized Fisher linear discriminant are studied by Loh,(19) Barsov,(20) Serdobolskij,(21) Raudys and Skurichina.(22) Another linear classifier is the so-called Pseudo Fisher linear discriminant (PFLD). Here a direct solution of (3) is obtained by (using augmented vectors):

7

–1

g PF ( x ) = ( w, w 0 ) • ( x, 1 ) = ( x, 1 ) ( X, I ) L

(6)

where (x,1) is the augmented vector to be classified and (X,I) is the augmented training set. The inverse (X,I)-1 is the Moore-Penrose Pseudo Inverse which gives the minimum norm solution. Before the inversion the data are shifted such that they have zero mean. This method is closely related to singular value decomposition. For values N ≥ p the PFLD, maximizing the distance to all given samples, is equivalent to the FLD (1). For values N < p , however, the Pseudo Fisher rule finds a linear subspace, which covers all the data samples. On this plane the PFLD estimates the data means and the covariance matrix, and builds a linear discriminant perpendicular to this subspace in all other directions for which no samples are given. The behavior of the PFLD as a function of the sample size is illustrated by Duin.(23) For one sample per class this method is equivalent to the Nearest Mean and to the Nearest Neighbor method. If the total sample size is equal to or larger than the dimensionality N ≥ p , the method is equivalent to the FLD. The generalization error of the PFLD shows a maximum at the point N=p (Fig. 4 - Fig. 6). Surprisingly, the error has a local minimum somewhere below the point N=p. This can be understood from the observation that the PFLD succeeds in finding hyperplanes with equal distances to all training samples until N=p.

0.5 L=0.

Generalization Error

0.4 0.35

NMC FLD RFLD

L=0.00001

0.3 0.25 L=0.01

0.2

L=0.1

0.15

L=0.3

0.1

L=3.

0.05 0

1

10 Training Sample Size per Class

0.4 0.35

L=500.

0.3

L=200. L=100.

0.25

L=50.

0.2

L=30.

0.15

L=20.

0.1

L=10.

0.05

L=3.

0

2

10

NMC FLD RFLD

0.45

Generalization Error

0.45

0.5

1

2

10 10 Training Sample Size per Class

Fig. 3. The generalization error of the NMC and the RFLD with different values of the regularization parameter λ=L versus the training sample size for 30-dimensional Gaussian correlated data. 8

However, there is no need to do that for samples that are already classified correctly by a subset of the training set. Therefore we modified the PFLD with the following editing procedure in which misclassified objects are iteratively added until all objects in the training set are correctly classified: 1. Put all objects of the training set in set U. Create an empty set L. 2. Find in U those two objects, one from each class, that are the closest. Move them from U to L. 3. Compute the PFLD D for the set L. 4. Compute the distance of all objects in the set U to D. 5. If all objects are correctly classified, stop. 6. Move one of misclassified objects in U with the largest distance to D from U to L. 7. If the number of objects in L is smaller than p (the dimensionality), go to 3. 8. Compute FLD for the entire set of objects, L ∪ U and stop. This procedure is called the Small Sample Size Classifier (SSSC).(23) It uses only those objects for the PFLD, which are in the area between the classes and which are absolutely needed for constructing a linear discriminant that classifies all objects correctly. If this appears

0.5 0.45

Generalization Error

0.4 0.35 0.3

NMC

0.25

PFLD

0.2

SSSC KLLC

0.15

RFLD

0.1 0.05 0

1

2

10 10 Training Sample Size per Class

Fig. 4. The generalization error of the NMC, the PFLD, the SSSC, the KLLC and the RFLD (λ=3) versus the training sample size for 30-dimensional Gaussian correlated data. 9

to be impossible, due to a too large training set relative to the class overlap, the FLD is used. The Small Sample Size Classifier is closely related to the Vapnik’s support vector classifier.(24) The Karhunen-Loeve’s Linear Classifier (KLLC) based on Karhunen-Loeve’s feature selection(25) is a popular linear classifier in the field of image classification. The KLLC with the joint covariance matrix of data classes builds the Fisher linear classifier in the subspace of principal components corresponding to the, say n, largest eigenvalues of the data distribution. This classifier performs nicely when the most informative features have the largest variances. In our simulations the KLLC uses the 4 largest eigenvalues of the 30-dimensional data distribution. In Fig. 4 to Fig. 6 the behavior is shown of the Regularized Fisher Linear Discriminant (λ=3), the Nearest Mean Classifier, the Pseudo Fisher Linear Discriminant, the Small Sample Size and Karhunen-Loeve’s Linear Classifiers as functions of the training sample size for 30dimensional Gaussian correlated data, 30-dimensional Non-Gaussian data and 256dimensional cell data. For all data the PFLD shows a critical behavior with a high maximum of the generalization error around N=p. For Gaussian correlated data the NMC has a rather bad performance for all sizes of the training data set because this classifier does not take into 0.5 0.45

Generalization Error

0.4 0.35 0.3 0.25 0.2

NMC PFLD

0.15 0.1 0.05 0

SSSC KLLC RFLD (L=50.) 1

10 Training Sample Size per Class

2

10

Fig. 5. The generalization error of the NMC, the PFLD, the SSSC, the KLLC and the RFLD (λ=50) versus the training sample size for 30-dimensional Non-Gaussian data.

10

Generalization Error

0.5 0.45

NMC

0.4

PFLD

0.35 0.3

SSSC KLLC RFLD (L=0.2)

0.25 0.2 0.15 0.1 0.05 0 0 10

1

2

10 10 Training Sample Size per Class

3

10

Fig. 6. The generalization error of the NMC, the PFLD, the SSSC and the RFLD (λ=0.2) versus the training sample size for 256-dimensional cell data. account the covariances. Therefore it is outperformed by the other linear classifiers. The nonGaussian data set is constructed such that is difficult to separate it by linear classifiers: the data classes are strongly correlated, their means are shifted and the largest eigenvalues of the data distribution do not correspond to the most informative features. Thus all linear classifiers have a bad performance for small sample sizes. By increasing training sample sizes, however, the PFLD (which is equivalent to the FLD for N ≥ p ) and the SSSC manage to exclude the influence of non-informative features with large variances and give three times smaller generalization errors than other classifiers. For the cell data all linear classifiers outperform the PFLD for small and critical sizes of the training set.

4. Bagging The main problem of building classifiers on small sample sets is that the parameters of the data distribution cannot be estimated properly. Very often a small training sample set presents the total data set incorrectly. Due to a small amount of training objects, some objects of the training set could largely distort the distribution. In some sense these samples might be called ‘outliers’ (they are not necessarily real outliers) or ‘misleaders’. The bootstrap(6) is a sampling technique which allows to get more accurate statistical estimators. Bootstrapping is based on random sampling with replacement. The random selection with replacement of N

11

b

b

b

b

objects X = ( X 1 , X 2 , …, XN ) from the set of N objects X = ( X 1, X 2, …, X N ) is called a bootstrap replicate. So some objects could be represented in a new set once, twice or even more times and some objects may not be represented at all. Taking a bootstrap replicate of the training sample set, one can avoid or get less ‘outliers’ in the bootstrap training set. By this reason the bootstrap estimate of the data distribution parameters is robust(5) and more accurate than the plug-in estimate(6) usually used in classification rules. Bagging is based on bootstrapping and aggregating concepts and presented by Breiman.(1) Bagging is implemented by us by averaging the parameters (coefficients) of the same linear classifier built on several bootstrap replicates. From each bootstrap replicate X

b

b

of the data set a ‘bootstrap’ version C ( X ) of the classifier C ( X ) is built. The average of 1 B β b these ‘bootstrap’ versions gives the ‘bagged’ classifier C ( X ) = --- ∑ C ( X ) , where B is the Bb = 1 number of bootstrap replicates used. Aggregating (or averaging) of classifiers actually means combining of classifiers. Rather often a combined classifier gives better results than individual classifiers, because of combining in the final solution advantages of the individual classifiers (see, for instance, Kittler et al(26)). Therefore, it may be concluded that bootstrapping and aggregating techniques may be used to build better classifier on training sample sets with misleaders. Bagging and its different modifications (e.g. Rao et al(27) and Taniguchi et al(28)) were investigated by a number of researchers. Mostly bagging was studied for regression problems. Breiman(1) has shown that bagging could reduce the error of linear regression and the error of classification for the nearest neighbour classifier and classification trees. He noticed that bagging is useful for unstable procedures only. For stable ones it could even deteriorate the performance of the classifier. The performance of bagged classifiers was also investigated by Tibshirani,(7) and Wolpert and Macready,(8) who used the bias-variance decomposition to estimate the generalization error of the “bagged” classifiers. We study the dependence of the efficiency of bagging on the stability of a classifier and the stability of bagged classifiers in linear discriminant analysis. Different modifications of bagging are possible. We introduce here just one of them which we call “nice” bagging. “Nice” bagging is bagging in which only “nice” bootstrap 12

versions of the classifier are averaged. “Nice” bootstrap versions of the classifier are the versions for which the error on the training set is not larger than the error of the original classifier on the same (not bootstrap) training set. In other words, “nice” bagging builds the classifier based on the apparent classification error. The properties of bagging and “nice” bagging for linear classifiers are investigated in section 6 of the current paper. In Fig. 7 an example is presented in which the average of bootstrap versions of the classifier gives a solution (a discriminant function), that is impossible to reach by separate bootstrap versions of the classifier. In this example, two-dimensional Gaussian correlated data is used (Fig. 1). The Nearest Mean Classifier is bagged on 100 bootstrap replicates of a training data set consisting of 10 objects. Coefficients of the original Nearest Mean discriminant function, coefficients of the bootstrap versions of the original classifier and coefficients of the bagged classifier are normalized in such way that all coefficients (w,w0) (see (3)) of each discriminant function are divided by the coefficient w0. In this way linear discriminants can be represented as points in a two-dimensional plot. A scatter plot of the Nearest Mean discriminant, its bootstrap versions and its bagged version in the space of normalized coefficients w1/w0. and w2/w0. are shown in Fig. 7. This example shows that the bagged (aggregated) classifier gives a solution inside an empty space between different bootstrap versions of the classifier. Separate bootstrap versions of the classifier can hardly reach the 0.2 0.15

the original classifier bootstrap versions

0.1

the bagged classifier

w2/w0

0.05 0 −0.05 −0.1 −0.15 −0.2 −0.25 −0.8

−0.7

−0.6

−0.5

−0.4 w1/w0

−0.3

−0.2

−0.1

0

Fig. 7. Scatter plot of the two-dimensional projection of the Nearest Mean discriminant in the space of its normalized coefficients w1/w0 and w2/w0 for 2-dimensional Gaussian Correlated data. 13

solution, obtained by aggregating (averaging) of bootstrap versions of the classifier. The experiments described below show that this bagged version of the Nearest Mean Classifier yields a better expected performance.

5. Stability of linear classifiers As we are interested in investigating stabilizing techniques, we need to measure the stability or the instability of classifiers. In this section we introduce one possible measure of the instability of a classifier(29) and consider the relation between the performance and instability of linear classifiers for the examples presented in Fig. 4 - Fig. 6. We also study the instability of the regularized Fisher Linear Discriminant and its relation with the generalization error for 30dimensional Gaussian correlated data. We measure the instability of a classifier by calculating the changes in classification of a test data set caused by the bootstrap replicate of the original learning data set. Repeating this procedure several times and averaging the results an estimate of the classifier instability is obtained. This will be called the instability measured by the test data set. In the same way the instability measured by the training data is computed, using the training data set itself instead of the test data set. Instabilities measured in this way and averaged over 50 repetitions are presented in Fig. 8 - Fig. 10. 0.01

Instability Measured by the Training Data Set

Instability Measured by the Test Data Set

0.4

NMC

0.35

PFLD 0.3

SSSC KLLC

0.25

RFLD

0.2 0.15 0.1 0.05 0

1

10 Training Sample Size per Class

PFLD

0.008

SSSC

0.007

KLLC RFLD

0.006 0.005 0.004 0.003 0.002 0.001 0

2

10

NMC

0.009

1

10

2

10

Training Sample Size per Class

Fig. 8. The instability, measured by the test data set (left) and by the training data set (right), of the NMC, the PFLD, the SSSC, the KLLC and the RFLD (L=λ=3) for 30dimensional Gaussian correlated data versus the training sample size. 14

Comparing Fig. 4 - Fig. 6 and the left pictures in Fig. 8 - Fig. 10 (instabilities measured by test data) one can notice that the generalization error and the stability of classifiers are related: more instable classifiers perform worse. In general the instability of a classifier depends on many factors such as the complexity of the classifier, the distribution of data used to construct the classifier, and the sensitivity of the classifier to the size and the composition of the training data set. In Fig. 8 - Fig. 10 the instability estimates based on the test set and the training set are compared. They behave different as a function of the training sample size. For all three data sets all classifiers are the most unstable for small sizes of the training data set. After increasing the number of training samples above the critical value (the data dimensionality), the classifiers become more stable and their generalization error decreases. Only the Nearest Mean Classifier is an exception due to its insensitivity to the number of training samples.(15) In spite of becoming more stable with an increasing training set, its generalization error does not change much (Fig. 4 - Fig. 6). The instability measured by the training data set (right sides in Fig. 8 - Fig. 10) shows the dependency of the classifier stability on the composition of the training set. The classifiers are the most unstable (the most sensitive to changes in the training set) if the size of the training set is comparable with the data dimensionality. On the other hand, the classifiers built on either 0.01

NMC

0.35

PFLD SSSC

0.3

KLLC 0.25

RFLD (L=50.)

0.2 0.15 0.1 0.05 0

1

10 Training Sample Size per Class

Instability Measured by the Training Data Set

Instability Measured by the Test Data Set

0.4

NMC

0.009

PFLD

0.008

SSSC

0.007

KLLC

0.006

RFLD (L=50.)

0.005 0.004 0.003 0.002 0.001 0

2

10

1

10 Training Sample Size per Class

2

10

Fig. 9. The instability, measured by the test data set (left) and by the training data set (right), of the NMC, the PFLD, the SSSC, the KLLC and the RFLD (L=λ=50) for 30dimensional Non-Gaussian data versus the training sample size. 15

very small training data sets or on large training sets are not very sensitive to changes in the composition of a particular training set. With very small training data sets it is not possible to estimate properly the parameters of the data distribution and to construct a good discriminant function. Any classifier built on a very small training set is bad. So, the changes in the composition of the training data set do not change the bad performance and the high instability (left sides in Fig. 8 - Fig. 10) of the classifier. That is why the instability measured by the −3

5 Instability Measured by the Training Data Set

Instability Measured by the Test Data Set

0.1

NMC

0.09

PFLD

0.08

SSSC 0.07

KLLC

0.06

RFLD (L=0.2)

0.05 0.04 0.03 0.02 0.01 0

1

2

4.5

NMC

4

PFLD SSSC

3.5

KLLC

3

RFLD (L=0.2)

2.5 2 1.5 1 0.5 0 0 10

3

10 10 Training Sample Size per Class

x 10

10

1

2

3

10 10 Training Sample Size per Class

10

Fig. 10. The instability, measured by the test data set (left) and by the training data set (right), of the NMC, the PFLD, the SSSC, the KLLC and the RFLD (L=λ=0.2) for 256dimensional cell data versus the training sample size. 0.4

NMC FLD RFLD

0.35 0.3 L=0.00001

L=0.

0.25 L=0.01

0.2 0.15

L=0.3 L=0.1

0.1 0.05 L=3.

0 0 10

1

10 Training Sample Size per Class

Instability Measured by the Test Data Set

Instability Measured by the Test Data Set

0.4

0.35 0.3 0.25 0.2

L=500. L=100. L=50. L=20. L=3.

0.15 0.1 0.05 0 0 10

2

10

NMC RFLD

1

10 Training Sample Size per Class

2

10

Fig. 11. The instability, measured by the test data set, of the RFLD with different values of the regularization parameter λ=L for 30-dimensional Gaussian correlated data versus the training sample size. 16

training data is not high for the classifiers built on these training sample sizes. The classifiers constructed on large sets are already stable. Therefore changes in the composition of the training set do not change the classifier much. Let us now consider the instability of the Regularized Fisher Linear Discriminant built on 30-dimensional Gaussian correlated data. It is nicely illustrated by Fig. 11 and Fig. 12 that the RFLD stabilizes the standard Fisher's Linear Discriminant. It shows that the regularization by the ridge-estimate of the covariance matrix is a stabilizing technique. When the value of the regularization parameter increases from 0 to 3, the instability (Fig. 11, Fig. 12) and the generalization error (Fig. 3) of the RFLD decrease to their minimal values. Further increasing the value of the regularization parameter increases the instability and the generalization error of the RFLD up to the instability and the generalization error of the Nearest Mean Classifier. Thus, some values of the regularization parameter in the RFLD help to stabilize the classifier. In this way the generalization error of the regularized classifier decreases as well. This example shows that the instability and the performance of this classifier are related. As the instability of the classifier, measured by the training set, illuminates the situations in which the classifier is the most sensitive to the composition of the training set, this instability could help to predict the possible use of bagging for a particular classifier for certain data and certain sizes of the training set. 0.01

0.009 0.008 0.007

NMC FLD RFLD

Instability Measured by the Training Data Set

Instability Measured by the Training Data Set

0.01

L=0. L=0.00001

0.006 L=0.01

0.005 L=0.1

0.004

L=0.3

0.003 L=3.

0.002 0.001 0 0 10

1

10 Training Sample Size per Class

0.008 L=500.

0.007

L=100.

0.006 L=50.

0.005 0.004

L=20.

0.003 L=3.

0.002 0.001 0 0 10

2

10

NMC RFLD

0.009

1

10 Training Sample Size per Class

2

10

Fig. 12. The instability, measured by the training data set, of the RFLD with different values of the regularization parameter λ=L for 30-dimensional Gaussian correlated data versus the training sample size. 17

In the next section we will investigate the instability measured by the training set and its relations with the use of bagging and the performance of bagged classifiers.

6. Instability and Performance of Bagged Linear Classifiers In this section we present the results of investigating the use of bagging and “nice” bagging for linear classifiers built on the data described in the second section. Bagging has a different effect on the performance and instability of different classifiers. This effect also depends on the data used. The usefulness of bagging could be predicted by considering the classifier instability measured by the training data set. Hereafter, the instability measured by the training set will be called simply the instability. The performance and the instability of the bagged and “nicely” bagged Nearest Mean Classifier are studied in section 6.1. Using the example of the NMC it is illustrated that bagging is not a stabilizing technique. Investigating of bagging and “nice” bagging for the Pseudo Fisher Linear Discriminant reveals the shifting effect of bagging. This effect as well as the generalization error and the instability of the bagged and nicely “bagged” PFLD are considered in section 6.2. The usefulness of bagging for the regularized Fisher Linear Classifier is studied in section 6.3. The performance and the instability of the bagged and “nicely” bagged Small Sample Size and Karhunen-Loeve’s Linear Classifiers are considered in sections 6.4 and 6.5 respectively.

6.1. Instability and Performance of the Bagged Nearest Mean Classifier First the usefulness of bagging and “nice” bagging for the Nearest Mean Classifier is studied on the three data sets, the 30-dimensional Gaussian correlated data, the 30-dimensional non-Gaussian data and 256-dimensional cell data. The averaged results over 50 repetitions for the different bootstrap numbers are presented in Fig. 13. Afterwards, on the example of Gaussian correlated data it is shown that bagging is not a stabilizing technique.

18

0.5

NMC Bagged NMC

0.45 0.4 Generalization Error

Instability Measured by the Training Data Set

0.012

0.35 0.3

B=2 B=5

0.25

B=20

0.2 B=100

0.15 0.1 0.05 0 0 10

1

10 Training Sample Size per Class

B=20 B=100

0.008

0.006

0.004

NMC Bagged NMC

0.002

0 0 10

2

1

10 Training Sample Size per Class

2

10

b) 0.012

NMC Bagged NMC

0.45 0.4 0.35

B=2 B=10 B=100

0.3 0.25 0.2 0.15 0.1 0.05 0 0 10

1

10 Training Sample Size per Class

Instability Measured by the Training Data Set

0.5

Generalization Error

B=5

10

a)

NMC Bagged NMC

0.01

0.008

0.006 B=2

0.004

B=10

B=100

0.002

0 0 10

2

10

c)

1

10 Training Sample Size per Class

2

10

d) 0.012

NMC Bagged NMC

0.09 0.08 B=2

0.07 0.06

B=10 B=100

0.05 0.04 0.03 0.02 0.01 0 0 10

1

10 Training Sample Size per Class

Instability Measured by the Training Data Set

0.1

Generalization Error

B=2

0.01

0.008

0.006

0.004

0.002 B=2 B=10 B=100

0 0 10

2

10

NMC Bagged NMC

0.01

1

10 Training Sample Size per Class

2

10

e) f) Fig. 13. The generalization error and the instability, measured by the training set, of the NMC and the “bagged” NMC for different numbers of bootstraps B versus the training sample size for 30-dimensional Gaussian correlated data (a,b), for 30-dimensional NonGaussian data (c,d) and for 256-dimensional cell data (e,f). 19

Let us first consider the effect of bagging on the Nearest Mean Classifier. For Gaussian correlated data (Fig. 13a,b) bagging reasonably improves the generalization of the NMC with the exception of very small and very large sizes of the training set. It could be explained by considering the instability of the NMC. As any classifier, the NMC, built on very small training sample sizes, has a bad expected average performance. Changes in the composition of the training set do not help to improve the performance of the classifier. Its instability, measured by the training set is very low. In this case, bagging, which tries to improve the original training set, cannot help and can even increases the generalization error of the NMC. The NMC on Gaussian correlated data is very unstable when the training sample size is comparable with the data dimensionality. It shows that the NMC is very sensitive to changes in the composition of the training set. Due to nice properties of the bootstrap(6) and advantages of aggregating, bagging decreases the generalization error of the NMC for training sample sizes which are comparable with the data dimensionality (see section 4). If, however, the training sample size increases further, the NMC becomes more stable and bagging becomes less useful. For Non-Gaussian (Fig. 13c,d) and cell data (Fig. 13e,f) the NMC is more stable, without a high instability for critical sizes of the training set. So for these data sets the NMC is not very sensitive to changes in the composition of the training set. Therefore bagging is not useful here. Let us now consider the generalization error and the instability of the “nicely” bagged NMC. For Gaussian correlated data (Fig. 14a,b) the behavior of the “nicely” bagged and bagged NMC is similar. However, the “nicely” bagged NMC is more stable than the simply bagged NMC and it has in average 2-3% lower generalization error. So on these data “nice” bagging is more preferable for the NMC than simple bagging.

20

0.5 Instability Measured by the Training Data Set

0.012

0.45

Generalization Error

0.4 0.35 0.3

B=2 B=5

0.25

B=20

0.2 B=100

0.15

NMC Bagged NMC "Nicely" Bagged NMC

0.1 0.05 0 0 10

1

10 Training Sample Size per Class

B=2

0.01

B=20

0.008 B=100

0.006

0.004

NMC "Nicely" Bagged NMC

0.002

0 0 10

2

10

a)

1

10 Training Sample Size per Class

2

10

b) 0.012

0.45

NMC "Nicely" Bagged NMC

0.4 0.35

Instability Measured by the Training Data Set

0.5

Generalization Error

B=5

B=2 B=10 B=100

0.3 0.25 0.2 0.15 0.1 0.05 0 0 10

1

10 Training Sample Size per Class

NMC "Nicely" Bagged NMC

0.01

0.008

0.006 B=2

0.004

0.002

0 0 10

2

10

c)

B=10

B=100

1

10 Training Sample Size per Class

2

10

d) −4

0.1

NMC "Nicely" Bagged NMC

0.09 0.08 Generalization Error

Instability Measured by the Training Data Set

4

0.07 0.06 0.05

B=2 B=10 B=100

0.04 0.03 0.02 0.01 0 0 10

1

10 Training Sample Size per Class

x 10

NMC "Nicely" Bagged NMC 3

2

1

0 0 10

2

10

B=2 B=10 B=100

1

10 Training Sample Size per Class

2

10

e) f) Fig. 14. The generalization error and the instability, measured by the training set, of the NMC and the “nicely bagged” NMC for different numbers of bootstraps B versus the training sample size for 30-dimensional Gaussian correlated data (a,b), for 30dimensional Non-Gaussian data (c,d) and for 256-dimensional cell data (e,f). 21

For non-Gaussian data (Fig. 14c,d) “nice” bagging does not improve the generalization error of the NMC, achieved by simple bagging. Like in the previous case, the “nicely” bagged NMC is more stable than the bagged NMC. The conclusions made above about the stability of the “nicely” bagged NMC are also valid for cell data (Fig. 14e,f). Note that in Fig. 14e,f we have changed the scale for a better visibility of the details of the behavior of the instability of the “nicely” bagged NMC. The “nicely” bagged NMC is more stable and even gives a small improvement in the generalization error of the original classifier for small training sample sizes. However, strictly speaking, bagging does not stabilize the original classifier. The bagged classifier could be both, less and more stable than the original classifier. Merely it depends on the number of bootstrap replicates used to build the bagged classifier. Using a bootstrap replicate of the original training set, the size of the training set actually becomes smaller. The instability of the classifier built on these data increases. When the number of bootstrap replicates increases, the bagged classifier, constructed on these bootstrap replicates, becomes more stable. In Fig. 15 and Fig. 16 examples are given for the behavior of the generalization error and the instability of the bagged and “nicely” bagged” NMC on the 30-dimensional Gaussian correlated data with the training sample sizes of 3 and 10 per class versus the number of bootstrap replicates. For both cases holds that when the number of bootstrap replicates increases, the bagged and “nicely” bagged NMC become more stable. Not always bagging stabilizes the original classifier. Sometimes it is enough to use 10 bootstrap replicates to get more stable classifier than the original one. Sometimes 1000 bootstrap replicates are not enough to surpass the stability of the original classifier. By Fig. 15 and Fig. 16 it is also demonstrated that the “nicely” bagged NMC is more stable than the bagged NMC. The fact, that bagging and “nice” bagging are not stabilizing techniques is also confirmed by Fig. 13b,d,f and Fig. 14b,d,f. By Fig. 15 and Fig. 16 it is shown that the generalization error and the instability of the bagged NMC are related. The correlation between the generalization error and the instability for the bagged and “nicely” bagged classifier on the Gaussian correlated data for the training

22

sample size N=3 per class is 0.38 and 0.68, respectively. For the training sample size N=10 per class these correlations are 0.97 and 0.98, respectively. However, when the bagged or “nicely” bagged classifier starts to improve the generalization error of the original classifier, the bagged classifier could be both, less and more stable than the original classifier. One can notice that after a certain value the increase of the number of bootstrap replications does not help to stabilize the classifier anymore nor improve its performance. This −3

0.37 Instability Measured by the Training Data Set

4

Generalization Error

0.36

0.35

0.34

Original NMC (N=3) Bagged NMC "Nicely" Bagged NMC

0.33

0.32

0.31 0 10

1

2

3

10 10 10 The Number of Bootstrap Replications

x 10

Original NMC (N=3) Bagged NMC "Nicely" Bagged NMC

3.8

3.6

3.4

3.2

3

2.8 0 10

4

10

1

2

3

10 10 10 The Number of Bootstrap Replications

4

10

Fig. 15. The generalization error and the instability, measured by the training set, of the NMC, the “bagged” and “nicely bagged” NMC versus the numbers of bootstraps B for 30-dimensional Gaussian Correlated data with the training sample size N=3 per class. −3

0.32 Instability Measured by the Training Data Set

11.5

Generalization Error

0.3 0.28

Original NMC (N=10) Bagged NMC "Nicely" Bagged NMC

0.26 0.24 0.22 0.2 0.18 0 10

1

2

3

10 10 10 The Number of Bootstrap Replications

4

10

x 10

11

Original NMC (N=10) Bagged NMC "Nicely" Bagged NMC

10.5 10 9.5 9 8.5 8 7.5 7 6.5 0 10

1

2

3

10 10 10 The Number of Bootstrap Replications

4

10

Fig. 16. The generalization error and the instability, measured by the training set, of the NMC, the “bagged” and “nicely bagged” NMC versus the numbers of bootstraps B for 30-dimensional Gaussian Correlated data with the training sample size N=10 per class. 23

means that in applying bagging in practical problems it is sufficient to choose some reasonable value of the number of bootstrap replications (say from 10 to 100, depending on the classifier and the data used). It is necessary to remember that the use of additional bootstrap replications in bagging may take a large computation time and yields just a small improvement in generalization error or even deteriorates it. In summary, the results of the investigation of the peculiarities of the bagged and “nicely” bagged NMC, we can draw the following conclusions: 1. Bagging and “nice” bagging are not stabilizing techniques. The bagged classifier could be both, more as well as less stable than the original classifier. Merely it depends on the number of bootstrap replicates used. 2. The number of bootstrap replicates in bagging should be limited. After a certain value, the increase of the number of bootstrap replications in bagging does not decrease the generalization error and the instability of the classifier anymore. 3. The instability of the classifier, measured by the training set, helps to predict the usefulness of bagging or “nice” bagging. 4. Both bagging and “nice” bagging improve the performance of the classifier when the classifier is extremely unstable. When the classifier is rather stable, bagging is not useful. 5. The “nicely” bagged NMC is more stable than the bagged NMC and gives a smaller generalization error.

6.2. Instability and Performance of the Bagged Pseudo Fisher Linear Classifier In this section it is demonstrated that bagging has a shifting phenomenon. Also bagging and “nice” bagging are studied for the Pseudo Fisher Linear Classifier. When one uses bootstrap replicates of the training set in bagging, the number of actually used training objects is reduced. Not all training objects are represented in a bootstrap replicate of the training set. Ignoring the object repetition in a bootstrap replicate the training set becomes smaller (training data in a subspace of lower dimensionality). The classifier built on a bootstrap replicate should be similar to the classifier constructed on such a smaller training set. So the bagged classifier should also have some features of the classifier built on the smaller

24

training set. This phenomenon is the most evident for the PFLD due to its characteristic maximum of the generalization error for critical sizes of the training set (Fig. 4 - Fig. 6). The generalization error of the bagged PFLD is shifted with reference to the generalization error of the PFLD (Fig. 17a,c,e). In this way one can expect that bagging improves the performance of the PFLD for training sample sizes comparable with data dimensionality. Thus, taking into account the phenomenon discussed above, one can conclude that the explanation of the behavior of the bagged classifier should be based on considering both, the instability as well as peculiarities of the classifier. As the NMC is insensitive to the number of training objects, it was sufficient to consider just its instability in order to explain the usefulness of bagging for the NMC. In investigating bagging for other linear classifiers, it is necessary to consider the peculiarities of the classifier as well. Let us now consider the usefulness of bagging for the PFLD on Gaussian correlated data (Fig. 17a,b). The generalization error of the PFLD has a local minimum when the number of training objects per class is equal to 5. So it is to be expected that the above discussed shifting effect of bagging is not helpful for small training sample sizes. Moreover, the PFLD is rather stable for these sizes of the training data set. Therefore bagging is useless for the PFLD in case of small training sample sizes. As the PFLD is rather unstable for critical sizes of the training set and as bagging has a positive shifting effect for these training sample sizes, bagging decreases the generalization error of the PFLD for training sample sizes comparable with the data dimensionality. For large training sample sizes the shifting effect of bagging becomes negative. The classifier is more stable, too. Therefore bagging does not improve the generalization error of the PFLD. For 30-dimensional non-Gaussian data (Fig. 17c,d) bagging is only useful for small and critical training sample sizes up to the data dimensionality. The general behavior of the generalization error of the PFLD on these data shows that the shifting effect of bagging will be positive up to this size. Bagging decreases the generalization error of the classifier. For critical sizes of training data the PFLD becomes more unstable. Bagging is useful up to the training

25

sample sizes equal to the data dimensionality. Then, the shifting effect of bagging becomes negative, and bagging cannot help anymore. For 256-dimensional cell data (Fig. 17e,f) bagging has a positive shifting effect on the generalization error of the PFLD for small and critical training sample sizes. Therefore, bagging is useful for the PFLD built on these sizes of training data. “Nice” bagging takes only “nice” bootstrap replicates of the training sample set. The classifier built on “nice” bootstrap replicates classifies the training set with the same or smaller generalization error, than the original classifier does. On the other hand, as the generalization error of almost all classifiers decreases when the number of training objects increases up to a certain limit, “nice” bagging perhaps tries to keep only those bootstrap replicates that contain the largest possible number of training objects represented in the original training set. A “nice” bootstrap replicate of the training set uses more training objects than an ordinary bootstrap replicate. In this way “nice” bagging has a smaller shifting effect than simple bagging. The shifting effect of “nice” bagging decreases when the training sample size increases. The generalization error and the instability of the “nicely” bagged PFLD are presented in Fig. 18af. For Gaussian correlated data “nice” bagging works similar to bagging (Fig. 18a,b and Fig. 17a,b). Due to a smaller shifting effect of “nice” bagging, the improvement in the generalization error of the PFLD is smaller than the improvement, obtained by bagging, in spite of the “nicely” bagged classifier being more stable than the bagged one. “Nice” bagging is not useful for very small and large training sample sizes, because the PFLD, built on the training set of these sizes, is rather stable. As the shifting effect of “nice” bagging is smaller than the shifting effect of bagging (Fig. 18c,d and Fig. 17c,d) and the PFLD is not very unstable on non-Gaussian data, “nice” bagging is almost useless for these data. For cell data “nice” bagging is only useful for the PFLD built on training sets of small and critical sizes (Fig. 18e,f and Fig. 17e,f). As “nice” bagging has a smaller shifting effect than bagging, “nice” bagging decreases the generalization error of the PFLD less than bagging.

26

0.5 Instability Measured by the Training Data Set

0.016

PFLD Bagged PFLD

0.45

Generalization Error

0.4 0.35 0.3 0.25 0.2

B=2 B=5 B=20 B=100

0.15 0.1 0.05 0 0 10

1

0.01 B=2 B=5 B=20 B=100

0.008 0.006 0.004 0.002 0 0 10

10

a)

1

10 Training Sample Size per Class

2

10

b) 0.016 Instability Measured by the Training Data Set

0.5 0.45 0.4 Generalization Error

0.012

2

10 Training Sample Size per Class

PFLD Bagged PFLD

0.014

0.35

B=2 B=5 B=20 B=100

0.3 0.25 0.2 0.15 0.1 0.05 0 0 10

PFLD Bagged PFLD 1

0.012 0.01

B=2

0.008

B=5 B=20

0.006 B=100

0.004 0.002 0 0 10

2

10 Training Sample Size per Class

PFLD Bagged PFLD

0.014

10

c)

1

10 Training Sample Size per Class

2

10

d) −3

x 10

0.45

Generalization Error

0.4

Instability Measured by the Training Data Set

0.5

PFLD Bagged PFLD

0.35

B=2 B=5

0.3 B=20 B=100

0.25 0.2 0.15 0.1 0.05 0 0 10

1

2

10 10 Training Sample Size per Class

3

10

7 6

PFLD Bagged PFLD

5 4 3 2 B=2 B=5

1 0 0 10

1

B=20 B=100

2

10 10 Training Sample Size per Class

3

10

e) f) Fig. 17. The generalization error and the instability, measured by the training set, of the PFLD and the “bagged” PFLD for different numbers of bootstraps B versus the training sample size for 30-dimensional Gaussian correlated data (a,b), for 30-dimensional NonGaussian data (c,d) and for 256-dimensional cell data (e,f). 27

0.016 Instability Measured by the Training Data Set

0.5

PFLD "Nicely" Bagged PFLD

0.45

Generalization Error

0.4 B=2

0.35 0.3

B=5

0.25 B=20

0.2

B=100

0.15 0.1 0.05 0 0 10

1

10 Training Sample Size per Class

0.012 0.01 0.008

B=2 B=5 B=20 B=100

0.006 0.004 0.002 0 0 10

2

10

a)

1

10 Training Sample Size per Class

2

10

b) 0.016 Instability Measured by the Training Data Set

0.5 0.45 0.4 Generalization Error

PFLD "Nicely" Bagged PFLD

0.014

0.35

B=2 B=5 B=20 B=100

0.3 0.25 0.2 0.15 0.1

PFLD "Nicely" Bagged PFLD

0.05 0 0 10

1

10 Training Sample Size per Class

PFLD "Nicely" Bagged PFLD

0.014 0.012 0.01 0.008 0.006 0.004

B=2 B=5 B=20 B=100

0.002 0 0 10

2

10

c)

1

10 Training Sample Size per Class

2

10

d) −3

0.5 0.45

PFLD "Nicely" Bagged PFLD

0.4 Generalization Error

Instability Measured by the Training Data Set

5

0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 10

B=2 B=5 B=20 B=100 1

2

10 10 Training Sample Size per Class

3

10

x 10

4.5 4

PFLD "Nicely" Bagged PFLD

3.5 3 2.5 2 1.5

B=2 B=5 B=20 B=100

1 0.5 0 0 10

1

2

10 10 Training Sample Size per Class

3

10

e) f) Fig. 18. The generalization error and the instability, measured by the training set, of the PFLD and the “nicely bagged” PFLD for different numbers of bootstraps B versus the training sample size for 30-dimensional Gaussian correlated data (a,b), for 30dimensional Non-Gaussian data (c,d) and for 256-dimensional cell data (e,f). 28

Note that Fig. 17b,d,f and Fig. 18b,d,f also demonstrate that bagging and “nice” bagging are not stabilizing techniques. So, in addition to the conclusions made in section 6.1, one can infer the following: 1. Bagging has a shifting effect. “Nice” bagging has a smaller shifting effect than bagging. 2. In order to predict the use of bagging or “nice” bagging, it is necessary to consider both, the instability and the peculiarities of the classifier. 3. For the PFLD bagging is more preferable than “nice” bagging.

6.3. Performance of the Bagged Regularized Fisher Linear Classifier In this section the usefulness of bagging and “nice” bagging is studied for the regularized Fisher Linear Discriminant with different values of the regularization parameter. It will be shown that for large training sample sizes “nice” bagging can correct an increase in the generalization error of the RFLD (with reference to the generalization error of the FLD) caused by using too large values for the regularization parameter in the RFLD. The RFLD represents the whole family of linear classifiers: from the standard Fisher Linear Discriminant (when the regularization parameter is equal to zero) and the Pseudo Fisher Linear Discriminant (when the regularization parameter is very small) to the Nearest Mean Classifier (when the regularization parameter approaches infinity). Therefore, bagging has a different effect on each of these regularized classifiers. The generalization error of the bagged and “nicely” bagged RFLD with different values of the regularization parameter λ is presented in Fig. 19a-j for Gaussian correlated data. We considered only this case, because from our point of view this is the most interesting example, which demonstrates the relation between the stability of the classifier and the use of bagging for this classifier. The usefulness of bagging and “nice” bagging for the RFLD could be nicely explained considering the instability of the RFLD (see Fig. 12a,b). The RFLD with the regularization parameter λ=10-10 is similar to the PFLD. So bagging and “nice” bagging behave in the same way as they do for the PFLD (Fig. 19a,b). The RFLD with the regularization parameter λ=0.1 is more stable than the RFLD with λ=10-10. However, due to the positive shifting effect of bootstrapping, bagging and “nice” bagging are still useful

29

for critical sizes of the training set (Fig. 19c,d). As this RFLD is still close to the PFLD, bagging gives better results than “nice” bagging. The RFLD with λ=3 is very stable. Both bagging and “nice” bagging are useless (Fig. 19e,f). When the regularization parameter increases over the value λ=3 to infinity, the RFLD approaches the Nearest Mean Classifier, just as becoming more unstable. Bagging again becomes useful. Bagging decreases the generalization error of the RFLD with λ=20 for critical and large training sample sizes (Fig. 19g,h). As the shifting effect of bagging is negative for this classifier, the usefulness of bagging is not large. The RFLD with λ=200 is rather close to the NMC. Bagging works in the same way as for the NMC (Fig. 19i,j).

0.5

0.5

RFLD (L=20) Bagged RFLD

0.45

0.35 0.3 0.25

0.4 Generalization Error

Generalization Error

0.4

0.45

B=2 B=5 B=20 B=100

0.2 0.15

0.3 0.25

0.15 0.1

0.05

0.05 1

10 Training Sample Size per Class

B=2 B=5 B=20 B=100

0.2

0.1

0 0 10

RFLD (L=20) "Nicely" Bagged RFLD FLD

0 0 10

2

10

g)

1

10 Training Sample Size per Class

2

10

h) 0.5

0.5

RFLD (L=200) Bagged RFLD

0.45

0.45

0.4 0.35 0.3

0.4 B=2 B=5 B=20 B=100

Generalization Error

Generalization Error

0.35

0.25 0.2 0.15

0.35 0.3 0.25 0.2 0.15

0.1

0.1

0.05

0.05

0 0 10

1

10 Training Sample Size per Class

0 0 10

2

10

B=2 B=5 B=20 B=100

RFLD (L=200) "Nicely" Bagged RFLD FLD 1

10 Training Sample Size per Class

2

10

i) j) Fig. 19. The generalization error of the “bagged” and “nicely bagged” RFLD (L=λ) for different numbers of bootstraps B versus the training sample size for 30-dimensional Gaussian correlated data. 30

0.5

0.5

RFLD (L=0.0000000001) Bagged RFLD

0.4

0.4

0.35

0.35

0.3 0.25 0.2

B=2 B=5 B=20 B=100

0.15

0.3 0.25 0.2

0.1

0.1 0.05 1

10 Training Sample Size per Class

0 0 10

2

10

a)

2

10

0.5

RFLD (L=0.1) Bagged RFLD

0.45 0.4 0.35 0.3 0.25 0.2

B=2 B=5 B=20 B=100

0.15

0.4 0.35 0.3 0.25 0.2

0.1

0.05

0.05 1

10 Training Sample Size per Class

B=2 B=5 B=20 B=100

0.15

0.1

0 0 10

RFLD (L=0.1) "Nicely" Bagged RFLD

0.45

Generalization Error

Generalization Error

1

10 Training Sample Size per Class

b) 0.5

0 0 10

2

10

c)

1

10 Training Sample Size per Class

2

10

d) 0.5

0.5

RFLD (L=3) Bagged RFLD

0.45 0.4

0.4

0.35

0.35

0.3 0.25

B=2 B=5 B=20 B=100

0.2 0.15

0.3 0.25

0.15 0.1

0.05

0.05 1

10 Training Sample Size per Class

0 0 10

2

10

e)

B=2 B=5 B=20 B=100

0.2

0.1

0 0 10

RFLD (L=3) "Nicely" Bagged RFLD

0.45

Generalization Error

Generalization Error

B=2 B=5 B=20 B=100

0.15

0.05 0 0 10

RFLD (L=0.0000000001) "Nicely" Bagged RFLD

0.45

Generalization Error

Generalization Error

0.45

1

10 Training Sample Size per Class

2

10

f)

Fig. 19. The generalization error of the “bagged” and “nicely bagged” RFLD (L=λ) for different numbers of bootstraps B versus the training sample size for 30-dimensional Gaussian correlated data. 31

Surprisingly, “nice” bagging works very well for the RFLD with large values of the regularization parameter in case of large training sample sizes. It decreases the generalization error of the RFLD by a factor two or more. This phenomenon could be explained as follows. The RFLD with large values of the regularization parameter approaches the NMC. However, the RFLD is still not the same classifier. It constructs the discriminant function in a different way than the NMC. On the other hand, “nice” bagging constructs the classifier by averaging the parameters of classifiers built on “nice” bootstrap replicates of the training set. These “nice” bootstrap versions of the classifier classify the training set with a smaller classification error than the original classifier, built on the whole training set. Therefore, “nice” bootstrap versions of the classifier represent the apparent classification error. The original training set was used as the validation data set to select classifiers with the apparent error. Classifiers selected to be “nice” on the validation set are not necessarily nice for a test data set. When the number of objects in the validation set is small, relatively bad classifiers might be selected to be “nice” bootstrap versions of the classifier. The “nicely” bagged classifier, constructed from these nonideal classifiers, will be bad as well. When the number of objects increases in the validation set, the validation set becomes more representative and the true error approaches the ideal classification error. The “nicely” bagged classifier approaches the ideal classifier with the Bayes classification error. This way, “nice” bagging constructs nice classifiers for large training sample sizes. When, for Gaussian correlated data, the FLD is stabilized by means of regularization with large values of the regularization parameter, the generalization error of the FLD decreases for small training sample sizes and increases for large training sample sizes. As “nice” bagging builds an ideal classifier for large training sample sizes, “nice” bagging neutralizes the negative effect of too large regularization and decreases the generalization error of the regularized classifier up to the generalization error of the FLD and even slightly improves it. Thus, the conclusions made in two previous sections are valid for the RFLD as well. Bagging and “nice” bagging are more useful for less stable classifiers. If regularization deteriorates the generalization error of the classifier built on a large training set, “nice” bagging helps to neutralize the negative effect of the regularization and decreases the generalization error of the regularized classifier.

32

6.4. Instability and Performance of the Bagged Small Sample Size Classifier In this section the usefulness of bagging for the Small Sample Size Classifier is studied for 30-dimensional Gaussian correlated data and 30-dimensional Non-Gaussian data. We do not consider 256-dimensional cell data because bagging on these data leads to large computational burdens. However, as the SSSC seems to be rather stable for this data (Fig. 10b), it is reasonable to conclude that bagging is useless for the SSSC in this case. Let us consider now the bagging experiments on the SSSC. For Gaussian correlated data (Fig. 20a,b) the SSSC is more stable than the NMC or the PFLD. However, bagging is still 0.01

SSSC Bagged SSSC

0.45 0.4 Generalization Error

Instability Measured by the Training Data Set

0.5

0.35 0.3 0.25

B=2 B=5 B=20 B=100

0.2 0.15 0.1 0.05 0 0 10

1

10 Training Sample Size per Class

B=2

SSSC Bagged SSSC

B=5

0.008 0.007

B=20

0.006

B=100

0.005 0.004 0.003 0.002 0.001 0 0 10

2

10

a)

1

10 Training Sample Size per Class

2

10

b) 0.01 Instability Measured by the Training Data Set

0.5 0.45 0.4 Generalization Error

0.009

0.35

B=2 B=5 B=20 B=100

0.3 0.25 0.2 0.15 0.1 0.05 0 0 10

SSSC Bagged SSSC 1

10 Training Sample Size per Class

0.009 0.008

B=2

0.007

B=5

0.006

B=20 B=100

0.005 0.004 0.003 0.002 0.001 0 0 10

2

10

SSSC Bagged SSSC

1

10 Training Sample Size per Class

2

10

c) d) Fig. 20. The generalization error and the instability, measured by the training set, of the SSSC and the “bagged” SSSC for different numbers of bootstraps B versus the training sample size for 30-dimensional Gaussian correlated data (a,b) and for 30-dimensional Non-Gaussian data (c,d). 33

useful for critical training sample sizes. In spite of the increasing instability of the SSSC when the training sample size increases, bagging becomes less useful for large training sample sizes due to the negative shifting effect of bagging. For Non-Gaussian data (Fig. 20c,d) the SSSC is rather unstable for critical training sample sizes. The shifting effect of bagging is positive for small and critical training sample sizes up to N=15 per class and negative for training sets with more than 15 objects per class. Thus, bagging decreases the generalization error of the SSSC only for training sample sizes up to N=20 per class approximately, although the classifier is still rather unstable for training sample sizes up to 30-40 objects per class. In Fig. 20b,d and Fig. 21b,d it is shown once more that bagging does not stabilize the original classifier. As “nice” bagging builds nonideal classifiers for small training sample sizes and its shifting effect is smaller than the shifting effect of ordinary bagging, “nice” bagging is less useful for Gaussian correlated data (Fig. 21a,b) and useless for Non-Gaussian data (Fig. 21c,d).

6.5. Instability and Performance of the Bagged Karhunen-Loeve’s Linear Classifier The study of the use of bagging for the Karhunen-Loeve’s Linear Classifier is presented in this section. As bagging seems to be useless for the KLLC built on 256-dimensional cell data due to the high stability of this classifier (Fig. 10b), the simulation study of this case is skipped. For Gaussian correlated data the KLLC is rather stable for all training sample sizes (Fig. 22a,b). Bagging becomes useless and even deteriorates the generalization error of the original classifier. For non-Gaussian data (Fig. 22c,d) the KLLC is unstable for critical sizes of the training set. So bagging is useful for these training sample sizes. For non-Gaussian data the effect of bagging is not large because the KLLC builds a discriminant function in the subspace of principal components defined by some of the largest eigenvalues of the data distribution. In this way the KLLC ignores some informative features with small variances. Changes in the

34

composition of the training set do not help much to build a better classifier. Bagging cannot be very useful here. “Nice” bagging shows very similar results and we do not present figures demonstrating its behavior.

7. Conclusions

0.5 0.45

Generalization Error

0.4 0.35 0.3 0.25 0.2

B=2 B=5 B=20 B=100

0.15 0.1 0.05 0 0 10

1

10 Training Sample Size per Class

Instability Measured by the Training Data Set

0.01

SSSC "Nicely" Bagged SSSC

0.008

B=2 B=5 B=20 B=100

0.007 0.006 0.005 0.004 0.003 0.002 0.001 0 0 10

2

10

1

10 Training Sample Size per Class

2

10

b)

a) 0.5

Instability Measured by the Training Data Set

0.01

0.45 0.4 Generalization Error

SSSC "Nicely" Bagged SSSC

0.009

0.35

B=2 B=5 B=20 B=100

0.3 0.25 0.2 0.15 0.1 0.05 0 0 10

SSSC "Nicely" Bagged SSSC 1

10 Training Sample Size per Class

0.008 0.007 0.006 0.005 0.004 0.003 0.002 0.001 0 0 10

2

10

SSSC "Nicely" Bagged SSSC

0.009

B=2 B=5 B=20 B=100 1

10 Training Sample Size per Class

2

10

c) d) Fig. 21. The generalization error and the instability, measured by the training set, of the SSSC and the “nicely bagged” SSSC for different numbers of bootstraps B versus the training sample size for 30-dimensional Gaussian correlated data (a,b) and for 30dimensional Non-Gaussian data (c,d). 35

The bagging technique combines benefits of bootstrapping and aggregating. The bootstrap method, its implementation on the computer, its usefulness to get more accurate statistics and its application to some real data analysis problems are nicely described by Efron and Tibshirani.(6) Bagging was implemented by Breiman(1) and investigated by a number of researchers. However, bagging was mainly studied for regression problems and much less in discriminant analysis (classification trees and the nearest neighbour classifier were considered). Breiman has found that for unstable procedures bagging works well. We continued this investigation. In the current paper bagging for linear classifiers, such as the Nearest Mean 0.5

KLLC Bagged KLLC

0.45 0.4 Generalization Error

Instability Measured by the Training Data Set

0.01

0.35 0.3

B=2 B=10 B=100

0.25 0.2 0.15 0.1 0.05 0 0 10

1

10 Training Sample Size per Class

0.007 0.006 0.005 0.004 0.003

B=2 B=10 B=100

0.002 0.001 0 0 10

2

1

10 Training Sample Size per Class

2

10

b) 0.01 Instability Measured by the Training Data Set

0.5 0.45 0.4 Generalization Error

0.008

10

a)

KLLC Bagged KLLC

0.009

0.35

B=2 B=5 B=20 B=100

0.3 0.25 0.2 0.15 0.1 0.05 0 0 10

KLLC Bagged KLLC 1

10 Training Sample Size per Class

0.008 0.007 0.006 0.005 0.004

B=2

0.003

B=5

0.002

B=20 B=100

0.001 0 0 10

2

10

KLLC Bagged KLLC

0.009

1

10 Training Sample Size per Class

2

10

c) d) Fig. 22. The generalization error and the instability, measured by the training set, of the KLLC and the “bagged” KLLC for different numbers of bootstraps B versus the training sample size for 30-dimensional Gaussian correlated data (a,b) and for 30-dimensional Non-Gaussian data (c,d). 36

Classifier, the Pseudo Fisher Linear Discriminant, the Regularized Fisher Linear Discriminant, the Small Sample Size Classifier and the Karhunen-Loeve’s Linear Classifier, has been studied. We also investigated the connection between the effect of bagging on the performance and its effect on the stability of classifiers. For our investigations a measure of stability of the classifier, called the instability, is suggested. It is demonstrated that the performance and the instability of the classifier are related: the more stable the classifier, the smaller its generalization error. This has been shown nicely for the RFLD with different values of the regularization parameter (Fig. 3, Fig. 11 and Fig. 12). It is shown that bagging is not a stabilizing technique nor always improves the performance of the original classifier. The bagged classifier can be more as well as less stable than the original classifier. It merely depends on the number of bootstrap replicates used to construct the bagged classifier. However, the number of bootstrap replicates used should not be very large. Increasing the number of bootstrap replications in bagging above a certain value does not decrease the generalization error and the instability of the classifier anymore. Therefore, when one applies bagging for linear classifiers, it is reasonable to use not more than 20-100 bootstrap replications (depending on the classifier and the data). It was suggested to consider the instability of the classifier measured by the training set for predicting the possible use of bagging. The instability of the classifier, measured by the training set, illuminates the situations, in which the classifier is the most sensitive to changes in the composition of the training set. Bagging uses bootstrap replicates of the training set, which make changes in its composition. In this way, the instability, measured by the training set, and the use of bagging are related. It is shown in some examples that bagging improves the performance of the classifier only when the classifier is extremely unstable. When the classifier is rather stable, bagging is useless. Simulation study revealed that bagging has a shifting effect. Due to the fact that the number of actually used training objects is reduced when one uses bootstrap replicates of the training set, the bagged classifier is close to the classifier built on a smaller training sample set. 37

So, the generalization error of the bagged classifier is shifted with reference to the generalization error of the original classifier in the direction of the generalization error of the classifier built on a smaller training sample set. Therefore, in order to predict the usefulness of bagging for the classifier, it is necessary to take into consideration both the instability and peculiarities of the classifier. As different linear classifiers have a different stability and behave differently on different kinds of data and for different training sample sizes, bagging does not work in the same way in different situations. The review of the use of bagging for linear classifiers in different situations was given in section 6. A modification of bagging called “nice” bagging has been studied. It is shown that the “nicely” bagged classifier is more stable than the bagged classifier. The shifting effect of “nice” bagging is also smaller. Sometimes “nice” bagging could be more preferable than bagging. However, in general “nice” bagging does not give better results than bagging.

Reference 1. L. Breiman, Bagging predictors, Machine Learning Journal 24, no. 2, 123-140 (1996). 2. A.K. Jain and B. Chandrasekaran, Dimensionality and sample size considerations in pattern recognition practice, Handbook of Statistics, vol. 2, ed. P.R. Krishnaiah and L.N. Kanal, North-Holland, Amsterdam, 835-855 (1987). 3. S. Raudys and A.K. Jain, Small sample size effects in statistical pattern recognition: Recommendations for practitioners, IEEE Transactions on Pattern Analysis and Machine Intelligence 13, no.3, 252-264 (1991). 4. R.P.W. Duin, On the accuracy of Statistical Pattern Recognizers, Thesis, Delft University of Technology (1978). 5. WP.J. Rousseeuw and A.M. Leroy, Robust regression and outlier detection. Wiley series in probability and mathematical statistics. Applied Probability and Statistics. John Wiley&Sons, New York (1987). 6. B. Efron and R. Tibshirani, An Introduction to the Bootstrap. Chapman and Hall, New York (1993). 7. R. Tibshirani, Bias, variance and prediction error for classification rules, Technical Report, University of Toronto, Canada (1996). 8. D.H. Wolpert and W.G. Macready, An Efficient method to estimate bagging’s generalization error, Technical Report, Santa Fe Institute, Santa Fe (1996). 38

9. H. Netten, I.T. Young, M. Prins, L.J. van Vliet, H.J. Tanke, J. Vrolijk, W. Sloos, Automation of Fluorescent dot counting in cell nuclei, Proceedings of the 12th Int. Conference on Pattern Recognition, Vol.1, Jerusalem, 84-87 (1994). 10. A. Hoekstra, H. Netten and D. de Ridder, A neural network applied to spot counting, Proceedings of ACSI’96, the Second Annual Conference of the Advanced School for Computing and Imaging, Lommel, Belgium, 224-229 (1996). 11. R.A. Fisher, The Use of multiple measurements in taxonomic problems, Annals of Eugenics 7, no. 2 (1936). 12. R.A. Fisher, The precision of discriminant functions, Annals of Eugenics 10, no. 4 (1940). 13. C.R. Rao, On some problems arising of discrimination with multiple characters, Sankya 9, 343-365 (1949). 14. S. Raudys and V. Pikelis, On dimensionality, sample size, classification error and complexity of classification algorithm in pattern recognition, IEEE Transaction on Pattern Analysis and Machine Intelligence PAMI-2, no. 3, 242-252 (1980). 15. S. Raudys, Statistical Pattern Recognition. Small Design Sample Problems, Institute of Mathematics and Cybernetics, Vilnius, 1-460 (1984). A manuscript. 16. S.A. Aivazian, V.M. Buchstaber, I.S. Yenyukov, L.D. Mechalkin. Applied Statistics. Classification and reduction of dimensionality. Reference Edition, Finansy i Statistika, Moscow (1989). 17. J.M. Friedman, Regularized Discriminant Analysis, JASA 84, 165-175 (1989). 18. R. Peck and J. Van Ness, The use of shrinkage estimators in linear discriminant analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, no. 4, 530-537 (1982). 19. Wei-Lem Loh, On linear discriminant analysis with adaptive ridge classification rules, Journal of Multivariate Analysis, no.53, 264-278 (1995). 20. D.M. Barsov, Classification error minimization using biased discriminant functions. Statistics, Probability, Economics. Moscow, Nauka, 376-379 (1985). (In Russian) 21. V.I. Serdobolskij, About minimal probability of misclassification in discriminant analysis, Lectures of USSR Academy of Sciences 270, no. 5, 1066-1070 (1983). 22. S. Raudys and M. Skurikhina, Small sample properties of ridge-estimate of covariance matrix in statistical and neural net classification, New Trends in Probability and Statistics, vol. 3, Tartu, Estonia, 237-245 (1994). 23. R.P.W. Duin, Small sample size generalization, Proceedings of 9th Scandinavian Conference on Image Analysis, Uppsala, Sweden (1995). 24. C. Cortes, V. Vapnik, Support-vector networks, Machine Learning 20, 273-297 (1995). 25. K. Fukunaga, Introduction to Statistical Pattern Recognition. Academic Press, 400-407 (1990). 26. J. Kittler, M. Hatef and R.P.W. Duin, Combining classifiers, Proceedings of ICPR, Vienna, Austria, 897-901 (1996). 39

27. J.S. Rao and R. Tibshirani, The out-of-bootstrap method for model averaging and selection, Report, University of Toronto, Canada (1997). 28. M. Taniguchi and V. Tresp, Averaging regularized estimators, Neural Computation 9, 11631178 (1997). 29. M. Skurichina and R.P.W. Duin, Stabilizing classifiers for very small sample sizes, Proceedings of ICPR, Vienna, Austria, 891-896 (1996).

40