Evolved Feature Weighting for Random Subspace Classifier

Report 1 Downloads 135 Views
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 2, FEBRUARY 2008

363

ACKNOWLEDGMENT

Evolved Feature Weighting for Random Subspace Classifier

The authors would like to thank B. Yanikoglu, B. Diri, S. Amasyali, and E. Islam Tatli for their help.

Loris Nanni and Alessandra Lumini

REFERENCES [1] O. T. Yildiz and E. Alpaydin, “Model selection in omnivariate decision trees,” in Lecture Notes Artif. Intell., ser. 3720. Berlin, Germany: Springer-Verlag, 2005, pp. 473–484. [2] X. B. Li, J. R. Sweigart, J. T. C. Teng, J. M. Donohue, L. A. Thombs, and S. M. Wang, “Multivariate decision trees using linear discriminants and tabu search,” IEEE Trans. Syst. Man Cybern. A, Syst. Humans, vol. 33, no. 2, pp. 194–205, Mar. 2003. [3] R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Ann. Eugenics, pp. 179–188, 1936. [4] T. Kohonen, Self Organizing Maps, ser. Series in Information Sciences. New York: Springer-Verlag, 2001, p. 245. [5] E. Alpaydin, Introduction to Machine Learning. Cambridge, MA: MIT Press, 2004, pp. 135–140. [6] L. Breiman, “Random forests–random features,” Dept. Statist., Univ. California, Berkeley, CA, Tech. Rep. 567, 1999. [7] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees. New York: Chapman & Hall, 1984. [8] T. S. Lim, W. Y. Loh, and Y. S. Shih, “A Comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms,” Mach. Learn., vol. 40, pp. 203–229, 2000. [9] C. Blake and C. Merz, UCI Repository of Machine Learning Databases, [Online]. Available: http://www.ics.uci.edu/mlearn/MLRepository.html 1998 [10] W. Y. Loh and Y. S. Shih, “Split selection methods for classification trees,” Statistica Sinica, vol. 7, pp. 815–840, 1997. [11] O. T. Yildiz and E. Alpaydin, “Linear discriminant trees,” Int. J. Pattern Recognit. Artif. Intell., vol. 19, no. 3, pp. 323–353, 2005. [12] M. F. Amasyali and O. Ersoy, “Cline: New multivariate decision tree construction heuristics,” in Proc. ICSC Congr. Comput. Intell. Methods Applicat., Istanbul, Turkey, Dec. 15–17, 2005, pp. 1–4. [13] Y. Li, M. Dong, and R. Kothari, “Classifiability-based omnivariate decision trees,” IEEE Trans. Neural Netw., vol. 16, no. 6, pp. 1547–1560, Nov. 2005. [14] C. T. Yildiz and E. Alpaydin, “Omnivariate decision trees,” IEEE Trans. Neural Netw., vol. 12, no. 6, pp. 1539–1546, Nov. 2001.

Abstract—The problem addressed in this letter concerns the multiclassifier generation by a random subspace method (RSM). In the RSM, the classifiers are constructed in random subspaces of the data feature space. In this letter, we propose an evolved feature weighting approach: in each subspace, the features are multiplied by a weight factor for minimizing the error rate in the training set. An efficient method based on particle swarm optimization (PSO) is here proposed for finding a set of weights for each feature in each subspace. The performance improvement with respect to the state-of-the-art approaches is validated through experiments with several benchmark data sets. Index Terms—Ensemble generation, feature weighting, nearest neighbor, particle swarm optimization (PSO).

I. INTRODUCTION In several classification problems, in particular, when data are high dimensional and the number of training samples is small compared to the data dimensionality, it may be difficult to construct a good classification rule. In fact, a classifier constructed on small training sets has usually a poor performance. In the literature, it is shown that a good approach to improve a weak classifier is to construct a strong multiclassifier that combines many weak classifiers with a powerful decision rule [10]–[12]. The most studied approaches for creating a multiclassifier are bagging, boosting, and random subspace. Bagging is based on a multiclassifier where each classifier is trained on a different subset of the training set and each subset is extracted randomly with replacement from the original training set [13]. Boosting is an iterative approach where each new classifier is focused on the training patterns that are misclassified by the previous classifiers [14]. Random subspace method (RSM) is based on selecting a random feature subset for each classifier [15]. In machine learning, feature weighting before classification is performed with the purpose to reveal the ability of each feature to distinguish pattern classes (see [7] for a survey). The weights of features are usually determined following two groups of approaches: through an iterative algorithm that evaluates, in a supervised manner, the performance of a classifier as a feedback to select a new set of weights [5], [9]; following a preexisting model’s bias, e.g., conditional probabilities, class projection, and mutual information [6], [8]. In this letter, we investigate how particle swarm optimization (PSO) can be applied to generate an optimal set of feature weights for pattern classification to be coupled with a random subspace approach. PSO [1] is a relatively novel evolutionary computation technique based on the analysis of the social behavior of biological organisms, such as the movement of bird flocking [2]. The basic idea for applying PSO to random subspace is the following: for each feature in each subspace, we initially assign a random weight between 0 and 1, in this way, the dimension of each particle of the PSO problem is (number of features) 3 (number of subspaces). Manuscript received January 29, 2007; revised April 13, 2007 and June 28, 2007; accepted July 15, 2007. The authors are with the DEIS, IEIIT—CNR, Università di Bologna, 40136 Bologna, Italy (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this letter are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNN.2007.910737

1045-9227/$25.00 © 2008 IEEE

364

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 2, FEBRUARY 2008

Fig. 1. System proposed.

Fig. 2. Numeric example of our idea.

Then, the weights of these features are adjusted using the PSO1 so that the classification error rate of the ensemble on the training set is minimized. The classifiers trained using this weighted random subspace are combined by majority vote rule [4]. This letter is organized as follows. In Section II, the new technique for ensemble generation is reported. In Section III, experimental results are presented. Finally, in Section IV, some concluding remarks are given. II. SYSTEM PROPOSED In the RSM, each individual classifier uses only a subset of all the features for the training and for the testing. Given a training containing Q features, the RSM generates M new training set sets 1 ; . . . ; M ; each new set i that contains Q 3 K features (0 < K < 1) is used to train exactly one classifier (a 1-nearest neighbor). Finally, these classifiers are combined by the majority vote rule [4]. Our aim is to find an optimal set of feature weights for the random subspaces in order to calculate the distance between two patterns and as a weighted Euclidean distance

T T

T

T

P

y

d(

w

x

Q 2 i=1 w(i) 1 (x(i) 0 y(i))

x y) = ;

where are the weights of the features. An initial set 0 = f 1 ; . . . ; Q2K 2M g of weights is obtained randomly extracting Q 2 K 2 M random weights between 0 and 1. Then, the values of 0 are iteratively updated using PSO1 . We represent the particle’s position as a string of length S = Q 2 K 2 M . In

W

w

w

W

1It is implemented as in PSO Matlab Toolbox; it is available at http://www. psotoolbox.sourceforge.net

this letter, we run PSO with 20 particles for 100 iterations and we adopt as fitness function the classification error rate of the ensemble using a five-fold cross validation on the training set. A scheme of our method is reported in Fig. 1. In Fig. 2, a numeric example of our idea is shown. We report a simple example of a 4-D feature space where we extract two bidimensional random subspaces. The first random subspace contains the first two features of the original space; the latter random subspace contains the third and the fourth original features. Given a pattern 1 = [1 2 3 4], the projection of this pattern onto the first random subspace is [1 2] and the projection of 1 onto the second random subspace is [3 4].

x

x

III. EXPERIMENTS We perform experiments in order to compare the classification performance of our PSO-based approach with those of the standard RSM and other nearest-neighbor-based classifiers. The experiments were conducted on ten benchmark data sets from the University of California at Irvine (UCI) Repository2: Heart, Sonar, Pima, Ionosphere, Breast, Vehicle, Vowel, WDBC, Creditg, and Wine. As suggested by many classification approaches, all the data sets have been normalized between 0 and 1. To minimize the possible misleading caused by the training data, the results on each data set have been averaged over ten experiments. For each experiment, we randomly resampled the learning and the test sets (containing, respectively, half of the patterns), maintaining the distribution of the patterns in the classes. Table I reports the performance obtained by the following: — NN, the simple 1-nearest neighbor classifier; — CNN, the center-based nearest neighbor [3], a recent nearestneighbor-based classifier; 2http://www.ics.uci.edu/mlearn/MLRepository.html

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 2, FEBRUARY 2008

365

TABLE I ERROR RATE OF THE METHODS TESTED

Fig. 3. Error rate obtained varying the number value of

N in WS(20 2 N ).

— RS(M ), a random subspace of NN, where K = 0:5; — WS(M ), our modified random subspace of NN, where K = 0:5; the parameters of PSO (see [2] for details) are Vmax = 0:25; c1 = 1; c2 = 3; W start = 1:5; and W end = 0:5; — WS(M 2 N ), we combine by vote rule N WS(M ) classifiers. The difference between WS(20 2 5) and WS(100) is that the search space in WS(20 2 5) is of 20 elements while in WS(100) the search space is of 100 elements. The results reported in Table I prove that our ensemble generation method permits to find a good discriminative set of classifiers. These results show that the following: — the same parameters configuration works well in several data sets; — WS outperforms both RS and CNN (a recent state-of-the-art among the nearest-neighbor-based classifiers); — the only data set where WS does not work very well is the Heart data set; in our opinion, this is due to the low cardinality of this data set (the training set is too small for a trained method as WS). In Fig. 3, we plot the performance of WS(20 2 N ) varying the value of N . We report the results on the Heart data set [Fig. 3(a)] and on the Wisconsin Dataset Breast Center (WDBC) data set [Fig. 3(b)]. It is interesting to note that in both data sets the performances are similar to N >= 5. Anyway, in both data sets, the best performance is obtained with N = 15. As a further experiment, we run the Wilcoxon signed-rank test [16] for comparing the results of RS(100) and WS(20 2 5). The null hypothesis is that there is no difference between the accuracies of the two classifiers [17]. We reject the null hypothesis (level of significance 0.05) and accept that the two classifiers have significant different accuracies. Using this test, we obtain a value of z of 01.9868 and the null hypothesis can be rejected if z is smaller than 01.96. This result confirms that WS outperforms RS. IV. CONCLUSION The problem addressed in this letter is the ensemble generation. We have proposed a method based on random subspace and feature weighing, where the weights are refined by the PSO algorithm.

The performance improvement with respect to other state-of-the-art approaches, in terms of error rate, is confirmed by a wide set of experiments. As future work, we plan to improve the effectiveness of our approach combining bagging and random subspace. Moreover, we want to study different fitness functions, for example, we can consider not only the average error rate obtained using a fivefold cross validation on the training set but also the standard deviation of the error rate.

REFERENCES [1] J. Kennedy and R. C. Eberhart, “Particle swarm optimization,” in Proc. IEEE Int. Conf. Neural Netw., Perth, Australia, 1995, pp. 1942–1948. [2] X. Wang, J. Yang, X. Teng, W. Xia, and R. Jensen, “Feature selection based on rough sets and particle swarm optimization,” Pattern Recognit. Lett., vol. 28, no. 4, pp. 459–471, Mar. 2007. [3] Q.-B. Gao and Z.-Z. Wang, “Center-based nearest neighbor classifier,” Pattern Recognit., vol. 40, no. 1, pp. 346–349, January 2007. [4] J. Kittler, M. Hatef, R. Duin, and J. Matas, “On combining classifiers,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 3, pp. 226–239, Mar. 1998. [5] M. L. Raymer et al., “Dimensionality reduction using genetic algorithms,” IEEE Trans. Evol. Comput., vol. 4, no. 2, pp. 164–171, Apr. 2000. [6] R. Paredes and E. Vidal, “A class-dependent weighted dissimilarity measure for nearest neighbor classification problems,” Pattern Recognit. Lett., vol. 21, no. 12, pp. 1027–1036, 2000. [7] D. Wettschereck, D. W. Aha, and T. Mohri, “A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms,” Artif. Intell. Rev., vol. 11, no. 1–5, pp. 273–314, 1997. [8] C. Domeniconi, J. Peng, and D. Gunopulos, “Locally adaptive metric nearest-neighbor classification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 9, pp. 1281–1285, Sep. 2002. [9] M. A. Tahir, A. Buridane, and F. Kurugollu, “Simultaneous feature selection and feature weighting using hybrid tabu search/K-nearest neighbor classifier,” Pattern Recognit. Lett., vol. 28, pp. 438–446, 2007. [10] D. Opitz and R. Maclin, “Popular ensemble methods: An empirical study,” J. Artif. Intell. Res., vol. 11, pp. 169–198, 1999. [11] J. Kittler, “On combining classifiers,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 3, pp. 226–239, Mar. 1998. [12] H. Altincay and M. Demirekler, “An information theoretic framework for weight estimation in the combination of probabilistic classifiers for speaker identification,” Speech Commun., vol. 30, no. 4, pp. 255–272, 2000.

366

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 2, FEBRUARY 2008

[13] L. Breiman, “Bagging predictors,” Mach. Learn., vol. 24, pp. 123–140, 1996. [14] R. E. Schapire, “The boosting approach to machine learning: An overview,” in Proc. MSRI Workshop Nonlinear Estim. Classif., Berkeley, CA, 2002, pp. 1–23. [15] T. K. Ho, “The random subspace method for constructing decision forests,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 8, pp. 832–844, Aug. 1998. [16] J. Demsar, “Statistical comparisons of classifiers over multiple data sets,” J. Mach. Learn. Res., vol. 7, pp. 1–30, 2006. [17] L. Kuncheva, Combining Pattern Classifiers. New York: Wiley, 2004.

Stability Analysis of Markovian Jumping Stochastic Cohen–Grossberg Neural Networks With Mixed Time Delays Huaguang Zhang and Yingchun Wang

Abstract—In this letter, the global asymptotical stability analysis problem is considered for a class of Markovian jumping stochastic Cohen–Grossberg neural networks (CGNNs) with mixed delays including discrete delays and distributed delays. An alternative delay-dependent stability analysis result is established based on the linear matrix inequality (LMI) technique, which can easily be checked by utilizing the numerically efficient Matlab LMI toolbox. Neither system transformation nor free-weight matrix via Newton–Leibniz formula is required. Two numerical examples are included to show the effectiveness of the result. Index Terms—Cohen–Grossberg neural networks (CGNNs), delay-dependent criteria, linear matrix inequality (LMI), Markovian jumping, mixed delay.

I. INTRODUCTION Since Cohen–Grossberg neural network (CGNN) [1] was initially proposed by Cohen–Grossberg in 1983, it has been attracting people’s attention due to its wide application in various engineering and scientific fields, such as neurobiology, population biology, and computing technology. Based on the form of [1], many new CGNNs models with time delays were proposed considering finite switching speeds of the amplifiers and transmission of signals in a network. Correspondingly, stability analysis problems have been an increasing research interest, and many results that guarantee the asymptotic, exponential, or absolute stability for delayed CGNNs have been reported in the literature (see [2]–[12]). The distributed delays were studied in [10]–[12] due

Manuscript received March 14, 2007; revised July 25, 2007; accepted August 23, 2007. This work was supported by the National Natural Science Foundation of China under Grants 60534010, 60572070, 60728307, and 60774048, the Funds for Creative Research Groups of China under Grant 60521003, the Program for Chueng Kong Scholars and Innovative Research Team in University under Grant IRT0421, and the National High Technology Research and Development Program under Grant 2006AA04Z183. H. Zhang is with the Automatic Control Department, Northeastern University, Shenyang 110004, China and also with the Key Laboratory of Integrated Automation of Process Industry, Northeastern University, National Education Ministry, Shenyang 110004, China (e-mail: [email protected]; [email protected]). Y. Wang is with the School of Information Science and Engineering, Northeastern University, Shenyang, Liaoning 110004, China (e-mail: [email protected]). Digital Object Identifier 10.1109/TNN.2007.910738

to the spatial nature of a neural network (NN) with an amount of parallel pathways of a variety axon sizes and lengths in many cases. Moreover, in real nervous systems, synaptic transmission is a noisy process brought on by random fluctuations from the release of neurotransmitters and other probabilistic causes; it was realized that an NN could be stabilized or destabilized by certain stochastic inputs [13], [14]. In particular, the stability criteria for stochastic delayed NN become an attractive research problem of prime importance [12], [15], [16]. On the other hand, the information latching phenomenon usually happens in NN [17], which can be dealt with by extracting finite state representation from trained network. In other words, the NNs may have finite modes and the modes may switch from one to another at different times and it is shown that the switching between different NN modes can be governed by a Markov chain. Such kind of NN is of great significance; see recent results [18]–[22], where [18]–[20] considered time delay simultaneously and the results given are all independent of the size of time delay elements. Moreover, it should be pointed out that, up to now, the stability analysis problem for stochastic CGNNs with Markovian switching and time delays has received very little research attention, despite of its practical importance. In this letter, we deal with the delay-dependent global asymptotic stability analysis problem for a class of Markovian jumping stochastic CGNNs with mixed delays. To the best of the authors’ knowledge, this is the first time to introduce and study the Markovian jumping stochastic CGNNs. It is pointed out that delayed NN have wildly been studied in the literature, such as [23]–[29], where some free-weight matrices were introduced by Newton–Leibniz formula in [23] and descriptor system transformation approach was employed in [24] for obtaining delay-dependent results. However, there exist few delay-dependent results for stochastic delayed CGNNs based on LMIs. Here, we recast a new delay-dependent stability analysis approach via introducing an additional integral quadratic function in Lyapunov functional and employing integral inequality technique; neither system transformation nor free-weight matrix via Newton–Leibniz formula was required. In the infinitesimal operator of the Lyapunov functional, more information of system, the nonliner activation terms, is reserved, which leads to simplification of analysis process and reduction of conservative results. Two numerical examples are used to show the usefulness of the derived LMI-based stability conditions. Throughout this letter, we let ( ; F ; fFt gt>0 ; P ) be a complete probability space with a filtration fFt gt>0 satisfying the usual conditions (i.e., the filtration contains all P -null sets and it is right continuous). Let h > 0 and C ([0h; 0]; Rn ) denote the family of all continuous Rn -valued functions on [0h; 0]. Let CF ([0h; 0]; Rn ) be the family of all F0 -measurable bounded C ([0h; 0]; Rn )-valued random variables  = f () : 0h    0g. For real symmetric matrices A and B , the notation A  B (respectively, A > B ) means that A 0 B is positive semidefinite (respectively, positive definite). The superscript “T ” represents the transpose. Notation E is mathematical expectation operator; j 1 j denotes Euclidean norm for vector or the spectral norm of matrix; and min (1) stands for the minimum eigenvalue of a matrix. II. PROBLEM FORMULATION AND ASSUMPTIONS In this letter, the CGNNs with mixed time delays are described as follows: du(t) dt

= 0a (u(t))

1045-9227/$25.00 © 2008 IEEE

b (u(t))

0C

0 Ag1 (u(t)) 0 Bg2 (u(t 0 h)) t

t0g

g3 (u(s)) ds + U

(1)