Document not found! Please try again

Using kNN Model for Automatic Feature Selection - Semantic Scholar

Report 1 Downloads 168 Views
Using kNN Model for Automatic Feature Selection Gongde Guo1, Daniel Neagu1 and Mark T.D. Cronin2 1

Department of Computing, University of Bradford, Bradford, BD7 1DP, UK {G.Guo, D.Neagu}@Bradford.ac.uk 2 School of Pharmacy and Chemistry, Liverpool John Moores University, L3 3AF, UK [email protected]

Abstract This paper proposes a kNN model-based feature selection method aimed at improving the efficiency and effectiveness of the ReliefF method by: (1) using a kNN model as the starter selection, aimed at choosing a set of more meaningful representatives to replace the original data for feature selection; (2) integration of the Heterogeneous Value Difference Metric to handle heterogeneous applications – those with both ordinal and nominal features; and (3) presenting a simple method of difference function calculation based on inductive information in each representative obtained by kNN model. We have evaluated the performance of the proposed kNN model-based feature selection method on toxicity dataset Phenols with two different endpoints. Experimental results indicate that the proposed feature selection method has a significant improvement in the classification accuracy for the trial dataset.

1

Introduction

The success of applying machine learning methods to real-world problems depends on many factors. One such factor is the quality of available data. The more the collected data contain irrelevant or redundant information, or contain noisy and unreliable information, the more difficult for any machine learning algorithm to discover or obtain acceptable and practicable results. Feature subset selection is the process of identifying and removing as much of the irrelevant and redundant information as possible. Regardless of whether a learner attempts to select features itself, or ignores the issue, feature selection prior to learning has obvious merits [4]: (1) Reduction of the size of the hypothesis space allows algorithms to operate faster and more effectively. (2) A more compact, easily interpreted representation of the target concept can be obtained. (3) Improvement of classification accuracy can be achieved in some cases. Feature selection methods are commonly divided into two broad categories: wrapper methods [6] and filter methods [2]. Wrapper methods usually employ a statistical re-sampling technique using the actual target learning algorithm to estimate the accuracy of feature subsets. The wrapper model tends to give superior performance as it finds features better suited to the predetermined learning algorithm.

The main problem of this approach is the relative low efficiency especially for large datasets and the dependency on the learning algorithm. On the other hand, filters operate independently of any learning algorithm. They make use of all the available training data only when commencing feature selection. When the training data become very large, the filter model is usually a good choice due to its computational efficiency and neutral bias toward any learning algorithm [9]. Many feature selection algorithms [12, 17, 18] have been developed to answer challenging research issues: from handling a huge number of instances, large dimensionality, to deal with data without class labels. The aim of this study was to investigate an optimised approach for feature selection, termed kNNMFS (kNN Model-based Feature Selection). This augments the typical feature subset selection algorithm ReliefF [8]. The resulting algorithm was run on toxicity data for phenols to assess the effect of reduction of the training data.

2

Related Work

The basic concept of Relief was introduced initially by Kira et al. in 1992 [7]. It is a feature weight-based algorithm inspired by instance-based learning algorithms. It estimates the quality of features according to how well their values distinguish between the instances of the same and different classes that are near each other [9]. The Relief family of algorithms e.g. Relief, ReliefF and RReliefF, are feature subset selection methods that are applied in a pre-processing step before the model is learned, and are amongst the most successful algorithms [10]. The majority of heuristic measures for estimating the quality of the attributes assumes the conditional independence (upon the target variable) of the attributes and is therefore less appropriate in problems which possibly involve much feature interaction. Relief algorithms do not make this assumption. They are efficient, aware of the contextual information, and can estimate the quality of attributes correctly in problems with strong dependencies between attributes [5]. The Relief algorithm works by randomly sampling an instance and locating its nearest neighbour from the same and opposite class. The values of the features of the nearest neighbours are compared to the sampled instance and used to update the relevance scores for each feature. This process is repeated for a user specified number of instances. The rationale is that a useful feature should differentiate between instances from different classes and have the same value for instances from the same class [5]. A major limitation of Relief is that it does not help with reducing redundant features. Another limitation is that it works only for binary classes. These drawbacks are overcome by ReliefF. ReliefF, an extension of Relief, aims to solve the problem of datasets with multiclass, noisy and incomplete data. It smoothes the influence of noise in the data by averaging the contribution of k nearest neighbours from the same and opposite class of each sampled instance, instead of the single nearest neighbour of Relief, which ensures greater robustness of the algorithm with regards to noise. The user-defined parameter k controls the locality of the estimates. Multi-class datasets are handled by finding the nearest neighbours from each class that are different from the current

sampled instance, and weighing their contribution by the prior probability of each class estimated from the training data [8]. One major drawback of ReliefF comes from its limitation to deal with the problem of multi-valued attributes. ReliefF uses a numerical-oriented method as a similarity measurement. It uses a simple strategy to cope with categorical attributes by assigning 0 to the difference function for categorical data with the same value, and 1 for categorical data with different values. This simple strategy cannot, however, measure the contribution of each discrete value of a categorical attribute to the class label appropriately. Another drawback of ReliefF is the setting of a suitable number of instances sampled from the dataset. The number of randomly selected instances, is usually set empirically, or set to the number of the entire dataset. RReliefF is a further extension of ReliefF to deal with regression problems, where the predicted value is continuous, and therefore nearest hits and misses cannot be used. To solve this difficulty, instead of requiring the exact knowledge of whether two instances belong to the same class or not, a kind of probability that the predicted values of two instances are different is introduced [11]. This probability can be modelled with the relative distance between the predicted (class) values of two instances. FSSMC (Feature Selection via Supervised Model Construction) [5] is an attempt to deal with the problems facing ReliefF described above. FSSMC chooses a set of more meaningful representatives (the centre of each cluster) to replace the whole dataset and serves as the basis for further feature selection. As the number of chosen representatives can be reduced to only 10 percent of the original datasets on average [3], it is computationally faster than ReliefF. Moreover, it applies a frequency based encoding scheme to transform categorical data to numerical data to cope with the multi-valued attributes. Although FSSMC improves the computational efficiency compared to ReliefF, there is no significant classification accuracy improvement of FSSMC on most trial datasets [5]. The problem probably is the manner in which it randomly chooses seeds for grouping clusters, thus generating a set of less optimal representatives for feature selection. Moreover, noise in the data will affect the generated representatives both in quality and quantity, e.g. the number of representatives. The basic idea of kNN model-based classification method (kNNModel) [3] is to find a set of more meaningful representatives of the complete dataset to serve as the basis for further classification. Each chosen representative di is represented in the form of which respectively represents the class label of di; the similarity of di to the furthest instance among the instances covered by Ni; the number of instances covered by Ni; a representation of instance di. The symbol Ni represents the area that the distance to di is less than or equal to Sim(di). kNNModel can generate a set of optimal representatives via inductively learning from the dataset.

3

kNN Model-based Feature Selection

3.1 A Modified kNN Model For the convenience of difference function calculation between any two representatives and making use of the generated information in the final model of

kNNModel for further feature selection, we make a slight change of the original kNNModel by adding some inductive information, e.g. the nearest neighbour and the furthest neighbour covered by a representative. Therefore, each representative in the modified kNNModel is represented as , in which the additive information such as Rep(di1), Rep(di2) respectively represent the nearest neighbour and the furthest neighbour covered by this representative. As there is not significant change between the original and the modified kNNModel, we still called the modified kNNModel as kNNModel for convention. Moreover, the Heterogeneous Value Difference Metric (HVDM) similarity measure [15], instead of the numerical-oriented methods, is used in kNNModel to deal with the multi-valued attributes problem. The modified kNNModel method is described as follows: Algorithm kNNModel Input: the entire training data D, parameter ε Output: a set of automatically generated representatives M 1. For a given similarity measure, create a similarity matrix from a given training set D 2. Set to “ungrouped” the tag of all instances and set M=Ø 3. For each “ungrouped” instance, find its local ε-neighbourhood 4. Among all the local ε-neighbourhoods obtained in step 3, find its global εneighbourhood Ni. Create a representative into M to represent all the instances covered by Ni, and then set to “grouped” the tag of all the instances covered by Ni 5. Repeat step 3 and step 4 until all the instances in the training set have been set to “grouped” 6. Model M consists of all the representatives collected from the above learning process. Fig. 1. Pseudo code of the modified kNNModel algorithm In the algorithm above, A neighbourhood of a given instance is defined as the set of nearest neighbours around this instance; A local neighbourhood is a neighbourhood which covers the maximal number of instances with the same class label; A local ε-neighbourhood is a neighbourhood which covers the maximal number of instances with the same class label except for allowing ε exceptions and a global εneighbourhood is a local ε-neighbourhood which covers the largest number of instances among a set of obtained local ε-neighbourhoods. 3.2 kNN Model-based Feature Selection A kNN model-based feature selection method, kNNMFS is proposed in this study. It takes the output of the modified kNNModel as seeds for further feature selection. Given a new instance, kNNMFS finds the nearest representative for each class and then directly uses the inductive information of each representative generated by kNNModel for feature weight calculation. This means the k in ReliefF is varied in our

algorithm. Its value depends on the number of instances covered by each nearest representative used for feature weight calculation. The detailed kNNMFS algorithm is described as follows: Algorithm kNNMFS Input: the entire training data D and parameter ε. Output: the vector W of estimations of the qualities of attributes. 1. 2. 3. 4.

Set all weights W[Ai]=0.0, i=1,2,…,p; M:=kNNModel(D, ε); m=|M|; for j=1 to m do begin Select representative Xj= from M 5. for each class C Cls(dj) find its nearest miss (C) from M; 6. for k=1 to p do begin Sim ( d ) 7. ) /( 2 m ) + W [ A ] = W [ A ] − ( diff ( A , d , d ) + diff ( A , d , d ) × j

k



C ≠ Cls ( d j )

k

k

j

j1

k

j

j2

Num ( d j )

⎛ ⎞ Sim ( d v ) P (C ) ⎜ ⎟ ⎜ 1 − P ( Cls ( d )) × ( diff ( A k , d j , d v 1 ( C )) + diff ( A k , d j , d v 2 ( C )) × Num ( d ) ) /( 2 m ) ⎟ v v ⎝ ⎠

8. end; 9. end; Fig. 2. Pseudo code of the kNNMFS algorithm In the algorithm above, p is the number of attributes in the dataset; m is the number of representatives which is obtained from kNNModel(D, ε) and is used for feature selection. diff() uses HVDM [15] as a different function for calculating the difference between two values from an attribute. Compared to ReliefF, kNNMFS speeds up the feature selection process by focussing on a few selected representatives instead of the whole dataset. These representatives are obtained by learning from the original dataset. Each of them is an optimal representation of a local data distribution. Using these representatives as seeds for feature selection better reflects the influence of each attribute on different classes, thus giving more accurate weights to attributes. Moreover, a change was made to the original difference function to allow kNNMFS to make use of the generated information in each representative such as Sim(dj) and Num(dj) from the created model of kNNModel for the calculation of weights. This modification reduces the computational cost further.

4

Experiments and Evaluation

To evaluate the effectiveness of the newly introduced algorithm kNNMFS, we performed some experiments on a dataset of toxicity values for approximately 250 chemicals, all of which contained a similar chemical feature, namely a phenolic group. The toxicity of the phenols was assessed in the ciliated protozoan Tetrahymena pyriformis, according to [13]. The full toxicity dataset was reported originally by Cronin et al [1] following experimental measurement of the effects by Schultz and co-

workers (College of Veterinary Medicine, University of Tennessee, Knoxville TN, USA). The analysis of the Tetrahymena pyriformis toxicity data allowed for an evaluation of the performance of the kNNMFS algorithm and to assess its performance in feature selection for the real-world application of toxicity prediction of the environmental effects of chemicals. 4.1 Toxicity Dataset for Phenols The hydroxy-substituted aromatic compounds (phenols) form a large and structurally diverse group of chemicals. These are interesting from a toxicological point of view, since the phenols are widely used organic compounds. They elicit a number of toxicities to different species [14]. Thus, there has been much interest in quantitative structure-activity relationships (QSARs) for phenols, due to their ubiquitous presence in the environment and the various toxicities they may have. One of the important tasks in the prediction of the toxicity of phenols using QSAR analysis is the examination of the relevance of the descriptors in the modeling paradigm. This is often a tedious task, considering the large number of descriptors and compounds to be studied. The algorithm proposed in this study is therefore a contribution to the area of analyzing the correlations between chemical descriptors and the development of QSARs. In this study high quality toxicity data for a large number of phenols were collated from a historical source [14], and supplemented by those from further testing, providing data on 250 compounds for the development and validation of QSARs [1]. A total of 173 descriptors were calculated for each compound. These descriptors were calculated to represent the physico-chemical, structural and topological properties that were relevant to toxicity. An explanation of these chemical descriptors and the large variety of software tools used to calculate them is available from [1]. 4.2 Evaluation Criteria An optimal subset is always relative to a certain evaluation criterion. Evaluation criteria can be broadly categorized into two groups based on their dependency on the learning algorithm applied to the selected feature subset. Typically, an independent criterion, as in filter models, tries to evaluate the goodness of a feature, or a feature subset, without the involvement of a learning algorithm in this process. A dependent criterion, as in wrapper models, tries to evaluate the goodness of a feature, or feature subset, by evaluating the performance of the learning algorithm applied on the selected subset. For the prediction of continuous class values, e.g. the toxicity values in the phenols dataset, dependent criteria: Correlation Coefficient (CC), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Relative Absolute Error (RAE), and Root Relative Squared Error (RRSE) are chosen to evaluate the goodness of different feature selection algorithms in the experiments. The evaluation measures for continuous class values prediction are presented in Table 1. For the prediction of discrete classes, e.g. the Mechanism of Action in the Phenols dataset, average

classification accuracy and unbiased variance are used as evaluation criteria. The unbiased variance is defined as: 1 n s2 = ( xi − x ) 2 ∑ n − 1 i =1 where x is the sample mean. These evaluation measures are used frequently to compare the performance of different feature selection methods. Table 1. Evaluation measures for continuous class values prediction Acronym

Full Name

Equation n

Correlation Coefficient

CC

r=

i =1

Mean Absolute Error

Ei =

RMSE

Root Mean Squared Error

Ei =

n

RAE

Ei =

∑x

− yi

i

i =1 n

∑y

j

−y

n

RRSE

Ei =

j =1

n

1 n ∑ ( xi − yi ) 2 n i =1

j =1

Root Relative Squared Error

i =1

1 n ∑ xi − yi n i =1

n

Relative Absolute Error

n

n ⎡ ⎤⎡ ⎤ 2 2 2 2 ⎢n∑ xi − (∑ x j ) ⎥ ⎢n∑ yi − (∑ y j ) ⎥ ⎢⎣ i =1 ⎥⎦ ⎢⎣ i =1 ⎥⎦ j =1 j =1 n

MAE

n

n∑ xi yi − ∑ xi ∑ yi

∑ (x i =1 n

∑(y j =1

− yi ) 2

i

j

− y)2

The following terms are used in Table 1 for a set of n data points (xi, yi), where xi represents the predicted value of yi, yi is the true class value, and y is the average defined by the formula: y =

1 n ∑ yj . n j =1

4.3 Evaluation [Experiment 1] In this experiment, eight feature selection methods including ReliefF and kNNMFS were performed on the phenols dataset with toxicity as endpoint to choose a set of optimal subsets based on different evaluation criteria. Besides

kNNMFS that was implemented in our own prototype, seven other feature selection methods are implemented in the Weka [16] software package. The experimental results performed on subsets obtained by different feature selection methods are presented in Table 2. In the experiments, a 10-fold cross validation method was used for evaluation. It is obvious that the proposed kNNMFS method performs better than any other feature selection methods evaluated by the linear regression algorithm on the phenols dataset. The performance on the subset after feature selection by kNNMFS using linear regression algorithm is significantly better than that on the original dataset. Compared to ReliefF and other feature selection methods, kNNMFS obtains higher correlation coefficient and lower error rates such as MAE, RMSE, RAE and RRSE. Table 2. Performance carried out on different subsets after feature selection FSM GR IG Chi ReliefF SVM CS CFS kNNMFS Phenols

NSF 20 20 20 20 20 13 7 35 173

Evaluated by Linear Regression Algorithm CC MAE RMSE RAE RRSE 0.7722 0.4083 0.5291 60.7675% 63.7304% 0.7662 0.3942 0.5325 58.6724% 63.1352% 0.7570 0.4065 0.5439 60.5101% 65.5146% 0.8353 0.3455 0.4568 51.4319% 55.0232% 0.8239 0.3564 0.4697 53.0501% 56.5722% 0.7702 0.3982 0.5292 59.2748% 63.7334% 0.8049 0.3681 0.4908 54.7891% 59.1181% 0.8627 0.3150 0.4226 46.8855% 50.8992% 0.8039 0.3993 0.5427 59.4360% 65.3601%

The meaning of the column titles in Table 2 is as follows: FSM – Feature Selection Method; NSF – Number of Selected Features. The feature selection methods studied include: GR – Gain Ratio feature evaluator; IG – Information Gain ranking filter; Chi – Chi-squared ranking filter; ReliefF – ReliefF Feature selection; SVM- SVM feature evaluator; CS – Consistency Subset evaluator; CFS – Correlation-based Feature Selection; kNNMFS – kNN Model-based Feature Selection and Phenols – the original Phenols data set with 173 features. [Experiment 2] In this experiment, we performed the same feature selection as we did in experiment 1 on the phenols dataset with mechanism of action as endpoint and then carried out classification using Weighted kNN (wkNN) which was implemented in our own prototype. The experimental results are presented in Table 3. It shows that the proposed kNNMFS method performs better than any other feature selection method on the phenols dataset with mechanism of action as endpoint. The average classification accuracy on the subset after feature selection by kNNMFS using wkNN is higher than that of by any other feature selection methods and that of the original dataset. Compared to the original Phenols dataset, kNNMFS achieves 8.1% improvement in terms of average classification accuracy and has relatively small range of variance.

Table 3. Performance of wkNN algorithm on different phenols subsets

FSM GR IG Chi ReliefF SVM CS CFS kNNMFS Phenols

5

NSF 20 20 20 20 20 13 7 35 173

10-Fold Cross Validation Using wkNN (k=5) Average Accuracy Variance Deviation 1.31 1.70 89.32 1.10 1.21 89.08 0.71 0.50 88.68 1.15 1.32 91.40 0.63 0.40 91.80 0.87 0.76 89.40 1.12 1.26 80.76 0.67 0.44 93.24 0.66 0.43 86.24

Conclusions

In this paper we present a novel solution to deal with the shortcomings of ReliefF. To solve the problem of choosing a set of seeds for ReliefF, we modified the original kNNModel method by choosing a few more meaningful representatives from the training set, in addition to some extra information to represent the whole training set, and used it as a starter reference for ReliefF. In the selection of each representative we used the optimal but different k, decided automatically for each dataset itself. The representatives obtained can be used directly for feature selection. Experimental results showed that the performance evaluated on the subsets of the Phenol dataset with different endpoints by kNNMFS is better than that of using any other feature selection methods. The improvement is significant compared to ReliefF and other feature selection methods. The results obtained using the proposed algorithms for chemical descriptors analysis applied in predictive toxicology are encouraging and show that the method is worthy of further research. Further research is required into investigating the effects of boundary data or centre data of clusters chosen as seeds for kNNMFS.

Acknowledgment This work was supported partly by the EPSRC project PYTHIA – Predictive Toxicology Knowledge Representation and Processing Tool Based on a Hybrid Intelligent Systems Approach, Grant Reference: GR/T02508/01.

References 1. Cronin, M.T.D., Aptula, A.O., Duffy, J. C. et al.: Comparative Assessment of Methods to Develop QSARs for the Prediction of the Toxicity of Phenols to Tetrahymena Pyriformis, Chemosphere 49 (2002), pp. 1201-1221 2. Fayyad, U.M. and Irani, K.B.: The Attribute Selection Problem in Decision Tree Generation. In Proc. of AAAI-92, the 9th National Conference on Artificial Intelligence (1992), pp. 104-110, AAAI Press/The MIT Press 3. Guo, G., Wang, H., Bell, D. et al.: kNN Model-based Approach in Classification. In Proc. of CoopIS/DOA/ODBASE 2003, LNCS 2888, Springer-Verlag, pp. 986-996 (2003) 4. Hall, M. A.: Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning, In Proc. of ICML’00, the 17th International Conference on Machine Learning, pp. 359 – 366 (2000) 5. Huang, Y., McCullagh, P.J. Black, N.D.: Feature Selection via Supervised Model Construction. In Proc. of the Fourth IEEE International Conference on Data Mining, pp. 411-414 (2004) 6. John, G.H., Kohavi, R. and Pfleger, K.: Irrelevant Feature and the Subset Selection Problem. In W.W. Cohen and Hirsh H., editors, Machine Learning: Proc. of the Eleventh International Conference (1994), pp. 121-129, New Brunswick, N.J., Rutgers University 7. Kira, K. and Rendell, L.A.: A Practical Approach to Feature Selection. Machine Learning (1992): 249-256 8. Kononenko, I.: Estimating attributes: Analysis and Extension of Relief. In Proc. of ECML’94, the Seventh European Conference in Machine Learning (1994), SpringerVerlag, pp.171-182 9. Liu, H., Yu, L., Dash, M. and Motoda, H.: Active Feature Selection Using Classes. In Proc. of PAKDD’03, pp. 474-485 (2003) 10. Robnik, M., Kononenko, I.: Machine Learning, Kluwer Academic Publishers (2003), 53, pp. 23-69 11. Sikonja, M.R. Kononenko, I.: Theoretical and Empirical Analysis of ReliefF and RReliefF. Machine Learning Journal (2003) 53:23- 69 12. Søndberg-Madsen, N., Thomsen, C. and Peña, J. M.: Unsupervised Feature Subset Selection. In Proc. of the Workshop on Probabilistic Graphical Models for Classification (within ECML/PKDD 2003), 71-82 (2003) 13. Scheultz, T.W.: TETRATOX: The Tetrahymena Pyriformis Population Growth Impairment Endpoint – A Surrogate for Fish Lethality. Toxicol. Methods, 7, 289-309 (1997) 14. Schultz, T.W., Sinks, G.D., Cronin, M.T.D.: Identification of Mechanisms of Toxic Action of Phenols to Tetrahymena Pyriformis from Molecular Descriptors. In: Chen, F., Schuurmann, G. (Eds.), Quantitative Structure-Activity Relationships in Environmental Sciences – VII. SETAC Press, Presacola, FL, USA, pp. 329-342 (1997) 15. Wilson, D.R. and Martinez, T.R..: Improved Heterogeneous Distance Functions, Journal of Artificial Intelligence Research (JAIR), 6-1, pp.1-34 (1997) 16. Witten, I.H. and Frank, E.: Data Mining: Practical Machine Learning Tools with Java Implementations, Morgan Kaufmann (2000), San Francisco 17. Biesiada, J. and Duch, W.: Feature Selection for High-Dimensional Data: A KolmogorovSmirnov Correlation-based Filter. In Proc. of CORES 2005, the 4th International Conference on Computer Recognition Systems (2005) 18. Sebban, M. and Nock, R..: A Hybrid Filter/Wrapper Approach of Feature Selection Using Information Theory. Pattern Recognition 35(4):835-846 (2002)