Monte Carlo feature selection and interdependency discovery in ...

Report 2 Downloads 12 Views
Monte Carlo feature selection and interdependency discovery in supervised classification Michał Drami´nski 1 , Marcin Kierczak 1 , Jacek Koronacki and Jan Komorowski

Abstract Applications of machine learning techniques in Life Sciences are the main applications forcing a paradigm shift in the way these techniques are used. Rather than obtaining the best possible supervised classifier, the Life Scientist needs to know which features contribute best to classifying observations into distinct classes and what are the interdependencies between the features. To this end we significantly extend our earlier work [Drami´nski et al. (2008)] that introduced an effective and reliable method for ranking features according to their importance for classification. We begin with adding a method for finding a cut-off between informative and non-informative features and then continue with a development of a methodology and an implementation of a procedure for determining interdependencies between informative features. The reliability of our approach rests on multiple construction of tree classifiers. Essentially, each classifier is trained on a randomly chosen subset of the original data using only a fraction of all of the observed features. This approach is conceptually simple yet computer-intensive. The methodology is validated on a large and difficult task of modelling HIV-1 reverse transcriptase resistance to drugs which is a good example of the aforementioned paradigm shift. We construct a classifier but of the main interest is the identification of mutation points (i.e. features) and their combinations that model drug resistance. Michał Drami´nski and Jacek Koronacki Institute of Computer Science, Polish Acad. Sci., Ordona 21, Warsaw, Poland, e-mail: Michal. [email protected],[email protected] Marcin Kierczak The Linnaeus Centre for Bioinformatics, Uppsala University and The Swedish University of Agricultural Sciences, Box 758, Uppsala, Sweden, e-mail: [email protected] Jan Komorowski The Linnaeus Centre for Bioinformatics, Uppsala University and The Swedish University of Agricultural Sciences, Box 758, Uppsala, Sweden Interdisciplinary Centre for Mathematical and Computer Modelling, Warsaw University, Poland, e-mail: [email protected] 1

These authors contributed equally.

1

2

M. Drami´nski et al.

1 Introduction A major challenge in the analysis of many data sets, especially those presently generated by advanced biotechnologies, is their size: a very small number of records (samples, observations), of the order of tens, versus thousands of attributes or features per each record. Typical examples include microarray gene expression experiments (where the features are gene expression levels) or data coming from next generation DNA or RNA sequencing projects. Another obvious example are transactional data of commercial origin. In all these tasks supervised classification is quite different from a typical data mining problem in which every class has a large number of examples. In the latter context, the main task is to propose a classifier of the highest possible quality of classification. In class prediction for typical gene expression data it is not the classifier per se that is crucial; rather, selection of informative (discriminative) features and the discovered interdependencies among them are to give the Life Scientists a much desired possibility of the interpretation of the classification results. Given such data, all reasonable classifiers can be claimed to be capable of providing essentially similar results (if measured by error rate or the like criteria; cf. [Dudoit and Fridlyand (2003)]). However, since it is rather a rule than an exception that most features in the data are not informative, it is indeed of utmost interest to select the few ones that are informative and that may form the bases for class prediction. Equally interesting is a discovery of interdependencies between the informative features. Generally speaking, feature selection may be performed either prior to building the classifier, or as an inherent part of this process. These two approaches are referred to as filter methods and wrapper methods, respectively. Currently, the wrapper methods are often divided into two subclasses: one retaining the name ”wrapper methods” and the other, termed embedded methods. Within this finer taxonomy, the former refer to such classification methods in which feature selection is ”wrapped” around the classifier construction and the latter one to those in which feature selection is directly built into the classifier construction. A significant progress in these areas of research has been achieved in recent years; for a brief account, up to 2002, see [Dudoit and Fridlyand (2003)] and for an extensive survey and later developments see [Saeys et al. (2007)]. Regarding the wrapper and embedded approaches, an early successful method, not mentioned by [Saeys et al. (2007)], was developed by Tibshirani et al. (see [Tibshirani et al. (2002), Tibshirani et al. (2003)]) and is called nearest shrunken centroids. Most recently, a Bayesian technique of automatic relevance determination, the use of support vector machines, and the use of ensembles of classifiers, all these either alone or in combination, have proved particularly promising. For further details see [Li et al. (2002), Lu et al. (2007), Chrysostomou et al. (2008)] and the literature there. In the context of feature selection the last developments by the late Leo Breiman deserve special attention. In his Random Forests, he proposed to make use of the so-called variable (i.e. feature) importance for feature selection. Determination of the importance of the variable is not necessary for random forest

Feature selection and interdependency discovery

3

construction, but it is a subroutine performed in parallel to building the forest; cf. [Breiman and Cutler (2008)]. Ranking features by variable importance can thus be considered to be a by-product of building the classifier. While ranking variables according to their importance is a natural basis for a filter, nothing prevents one from using such importances within, say, the embedded approach; cf., e.g., [Diaz-Uriarte and de Andres (2006)]. In any case, feature selection by measuring variable importance in random forests should be seen as a very promising method, albeit under one proviso. Namely, the problem with variable importance as originally defined is that it is biased towards variables with many categories and variables that are correlated; cf. [Strobl et al. (2007), Archer and Kimes (2008)]. Accordingly, proper debiasing is needed, in order to obtain true ranking of features; cf. [Strobl et al. (2008)]. One potential advantage of the filter approach is that it constructs a group of features that contribute the most to the classification task, and therefore are informative or ”relatively important”, to a given classification task regardless of the classifier that will be used. In other words, the filter approach should be seen as a way of providing an objective measure of relative importance of each feature for a particular classification task. Of course, for this to be the case, a filter method used for feature selection should be capable of incorporating interdependencies between the features. Indeed, the fact that a feature may prove informative only in conjunction with some other features, but not alone, should be taken into account. Clearly, the aforementioned algorithms for measuring variable importance in random forests possess the last capability. Recently, a novel, effective and reliable filter method for ranking features according to their importance for a given supervised classification task has been introduced by [Drami´nski et al. (2008)]. The method is capable of incorporating interdependencies between features. It bears some remote similarity to the Random Forest methodology, but differs entirely in the way features ranking is performed. Specifically, our method does not require debiasing and is conceptually simpler. A more important and new result is that it provides explicit information about interdependencies among features. Within our approach, discovering interdependencies builds on identifying features which ”cooperate” in determining that samples belong to the same classes. It is worthwhile to emphasize that this is completely different from the usual approach which aims at finding features that are similar in some sense. The procedure from [Drami´nski et al. (2008)] for Monte Carlo feature selection is briefly recapitulated in Section 2. Since the original aim was only to rank features according to their classification ability, no distinction was made among informative and non-informative features. In that section we introduce an additional procedure to find a cut-off separating informative from non-informative features in the ranking list. In Section 3, a way to discover interdependencies between features is provided. In Section 4 application of the method is illustrated on the HIV-1 resistance to Didanosine. Interpretation of the obtained results is provided in Subsection 4.1. We close with concluding remarks in Section 5.

4

M. Drami´nski et al.

t splits s subsets

d attributes

m

...

m ...

...

RI

m

Fig. 1 Block diagram of the main step of the MCFS procedure.

2 Monte Carlo Feature Selection The Monte Carlo feature selection (MCFS) part of the algorithm is conceptually simple, albeit computer-intensive. We consider a particular feature to be important, or informative, if it is likely to take part in the process of classifying samples into classes ”more often than not”. This ”readiness” of a feature to take part in the classification process, termed relative importance of a feature, is measured via intensive use of classification trees. The use of classification trees is motivated by the fact that they can be considered to be the most flexible classifiers within the family of all classification methods. In our method, however, the classifiers are used for measuring relative importance of features, not for classification per se. In the main step of the procedure, we estimate relative importance of features by constructing thousands of trees for randomly selected subsets of the features. More precisely, out of all d features, we select s subsets of m features, m being fixed and m