A hybrid feature selection scheme for ... - Semantic Scholar

Comment

Report 2 Downloads 104 Views

Expert Systems with Applications 38 (2011) 11311–11320

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

A hybrid feature selection scheme for unsupervised learning and its application in bearing fault diagnosis Yang Yang a,⇑, Yinxia Liao b, Guang Meng a, Jay Lee b a b

State Key Laboratory of Mechanical System and Vibration, Shanghai Jiaotong University, Shanghai 200240, PR China NSF I/UCR Center for Intelligent Maintenance Systems, 560 Rhodes Hall, University of Cincinnati, Cincinnati, OH 45221, USA

a r t i c l e

i n f o

Keywords: Feature selection Unsupervised learning Fault diagnostics

a b s t r a c t With the development of the condition-based maintenance techniques and the consequent requirement for good machine learning methods, new challenges arise in unsupervised learning. In the real-world situations, due to the relevant features that could exhibit the real machine condition are often unknown as priori, condition monitoring systems based on unimportant features, e.g. noise, might suffer high falsealarm rates, especially when the characteristics of failures are costly or difﬁcult to learn. Therefore, it is important to select the most representative features for unsupervised learning in fault diagnostics. In this paper, a hybrid feature selection scheme (HFS) for unsupervised learning is proposed to improve the robustness and the accuracy of fault diagnostics. It provides a general framework of the feature selection based on signiﬁcance evaluation and similarity measurement with respect to the multiple clustering solutions. The effectiveness of the proposed HFS method is demonstrated by a bearing fault diagnostics application and comparison with other features selection methods. Ó 2011 Elsevier Ltd. All rights reserved.

1. Introduction As sensing and signal processing technologies advance rapidly, increasingly features have been involved in condition monitoring system and fault diagnosis. A challenge in this area is to select the most sensitive parameters for the various types of fault, especially when the characteristics of failures are costly or difﬁcult to learn (Malhi & Gao, 2004). In reality, since the relevant or important features are often not available as priori, amount of candidate features have been proposed to achieve a better representation of the machine health condition (Dash & Liu, 1997; Jardine, Lin, & Banjevic, 2006; Peng & Chu, 2004). Due to the irrelevant and redundant features in the original feature space, employing all features might lead to high complexity and low performance of fault diagnosis. Moreover, most unsupervised learning methods assume that all features have uniform importance degree during clustering operations (Dash & Koot, 2009). Even in the optimal feature set, it is also assumed that each feature has the same sensitivity throughout clustering operations. In fact, it is known that an important feature facilitates creating clusters while an unimportant feature, on the contrary, may jeopardize the clustering operation by blurring the clusters. Thereby, it is better to select only the most representative features (Xu, Xuan, Shi, & Wu, 2009) rather than simply reducing the number of the features. Hence, it is of signiﬁcance to develop a systematic and automatic feature selection method ⇑ Corresponding author. Tel.: +86 21 34206831x322. E-mail address: [email protected] (Y. Yang). 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.02.181

that is capable of selecting the prominent features to achieve a better insight into the underlying machine performance. Summarily speaking, the feature selection is one of the essential and frequently used techniques in machine learning (Blum & Langley, 1997; Dash & Koot, 2009; Dash & Liu, 1997; Ginart, Barlas, & Goldin, 2007; Jain, Duin, & Mao, 2000; Kwak & Choi, 2002), whose aim is to select the most representative features and which brings the immediate effects for improving mining performance such as the predictive accuracy and solution comprehensibility (Guyon & Elisseeff, 2003; Liu & Yu, 2005). However, traditional feature selection algorithms for classiﬁcation do not work for unsupervised learning since there is no class information available. Dimensionality reduction or feature extraction methods are frequently used for unsupervised data, such as Principal Components Analysis (PCA), Karhunen–Loeve transformation, or Singular Value Decomposition (SVD) (Dash & Koot, 2009). Malhi and Gao (2004) presented a PCA-based feature selection model for the bearing defect classiﬁcation in the condition monitoring system. Compared to using all features initially considered relevant to the classiﬁcation results, it provided higher accurate classiﬁcations for both supervised and unsupervised purposes with fewer feature inputs. But the drawback is the difﬁculty of understanding the data and the found clusters through the extracted features (Dash & Koot, 2009). Given sufﬁcient computation time, the feature subset selection investigates all candidate feature subsets and selects the optimal one with satisfying the cost function. Greedy search algorithms like sequential forward feature selection (SFFS) (or backward search feature selection (BSFS)) and random feature selection

11312

Y. Yang et al. / Expert Systems with Applications 38 (2011) 11311–11320

were commonly used. Oduntan, Toulouse, and Baumgartner (2008) developed a multilevel tabu search algorithm combined a hierarchical search framework, and compared it to the sequential forward feature selection, the random feature selection and the tabu search feature selection (Zhang & Sun, 2002). Feature subset selections require intensive computational time and show poor performance for non-monotonic indices. In order to overcome these drawbacks, feature selection methods tend to rank features or select a subset of original features (Guyon & Elisseeff, 2003). Feature ranking techniques resort the features according to cost functions and select a subset from the ordered features. Hong, Kwong, and Chang (2008a) introduced an effective methods feature ranking from multiple view (FRMV). It scores each feature using a ranking criterion by considering multiple clustering results, and selects the ﬁrst several features with the best quality as the ‘‘optimal’’ feature subset. However, FRMV prefers to the importance of features that achieve better classiﬁcation rather than the redundancy of the selected features, which results in the selected features not necessarily being the optimal subset. Considering the approaches for evaluating the cost function of feature selection techniques, feature selection algorithms broadly fall into three categories: the ﬁlter model, the wrapper model and the hybrid model (Liu & Yu, 2005) .The ﬁlter model discovers the general characteristics of data and considers the feature selection as a preprocessing step which is independent of any mining algorithms. Filter method is less time consuming while less efﬁcient. The wrapper model incorporates the one predetermined learning algorithm and selects the feature subset aiming to improve its mining performance according to certain criteria. It is more timeconsuming but more effective compared to the ﬁlter models. Moreover, the predetermined learning algorithm remains bias towards the shape of the cluster, that is, it is sensitive to the data structure according to its operation concept. The hybrid model tends to take advantage of the two models in different search stages according to different criteria. Mitra, Murthy, and Pal (2002) described a ﬁlter feature selection algorithm for high dimension data sets based on measuring similarity between features whereby the redundancy therein was removed, and a maximum information compression index was also introduced to estimate the similarity between features in their work. Li, Dong, and Hua (2008) proposed a novel ﬁlter features selection algorithm through feature clustering (FFC) to group the features into different clusters based on the feature similarity and to select the representative features in each cluster to reduce the feature redundancy. Wei and Billings (2007) introduced a forward orthogonal search feature selection algorithm by maximizing the overall dependency to ﬁnd signiﬁcant variables, which also provided a rank list of selected features ordered according to the percentage contribution for representing the overall structures. Liu, Ma, Zhang, and Mathew (2006) presented a wrapper model based on fuzzy c-means (FCM) algorithm for rolling element bearing fault diagnostics. Sugumaran and Ramachandran (2007) employed a wrapper approach based on decision tree with information gain and entropy reduction as criteria to select representative features that could discriminate faults of bearing. Hong , Kwong, and Chang (2008b) described a novel feature selection algorithm based on unsupervised learning ensembles and population based incremental learning algorithm. It searches for a subset from all candidate feature subsets so that the clustering operation based on this feature subset could generate the most similar clustering result to the one obtained by a unsupervised learning ensembles method. Huang, Cai, and Xu (2007a) developed a two stages hybrid genetic algorithm to ﬁnd a subset of features. In the ﬁrst stage, the mutual information between the predictive labels and the true class labels served as a ﬁtness function for the genetic algorithm to conduct the global search in a wrapper way. Then in the second stage, the conditional mutual information served as an independent measure for the feature ranking considering both the relevance and the redundancy of features.

As mentioned above, these techniques either require the available features to be independent initially, to which the realistic situations are opposite, or remain bias toward the shape of the cluster due to their fundamental concept. This paper introduces a hybrid feature selection scheme for unsupervised learning, which can overcome those deﬁciencies. The proposed scheme generates two random-selected subspaces for further clustering, combines different genres of clustering analysis to obtain a population of subdecisions of feature selection based on signiﬁcance measurement, and removes redundant features based on feature similarity measurement to improve the quality of selected features. The effectiveness of the proposed scheme is validated by an application of bearing defects classiﬁcation, and the experimental results illustrate that the proposed method is able to (a) identify the features that are relevant to the bearing defects, and (b) maximize the performance of unsupervised learning models with fewer features. The rest of this paper is arranged as follows. Section 2 illustrates the proposed HFS scheme in details. Section 3 discusses the application of the proposed feature selection scheme in bearing fault diagnosis. Finally, Section 4 concludes this paper.

2. Hybrid feature selection scheme (HFS) for unsupervised classiﬁcation It is time consuming or a difﬁcult mission even for an experienced fault diagnosis engineer to determine which feature among all available features is able to distinguish the characteristics of various failures, especially there is no prior knowledge (class information) available. To tackle the problem of class information absence, FRMV (Hong et al., 2008a) extended the feature ranking methodology into the unsupervised data clustering data. It offered a generic approach to boost the performance of clustering analysis. A stable and robust unsupervised feature ranking approach was proposed based on the ensembles of multiple feature rankings obtained from different views of the same data set. When conducting the FRMV, data instances were ﬁrst classiﬁed in a randomly selected feature subspace to obtain a clustering solution, and then all features were ranked according to their relevancies with the obtained clustering solution. These two steps iterated until a population of feature rankings was achieved. Thereby, all obtained features rankings were combined by a consensus function into a single consensus one. However, FRMV clustered the data instances in a subspace that consists of random-selected half number of features every time. It was likely that some valuable features might be ignored in the beginning, that is, some features probably might never be included in all iterations. Besides, it only focused on the ensemble of one unsupervised learning algorithm results through its scheme, which obviously overlooked the reality that it is likely for a learning algorithm to hold the bias toward the nature structure of data, such as the hyper-spherical structure and the hierarchical structure (Frigui, 2008; Greene, Cunningham, & Mayer, 2008). As simply illustrated in Fig. 1, the data set is consisted of 11 points. If two clusters are contained in the data set, a classiﬁer based on the hierarchical concept tends to assign point 1, 2, 3, 4, 5 to a cluster, and the remaining to the other cluster. While, a classiﬁer based on the hyper-spherical concept tempts to assign point 1, 2, 3, 4, 6, 7, 8 to a cluster, and the remaining to the other cluster. Furthermore, in FRMV all features were supposed to be independent before the selection process, which is usually opposite in the real world. Thereby, the well-ranked features were related to their neighbors with high probability, in other words, some top ranked features might turn out to be redundant. Since the abovementioned shortcomings in FRMV are the obstacles of boosting higher classiﬁcation performance and constraints of wider applications in the real world, a hybrid feature selection

Y. Yang et al. / Expert Systems with Applications 38 (2011) 11311–11320

11313

Fig. 1. An example of data structures in 2D; (a) hierarchical cluster, (b) spherical cluster.

scheme (HFS) is proposed to overcome these deﬁciencies. The remaining part of this section will present the HFS scheme for unsupervised learning and introduce the criterions used in HFS.

decision through combining the P sub-decisions into a single consensus feature decision RFpre-ﬁnal, which is thereafter processed according to the feature similarity to obtain RFﬁnal. Detail of the scheme is described as follows.

2.1. Procedure of the hybrid unsupervised feature selection method The HFS is inspired by FRMV, which ranked each feature according to the relevancy between the feature and the combined clustering solutions. Moreover, the HFS is developed to combine the different genres of clustering analysis to a consensus decision and rank features according to the relevancies between the features and the consensus decision and independencies between features. Generally, HFS involves two aspects: (1) signiﬁcance evaluation, which determines the contribution of each feature on behalf of multiple clustering results; (2) redundancy evaluation, which retains the most signiﬁcant and independent features concerning the feature similarity. Some notations used throughout this paper are given as follows. The input vector x of original feature space X with D candidate features is denoted as X i ¼ fx1i ; x2i . . . ; xDi g ði ¼ 1; . . . ; MÞ, in which i is denoted as ith instance and M is the number of instances. Let RFk = {rank(k)(F1), rank(k)(F2), . . . , rank(k)(Fn)} (1 < rank(k)(Fi) < D) be the kth sub-decision of the feature ranking, where rank(k)(Fi) denotes the rank of the ith feature F in the kth sub-decision. Assuming there are P sub-decisions of the feature rank {RF(1), RF(2), . . . RF(p)} ,a combine function determines a ﬁnal

Algorithm: Hybrid feature selection scheme for unsupervised learning Input: feature space X, the number of cluster N, maximum iteration L Output: decision of feature selection (1) Iterate until get a population of dub-decisions For k = 1:L, Do: (1.1) Randomly divide original feature space into two subspaces X1, X2 (1.2) Group data with the ﬁrst and second group of unsupervised learning algorithms in subspace X1, X2 separately (1.3) Evaluate signiﬁcance of each feature based on signiﬁcance measurement to obtain the kth sub-decision of feature selection RF(k) End //combine all rankings into a single consensus one (2) RFpre-ﬁnal = combiner{RF(1), RF(2), . . . , RF(2L)} (3) Redundancy evaluation based on feature similarities (4) Return RFﬁnal

11314

Y. Yang et al. / Expert Systems with Applications 38 (2011) 11311–11320

At the beginning, the original feature space X is random divided into two feature subspaces X1, X2, in which the instances are clustered correspondingly. In step 1.2, two different genres of clustering analysis, e.g. hyper-spherical cluster and hierarchical cluster, are used to classify data instances in the two subspaces respectively. Thereby, a clustering solution is obtained in each subspace. Then all features are ranked with respect to their relevancies to the obtained clustering solutions in step 1.3, named the signiﬁcance evaluation. These above two steps iterate until a population of feature rankings, named sub-decisions, are achieved. In step 2, a consensus function is utilized to combine all sub-decisions into a pre-ﬁnal decision. Thereafter, the ﬁnal decision of feature selection is accomplished by re-ranking the pre-ﬁnal decision according to the re-rank scheme based on feature similarity in step 3, named the redundancy evaluation. The details of HFS will be introduced in Sections 2.2 and 2.3. Fig. 2 illustrates the framework of the HFS.

Table 1 lists the differences between FRMV and the proposed HFS. First of all, in order to make sure that every feature in the original feature set is able to contribute to the decision making, HFS is making use of both randomly divided subspaces from the original feature space instead of ignoring some features due to randomly selecting the half of the original feature space in FRMV. Secondly, HFS considers the bias of individual unsupervised learning algorithm. Thereby, different genres of clustering methods are used to cluster data in the subspace. Moreover, HFS provides a redundancy evaluation according to the feature similarity and re-ranks the features. It is more appropriate for the real world applications than FRMV. 2.2. Signiﬁcance evaluation The goal of unsupervised feature selection is to ﬁnd as few features as possible that best uncovers ‘‘interesting natural’’ clusters from data, which could be found by unsupervised learning algorithm. Therefore, the relationship between the clustering solution and feature is considered as the signiﬁcance of the feature to the clustering solution. In step 1.3, the features are ranked with respect to their relevancies with the obtained clustering solutions, named signiﬁcance evaluation. The sub-decisions serve as the target and each feature is considered as a variable. In this research, the widely used linear correlation coefﬁcient (LCC) (Hong et al., 2008a), the symmetrical uncertainty (SU) (Yu & Liu, 2004) and the Davies– Bouldin index (DB) (Bouldin, 1979) are used for signiﬁcance evaluation. The details of each criterion are introduced as follows. For convenience, denote Fk and R(k) as the kth feature and the kth sub-decision respectively. First, the linear correlation coefﬁcient studies the correlations between the variables and the target, which is calculated as follows:

LCCðF k ; Rk Þ ¼

covðF k ; Rk Þ

rðF k ÞrðRk Þ

ð1Þ

;

where r(Rk) is the standard deviation of the kth target and cov(Fk, Rk) is the covariance between Fk and Rk. Secondly, the symmetrical uncertainty is deﬁned as follows:

SUðF k ; Rk Þ ¼ 2

IGðF k jRk Þ ; HðF k Þ þ HðRk Þ

ð2Þ

with

IGðF k jRk Þ ¼ HðF k Þ HðF k jRk Þ; X PðF 0k Þ logðPðF 0k ÞÞ; HðF k Þ ¼ F 0k 2XðF k Þ

HðF k jRk Þ ¼ PðF 0k Þ ¼ Fig. 2. Flowchart of HFS for unsupervised learning.

X

6 6 X 6

ð3Þ ð4Þ 7 7 7

PðR0k Þ4 PðF 0k jRÞ logðPðF 0k jRÞÞ5; 0 0 Rk 2XðRk Þ F k 2XðF k Þ

ð5Þ

PN

dðdi ; F 0k Þ ¼

0 i¼1 dðdi ; F k Þ

; N 1; if di ¼ F 0k

0;

otherwise

ð6Þ ;

ð7Þ

Table 1 Comparison between FRMV and HFS.

Subspace Clustering analysis Independent evaluation Note: N the number of all features.

FRMV

HFS

Randomly select 2n features Remain the bias towards single data structure None

Randomly divide feature space to two subspaces X1, X2 Consider the bias towards date structure of individual algorithm Re-rank the features according to similarity measurement

11315

Y. Yang et al. / Expert Systems with Applications 38 (2011) 11311–11320

where H(Fk) calculates the entropy of Fk and H(Fk|Rk) is the conditional entropy of Fk. X(Fk) denotes all possible values of Fk and X(Rk) is all possible values of Rk. PðF 0k Þ is the probability that Fk equals to F 0k and PðF 0k jRk Þ is the probability that Fk equals to F 0k under the condition that the instances are assigned into the group Rk. In addition, the value 1 of the symmetrical uncertainty SU(Fk, Rk) indicates that Fk is completely related to Rk. On the other hand, the value 0 of the symmetrical uncertainty SU(Fk, Rk) means that Fk is absolutely irrelevant with target (Hong et al., 2008a; Shao & Nezu, 2000). Thirdly, DB index is a function of the ratio of the sum of withincluster scatter to between–cluster separations, which is computed as follows:

DB ¼

n Sn ðQ i Þ þ Sn ðQ j Þ 1X min ; n i¼1 1–j SðQ i ; Q j Þ

ð8Þ

where n is the number of clusters, Qi stands for the ith cluster, Sn denotes the average distance of all objects from the clusters to their cluster centre, S(Qi, Qj) is the distance between cluster centers. The DB index is small if the clusters are compact and fat from each other, in other word, a small DB index means a good clustering. 2.3. Combination and similarity measurement Besides maximization of clustering performance, the other important purpose is the selection of features based on the feature dependency or the similarity. Since any feature carrying little or no additional information beyond that subsumed by the remaining features, is redundant and should be eliminated (Mitra et al., 2002). That is, if there is a feature with high rank carrying valuable information and it is very similar to a lower-ranked feature, thus latter one should be eliminated due to it carries no additional valuable information. Therefore, the similarities between features are considered as reference for the redundancy evaluation. In step 2, a consensus function is utilized to combine all subdecisions into a pre-ﬁnal decision, named combiner. A large number of combiners used for combining the results of the classiﬁer were discussed in Dietrich, Palm, and Schwenker (2003). The most common combiners are majority vote, simple average and weighted average. In the simple average, the average of learning model results is calculated and the variable owned the largest average value is selected as the ﬁnal decision. The weighted average is the same concept as the simple average except that the weights are selected heuristically. While, the majority vote assigns the kth variable a rank j if more than half of sub-decisions vote it to rank j. Practically, the determination of the weights in the weighted average combiners relies on experience. On the other hand, the majority vote could lead to confusion of decision making, e.g. one feature could be nominated with two ranks at the same time. Therefore, the simple average combiner is applied in this study to combine the sub-decisions, which is computed as follows:

PM ARðjÞ ¼

k¼1 rankðkÞ ðjÞ

M

;

ð9Þ

where M is the population of sub-decisions, and rank(k)(j) is the signiﬁcance measurement of feature j in kth sub-decision RF(k). Thereafter, in step 3, in order to reduce the redundancy, those high ranked but less independent features with respect to the obtained pre-ﬁnal decision are eliminated. The similarity between features could be utilized to estimate the redundancy. There are broadly criteria for measuring similarity between two random variables, based on the linear dependency between them. The reason why chooses the linear dependency as a feature similarity measure is that the data is still linearly separable when all but

one of the linearly dependent features are eliminated if the data is linearly separable in the original representation. In this research, the most well known measure of similarity between two random variables, correlation coefﬁcient, is adopted. The correlation coefﬁcient q between two variables x and y is deﬁned as

covðx; yÞ

qðx; yÞ ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ;

ð10Þ

varðxÞvarðyÞ

where var(x) denotes the variance of x and cov (x, y) is the covariance between two variables x and y. The elimination procedure is then conducted according to the pre-ﬁnal decision and the similarity measure between features. For example, the most signiﬁcant (top one) feature is retained, to which thereby the most related features based on the similarity measure are considered as the redundant features to be removed, and the successive features are processed likewise until the wellranked features are linear independent. 3. HFS’s application in bearing fault diagnosis This section applies the HFS in bearing fault diagnostics. The comparison results between HFS and other feature selection methods will be demonstrated and discussed. To validate the proposed feature selection scheme could improve the classiﬁcation accuracy, a comparison between the proposed hybrid feature selection scheme and other ﬁve feature selection approaches was carried out. The eight learning algorithms are listed as follows: (1) (2) (3) (4) (5)

HFS with Symmetrical uncertainty (HFS_SU); HFS with Linear Correlation Coefﬁcient (HFS_LCC) HFS with DB index (HFS_DB) PCA-based feature selection (Malhi & Gao, 2004); FRMV based on k-means clustering with Symmetrical uncertainty (FRMV_KM) (Hong et al., 2008a); (6) Forward search feature selection (SFFS) (Oduntan et al., 2008); (7) Forward orthogonal search feature selection algorithm by maximizing the overall dependency (fosmod) (Wei & Billings, 2007); (8) Feature selection through feature clustering (FFC) (Li, Hu, Shen, Chen, & Li, 2008).

The comparisons among them were in term of classiﬁcation accuracy. According to Hong et al. (2008a), the iteration of FRMV_KM was set to 100, the k-means clustering was used to obtain the population of clustering solutions and SU was adopted as the evaluation criteria. In order to get the comparable population of sub-decision, the iteration of the proposed algorithm was set to 50. The threshold of the fosmod was set to 0.2. Two commonly used clustering algorithms were adopted in the HFS, fuzzy c-mean clustering and hierarchical clustering algorithms. In this research, the result of FCM was defuzziﬁed as follows:

RðkÞ ¼

1; if PðkÞ ¼ maxðPÞ 0; otherwise

;

ð11Þ

where P and P(k) denote the membership of instance that belongs to each cluster and the possibility of the instance that belongs to kth cluster, respectively. Features discussed in this chapter for bearing defects included the features extracted from time-domain, frequency domain, time-frequency domain and empirical mode decomposition (EMD). Firstly, in the time domain, statistical parameters were extracted from the waveform of the vibration signals directly. A wide set of statistical parameters, such as rms, kurtosis, skewness, crest

11316

Y. Yang et al. / Expert Systems with Applications 38 (2011) 11311–11320

factor and normalized high order central moment, have been developed (Jack & Nandi, 2002; Lei, He, & Zi, 2008; Samanta & Nataraj, 2009; Samanta, Al-Balushi, & AI-Araimi, 2003). Second, the characteristic frequencies related to the bearing components were located, e.g. ball spin frequency (BSF), ball-pass frequency of inner ring (BPFI), and ball-pass frequency of outer ring (BPFO). Besides, in order to interpret real world signals effectively, the envelope technique for the frequency spectrum was used to extract the features of the modulated carrier frequency signals (Patil, Mathew, & RajendraKumar 2008). In addition, a new signal feature, proposed by Huang from envelope signal (Hung, Xi, & Li, 2007b), the power ratio of maximal defective frequency to mean or PMM for short, was calculated as follows:

PMM ¼

maxðpðfpo Þ; pðfpi Þ; pðfbc ÞÞ ; meanðpÞ

ð11Þ

where P(fPo), p(fpi) and p(fbc) are the average power of the defective frequencies of the outer-race, inner race and ball defects, respectively; and mean (p) is the average of overall frequency power. Thirdly, Yen introduced wavelet packet transform (WPT) in Yen (2000) as follows,

ejn ¼

X

w2j;n;k ;

ð12Þ

k

where wj.n.k is the packet coefﬁcient, j is the scaling parameter, k is the translation parameter, and n is the oscillation parameter. Each

wavelet packet coefﬁcient measures a speciﬁc sub-band frequency content. In addition, EMD was used to decompose signal into several intrinsic mode functions (IMFs) and a residual. The EMD energy entropy in Yu, Yu, and Cheng (2006) was developed to calculate the ﬁrst several IMFs of signal. In this research, self-organized map (SOM) was used to validate the classiﬁcation performance based on the selected features. The theoretical background of unsupervised SOM has been extensively studied in the literature. A brief introduction of SOM can be found in Liao and Lee (2009) for bearing faults diagnose. With available data from different bearing failure modes, the SOM can be applied to build a health map in which different regions indicate different defects of a bearing. Each input vector could be represented by a BMU (Best Machining Unit) in the SOM. After training, the input vectors of a speciﬁc bearing defect are represented by a cluster of BMUs in the map, which forms a region indicating the defect. If the input vectors are labeled, each region could be deﬁned to represent a defect.

3.1. Experiments In this research, two tests were conducted on two types of bearing and the class information was considered as unknown in both cases. In the ﬁrst test, bearings were artiﬁcially made to have roller defect, inner-race defect, outer-race defect and four different

Normal

Roller defect 0.6 Acceleration (g)

Acceleration (g)

0.1 0.05 0 -0.05 -0.1

0

0.5

1

1.5 2 Time (s) Inner-race defect

2.5

Acceleration (g)

Acceleration (g)

0

0

0.5

1

1.5 2 Time (s) Inner-race & Roller defect

2.5

Acceleration (g)

Acceleration (g)

0.5

1

0

0.5

0

0.5

0

0.5

1.5 2 Time (s) Outer-race defect

2.5

3

1

2.5

3

1

2.5

3

2.5

3

0 -0.05 1.5 2 Time (s) Outer & Inner-race defect

1

0.5 0

0

0.5

1

1.5 2 2.5 Time (s) Outer & inner-race & Roller defect

0.5 0 -0.5 -1

3

1.5 2 Time (s) Outer-race & Roller defect

0.5 Acceleration (g)

2 Acceleration (g)

0

0.05

-0.1

3

1

1 0 -1 -2

0

0.1

0.5

-0.5

0.2

-0.2

3

1

-0.5

0.4

0

0.5

1

1.5 Time (s)

2

2.5

3

0

-0.5

1

1.5 Time (s)

Fig. 3. Vibration signal of the ﬁrst test, including normal pattern and seven failure patterns.

2

11317

Y. Yang et al. / Expert Systems with Applications 38 (2011) 11311–11320

combinations of the single failures respectively. In this case, SKF32208 bearing was tested, with an accelerometer installed on the vertical direction of its housing. The sampling rate for the vibration signal was 50 kHz. The BPFI, BPFO, and BSF for this case were calculated as 131.73 Hz, 95.2 Hz and 77.44 Hz, respectively. Fig. 3 shows the vibration signal of all defects as well as the normal condition in the ﬁrst test. In the second test, a set of 6308-2R single row deep groove ball bearings were run to failure resulting in roller defect, inner-race defect and outer-race defect (Huang et al., 2007b). Totally 10 bearings were involved through the experiment. The data sampling frequency was 20 kHz. The BPFI, BPFO, and BSF in this case were calculated as 328.6 Hz, 205.3 Hz and 274.2 Hz, respectively. It should be pointed out that the beginning of the second test was not stable, and then it fell into a long normal period. Hence, two separate segments from the stable normal period were selected to be baseline for training and testing, respectively. On the other hand, the data that exceeded mean value before end of the test was supposed as potential failure patterns. Therefore, 70% of the faulty patterns and the half of good patterns were used for training unsupervised learning model, while all the faulty patterns and the other half of good patterns for testing. Fig. 4 shows the part of the data segments of one bearing from the run-to-failure experiment in the second test.

classiﬁcation accuracies based on HFS_SU, HFS_LCC and HFS_DB with the top one ranked feature as input were 92.11%, 92.11% and 85.59%, respectively. When using the ﬁrst three ranked features, the accuracy of 97.19%, 99.77% and 97.03% were achieved. In the comparison with HFS_SU, HFS_DB, features selected by HFS_LCC achieved higher classiﬁcation accuracy of 99.77%. In other word, HFS_LCC apparently selected most representative features for this speciﬁc application. As shown in Fig. 5b, the highest classiﬁcation accuracy was 99.38% with 5 features for PCA. In Fig. 5c, the classiﬁcation accuracies based on HFS_SU, HFS_DB and HFS_LCC were higher than the results based on FRMV_KM. For FRMV_KM, highest classiﬁcation accuracy of 98.36% was achieved with 12 features. Fig. 5d compared SFFS and three HFS methods, HFS_LCC selected most representative features, and the accuracies reached by HFS were higher. For SFFS, highest classiﬁcation accuracy of 98.43% was achieved with 9 features. As shown in Fig. 5e, although the ﬁrst 11 features selected by fosmod ultimately reached the accuracy of 99.14%, HFS not only obtained higher accuracy but also ranked features with high reliability. Comparing to FFC as shown in Fig. 5f, features selected by HFS provided better classiﬁcation

Validated by SOM 100

3.2. Analysis and result

98

In the ﬁrst test, totally 24 features were computed as follows. Half of the data was used for training the SOM and the remaining part for testing.

Figs. 5 shows the results of ﬁrst test, with the x axis representing the number of selected features fed into the unsupervised SOM for clustering and the y axis representing the classiﬁcation accuracy correspondingly. The ﬁrst 12 features selected by each algorithm are shown for convenience. Take Fig. 5a as an example, the

unstable beginning

-20

0

500

1000

1500

Acceleration (g) 0

500

1000

1500

2000

10

11

12

Fig. 5a. Comparison results of classiﬁcation accuracy of HFS_LLC, HFS_SU, HFS_DB.

Validated by SOM

96

0

500

1000

1500

2000

200

0

3 4 5 6 7 8 9 Number of features according to rankings

100

failure

50

2

98

-50

2000

1

0

stable 2

Acceleration (g)

88

Accuracy

Acceleration (g)

Acceleration (g)

0

-50

90

50

20

HFS SU HFS LCC HFS DB

94

92

stable 1

40

-40

Accuracy

Energies centered at 1xBPFO, 2xBPFO, 1xBPFI, 2xBPFI, 1xBSF, 2xBSF. 6 statistics for the raw signal (mean, rms, kurtosis, crest factor, skewness, entropy). 6 statistics for envelop signal obtained by hilbert transform. 6 statistics for the spectrum results of the waveform by FFT.

96

94 HFS SU HFS LCC HFS DB PCA

92

100 0

90

-100 -200

0

500

1000

1500

2000

Fig. 4. Vibration signal of one bearing of the second test; (1) unstable beginning of the test; (2) ﬁrst stable segment; (3) second stable segment; (4) failure pattern (inner race defect).

88

1

2

3 4 5 6 7 8 9 Number of features according to rankings

10

11

12

Fig. 5b. Comparison results of classiﬁcation accuracy of HFS_LLC, HFS_SU, HFS_DB and PCA based method.

11318

Y. Yang et al. / Expert Systems with Applications 38 (2011) 11311–11320 Validated by SOM

Validated by SOM

100

100

90

90 HFS SU HFS LCC HFS DB FRMV KM

70

60

50

50

1

2

3 4 5 6 7 8 9 Number of features according to rankings

10

11

12

Fig. 5c. Comparison results of classiﬁcation accuracy of HFS_LLC, HFS_SU, HFS_DB and FRMV_KM.

98

Accuracy

96

HFS SU HFS LCC HFS DB SFFS

94

92

90

88

1

2

3 4 5 6 7 8 9 Number of features according to rankings

10

11

12

Fig. 5d. Comparison results of classiﬁcation accuracy of HFS_LLC, HFS_SU, HFS_DB and SFFS.

Validated by SOM 100 95 HFS SU HFS LCC HFS DB fosmod

90 85 80 75 70 65 60 55 50

1

2

3 4 5 6 7 8 9 Number of features according to rankings

10

11

40

1

2

3 4 5 6 7 8 9 Number of features according to rankings

10

11

12

Fig. 5f. Comparison results of classiﬁcation accuracy of HFS_LLC, HFS_SU, HFS_DB and FFC.

accuracy with less features. For FFC, the highest accuracy of 98.43% was reached with 9 features. The performance improvement of the proposed model over FRMV_KM, SFFS, fosmod and FFC was mainly due to making use of every feature, combining the clustering solutions and independence evaluation, which overcomes deﬁciencies of less uncertainty of clustering solution and jeopardize of related features. In order to illustrate the effect of the redundancy of the feature set to the classiﬁcation performance and demonstrate the robustness of HFS, more features were involved to be the candidates in the second test. Totally 40 features were calculated and given as follows.

Validated by SOM 100

Accuracy

70

60

40

HFS SU HFS LCC HFS DB FFC

80 Accuracy

Accuracy

80

12

Fig. 5e. Comparison results of classiﬁcation accuracy of HFS_LLC, HFS_SU, HFS_DB and fosmod.

10 statistics for the raw signal (var, rms, skewness, kurtosis, crest factor, 5th to 9th central moment). Energies centered at 1xBPFO, 1xBPFI, 1xBSF for both raw signal and envelop. PMMs for both raw signal and envelop. 16 WPNs. 6 IMF energy entropies. The results in Fig. 6 show the classiﬁcation accuracy of the second test. As shown in Fig. 4(a), HFS_LCC reached highest classiﬁcation accuracy of 88.56% with ﬁrst 10 features, while HFS_DB and HFS_SU achieved their highest classiﬁcation accuracy of 87.29%, 85.17% with ﬁrst 8 features and 11 features, respectively Fig. 6b shows that compare to PCA based feature selection method, HFS_LCC and HFS_DB outperformed PCA (highest accuracy 86.02%) with respect to higher accuracy with the same number of features or less features. In the comparison with FRMV_KM (as shown in Fig. 6c) (highest accuracy 83.90%), HFS group showed apparently better classiﬁcation accuracy with less features. As shown in Figs. 6d and 6e, the feature selected by SFFS and fosmod resulted in the accuracy of 84.75% and 85.17%, which were worse than the only one feature selected by HFS_DB. Compared to FFC (as shown in Fig. 6f), HFS_LCC showed better performance. Since 86.86% accuracy was reached by FFC with 6 features selected. From the results of the two tests, the conclusion could be drawn that the proposed HFS is robust and effective in selecting the most representative features, which maximizes the unsupervised classiﬁcation performance. It also should be noted that for both tests, FRMV_KM and HFS_SU shared the same evaluation criterion, but the decision provided by the HFS_SU was always better than

11319

Y. Yang et al. / Expert Systems with Applications 38 (2011) 11311–11320

Validated by SOM

Validated by SOM 90

90 HFS SU HFS LCC HFS DB SFFS

HFS SU HFS LCC HFS DB

88

86

84

Accuracy

Accuracy

85

82

80 80

78

76

1

2

3 4 5 6 7 8 9 Number of features according to rankings

10

11

75

12

Fig. 6a. Comparison results of classiﬁcation accuracy of HFS_LLC, HFS_SU, HFS_DB.

1

2

3 4 5 6 7 8 9 Number of features according to rankings

10

11

12

Fig. 6d. Comparison results of classiﬁcation accuracy of HFS_LLC, HFS_SU, HFS_DB and SFFS.

Validated by SOM

Validated by SOM

90

90

HFS SU HFS LCC HFS DB fosmod

HFS SU HFS LCC HFS DB PCA

88

86

84

Accuracy

Accuracy

85

82

80 80

78

76

1

2

3 4 5 6 7 8 9 Number of features according to rankings

10

11

75

12

Fig. 6b. Comparison results of classiﬁcation accuracy of HFS_LLC, HFS_SU, HFS_DB and PCA based method.

1

2

3 4 5 6 7 8 9 Number of features according to rankings

10

11

12

Fig. 6e. Comparison results of classiﬁcation accuracy of HFS_LLC, HFS_SU, HFS_DB and fosmod.

Validated by SOM

Validated by SOM

90

90 HFS SU HFS LCC HFS DB FRMV KM

88

86

84

Accuracy

Accuracy

85

82

80 80

HFS SU HFS LCC HFS DB FFC

78

76

1

2

3 4 5 6 7 8 9 Number of features according to rankings

10

11

12

Fig. 6c. Comparison results of classiﬁcation accuracy of HFS_LLC, HFS_SU, HFS_DB and FRMV_KM.

75

1

2

3 4 5 6 7 8 9 Number of features according to rankings

10

11

12

Fig. 6f. Comparison results of classiﬁcation accuracy of HFS_LLC, HFS_SU, HFS_DB and FFC.

11320

Y. Yang et al. / Expert Systems with Applications 38 (2011) 11311–11320

FRMV_KM, which indicated that the proposed HFS scheme is superior to FRMV with respect to the same evaluation criterion. Besides, it is worth noticing that in both tests, the proposed HFS based on three evaluation criterions, e.g. SU, LCC and DB index, generated results with slight difference. It suggested that the effectiveness of features selected by proposed HFS relied on the applied evaluation criterion, and LCC was considered more appropriate for both the two cases. Nonetheless, it is still appropriate to conclude that the overall performance based on features selected by HFS was better comparing to other ﬁve methods.

4. Conclusion This paper presented a hybrid unsupervised feature selection (HFS) approach to select the most representative features for unsupervised learning and used two experimental bearing data to demonstrate the effectiveness of HFS. The performance of HFS approach was compared with other ﬁve feature selection methods with respect to the accuracy improvement of unsupervised learning algorithm SOM. The results showed that the proposed model could (a) identify the features that are relevant to the bearing defects, and (b) maximize the performance of unsupervised learning models with fewer features. Moreover, it suggested that HFS relied on the evaluation criterion to the certain application. Therefore, the further research will focus on expand HFS to broader applications and online machinery defect diagnostics and prognostics. Acknowledgement The authors gratefully acknowledge the support of 863 Program (No. 50821003), PR China, for this work. References Blum, A. L., & Langley, P. (1997). Selection of relevant features and examples in machine learning. Artiﬁcial Intelligence, 1–2, 245–271. Bouldin, D. L. D. a. D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 224–227. Dash, M., & Koot, P. W. (2009). Feature selection for clustering. In Encyclopedia of database systems (pp. 1119–1125). Dash, M., & Liu, H. (1997). Feature selection for classiﬁcation. Intelligent Data Analysis, 1–4, 131–156. Dietrich, C., Palm, G., & Schwenker, F. (2003). Decision templates for the classiﬁcation of bioacoustic time series. Information Fusion, 2, 101–109. Frigui, H. (2008). Clustering: Algorithms and applications. In 2008 1st international workshops on image processing theory,tools and applications, IPTA 2008. Sousse. Ginart, A., Barlas, I., & Goldin, J. (2007). Automated feature selection for embeddable prognostic and health monitoring (PHM) architectures. In AUTOTESTCON (Proceedings), Anaheim, CA (pp. 195–201). Greene, D., Cunningham, P., & Mayer, R. (2008). Unsupervised learning and clustering. Lecture Notes in Applied and Computational Mechanics, 51–90. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 1157–1182. Hong, Y., Kwong, S., & Chang, Y. (2008a). Consensus unsupervised feature ranking from multiple views. Pattern Recognition Letters, 5, 595–602. Hong, Y., Kwong, S., & Chang, Y. (2008b). Unsupervised feature selection using clustering ensembles and population based incremental learning algorithm. Pattern Recognition, 9, 2742–2756.

Huang, J., Cai, Y., & Xu, X. (2007a). A hybrid genetic algorithm for feature selection wrapper based on mutual information. Pattern Recognition Letters, 13, 1825–1844. Huang, R., Xi, L., & Li, X. (2007b). Residual life predictions for ball bearings based on self-organizing map and back propagation neural network methods. Mechanical Systems and Signal Processing, 1, 193–207. Jack, L. B., & Nandi, A. K. (2002). Fault detection using support vector machines and artiﬁcial neural networks, augmented & by genetic algorithms. Mechanical Systems and Signal Processing, 2–3, 373–390. Jain, A. K., Duin, R. P. W., & Mao, J. (2000). Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1, 4–37. Jardine, A. K. S., Lin, D., & Banjevic, D. (2006). A review on machinery diagnostics and prognostics implementing condition-based maintenance. Mechanical Systems and Signal Processing, 7, 1483–1510. Kwak, N., & Choi, C. H. (2002). Input feature selection for classiﬁcation problems. IEEE Transactions on Neural Networks, 1, 143–159. Lei, Y. G., He, Z. J., & Zi, Y. Y. (2008). A new approach to intelligent fault diagnosis of rotating machinery. Expert Systems with Applications, 4, 1593–1600. Li, G., Hu, X., Shen, X., et al. (2008) A novel unsupervised feature selection method for bioinformatics data sets through feature clustering, In IEEE international conference on granular computing, GRC 2008. Hangzhou. (pp. 41–47). Li, Y., Dong, M., & Hua, J. (2008). Localized feature selection for clustering. Pattern Recognition Letters, 10–18. Liao, L., & Lee, J. (2009). A novel method for machine performance degradation assessment based on ﬁxed cycle features test. Journal of Sound and Vibration, 326, 894–908. Liu, X., Ma, L., Zhang, S., & Mathew, J. (2006). Feature group optimisation for machinery fault diagnosis based on fuzzy measures. Australian Journal of Mechanical Engineering, 2, 191–197. Liu, H., & Yu, L. (2005). Toward integrating feature selection algorithms for classiﬁcation and clustering. IEEE Transactions on Knowledge and Data Engineering, 4, 491–502. Malhi, A., & Gao, R. X. (2004). PCA-based feature selection scheme for machine defect classiﬁcation. IEEE Transactions on Instrumentation and Measurement, 6, 1517–1525. Mitra, P., Murthy, C. A., & Pal, S. K. (2002). Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 3, 301–312. Oduntan, I. O., Toulouse, M., & Baumgartner, R. (2008). A multilevel tabu search algorithm for the feature selection problem in biomedical data. Computers & Mathematics with Applications, 5, 1019–1033. Patil, M. S., Mathew, J., & RajendraKumar, P. K. (2008). Bearing signature analysis as a medium for fault detection: A review. Journal of Tribology, 1. Peng, Z. K., & Chu, F. L. (2004). Application of the wavelet transform in machine condition monitoring and fault diagnostics: a review with bibliography. Mechanical Systems and Signal Processing, 2, 199–221. Samanta, B., Al-Balushi, K. R., & AI-Araimi, S. A. (2003). Artiﬁcial neural networks and support vector machines with genetic algorithm for bearing fault detection. Engineering Applications of Artiﬁcial Intelligence, 7-8, 657–665. Samanta, B., & Nataraj, C. (2009). Use of particle swarm optimization for machinery fault detection. Engineering Applications of Artiﬁcial Intelligence, 2, 308–316. Shao, Y., & Nezu, K. (2000). Prognosis of remaining bearing life using neural networks. Proceedings of the Institution of Mechanical Engineers. Part I. Journal of Systems and Control Engineering, 3, 217–230. Sugumaran, V., & Ramachandran, K. I. (2007). Automatic rule learning using decision tree for fuzzy classiﬁer in fault diagnosis of roller bearing. Mechanical Systems and Signal Processing, 5, 2237–2247. Wei, H. L., & Billings, S. A. (2007). Feature subset selection and ranking for data dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1, 162–166. Xu, Z., Xuan, J., Shi, T., & Wu, B. (2009). Application of a modiﬁed fuzzy ARTMAP with feature-weight learning for the fault diagnosis of bearing. Expert Systems with Applications, 6, 9961–9968. Yen, G. G. (2000). Wavelet packet feature extraction for vibration monitoring. IEEE Transactions on Industrial Electronics, 3, 650–667. Yu, Y., Yu, D., & Cheng, J. (2006). A roller bearing fault diagnosis method based on EMD energy entropy and ANN. Journal of Sound and Vibration, 1–2, 269–277. Yu, L., & Liu, H. (2004). Efﬁcient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research, 1205–1224. Zhang, H., & Sun, G. (2002). Feature selection using tabu search method. Pattern Recognition, 35, 701–711.

Recommend Documents

A hybrid approach for feature subset selection ... - Semantic Scholar

Hybrid Feature Selection Using Genetic Algorithm ... - Semantic Scholar

Feature Over-Selection - Semantic Scholar