Expert Systems with Applications 38 (2011) 11311–11320
Contents lists available at ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
A hybrid feature selection scheme for unsupervised learning and its application in bearing fault diagnosis Yang Yang a,⇑, Yinxia Liao b, Guang Meng a, Jay Lee b a b
State Key Laboratory of Mechanical System and Vibration, Shanghai Jiaotong University, Shanghai 200240, PR China NSF I/UCR Center for Intelligent Maintenance Systems, 560 Rhodes Hall, University of Cincinnati, Cincinnati, OH 45221, USA
a r t i c l e
i n f o
Keywords: Feature selection Unsupervised learning Fault diagnostics
a b s t r a c t With the development of the condition-based maintenance techniques and the consequent requirement for good machine learning methods, new challenges arise in unsupervised learning. In the real-world situations, due to the relevant features that could exhibit the real machine condition are often unknown as priori, condition monitoring systems based on unimportant features, e.g. noise, might suffer high falsealarm rates, especially when the characteristics of failures are costly or difficult to learn. Therefore, it is important to select the most representative features for unsupervised learning in fault diagnostics. In this paper, a hybrid feature selection scheme (HFS) for unsupervised learning is proposed to improve the robustness and the accuracy of fault diagnostics. It provides a general framework of the feature selection based on significance evaluation and similarity measurement with respect to the multiple clustering solutions. The effectiveness of the proposed HFS method is demonstrated by a bearing fault diagnostics application and comparison with other features selection methods. Ó 2011 Elsevier Ltd. All rights reserved.
1. Introduction As sensing and signal processing technologies advance rapidly, increasingly features have been involved in condition monitoring system and fault diagnosis. A challenge in this area is to select the most sensitive parameters for the various types of fault, especially when the characteristics of failures are costly or difficult to learn (Malhi & Gao, 2004). In reality, since the relevant or important features are often not available as priori, amount of candidate features have been proposed to achieve a better representation of the machine health condition (Dash & Liu, 1997; Jardine, Lin, & Banjevic, 2006; Peng & Chu, 2004). Due to the irrelevant and redundant features in the original feature space, employing all features might lead to high complexity and low performance of fault diagnosis. Moreover, most unsupervised learning methods assume that all features have uniform importance degree during clustering operations (Dash & Koot, 2009). Even in the optimal feature set, it is also assumed that each feature has the same sensitivity throughout clustering operations. In fact, it is known that an important feature facilitates creating clusters while an unimportant feature, on the contrary, may jeopardize the clustering operation by blurring the clusters. Thereby, it is better to select only the most representative features (Xu, Xuan, Shi, & Wu, 2009) rather than simply reducing the number of the features. Hence, it is of significance to develop a systematic and automatic feature selection method ⇑ Corresponding author. Tel.: +86 21 34206831x322. E-mail address:
[email protected] (Y. Yang). 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.02.181
that is capable of selecting the prominent features to achieve a better insight into the underlying machine performance. Summarily speaking, the feature selection is one of the essential and frequently used techniques in machine learning (Blum & Langley, 1997; Dash & Koot, 2009; Dash & Liu, 1997; Ginart, Barlas, & Goldin, 2007; Jain, Duin, & Mao, 2000; Kwak & Choi, 2002), whose aim is to select the most representative features and which brings the immediate effects for improving mining performance such as the predictive accuracy and solution comprehensibility (Guyon & Elisseeff, 2003; Liu & Yu, 2005). However, traditional feature selection algorithms for classification do not work for unsupervised learning since there is no class information available. Dimensionality reduction or feature extraction methods are frequently used for unsupervised data, such as Principal Components Analysis (PCA), Karhunen–Loeve transformation, or Singular Value Decomposition (SVD) (Dash & Koot, 2009). Malhi and Gao (2004) presented a PCA-based feature selection model for the bearing defect classification in the condition monitoring system. Compared to using all features initially considered relevant to the classification results, it provided higher accurate classifications for both supervised and unsupervised purposes with fewer feature inputs. But the drawback is the difficulty of understanding the data and the found clusters through the extracted features (Dash & Koot, 2009). Given sufficient computation time, the feature subset selection investigates all candidate feature subsets and selects the optimal one with satisfying the cost function. Greedy search algorithms like sequential forward feature selection (SFFS) (or backward search feature selection (BSFS)) and random feature selection
11312
Y. Yang et al. / Expert Systems with Applications 38 (2011) 11311–11320
were commonly used. Oduntan, Toulouse, and Baumgartner (2008) developed a multilevel tabu search algorithm combined a hierarchical search framework, and compared it to the sequential forward feature selection, the random feature selection and the tabu search feature selection (Zhang & Sun, 2002). Feature subset selections require intensive computational time and show poor performance for non-monotonic indices. In order to overcome these drawbacks, feature selection methods tend to rank features or select a subset of original features (Guyon & Elisseeff, 2003). Feature ranking techniques resort the features according to cost functions and select a subset from the ordered features. Hong, Kwong, and Chang (2008a) introduced an effective methods feature ranking from multiple view (FRMV). It scores each feature using a ranking criterion by considering multiple clustering results, and selects the first several features with the best quality as the ‘‘optimal’’ feature subset. However, FRMV prefers to the importance of features that achieve better classification rather than the redundancy of the selected features, which results in the selected features not necessarily being the optimal subset. Considering the approaches for evaluating the cost function of feature selection techniques, feature selection algorithms broadly fall into three categories: the filter model, the wrapper model and the hybrid model (Liu & Yu, 2005) .The filter model discovers the general characteristics of data and considers the feature selection as a preprocessing step which is independent of any mining algorithms. Filter method is less time consuming while less efficient. The wrapper model incorporates the one predetermined learning algorithm and selects the feature subset aiming to improve its mining performance according to certain criteria. It is more timeconsuming but more effective compared to the filter models. Moreover, the predetermined learning algorithm remains bias towards the shape of the cluster, that is, it is sensitive to the data structure according to its operation concept. The hybrid model tends to take advantage of the two models in different search stages according to different criteria. Mitra, Murthy, and Pal (2002) described a filter feature selection algorithm for high dimension data sets based on measuring similarity between features whereby the redundancy therein was removed, and a maximum information compression index was also introduced to estimate the similarity between features in their work. Li, Dong, and Hua (2008) proposed a novel filter features selection algorithm through feature clustering (FFC) to group the features into different clusters based on the feature similarity and to select the representative features in each cluster to reduce the feature redundancy. Wei and Billings (2007) introduced a forward orthogonal search feature selection algorithm by maximizing the overall dependency to find significant variables, which also provided a rank list of selected features ordered according to the percentage contribution for representing the overall structures. Liu, Ma, Zhang, and Mathew (2006) presented a wrapper model based on fuzzy c-means (FCM) algorithm for rolling element bearing fault diagnostics. Sugumaran and Ramachandran (2007) employed a wrapper approach based on decision tree with information gain and entropy reduction as criteria to select representative features that could discriminate faults of bearing. Hong , Kwong, and Chang (2008b) described a novel feature selection algorithm based on unsupervised learning ensembles and population based incremental learning algorithm. It searches for a subset from all candidate feature subsets so that the clustering operation based on this feature subset could generate the most similar clustering result to the one obtained by a unsupervised learning ensembles method. Huang, Cai, and Xu (2007a) developed a two stages hybrid genetic algorithm to find a subset of features. In the first stage, the mutual information between the predictive labels and the true class labels served as a fitness function for the genetic algorithm to conduct the global search in a wrapper way. Then in the second stage, the conditional mutual information served as an independent measure for the feature ranking considering both the relevance and the redundancy of features.
As mentioned above, these techniques either require the available features to be independent initially, to which the realistic situations are opposite, or remain bias toward the shape of the cluster due to their fundamental concept. This paper introduces a hybrid feature selection scheme for unsupervised learning, which can overcome those deficiencies. The proposed scheme generates two random-selected subspaces for further clustering, combines different genres of clustering analysis to obtain a population of subdecisions of feature selection based on significance measurement, and removes redundant features based on feature similarity measurement to improve the quality of selected features. The effectiveness of the proposed scheme is validated by an application of bearing defects classification, and the experimental results illustrate that the proposed method is able to (a) identify the features that are relevant to the bearing defects, and (b) maximize the performance of unsupervised learning models with fewer features. The rest of this paper is arranged as follows. Section 2 illustrates the proposed HFS scheme in details. Section 3 discusses the application of the proposed feature selection scheme in bearing fault diagnosis. Finally, Section 4 concludes this paper.
2. Hybrid feature selection scheme (HFS) for unsupervised classification It is time consuming or a difficult mission even for an experienced fault diagnosis engineer to determine which feature among all available features is able to distinguish the characteristics of various failures, especially there is no prior knowledge (class information) available. To tackle the problem of class information absence, FRMV (Hong et al., 2008a) extended the feature ranking methodology into the unsupervised data clustering data. It offered a generic approach to boost the performance of clustering analysis. A stable and robust unsupervised feature ranking approach was proposed based on the ensembles of multiple feature rankings obtained from different views of the same data set. When conducting the FRMV, data instances were first classified in a randomly selected feature subspace to obtain a clustering solution, and then all features were ranked according to their relevancies with the obtained clustering solution. These two steps iterated until a population of feature rankings was achieved. Thereby, all obtained features rankings were combined by a consensus function into a single consensus one. However, FRMV clustered the data instances in a subspace that consists of random-selected half number of features every time. It was likely that some valuable features might be ignored in the beginning, that is, some features probably might never be included in all iterations. Besides, it only focused on the ensemble of one unsupervised learning algorithm results through its scheme, which obviously overlooked the reality that it is likely for a learning algorithm to hold the bias toward the nature structure of data, such as the hyper-spherical structure and the hierarchical structure (Frigui, 2008; Greene, Cunningham, & Mayer, 2008). As simply illustrated in Fig. 1, the data set is consisted of 11 points. If two clusters are contained in the data set, a classifier based on the hierarchical concept tends to assign point 1, 2, 3, 4, 5 to a cluster, and the remaining to the other cluster. While, a classifier based on the hyper-spherical concept tempts to assign point 1, 2, 3, 4, 6, 7, 8 to a cluster, and the remaining to the other cluster. Furthermore, in FRMV all features were supposed to be independent before the selection process, which is usually opposite in the real world. Thereby, the well-ranked features were related to their neighbors with high probability, in other words, some top ranked features might turn out to be redundant. Since the abovementioned shortcomings in FRMV are the obstacles of boosting higher classification performance and constraints of wider applications in the real world, a hybrid feature selection
Y. Yang et al. / Expert Systems with Applications 38 (2011) 11311–11320
11313
Fig. 1. An example of data structures in 2D; (a) hierarchical cluster, (b) spherical cluster.
scheme (HFS) is proposed to overcome these deficiencies. The remaining part of this section will present the HFS scheme for unsupervised learning and introduce the criterions used in HFS.
decision through combining the P sub-decisions into a single consensus feature decision RFpre-final, which is thereafter processed according to the feature similarity to obtain RFfinal. Detail of the scheme is described as follows.
2.1. Procedure of the hybrid unsupervised feature selection method The HFS is inspired by FRMV, which ranked each feature according to the relevancy between the feature and the combined clustering solutions. Moreover, the HFS is developed to combine the different genres of clustering analysis to a consensus decision and rank features according to the relevancies between the features and the consensus decision and independencies between features. Generally, HFS involves two aspects: (1) significance evaluation, which determines the contribution of each feature on behalf of multiple clustering results; (2) redundancy evaluation, which retains the most significant and independent features concerning the feature similarity. Some notations used throughout this paper are given as follows. The input vector x of original feature space X with D candidate features is denoted as X i ¼ fx1i ; x2i . . . ; xDi g ði ¼ 1; . . . ; MÞ, in which i is denoted as ith instance and M is the number of instances. Let RFk = {rank(k)(F1), rank(k)(F2), . . . , rank(k)(Fn)} (1 < rank(k)(Fi) < D) be the kth sub-decision of the feature ranking, where rank(k)(Fi) denotes the rank of the ith feature F in the kth sub-decision. Assuming there are P sub-decisions of the feature rank {RF(1), RF(2), . . . RF(p)} ,a combine function determines a final
Algorithm: Hybrid feature selection scheme for unsupervised learning Input: feature space X, the number of cluster N, maximum iteration L Output: decision of feature selection (1) Iterate until get a population of dub-decisions For k = 1:L, Do: (1.1) Randomly divide original feature space into two subspaces X1, X2 (1.2) Group data with the first and second group of unsupervised learning algorithms in subspace X1, X2 separately (1.3) Evaluate significance of each feature based on significance measurement to obtain the kth sub-decision of feature selection RF(k) End //combine all rankings into a single consensus one (2) RFpre-final = combiner{RF(1), RF(2), . . . , RF(2L)} (3) Redundancy evaluation based on feature similarities (4) Return RFfinal
11314
Y. Yang et al. / Expert Systems with Applications 38 (2011) 11311–11320
At the beginning, the original feature space X is random divided into two feature subspaces X1, X2, in which the instances are clustered correspondingly. In step 1.2, two different genres of clustering analysis, e.g. hyper-spherical cluster and hierarchical cluster, are used to classify data instances in the two subspaces respectively. Thereby, a clustering solution is obtained in each subspace. Then all features are ranked with respect to their relevancies to the obtained clustering solutions in step 1.3, named the significance evaluation. These above two steps iterate until a population of feature rankings, named sub-decisions, are achieved. In step 2, a consensus function is utilized to combine all sub-decisions into a pre-final decision. Thereafter, the final decision of feature selection is accomplished by re-ranking the pre-final decision according to the re-rank scheme based on feature similarity in step 3, named the redundancy evaluation. The details of HFS will be introduced in Sections 2.2 and 2.3. Fig. 2 illustrates the framework of the HFS.
Table 1 lists the differences between FRMV and the proposed HFS. First of all, in order to make sure that every feature in the original feature set is able to contribute to the decision making, HFS is making use of both randomly divided subspaces from the original feature space instead of ignoring some features due to randomly selecting the half of the original feature space in FRMV. Secondly, HFS considers the bias of individual unsupervised learning algorithm. Thereby, different genres of clustering methods are used to cluster data in the subspace. Moreover, HFS provides a redundancy evaluation according to the feature similarity and re-ranks the features. It is more appropriate for the real world applications than FRMV. 2.2. Significance evaluation The goal of unsupervised feature selection is to find as few features as possible that best uncovers ‘‘interesting natural’’ clusters from data, which could be found by unsupervised learning algorithm. Therefore, the relationship between the clustering solution and feature is considered as the significance of the feature to the clustering solution. In step 1.3, the features are ranked with respect to their relevancies with the obtained clustering solutions, named significance evaluation. The sub-decisions serve as the target and each feature is considered as a variable. In this research, the widely used linear correlation coefficient (LCC) (Hong et al., 2008a), the symmetrical uncertainty (SU) (Yu & Liu, 2004) and the Davies– Bouldin index (DB) (Bouldin, 1979) are used for significance evaluation. The details of each criterion are introduced as follows. For convenience, denote Fk and R(k) as the kth feature and the kth sub-decision respectively. First, the linear correlation coefficient studies the correlations between the variables and the target, which is calculated as follows:
LCCðF k ; Rk Þ ¼
covðF k ; Rk Þ
rðF k ÞrðRk Þ
ð1Þ
;
where r(Rk) is the standard deviation of the kth target and cov(Fk, Rk) is the covariance between Fk and Rk. Secondly, the symmetrical uncertainty is defined as follows:
SUðF k ; Rk Þ ¼ 2
IGðF k jRk Þ ; HðF k Þ þ HðRk Þ
ð2Þ
with
IGðF k jRk Þ ¼ HðF k Þ HðF k jRk Þ; X PðF 0k Þ logðPðF 0k ÞÞ; HðF k Þ ¼ F 0k 2XðF k Þ
HðF k jRk Þ ¼ PðF 0k Þ ¼ Fig. 2. Flowchart of HFS for unsupervised learning.
X
6 6 X 6
ð3Þ ð4Þ 7 7 7
PðR0k Þ4 PðF 0k jRÞ logðPðF 0k jRÞÞ5; 0 0 Rk 2XðRk Þ F k 2XðF k Þ
ð5Þ
PN
dðdi ; F 0k Þ ¼
0 i¼1 dðdi ; F k Þ
; N 1; if di ¼ F 0k
0;
otherwise
ð6Þ ;
ð7Þ
Table 1 Comparison between FRMV and HFS.
Subspace Clustering analysis Independent evaluation Note: N the number of all features.
FRMV
HFS
Randomly select 2n features Remain the bias towards single data structure None
Randomly divide feature space to two subspaces X1, X2 Consider the bias towards date structure of individual algorithm Re-rank the features according to similarity measurement
11315
Y. Yang et al. / Expert Systems with Applications 38 (2011) 11311–11320
where H(Fk) calculates the entropy of Fk and H(Fk|Rk) is the conditional entropy of Fk. X(Fk) denotes all possible values of Fk and X(Rk) is all possible values of Rk. PðF 0k Þ is the probability that Fk equals to F 0k and PðF 0k jRk Þ is the probability that Fk equals to F 0k under the condition that the instances are assigned into the group Rk. In addition, the value 1 of the symmetrical uncertainty SU(Fk, Rk) indicates that Fk is completely related to Rk. On the other hand, the value 0 of the symmetrical uncertainty SU(Fk, Rk) means that Fk is absolutely irrelevant with target (Hong et al., 2008a; Shao & Nezu, 2000). Thirdly, DB index is a function of the ratio of the sum of withincluster scatter to between–cluster separations, which is computed as follows:
DB ¼
n Sn ðQ i Þ þ Sn ðQ j Þ 1X min ; n i¼1 1–j SðQ i ; Q j Þ
ð8Þ
where n is the number of clusters, Qi stands for the ith cluster, Sn denotes the average distance of all objects from the clusters to their cluster centre, S(Qi, Qj) is the distance between cluster centers. The DB index is small if the clusters are compact and fat from each other, in other word, a small DB index means a good clustering. 2.3. Combination and similarity measurement Besides maximization of clustering performance, the other important purpose is the selection of features based on the feature dependency or the similarity. Since any feature carrying little or no additional information beyond that subsumed by the remaining features, is redundant and should be eliminated (Mitra et al., 2002). That is, if there is a feature with high rank carrying valuable information and it is very similar to a lower-ranked feature, thus latter one should be eliminated due to it carries no additional valuable information. Therefore, the similarities between features are considered as reference for the redundancy evaluation. In step 2, a consensus function is utilized to combine all subdecisions into a pre-final decision, named combiner. A large number of combiners used for combining the results of the classifier were discussed in Dietrich, Palm, and Schwenker (2003). The most common combiners are majority vote, simple average and weighted average. In the simple average, the average of learning model results is calculated and the variable owned the largest average value is selected as the final decision. The weighted average is the same concept as the simple average except that the weights are selected heuristically. While, the majority vote assigns the kth variable a rank j if more than half of sub-decisions vote it to rank j. Practically, the determination of the weights in the weighted average combiners relies on experience. On the other hand, the majority vote could lead to confusion of decision making, e.g. one feature could be nominated with two ranks at the same time. Therefore, the simple average combiner is applied in this study to combine the sub-decisions, which is computed as follows:
PM ARðjÞ ¼
k¼1 rankðkÞ ðjÞ
M
;
ð9Þ
where M is the population of sub-decisions, and rank(k)(j) is the significance measurement of feature j in kth sub-decision RF(k). Thereafter, in step 3, in order to reduce the redundancy, those high ranked but less independent features with respect to the obtained pre-final decision are eliminated. The similarity between features could be utilized to estimate the redundancy. There are broadly criteria for measuring similarity between two random variables, based on the linear dependency between them. The reason why chooses the linear dependency as a feature similarity measure is that the data is still linearly separable when all but
one of the linearly dependent features are eliminated if the data is linearly separable in the original representation. In this research, the most well known measure of similarity between two random variables, correlation coefficient, is adopted. The correlation coefficient q between two variables x and y is defined as
covðx; yÞ
qðx; yÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ;
ð10Þ
varðxÞvarðyÞ
where var(x) denotes the variance of x and cov (x, y) is the covariance between two variables x and y. The elimination procedure is then conducted according to the pre-final decision and the similarity measure between features. For example, the most significant (top one) feature is retained, to which thereby the most related features based on the similarity measure are considered as the redundant features to be removed, and the successive features are processed likewise until the wellranked features are linear independent. 3. HFS’s application in bearing fault diagnosis This section applies the HFS in bearing fault diagnostics. The comparison results between HFS and other feature selection methods will be demonstrated and discussed. To validate the proposed feature selection scheme could improve the classification accuracy, a comparison between the proposed hybrid feature selection scheme and other five feature selection approaches was carried out. The eight learning algorithms are listed as follows: (1) (2) (3) (4) (5)
HFS with Symmetrical uncertainty (HFS_SU); HFS with Linear Correlation Coefficient (HFS_LCC) HFS with DB index (HFS_DB) PCA-based feature selection (Malhi & Gao, 2004); FRMV based on k-means clustering with Symmetrical uncertainty (FRMV_KM) (Hong et al., 2008a); (6) Forward search feature selection (SFFS) (Oduntan et al., 2008); (7) Forward orthogonal search feature selection algorithm by maximizing the overall dependency (fosmod) (Wei & Billings, 2007); (8) Feature selection through feature clustering (FFC) (Li, Hu, Shen, Chen, & Li, 2008).
The comparisons among them were in term of classification accuracy. According to Hong et al. (2008a), the iteration of FRMV_KM was set to 100, the k-means clustering was used to obtain the population of clustering solutions and SU was adopted as the evaluation criteria. In order to get the comparable population of sub-decision, the iteration of the proposed algorithm was set to 50. The threshold of the fosmod was set to 0.2. Two commonly used clustering algorithms were adopted in the HFS, fuzzy c-mean clustering and hierarchical clustering algorithms. In this research, the result of FCM was defuzzified as follows:
RðkÞ ¼
1; if PðkÞ ¼ maxðPÞ 0; otherwise
;
ð11Þ
where P and P(k) denote the membership of instance that belongs to each cluster and the possibility of the instance that belongs to kth cluster, respectively. Features discussed in this chapter for bearing defects included the features extracted from time-domain, frequency domain, time-frequency domain and empirical mode decomposition (EMD). Firstly, in the time domain, statistical parameters were extracted from the waveform of the vibration signals directly. A wide set of statistical parameters, such as rms, kurtosis, skewness, crest
11316
Y. Yang et al. / Expert Systems with Applications 38 (2011) 11311–11320
factor and normalized high order central moment, have been developed (Jack & Nandi, 2002; Lei, He, & Zi, 2008; Samanta & Nataraj, 2009; Samanta, Al-Balushi, & AI-Araimi, 2003). Second, the characteristic frequencies related to the bearing components were located, e.g. ball spin frequency (BSF), ball-pass frequency of inner ring (BPFI), and ball-pass frequency of outer ring (BPFO). Besides, in order to interpret real world signals effectively, the envelope technique for the frequency spectrum was used to extract the features of the modulated carrier frequency signals (Patil, Mathew, & RajendraKumar 2008). In addition, a new signal feature, proposed by Huang from envelope signal (Hung, Xi, & Li, 2007b), the power ratio of maximal defective frequency to mean or PMM for short, was calculated as follows:
PMM ¼
maxðpðfpo Þ; pðfpi Þ; pðfbc ÞÞ ; meanðpÞ
ð11Þ
where P(fPo), p(fpi) and p(fbc) are the average power of the defective frequencies of the outer-race, inner race and ball defects, respectively; and mean (p) is the average of overall frequency power. Thirdly, Yen introduced wavelet packet transform (WPT) in Yen (2000) as follows,
ejn ¼
X
w2j;n;k ;
ð12Þ
k
where wj.n.k is the packet coefficient, j is the scaling parameter, k is the translation parameter, and n is the oscillation parameter. Each
wavelet packet coefficient measures a specific sub-band frequency content. In addition, EMD was used to decompose signal into several intrinsic mode functions (IMFs) and a residual. The EMD energy entropy in Yu, Yu, and Cheng (2006) was developed to calculate the first several IMFs of signal. In this research, self-organized map (SOM) was used to validate the classification performance based on the selected features. The theoretical background of unsupervised SOM has been extensively studied in the literature. A brief introduction of SOM can be found in Liao and Lee (2009) for bearing faults diagnose. With available data from different bearing failure modes, the SOM can be applied to build a health map in which different regions indicate different defects of a bearing. Each input vector could be represented by a BMU (Best Machining Unit) in the SOM. After training, the input vectors of a specific bearing defect are represented by a cluster of BMUs in the map, which forms a region indicating the defect. If the input vectors are labeled, each region could be defined to represent a defect.
3.1. Experiments In this research, two tests were conducted on two types of bearing and the class information was considered as unknown in both cases. In the first test, bearings were artificially made to have roller defect, inner-race defect, outer-race defect and four different
Normal
Roller defect 0.6 Acceleration (g)
Acceleration (g)
0.1 0.05 0 -0.05 -0.1
0
0.5
1
1.5 2 Time (s) Inner-race defect
2.5
Acceleration (g)
Acceleration (g)
0
0
0.5
1
1.5 2 Time (s) Inner-race & Roller defect
2.5
Acceleration (g)
Acceleration (g)
0.5
1
0
0.5
0
0.5
0
0.5
1.5 2 Time (s) Outer-race defect
2.5
3
1
2.5
3
1
2.5
3
2.5
3
0 -0.05 1.5 2 Time (s) Outer & Inner-race defect
1
0.5 0
0
0.5
1
1.5 2 2.5 Time (s) Outer & inner-race & Roller defect
0.5 0 -0.5 -1
3
1.5 2 Time (s) Outer-race & Roller defect
0.5 Acceleration (g)
2 Acceleration (g)
0
0.05
-0.1
3
1
1 0 -1 -2
0
0.1
0.5
-0.5
0.2
-0.2
3
1
-0.5
0.4
0
0.5
1
1.5 Time (s)
2
2.5
3
0
-0.5
1
1.5 Time (s)
Fig. 3. Vibration signal of the first test, including normal pattern and seven failure patterns.
2
11317
Y. Yang et al. / Expert Systems with Applications 38 (2011) 11311–11320
combinations of the single failures respectively. In this case, SKF32208 bearing was tested, with an accelerometer installed on the vertical direction of its housing. The sampling rate for the vibration signal was 50 kHz. The BPFI, BPFO, and BSF for this case were calculated as 131.73 Hz, 95.2 Hz and 77.44 Hz, respectively. Fig. 3 shows the vibration signal of all defects as well as the normal condition in the first test. In the second test, a set of 6308-2R single row deep groove ball bearings were run to failure resulting in roller defect, inner-race defect and outer-race defect (Huang et al., 2007b). Totally 10 bearings were involved through the experiment. The data sampling frequency was 20 kHz. The BPFI, BPFO, and BSF in this case were calculated as 328.6 Hz, 205.3 Hz and 274.2 Hz, respectively. It should be pointed out that the beginning of the second test was not stable, and then it fell into a long normal period. Hence, two separate segments from the stable normal period were selected to be baseline for training and testing, respectively. On the other hand, the data that exceeded mean value before end of the test was supposed as potential failure patterns. Therefore, 70% of the faulty patterns and the half of good patterns were used for training unsupervised learning model, while all the faulty patterns and the other half of good patterns for testing. Fig. 4 shows the part of the data segments of one bearing from the run-to-failure experiment in the second test.
classification accuracies based on HFS_SU, HFS_LCC and HFS_DB with the top one ranked feature as input were 92.11%, 92.11% and 85.59%, respectively. When using the first three ranked features, the accuracy of 97.19%, 99.77% and 97.03% were achieved. In the comparison with HFS_SU, HFS_DB, features selected by HFS_LCC achieved higher classification accuracy of 99.77%. In other word, HFS_LCC apparently selected most representative features for this specific application. As shown in Fig. 5b, the highest classification accuracy was 99.38% with 5 features for PCA. In Fig. 5c, the classification accuracies based on HFS_SU, HFS_DB and HFS_LCC were higher than the results based on FRMV_KM. For FRMV_KM, highest classification accuracy of 98.36% was achieved with 12 features. Fig. 5d compared SFFS and three HFS methods, HFS_LCC selected most representative features, and the accuracies reached by HFS were higher. For SFFS, highest classification accuracy of 98.43% was achieved with 9 features. As shown in Fig. 5e, although the first 11 features selected by fosmod ultimately reached the accuracy of 99.14%, HFS not only obtained higher accuracy but also ranked features with high reliability. Comparing to FFC as shown in Fig. 5f, features selected by HFS provided better classification
Validated by SOM 100
3.2. Analysis and result
98
In the first test, totally 24 features were computed as follows. Half of the data was used for training the SOM and the remaining part for testing.
Figs. 5 shows the results of first test, with the x axis representing the number of selected features fed into the unsupervised SOM for clustering and the y axis representing the classification accuracy correspondingly. The first 12 features selected by each algorithm are shown for convenience. Take Fig. 5a as an example, the
unstable beginning
-20
0
500
1000
1500
Acceleration (g) 0
500
1000
1500
2000
10
11
12
Fig. 5a. Comparison results of classification accuracy of HFS_LLC, HFS_SU, HFS_DB.
Validated by SOM
96
0
500
1000
1500
2000
200
0
3 4 5 6 7 8 9 Number of features according to rankings
100
failure
50
2
98
-50
2000
1
0
stable 2
Acceleration (g)
88
Accuracy
Acceleration (g)
Acceleration (g)
0
-50
90
50
20
HFS SU HFS LCC HFS DB
94
92
stable 1
40
-40
Accuracy
Energies centered at 1xBPFO, 2xBPFO, 1xBPFI, 2xBPFI, 1xBSF, 2xBSF. 6 statistics for the raw signal (mean, rms, kurtosis, crest factor, skewness, entropy). 6 statistics for envelop signal obtained by hilbert transform. 6 statistics for the spectrum results of the waveform by FFT.
96
94 HFS SU HFS LCC HFS DB PCA
92
100 0
90
-100 -200
0
500
1000
1500
2000
Fig. 4. Vibration signal of one bearing of the second test; (1) unstable beginning of the test; (2) first stable segment; (3) second stable segment; (4) failure pattern (inner race defect).
88
1
2
3 4 5 6 7 8 9 Number of features according to rankings
10
11
12
Fig. 5b. Comparison results of classification accuracy of HFS_LLC, HFS_SU, HFS_DB and PCA based method.
11318
Y. Yang et al. / Expert Systems with Applications 38 (2011) 11311–11320 Validated by SOM
Validated by SOM
100
100
90
90 HFS SU HFS LCC HFS DB FRMV KM
70
60
50
50
1
2
3 4 5 6 7 8 9 Number of features according to rankings
10
11
12
Fig. 5c. Comparison results of classification accuracy of HFS_LLC, HFS_SU, HFS_DB and FRMV_KM.
98
Accuracy
96
HFS SU HFS LCC HFS DB SFFS
94
92
90
88
1
2
3 4 5 6 7 8 9 Number of features according to rankings
10
11
12
Fig. 5d. Comparison results of classification accuracy of HFS_LLC, HFS_SU, HFS_DB and SFFS.
Validated by SOM 100 95 HFS SU HFS LCC HFS DB fosmod
90 85 80 75 70 65 60 55 50
1
2
3 4 5 6 7 8 9 Number of features according to rankings
10
11
40
1
2
3 4 5 6 7 8 9 Number of features according to rankings
10
11
12
Fig. 5f. Comparison results of classification accuracy of HFS_LLC, HFS_SU, HFS_DB and FFC.
accuracy with less features. For FFC, the highest accuracy of 98.43% was reached with 9 features. The performance improvement of the proposed model over FRMV_KM, SFFS, fosmod and FFC was mainly due to making use of every feature, combining the clustering solutions and independence evaluation, which overcomes deficiencies of less uncertainty of clustering solution and jeopardize of related features. In order to illustrate the effect of the redundancy of the feature set to the classification performance and demonstrate the robustness of HFS, more features were involved to be the candidates in the second test. Totally 40 features were calculated and given as follows.
Validated by SOM 100
Accuracy
70
60
40
HFS SU HFS LCC HFS DB FFC
80 Accuracy
Accuracy
80
12
Fig. 5e. Comparison results of classification accuracy of HFS_LLC, HFS_SU, HFS_DB and fosmod.
10 statistics for the raw signal (var, rms, skewness, kurtosis, crest factor, 5th to 9th central moment). Energies centered at 1xBPFO, 1xBPFI, 1xBSF for both raw signal and envelop. PMMs for both raw signal and envelop. 16 WPNs. 6 IMF energy entropies. The results in Fig. 6 show the classification accuracy of the second test. As shown in Fig. 4(a), HFS_LCC reached highest classification accuracy of 88.56% with first 10 features, while HFS_DB and HFS_SU achieved their highest classification accuracy of 87.29%, 85.17% with first 8 features and 11 features, respectively Fig. 6b shows that compare to PCA based feature selection method, HFS_LCC and HFS_DB outperformed PCA (highest accuracy 86.02%) with respect to higher accuracy with the same number of features or less features. In the comparison with FRMV_KM (as shown in Fig. 6c) (highest accuracy 83.90%), HFS group showed apparently better classification accuracy with less features. As shown in Figs. 6d and 6e, the feature selected by SFFS and fosmod resulted in the accuracy of 84.75% and 85.17%, which were worse than the only one feature selected by HFS_DB. Compared to FFC (as shown in Fig. 6f), HFS_LCC showed better performance. Since 86.86% accuracy was reached by FFC with 6 features selected. From the results of the two tests, the conclusion could be drawn that the proposed HFS is robust and effective in selecting the most representative features, which maximizes the unsupervised classification performance. It also should be noted that for both tests, FRMV_KM and HFS_SU shared the same evaluation criterion, but the decision provided by the HFS_SU was always better than
11319
Y. Yang et al. / Expert Systems with Applications 38 (2011) 11311–11320
Validated by SOM
Validated by SOM 90
90 HFS SU HFS LCC HFS DB SFFS
HFS SU HFS LCC HFS DB
88
86
84
Accuracy
Accuracy
85
82
80 80
78
76
1
2
3 4 5 6 7 8 9 Number of features according to rankings
10
11
75
12
Fig. 6a. Comparison results of classification accuracy of HFS_LLC, HFS_SU, HFS_DB.
1
2
3 4 5 6 7 8 9 Number of features according to rankings
10
11
12
Fig. 6d. Comparison results of classification accuracy of HFS_LLC, HFS_SU, HFS_DB and SFFS.
Validated by SOM
Validated by SOM
90
90
HFS SU HFS LCC HFS DB fosmod
HFS SU HFS LCC HFS DB PCA
88
86
84
Accuracy
Accuracy
85
82
80 80
78
76
1
2
3 4 5 6 7 8 9 Number of features according to rankings
10
11
75
12
Fig. 6b. Comparison results of classification accuracy of HFS_LLC, HFS_SU, HFS_DB and PCA based method.
1
2
3 4 5 6 7 8 9 Number of features according to rankings
10
11
12
Fig. 6e. Comparison results of classification accuracy of HFS_LLC, HFS_SU, HFS_DB and fosmod.
Validated by SOM
Validated by SOM
90
90 HFS SU HFS LCC HFS DB FRMV KM
88
86
84
Accuracy
Accuracy
85
82
80 80
HFS SU HFS LCC HFS DB FFC
78
76
1
2
3 4 5 6 7 8 9 Number of features according to rankings
10
11
12
Fig. 6c. Comparison results of classification accuracy of HFS_LLC, HFS_SU, HFS_DB and FRMV_KM.
75
1
2
3 4 5 6 7 8 9 Number of features according to rankings
10
11
12
Fig. 6f. Comparison results of classification accuracy of HFS_LLC, HFS_SU, HFS_DB and FFC.
11320
Y. Yang et al. / Expert Systems with Applications 38 (2011) 11311–11320
FRMV_KM, which indicated that the proposed HFS scheme is superior to FRMV with respect to the same evaluation criterion. Besides, it is worth noticing that in both tests, the proposed HFS based on three evaluation criterions, e.g. SU, LCC and DB index, generated results with slight difference. It suggested that the effectiveness of features selected by proposed HFS relied on the applied evaluation criterion, and LCC was considered more appropriate for both the two cases. Nonetheless, it is still appropriate to conclude that the overall performance based on features selected by HFS was better comparing to other five methods.
4. Conclusion This paper presented a hybrid unsupervised feature selection (HFS) approach to select the most representative features for unsupervised learning and used two experimental bearing data to demonstrate the effectiveness of HFS. The performance of HFS approach was compared with other five feature selection methods with respect to the accuracy improvement of unsupervised learning algorithm SOM. The results showed that the proposed model could (a) identify the features that are relevant to the bearing defects, and (b) maximize the performance of unsupervised learning models with fewer features. Moreover, it suggested that HFS relied on the evaluation criterion to the certain application. Therefore, the further research will focus on expand HFS to broader applications and online machinery defect diagnostics and prognostics. Acknowledgement The authors gratefully acknowledge the support of 863 Program (No. 50821003), PR China, for this work. References Blum, A. L., & Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence, 1–2, 245–271. Bouldin, D. L. D. a. D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 224–227. Dash, M., & Koot, P. W. (2009). Feature selection for clustering. In Encyclopedia of database systems (pp. 1119–1125). Dash, M., & Liu, H. (1997). Feature selection for classification. Intelligent Data Analysis, 1–4, 131–156. Dietrich, C., Palm, G., & Schwenker, F. (2003). Decision templates for the classification of bioacoustic time series. Information Fusion, 2, 101–109. Frigui, H. (2008). Clustering: Algorithms and applications. In 2008 1st international workshops on image processing theory,tools and applications, IPTA 2008. Sousse. Ginart, A., Barlas, I., & Goldin, J. (2007). Automated feature selection for embeddable prognostic and health monitoring (PHM) architectures. In AUTOTESTCON (Proceedings), Anaheim, CA (pp. 195–201). Greene, D., Cunningham, P., & Mayer, R. (2008). Unsupervised learning and clustering. Lecture Notes in Applied and Computational Mechanics, 51–90. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 1157–1182. Hong, Y., Kwong, S., & Chang, Y. (2008a). Consensus unsupervised feature ranking from multiple views. Pattern Recognition Letters, 5, 595–602. Hong, Y., Kwong, S., & Chang, Y. (2008b). Unsupervised feature selection using clustering ensembles and population based incremental learning algorithm. Pattern Recognition, 9, 2742–2756.
Huang, J., Cai, Y., & Xu, X. (2007a). A hybrid genetic algorithm for feature selection wrapper based on mutual information. Pattern Recognition Letters, 13, 1825–1844. Huang, R., Xi, L., & Li, X. (2007b). Residual life predictions for ball bearings based on self-organizing map and back propagation neural network methods. Mechanical Systems and Signal Processing, 1, 193–207. Jack, L. B., & Nandi, A. K. (2002). Fault detection using support vector machines and artificial neural networks, augmented & by genetic algorithms. Mechanical Systems and Signal Processing, 2–3, 373–390. Jain, A. K., Duin, R. P. W., & Mao, J. (2000). Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1, 4–37. Jardine, A. K. S., Lin, D., & Banjevic, D. (2006). A review on machinery diagnostics and prognostics implementing condition-based maintenance. Mechanical Systems and Signal Processing, 7, 1483–1510. Kwak, N., & Choi, C. H. (2002). Input feature selection for classification problems. IEEE Transactions on Neural Networks, 1, 143–159. Lei, Y. G., He, Z. J., & Zi, Y. Y. (2008). A new approach to intelligent fault diagnosis of rotating machinery. Expert Systems with Applications, 4, 1593–1600. Li, G., Hu, X., Shen, X., et al. (2008) A novel unsupervised feature selection method for bioinformatics data sets through feature clustering, In IEEE international conference on granular computing, GRC 2008. Hangzhou. (pp. 41–47). Li, Y., Dong, M., & Hua, J. (2008). Localized feature selection for clustering. Pattern Recognition Letters, 10–18. Liao, L., & Lee, J. (2009). A novel method for machine performance degradation assessment based on fixed cycle features test. Journal of Sound and Vibration, 326, 894–908. Liu, X., Ma, L., Zhang, S., & Mathew, J. (2006). Feature group optimisation for machinery fault diagnosis based on fuzzy measures. Australian Journal of Mechanical Engineering, 2, 191–197. Liu, H., & Yu, L. (2005). Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering, 4, 491–502. Malhi, A., & Gao, R. X. (2004). PCA-based feature selection scheme for machine defect classification. IEEE Transactions on Instrumentation and Measurement, 6, 1517–1525. Mitra, P., Murthy, C. A., & Pal, S. K. (2002). Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 3, 301–312. Oduntan, I. O., Toulouse, M., & Baumgartner, R. (2008). A multilevel tabu search algorithm for the feature selection problem in biomedical data. Computers & Mathematics with Applications, 5, 1019–1033. Patil, M. S., Mathew, J., & RajendraKumar, P. K. (2008). Bearing signature analysis as a medium for fault detection: A review. Journal of Tribology, 1. Peng, Z. K., & Chu, F. L. (2004). Application of the wavelet transform in machine condition monitoring and fault diagnostics: a review with bibliography. Mechanical Systems and Signal Processing, 2, 199–221. Samanta, B., Al-Balushi, K. R., & AI-Araimi, S. A. (2003). Artificial neural networks and support vector machines with genetic algorithm for bearing fault detection. Engineering Applications of Artificial Intelligence, 7-8, 657–665. Samanta, B., & Nataraj, C. (2009). Use of particle swarm optimization for machinery fault detection. Engineering Applications of Artificial Intelligence, 2, 308–316. Shao, Y., & Nezu, K. (2000). Prognosis of remaining bearing life using neural networks. Proceedings of the Institution of Mechanical Engineers. Part I. Journal of Systems and Control Engineering, 3, 217–230. Sugumaran, V., & Ramachandran, K. I. (2007). Automatic rule learning using decision tree for fuzzy classifier in fault diagnosis of roller bearing. Mechanical Systems and Signal Processing, 5, 2237–2247. Wei, H. L., & Billings, S. A. (2007). Feature subset selection and ranking for data dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1, 162–166. Xu, Z., Xuan, J., Shi, T., & Wu, B. (2009). Application of a modified fuzzy ARTMAP with feature-weight learning for the fault diagnosis of bearing. Expert Systems with Applications, 6, 9961–9968. Yen, G. G. (2000). Wavelet packet feature extraction for vibration monitoring. IEEE Transactions on Industrial Electronics, 3, 650–667. Yu, Y., Yu, D., & Cheng, J. (2006). A roller bearing fault diagnosis method based on EMD energy entropy and ANN. Journal of Sound and Vibration, 1–2, 269–277. Yu, L., & Liu, H. (2004). Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research, 1205–1224. Zhang, H., & Sun, G. (2002). Feature selection using tabu search method. Pattern Recognition, 35, 701–711.