COMPARATIVE EVALUATION OF VECTOR MACHINE BASED HYPERSPECTRAL CLASSIFICATION METHODS Ali Can Karaca, Alp Ertürk, M. Kemal Güllü, Sarp Ertürk Kocaeli University Laboratory of Image and Signal Processing (KULIS) Electronics and Telecomm. Eng. Dept., Umuttepe Campus, Kocaeli, Turkey {alican.karaca1, alp.erturk, kemalg, sertur}@kocaeli.edu.tr ABSTRACT This paper presents a comparison of the classification performance of some vector machine based classification methods, namely, Import Vector Machines (IVM), Support Vector Machines (SVM) and Relevance Vector Machines (RVM), for hyperspectral images. Evaluation is carried out in terms of the number of vectors and classification accuracies. Furthermore, novel to this paper, Discriminative Random Field method with Graph Cut algorithm is applied to the probabilistic classification output of IVM based hyperspectral classification results, and it is shown that this approach significantly increases classification accuracies. Index Terms— Import Vector Machines, Support Vector Machines, Relevance Vector Machines, Hyperspectral Classification 1. INTRODUCTION The choice of classifier to be used in remote sensing is an important research topic. Vector machine based classification methods are used extensively in the literature because of the high classification accuracies they provide. Support Vector Machines (SVM), proposed in [1], is the most popular vector machine method, and has been utilized for regression and classification tasks in hyperspectral data sets frequently. Despite the fact that SVM is a robust, high accuracy classifier, it has some disadvantages. SVM doesn't provide probabilistic outputs because of being a margin classifier, cross validation must be applied to select the margin parameter correctly and number of support vectors increase linearly when the training data ratio is increased. Relevance Vector Machines (RVM) is proposed in [2] and has been applied to hyperspectral images in [3]. It is reported in [3] that RVM provides classification accuracies close to SVM with a faster computational time. Import Vector Machines (IVM), proposed in [4], is a recent method for hyperspectral classification. Similar to RVM, IVM has computational advantage and provides more sparse solutions with respect to SVM while providing higher accuracy than RVM.
978-1-4673-1159-5/12/$31.00 ©2012 IEEE
To improve classification results, the spectral information of each pixel and the spatial information which is extracted from its neighborhood are commonly combined. In spectral-spatial classifications, Random Fields approaches are frequently used to combine these terms. In [5], Markov Random Fields are applied to the output of probabilistic SVM. In [6], Conditional Random Fields are applied after the Gaussian process and the change in classification accuracies are evaluated depending on different number of neighbors. Both of the references show that using Random Fields as post processing increase classification accuracies. Discriminative Random Fields (DRF) are used in this paper to model spatial interactions between pixels. Using DRF, probabilistic outputs of each vector of IVM and its neighborhood relations are combined. If probability of a pixel is low and the neighbors belong to the same class, then the label of the neighbors is assigned to the current pixel. This operation is carried out by graph cut algorithm. In this paper, the performances of SVM, IVM and RVM for hyperspectral image classification are evaluated and compared. Also, DRF is applied after IVM as a postprocessing method and evaluated in terms of accuracy and classification maps. 2. VECTOR MACHINE BASED CLASSIFICATION METHODS In supervised classification methods, a set of training data X={ (x1,y1), ... , (xn,yn) } is given for n labeled samples denoted as xi and the corresponding class labels are shown by yi אC={1,2,...,K}. It is desired to learn a model of the dependency of the classes on the training input vectors, so that accurate predictions of the class identity can be made for previously unseen values of x. SVM provides a good classification performance for two class classification, in which case yi= {-1,1} by making predictions based on a function in the form of (1).
݂ሺݔǢ ݓሻ ൌ ݓ ܭሺݔǡ ݔ ሻ ܾሺͳሻ ୀଵ
where K ( -, - ), gives an inner product in the transformed space and is referred as a kernel function, effectively
4970
IGARSS 2012
defining one basis function for each sample in the training set. The main idea of SVM is to maximize the margin between the class boundary and training patterns that are implicitly defined in the feature space by the kernel K . The input vectors that are geometrically near the boundary are assigned non-zero weights in the classifier model and are called Support Vectors (SVs). The main idea of RVM is based on a Bayesian approach, and therefore RVM provides probabilistic outputs. In two-class classification, where class labels are assigned as yi ={0,1}, and the Bernoulli distribution is used as the likelihood function, the prior probability can be denoted as ௬
ଵି௬
ሺݕȁݓሻ ൌ ෑ ߪሼ݂ሺݔ Ǣ ݓሻሽ ሾͳ െ ߪሼ݂ሺݔ Ǣ ݓሻሽሿ
ሺʹሻ
ୀଵ
where ߪ is the logistic sigmoid function, and w is the weight vector. ሺݓȁߙሻ is the prior probability for the Bayesian model which has a Gaussian distribution and is conditioned on Ƚ which is a hyperparameter that affects the weight values. Weight probabilities can be obtained if likelihood and prior probabilities are known. After RVM classification, only some of the wi values are non-zero, just as in SVM. The corresponding vectors are called Relevance Vectors (RVs). Similar to RVM and unlike SVM, IVM can directly be used in multi-class classification using the output probabilities of classes. The general form of SVM consists of loss and penalty terms. On the other hand, the Kernel Logistic Regression (KLR) which is a basic principal of IVM uses the negative log-likelihood (NLL) of the binomial distribution, i.e. ሺͳ ݁ ି௬ሺ௫ ሻ ሻ, instead of the loss function ሺͳ െ ݕ ݂ሺݔ ሻሻ. Unlike SVM, the weights of KLR are all non-zero so that the computational cost must be reduced. Therefore, important points of KLR are selected by IVM and added to a subset of the data set (denoted as S) and these points are called import vectors (IVs). The minimization equalization of KLR can be written in the finite dimensional form of ͳ ߣ ܪൌ ͳ ൫ͳ ݁ ି௬Ǥሺೞఈሻ ൯ ߙ ் ܭ ߙሺ͵ሻ ݊ ʹ where ߙ ={ߙ1,ߙ 2, … ,ߙn}T are the parameters of the posterior probability density function p, Ks is an (n x S) dimensional regressor matrix which can be calculated by the inner product of input vectors, and Kr is an (S x S) dimensional regularization matrix. The ߙparameters can be found using the Newton-Raphson iteration and gradient of the H matrix in the form of ିଵ ͳ ߙ ሺሻ ൌ ൬ ܭ௦ ் ܭݓ௦ ߣܭ ൰ ܭ௦ ் ݖݓሺͶሻ ݊ ͳ ݖൌ ܭ௦ ߙ ሺିଵሻ ି ݓଵ ሺ െ ݕሻሺͷሻ ݊
where p denotes the probabilities ൌ ͳȀሺͳ ሺߙ ் ܭ௦ ሻሻ. By rewriting multi-class probabilities, the multi-class extension of the IVM method can be updated in the form of ሺݕ ȁ݇ ௦ Ǣ ߙሻ ൌ
ሺߙ ் ݇ ௦ ሻ σ ൫ߙ ் ݇ ௦ ൯
ሺሻ
where ynk is the class label, and ݇ ௦ is a kernel matrix with columns of Ks. 3. DISCRIMINATIVE RANDOM FIELDS AND GRAPH CUT ALGORITHM The Discriminative Random Fields (DRF) approach has been proposed in [7] and has been used for land cover classification in [8]. The DRF definition includes the smoothness term, which shows a penalizing cost between a set of neighborhood points, and the data term which is obtained from the IVM output and shows class probabilities of the current point. DRF is formulized as: ܲሺݕȁܺሻ ൌ
ͳ ݈ܲ ݃ሺݕ ȁݔ ሻ ߚ ߜሺݕ ǡ ݕ ሻ൩ሺሻ ܼ ீא
ீאא
where ߜ is the Kronecker delta function, yn shows class labels, xn shows current input point, G is the set of data points, B is the number of neighborhood value, and ߚ is the interaction parameter. If DRF is directly applied, it is necessary and critical to estimate the weighting parameter,ߚ. In order to avoid this process, the graph cut algorithm is combined with DRF. Note that energy minimization using graph cuts has been proposed in [9]. This algorithm provides a minimization of total energy: ܧሺ݂ሻ ൌ ܧௗ௧ ሺ݂ሻ ܧ௦௧ ሺ݂ሻሺͺሻ The graph cut algorithm mentioned above has two subalgorithms. The main idea of the γ, ε swap algorithm: for a pair of labels γ, ε, is to successively segment all γ pixels from ε pixels with graph cuts and the algorithm will change the γ and ε values at each iteration. The main idea of the γ extension algorithm: for a label γ, is to segment all γ and non-γ pixels with graph cuts. 4. EXPERIMENTAL RESULTS Radial basis function kernel (RBF) is used for all of the experiments conducted in this work. LIBSVM, which uses the one against one approach in multi-class cases, is used for SVM classification. Matlab implementation of RVM is used which is given in M. Tipping's website. RVM is similarly applied in the one against one approach to enable comparative evaluation. The classification accuracies were calculated for different training data ratios (TDRs) i.e. a TDR of 2% shows that 2% of the data was used in the training phase.
4971
4.1. Experiment 1 4.1.1. Data set Description Experimental results are provided for the Indian Pine hyperspectral image which was taken over northwest Indiana’s Indian Pine test site in June 1992. The data set has 220 spectral bands and 145 x 145 pixels. The number of bands is reduced to 200 by removing noisy bands. The data set has actually 16 classes, but classes with small number of pixels are removed to retain 9 classes.
In 20% TDR, that can be clearly observed between SVM, RVM and IVM. Table II shows classification accuracies for each class for a TDR of 5%. It is again observed that DRF-GC provides significantly increased classification accuracies. TABLE II CLASS BASED CLASSIFICATION ACCURACIES IN INDIAN PINE DATA SET WHEN TDR=5%
4.1.2. Experimental Setup SVM, RVM and IVM parameters are determined by fivefold cross validation. The kernel parameter is searched between 0.1 and 2.1 by 0.1 steps and SVM's margin parameter is searched between 10 and 1000 by a step size of 10. A Greedy selection for determining the ߣ parameter in IVM was used searching from exp(0) to exp(-20) by dividing by exp(1) in each step. 4.1.3. Experimental Results Classification results for IVM, SVM and RVM, and DRF Graph Cut using IVM probabilistic outputs are shown in Table I for different TDRs. It can be observed that the classification accuracy of IVM is quite similar to SVM, but the accuracy of RVM is lower by about 3%. The accuracy difference between RVM and SVM is similar to the results reported in [3]. It can also be observed that DRF-GC significantly increases the classification accuracy of IVM by about 5%. TABLE I CLASSIFICATION ACCURACIES IN INDIAN PINE
TDR
IVM
SVM
RVM
IVM-DRF-GC
2%
75.92
76.92
67.70
83.72
5%
83.85
83.63
80.21
88.70
10 %
88.13
88.03
84.42
94.76
20 %
90.07
91.82
87.76
95.56
IVM
RVM
SVM
Corn-no till Corn-min till Grass-pasture Grass-trees Hay-windrowed Soybean-no till Soybean-min till Soybean-clean Woods
77.46 68.06 88.14 96.76 98.28 74.32 82.00 77.19 99.35
77.68 61.74 89.19 90.69 83.41 68.44 79.56 71.36 98.45
78.49 64.65 93.43 92.24 96.34 71.27 84.17 76.33 99.76
IVMDRF-GC 78.48 64.01 94.49 97.88 99.35 80.52 92.06 98.28 99.51
4.2. Experiment 2 4.2.1. Data set Description The second data set is the Pavia University data set which has 103 bands and 610 x 340 pixels and has 9 classes. Training data is selected randomly for each class, proportional to training data ratio. 4.2.2. Experimental Setup Like in Experiment 1, RBF kernel is used and RBF parameter is selected using five-fold cross validation for each vector machines. IVM's RBF kernel parameter is obtained as 0.45 and Greedy selection is applied for ߣ from exp(-1) to exp(-15) by a step size of exp(1). For RVM, the parameter is found between 0.6 and 0.8 in different TDRs. 4.2.3. Experimental Results Classification results for Pavia University data are presented in Table III. These results are the means of five runs performed for randomly selected training data in each case, for each TDR. It is seen that SVM provides the highest accuracies overall.
IVM
NoV in Indian Pine
Class Names
SVM
2%
5%
10%
IVM SVM RVM
NoV in Pavia Uni.
RVM
20%
TDRs
2%
Fig. 1. Number of Vectors in Indian Pine data set
5%
10%
20%
TDRs
In Figure 1, it is seen that there are a lower number of IVs and RVs compared to SVs providing more sparseness.
4972
Fig. 2. Number of Vectors in Pavia University data set
IVM's accuracies are close to SVM's, with the maximum accuracy difference being 0.46%, which is not significant. For some TDRs, accuracy difference between SVM and RVM can be significant. After DRF-GC algorithm is applied, accuracies increase by a maximum of 7%. TABLE III CLASSIFICATION ACCURACIES IN PAVIA
TDR
IVM
SVM
RVM
IVM-DRF-GC
2%
91.10
91.56
89.93
98.71
5%
93.15
93.37
92.51
99.47
10 %
94.17
94.36
93.32
99.52
20 %
94.87
94.91
94.56
99.65
analyzed and classification results are interpreted. Furthermore, it is shown in this paper that by using DRFGC together with the probabilistic outputs of IVM and relating neighborhoods the classification accuracy can be increased by about 5% in Indian Pine and nearly 6% in Pavia data sets. From the IVM-DRF-GC and IVM classification maps it is observed that DRF has corrected most of the falsely labeled pixels. Evaluating the classification accuracy, computational time, and sparseness as well as direct application to multiclass classification, IVM stands out. DRF is proposed as a post-processing method, applied after IVM, to increase classification results significantly. 6. ACKNOWLEDGEMENT
The number of import, relevance and support vectors are given in Figure 2. It shows that if TDR is increased, support vectors increase linearly, but relevance and import vectors do not. For a TDR of 20%, SVs / IVs ratio is equal to 3.6 and SVs / RVs ratio is equal to 5.3. These ratios show that RVM and IVM provide sparser solutions with respect to SVM. Classification on the Pavia University data set is conducted randomly training data sets and IVM is used with a TDR 2%. Classification maps of IVM and IVM-DRF-GC outputs for Pavia University data set and a TDR of 2%, are provided in Figure 3.a and Figure 3.b respectively. In this case, classification error decreases by about 7.6% with DRF-GC post-processing.
This work was supported by Turkish Ministry of Development under project number with 2011K120330. Authors would like to thank R. Roscher for providing Import Vector Machine classifier C++ and MATLAB codes. 7. REFERENCES [1] B. E. Boser, I.M. Guyon, and V. Vapnik, “A training algorithm for optimal margin classifiers,” Proc. 5th Annu. ACM Workshop Comput. Learn. Theory, pp. 144–152, 1992. [2] M. E. Tipping, “The relevance vector machine,” Advances in Neural Information Processing Systems , vol. 12, S. A. Solla, T. K. Leen, and K.-R. Müller, Eds. Cambridge, MA: MIT Press, 2000. [3] B. Demir, and S. Ertürk, “Hyperspectral image classification using relevance vector machines,” IEEE Geoscience and Remote Sensing Letters, vol. 4, pp. 586–590, 2007. [4] J. Zhu, and T. Hastie, “Kernel Logistic Regression and the Import Vector Machine,” J. Comput. Grap., vol. 14, no. 1, pp.185–205, 2005. [5] Y. Tarabalka, M. Fauvel, J. Chanussot, and J. A. Benediktsson, “SVM and MRF-based method for accurate classification of hyperspectral images,” IEEE Geoscience and Remote Sensing Letters, vol. 7, no. 4, pp. 736–740, Oct. 2010. [6] F. Yao, Y. Qian, Z. Hu and J. Li, “A Novel Hyperspectral Remote Sensing Images Classification using Gaussian Processes with Conditional Random Fields,” Intelligent Systems and Knowledge Engineering (ISKE), pp.197-202, Nov. 2010.
(a) IVM result
[7] S. Kumar, and M. Hebert, “Discriminative Random Fields,” IJCV, vol. 68, no. 2 , pp. 179–201, 2006.
(b) IVM-DRF-GC result
Fig. 3. Classification maps of Pavia data set in 2% TDR 5. CONCLUSION
[8] R. Roscher, B. Waske, and W. Förstner, “Kernel Discriminative Random Fields for Land Cover Classification,” 6th IAPR Workshop on Pattern Recognition in Remote Sensing, 2010.
In this paper, the classification performances of vector machines are evaluated for hyperspectral images. The variation of the number of vectors with respect to TDR is
[9] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy minimization via graph cuts,” Seventh International Conference on Computer Vision (ICCV’99), pp. 377–384, 1999.
4973