SEMI-SUPERVISED REMOTE SENSING IMAGE ... - Semantic Scholar

Report 9 Downloads 110 Views
SEMI-SUPERVISED REMOTE SENSING IMAGE CLASSIFICATION BASED ON CLUSTERING AND THE MEAN MAP KERNEL L. G´omez-Chova† , L. Bruzzone‡, G. Camps-Valls†, and J. Calpe-Maravilla† †



Dept. of Electronics Engineering, University of Valencia, 46100-Valencia, Spain. Dept. of Information Engineering and Computer Science, University of Trento, 38050-Trento, Italy. [email protected]

ABSTRACT This paper presents a semi-supervised support vector machine (SVM) method based on the combination of the expectationmaximization (EM) algorithm for Gaussian mixture models (GMM) and the mean map kernel. The proposed method uses the most reliable samples in terms of maximum likelihood to compute a kernel function that accurately reflects the similarity between clusters in the kernel space. The proposed method improves classification accuracy in situations where the available labeled information does not properly describe the classes in the test image. 1. INTRODUCTION In many remote sensing image classification problems, it is difficult to collect a sufficient number of statistically significant ground-truth samples to define a complete training set for developing robust supervised classifiers. In this setting, extrapolating to unseen scenes is a challenging problem as no training samples in the test image are available. In such situations, labeled data extracted from other images modeling similar problems can be used. However, this situation is not well posed since training samples might not properly describe the test data, which is known as the sample selection bias problem. In this kind of problems, unlabeled samples of the test image can be jointly used with the available training samples for increasing the reliability and accuracy of the learning. Supervised support vector machines (SVMs) [1, 2] excel in using the labeled information for classification, being regularized maximum margin classifiers with an appropriate loss function [3]. These methods, nevertheless, obviate the potentially useful wealth of unlabeled data, and need to be reformulated to exploit it. Semisupervised learning (SSL) is concerned with such situations [4, 5], and several approaches have been carried out in the context of remotely sensed image classification [6, 7, 8]. Essentially, two different classes of SSL algorithms are encountered in the literature: generative and discriminative models. This work is related to generative models that involve estimating the conditional distribution by means of modeling the classconditional distributions explicitly. Distribution of remote sensing data over natural land covers is usually smooth and spectra of pixels of the same land cover are similar, thus fulfilling the cluster assumption by which samples in the same cluster are likely to belong to the same class. In this paper, we assume that data is organized into a number of groups or clusters according to a distance measure in some representation space. The key point is the definition of a kernel function that accurately reflects the similarity between clusters [9]. This paper has been partially supported by the Spanish Ministry for Education and Science under projects MEC/HI2005-0228, DATASAT ESP200507724-C05-03, and CONSOLIDER/CSD2007-00018.

This paper presents a semi-supervised version of the SVM method. The proposed method is based on two assumptions: 1) classes in the test image(s) present features inducing compact clusters, and 2) unlabeled data can help modeling these clusters. The method combines the expectation-maximization (EM) algorithm for fitting Gaussian mixture models (GMM) [10], which has been extensively used in remote sensing image classification [11, 12, 13], and the mean map kernel, which computes the similarity between clusters in the kernel space. This alleviates the sample selection bias problem and produces improved classification accuracy by reinforcing samples in the same cluster belonging to the same class. For illustration purposes, we show results in the complex problem of detecting clouds from multispectral satellite images. The paper is organized as follows. Section 2 reviews the notation and tools needed for developing the semi-supervised algorithms. Section 3 presents the proposed kernel method that combines clustering and the mean map kernel. Section 4 is devoted to analyze the results of the proposed method in situations where the model assumptions are verified or broken. Section 5 concludes the paper. 2. NOTATION AND PRELIMINARIES In supervised classification problems, we are given a set of ℓ labeled (training) samples {xi , yi }ℓi=1 , where xi ∈ Rd is defined in an input space X , and yi ∈ N belongs to the observation (output) space. SVMs attempt to separate samples belonging to different classes by tracing a maximum margin hyperplane in a high dimension (possibly infinite) space H where samples are mapped to through a nonlinear mapping function φ [1, 2]. One can work in this feature space without even knowing the explicit form of the mapping φ, but only the kernel function formed by the dot product among them: K(xi , xj ) = hφ(xi ), φ(xj )i .

(1)

In any kernel method, a proper definition of the structural form of K reflecting signal relations is the crucial step to attain good results. 2.1. The Mean Map Given a finite subset of training samples S = {xi }n i=1 laying in the input space X , let us define now Φ(S) = {φ(x1 ), . . . , φ(xn )} as the image of S under the map φ. Hence Φ(S) is a subset of the inner product space H. In particular, the centre of mass of the set S in the kernel space is the vector: φµ (S) =

n 1X φ(xi ), n i=1

(2)

where φµ (·) denotes the mean map. We should stress that, in principle, there may be not an explicit vector representation of the centre

of mass, since, in this case, there may also not exist a point in the input space X whose image under φ is φµ (S). However, computing the mean in a richer high dimensional feature space can report additional advantages. 2.2. Cluster Similarity and the Mean Map Kernel Despite the apparent simplicity of the mean map function, significant information about the embedded data set Φ(S) can be obtained. Let us consider two finite subsets of samples S1 = {a1 , . . . , am } and S2 = {b1 , . . . , bn } belonging to two different clusters ω1 and ω2 , respectively. We are interested in defining a cluster similarity function that indicates the proximity between them. A straightforward kernel function reflecting the similarity among clusters is to evaluate the kernel function between the means of the clusters in the input space X : KµX (S1 , S2 ) ≡ hφ(µ1 ), φ(µ2 )i = K(µ1 , µ2 ).

(3)

Unfortunately, by doing this, we lose the advantage of working with the samples in the kernel space H implicitly. The centre of mass ofP the sets S1 and S2 in the kernel are Pspace m n 1 1 the vectors φµ (S1 ) = m φ(a ) and φ (S ) = φ(b i 2 i ). µ i=1 i=1 n Despite the apparent inaccessibility of the points φµ (S1 ) and φµ (S2 ) in the kernel space H, we can compute the cluster similarity in H using only evaluations of the sample similarity contained in the kernel matrix: ˙ ¸ KµH (S1 , S2 ) ≡ φµ (S1 ), φµ (S2 ) * + m n 1 X 1X = φ(ai ), φ(bj ) (4) m i=1 n j=1 =

m n 1 XX K(ai , bj ) mn i=1 j=1

The concept of the mean map kernel has been recently extended to compare data distributions in the kernel space [14]. 3. SEMI-SUPERVISED MEAN MAP KERNEL This section describes the developed semi-supervised kernel-based method. SSL methods assume having access to a set of unlabeled (test) samples and learn from both labeled and unlabeled samples. To fix notation, we are given a set of n samples, from which ℓ are laℓ+u beled samples, {xi , yi }ℓi=1 , and u are unlabeled samples {xi }i=ℓ+1 . In order to get cluster information, we first apply a clustering algorithm, which provides for each sample xi a crisp or soft association, hik , to each cluster ωk , k = 1, . . . , c. 3.1. Image Clustering Throughout this work, we will consider the dataset as a mixture of normal distributions so the EM algorithm can be used to obtain the maximum likelihood estimation of the probability density function (pdf) of the Gaussian mixture. The EM algorithm estimates the mixture coefficient πk , the mean µk , and the covariance matrix Σk for each component of the mixture. Then, the algorithm assigns each sample to the cluster with the maximum a posteriori probability (MAP); and the cluster membership hik represents the estimates of the posterior probabilities; that is, membership or probability value between [0, 1], and sum-to-one cluster memberships, P k hik = 1. Hence, the optimal cluster label for each sample is

Table 1. Particular cases of the proposed method depending on the space for computing cluster similarities (input X , or kernel space H), and how unlabeled samples contribute to each cluster (crisp Kµ or soft Kµs association). Method SVM µ-SVM in X µ-SVM in H µs -SVM in H

Kernel K KµX KµH KµHs

Mapping φ(x) φ(µ) φµ (S) φµs (S)

Eq. (1) (3) (4) (6)

found as hi = argmaxk {hik }, i.e. hi = k if the sample xi is assigned to the cluster ωk . The ease of use and fast classification performance justifies the selection of this algorithm, even though other clustering algorithms might be equally included in our method. Once that a clustering is done, one should compute cluster similarity either in the original or kernel space. 3.2. Sample Selection Bias and the Soft Mean Map So far we have assumed that training and test data are independently and identically distributed (i.i.d.) drawn from the same pdf, but actually training and test set distributions could not match, which is known in the literature as sample selection bias [15] or covariate shift [16]. Obviously, if the training and the test data have nothing in common, there is no chance to learn anything. Thus, we assume that both follow the same conditional distribution p(y|x), but the input distributions p(x) differ, yet not completely. We propose to trim this relative reliability of training samples by weighting the contribution of each sample xi to the definition of the centre of mass of each cluster in the kernel space H with the EM estimated posterior probabilities hik , that is: P hik φ(xi ) iP , (5) φµs (Sk ) = i hik which we call the soft mean map. This mapping weights the most reliable samples, in terms of maximum likelihood in test to compute the cluster centers in the kernel space. The corresponding kernel can be easily computed as: ˙ ¸ KµHs (Sk , Sl ) = φµs (Sk ), φµs (Sl ) *P + P j hjl φ(xj ) i hik φ(xi ) P P = , (6) i hik j hjl P P hik hjl K(xi , xj ) i Pj P = , i hik j hjl

and now, when computing cluster similarities, all samples contribute to all clusters but with different relative weights. Note that the mean map kernel in (4) is a particular case of the proposed soft mean map kernel in (6) when the training samples are associated only with one cluster (crisp association), i.e. when hik = 1 if xi belongs to cluster ωk and hik = 0 otherwise. Table 1 shows the relationship between the proposed methods. 3.3. Remarks

The proposed method brings together the ideas of unsupervised clustering, SVM, and the mean map kernel in a simple and natural way. The method exploits the information contained in the unlabeled samples to improve performance in challenging scenarios. Essentially,

0.75 SVM on µk (K) SVM on xi (K)

0.7

µ-SVM (KµX ) µ-SVM (KµH )

Kappa statistic,κ

the method tries: 1) to reinforce global consistency, and 2) to mitigate the sample selection bias. The final classification model is obtained by solving a standard SVM with a kernel learned from the similarities between clusters for the labeled training samples. Note that clustering the whole image allows us to take advantage of the wealth of information and the high degree of spatial and spectral correlation of the image pixels. Although the EM algorithm is applied to the entire image, the number of unlabeled samples u used to describe the clusters can be selected by the user to control the computational cost.

0.65

µ-SVM (KµHs )

0.6 0.55 0.5

4. EXPERIMENTAL RESULTS This section presents the experimental results. For illustration purposes, we focus on the challenging covariance-shifted scenario of cloud identification from multispectral images.

0.45 8

15

29 56 107 # Labeled Samples

207

400

Fig. 1. Average cloud classification results for the 5 MERIS sample images training the models with labeled samples from the other 4 images and 800 unlabeled samples from the image to be classified. 4.3. Numerical results

4.1. Data collection Cloud screening constitutes a clear candidate for the proposed SSL method, since very few labeled cloud pixels are typically available, and cloud features change to a great extent depending on the cloud type, thickness, transparency, height, or background. We used data from the MEdium Resolution Imaging Spectrometer (MERIS) instrument on board the ESA ENVIronmental SATellite (ENVISAT). We used as inputs 6 physically-inspired features extracted from the 15 MERIS multi-spectral bands [13]. Experiments were carried out using five MERIS images taken over Spain, Tunisia, Finland, and France. In order to test the robustness of the algorithm to differences in the training and test distributions (sample selection bias), results were obtained following an image-fold cross-validation strategy, that is, each test image was classified with a model built with labeled samples from the other images and unlabeled samples coming from the same test image.

4.2. Experimental setup and model development We generated training sets consisting of ℓ = 400 labeled samples (200 samples per class), and added u = 800 unlabeled (randomly selected) samples from the analyzed test data for the training of the SSL methods. We varied the rate of labeled samples, i.e. {2, 4, 7, 14, 27, 52, 100}% of the labeled samples of the training set were used to train the models in each experiment. In order to avoid skewed conclusions, for each value of ℓ, the experiments were run for ten realizations using randomly selected training samples. Then, the kappa statistic κ was estimated in the classification of 5000 independent samples. The proposed µ-SVM classifiers are benchmarked against: 1) the standard SVM applied to the image pixels (denoted by ‘SVM on xi ’), which is used as a reference for supervised methods; and 2) the standard SVM but applied to each cluster center in order to obtain its class label that is propagated to all the samples belonging to this cluster (denoted by ‘SVM on µk ’), which is the standard approach in unsupervised classification problems. We used for all the experiments the radial ´basis function (RBF) ` kernel, K(xi , xj ) = exp −kxi − xj k2 /2σ 2 , where σ ∈ R+ is the kernel width. Free parameters (C and σ of SVM) were selected through 10-fold cross-validation in the training set.

Figure 1 shows the average κ for all the methods. All methods provide poor results in ill-posed situations (low number of labeled samples). The standard SVM yields the worst classification results since it is critically affected by the sample selection bias problem. This problem is mitigated when using the standard SVM to directly classify centroids, as they are typically far away from the classification boundaries (even if they are biased). However, the main impediment is still the SVM classifier used to label the cluster centers, which is optimized to classify the training samples that might not represent the class distribution in test. The µ-SVM classifiers KµX , KµH , and KµHs give the best results with enough labeled samples, but with few labeled samples a whole cluster can be misclassified. It is worth to note that the proposed method is not equivalent to a simple segmentation of the image by classifying the centers of the clusters, µk . Among these three classifiers, KµH produce worse results. This can be explained since the selection of kernel free parameters is based on the classification accuracy in the training set, thus an inappropriate training set produces an inappropriate mapping (in terms of class separability in test). As a consequence, KµH is more affected by the sample selection bias since all the unlabeled samples in the training set are used to compute the cluster similarity in an inappropriate kernel space. On the other hand, KµX is more robust to the sample selection bias because it approximates the cluster similarity to the similarities of the cluster centers µk already defined in the input space, and thus it is less dependent on how the unlabeled samples representing the clusters are mapped to H. In this context, KµHs provides the overall best results, and it is also more robust to the sample selection bias problem. The soft mean map weights the contribution of each sample to the definition of the centre of mass of each cluster in the kernel space H with the EM estimated posterior probabilities, which is equivalent to eliminate the samples that do not properly represent the cluster in the input space. Hence, the estimation of the cluster center in H is less influenced by the selection of an inappropriate mapping. 4.4. Classification maps A quantitative and visual analysis of the corresponding classification maps of the test images was carried out. Figure 2 shows the comparison of the ‘SVM on µk ’ and the ‘µs -SVM’ methods against the reference cloud masks from [13]. The Barrax and France images are representative examples of the covariance shift in cloud screening

RGB

SVM on µk (K)

µs -SVM (KµHs )

means of the feature vectors in this space. The good results obtained for the cloud screening application suggest that, when a proper data assumption is made, the proposed semi-supervised method outperforms the standard supervised algorithm. Since the distribution of remote sensing data over natural land covers is usually smooth (spectra of pixels from the same land cover are similar), the cluster assumption imposed in the model is reasonable, and the EM algorithm with GMM implements this efficiently.

BARRAX

κ=0.40 ; OA=91.5%

κ=0.96 ; OA=99.5%

6. REFERENCES [1] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, Cambridge, MA, USA, 2004. ´ [2] G. Camps-Valls, J. L. Rojo-Alvarez, and M. Mart´ınez-Ram´on, Eds., Kernel Methods in Bioengineering, Signal and Image Processing, Idea Group Publishing, Hershey, PA (USA), Jan 2007.

FRANCE Cloud / Cloud

κ=0.51 ; OA=88.5% Land / Cloud

Cloud / Land

κ=0.69 ; OA=93.5% Land / Land

[3] G. Camps-Valls and L. Bruzzone, “Kernel-based methods for hyperspectral image classification,” IEEE Trans. Geos. Rem. Sens., vol. 43, no. 6, pp. 1351–1362, Jun 2005. [4] Xiaojin Zhu, “Semi-supervised learning literature survey,” Tech. Rep. 1530, Computer Sciences, University of Wisconsin-Madison, USA, 2005, Last modified on June 24, 2007, Online document: http://www.cs.wisc.edu/ jerryzhu/pub/ssl survey.pdf.

Fig. 2. Cloud maps for the MERIS images over Barrax and France. Discrepancies between methods are shown in red when proposed method detect cloud and in yellow when pixels are classified as cloud-free.

[5] O. Chapelle, B. Sch¨olkopf, and A. Zien, Semi-Supervised Learning, MIT Press, Cambridge, MA, USA, 1st edition, 2006.

problems. The Barrax image is an easy cloud screening scenario, while the France image presents clouds but also snow at different altitude mountains. Classification accuracies (OA) higher than 90% are obtained for both cases, but the lower values of κ for some cases reflect that classification results are unbalanced due to the misclassification of a significant number pixels of one class. In the Barrax image, KµHs provides good results with the (more realistic) imagefold approach, which means that training samples from the other images are useful to correctly classify the clusters found in the test image. However, the standard ‘SVM on µk ’ presents worse results, which suggests that the classifier is biased towards learning the labeled samples of the other images instead of the cluster structure. In the France image, the situation is similar but both methods produce worse classifications. This can be explained since, in an image-fold approach, no training samples from the other images model the difference between clouds and snowy mountains, thus the classifier can not learn this difference. Therefore, although the proposed semisupervised methods benefit from the inclusion of unlabeled samples by estimating the marginal data distribution, these methods are limited by the quality of the available labeled information, and can not alleviate situations with a dramatic sample selection bias problem.

[7] G. Camps-Valls, T. V. Bandos Marsheva, and D. Zhou, “Semisupervised graph-based hyperspectral image classification,” IEEE Trans. Geos. Rem. Sens., vol. 45, no. 10, pp. 3044–3054, Oct 2007.

5. CONCLUSIONS A semi-supervised SVM classification method based on cluster similarity is proposed in this work. In remote sensing applications, no training samples are typically available for the image to be classified, and thus, labeled data is extracted from other images modeling similar problems. In these scenario, not all the training samples are equally reliable and hence the presented soft mean map constitutes a suitable approach to combat the sample selection bias. The information from unlabeled samples of the test dataset is included in the standard SVM by means of a cluster similarity, which is directly computed in the kernel space with a dedicated kernel based on the

[6] L. Bruzzone, M. Chi, and M. Marconcini, “A novel transductive SVM for the semisupervised classification of remote-sensing images,” IEEE Trans. Geos. Rem. Sens., vol. 44, no. 11, pp. 3363–3373, 2006.

[8] L. G´omez-Chova, G. Camps-Valls, J. Mu˜noz-Mar´ı, and J. CalpeMaravilla, “Semisupervised Image Classification with Laplacian Support Vector Machines,” IEEE Geos. Rem. Sens. Lett., vol. 5, no. 3, Jul 2008. [9] S.K. Zhou and R. Chellappa, “From sample similarity to ensemble similarity: probabilistic distance measures in reproducing kernel Hilbert space,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 28, no. 6, pp. 917– 929, Jun 2006. [10] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society, Series B, vol. 39, pp. 1–38, 1977. [11] B.M. Shahshahani and D.A. Landgrebe, “The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon,” IEEE Trans. Geos. Rem. Sens., vol. 32, no. 5, pp. 1087–1095, 1994. [12] Q. Jackson and D.A. Landgrebe, “An adaptive classifier design for high-dimensional data analysis with a limited training data set,” IEEE Trans. Geos. Rem. Sens., pp. 2664–2679, Dec. 2001. [13] L. G´omez-Chova, G. Camps-Valls, J. Calpe, L. Guanter, and J. Moreno, “Cloud-screening algorithm for ENVISAT/MERIS multispectral images,” IEEE Trans. Geos. Rem. Sens., vol. 45, no. 12, pp. 4105–4118, Dec 2007. [14] A. Gretton, K. M. Borgwardt, M. Rasch, B. Sch¨olkopf, and A. Smola, “A kernel method for the two-sample-problem,” in NIPS 2006, Cambridge, MA, USA, Jan 2007, vol. 19, pp. 1–8, MIT Press. [15] J. Huang, A. Smola, A. Gretton, K. M. Borgwardt, and B. Sch¨olkopf, “Correcting sample selection bias by unlabeled data,” in NIPS 2006, Cambridge, MA, USA, Jan 2007, vol. 19, pp. 1–8, MIT Press. [16] M. Sugiyama and K.-R. M¨uller, “Input-Dependent Estimation of Generalization Error under Covariate Shift,” Statistics & Decisions, vol. 23, no. 4, pp. 249–279, 2005.