Combination of One-Class Remote Sensing Image Classifiers

Combination of One-Class Remote Sensing Image Classifiers Jordi Mu˜noz-Mar´ı, Gustavo Camps-Valls, Luis G´omez-Chova, and Javier Calpe-Maravilla Dept. Enginyeria Electr`onica. Universitat de Val`encia. C/ Dr. Moliner, 50. 46100. Burjassot, Val`encia. Spain. [email protected], http://www.uv.es/jordi

Abstract— This paper presents simple but powerful combination methods of dedicated one-class classifiers (OCCs) for efficient remote sensing image classification. The mean and product combination rules are applied to the probabilistic outputs generated by OCCs, and the performance is illustrated in a urban monitoring application in which multi-sensor (optical and SAR) data and multi-source (spectral and contextual) features are available. Two OCCs are used as core parts: the classical mixture of Gaussians (MoG) and the support vector domain description (SVDD) classifier. The obtained results by combining SVDD classifier outputs show a clear improvement in the accuracy, and more robustness to high dimensional samples compared to both MoG and stacked approaches.

I. I NTRODUCTION One-class classifiers (OCCs) are a special kind of classifiers specialized in data domain description, i.e. detecting the objects belonging to a class of interest, targets, and rejecting the other classes objects, outliers. Different kinds of OCCs exist in the literature, such as the Mixture of Gaussians (MoG), Parzen windows, k-Nearest Neighbour, and the recently introduced Support Vector Domain Description (SVDD) [1], [2], which is the one-class implementation of the support vector machine (SVM) [3], [4]. In particular, the SVDD has been successfully presented in the context of remote sensing [5]–[7]. This classifier, being a maximum margin regularized kernel method, has shown good robustness capabilities to ill-posed problems [8]. High dimensionality problems are encountered in hyperspectral image classification, but also when dealing with multi-sensor and multi-source data. In such cases, the commonly adopted approach consists in stacking all available features to train the classifier. This strategy is clearly suboptimal as sample dimension is unnecessarily increased. A possible way to alleviate this problem is to sum dedicated kernels to each feature set [9]. In this paper, we present a simpler approach that combines already trained classifiers applying the mean and product rules on the posterior probabilities of dedicated OCCs [10]. Application to challenging urban monitoring problems combining optical, radar, spectral and contextual features shows the suitability of the proposed combination methods. The paper is outlined as follows. Section II briefly reviews the one-class classifiers used in this work, and Section III presents the proposed combination methods. Section IV shows the experiments and obtained results, and finally section V draws some conclusions and outlines further work.

II. O NE -C LASS C LASSIFIERS In this section, we briefly review the formulations of the MoG and SVDD classifiers used in this paper. A. Mixture of Gaussians The MoG OCC consists in modeling the target class using a mixture of K Gaussians: f (x) = −

K  i=1

  Pi exp −(x − µi )T Σ−1 i (x − µi ) ,

(1)

where parameters Pi , µi and Σi represent the contribution, mean and sample covariance for each Gaussian in the mixture, respectively. These parameters can be optimized using the expectation maximization (EM) algorithm. In the training process a regularization term is included to improve the performance in high dimensional scenarios. B. Support Vector Domain Description (SVDD) Notationally, the SVDD considers a dataset {xi ∈ RN , i = 1, . . . , n} belonging to a given class of interest [1]. The goal is to find a hypersphere (in a high dimensional Hilbert feature space H where the samples have been mapped to through a non-linear transformation φ) of radius R > 0 and center a with a minimum volume containing most of these data objects [1] (see Fig. 1). Hypersphere in feature Hilbert space H

xi ξi

a

R

O

Fig. 1. The hypersphere containing the (colored) target data is described by the center a and radius R, in which the samples in the boundary are the support vectors (green) and samples outside the ball are assigned a positive constrained to deal with outliers.

Therefore one has to minimize R2 constrained to φ(xi ) − a2 ≤ R2 , ∀i = 1, . . . , n. In addition, since the (training) distribution may contain outliers, one introduces a set of slack

variables ξi ≥ 0, as usual in the SVM framework, and the problem then becomes:    2 ξi (2) min R + C R,a,ξ

i

constrained to φ(xi ) − a2 ≤ R2 + ξi ,

ξi ≥ 0 ∀i = 1, . . . , n

(3)

where the parameter C controls the trade-off between the volume of the hypersphere and the permitted errors (regularization parameter). The primal function (2) is usually solved through its Lagrangian dual functional, which constitutes a quadratic programming problem [1], [4], [11]. Here, all φ mappings appear in the form of inner products, which allows us to define a kernel function K(xi , xj ) = φ(xi ), φ(xj ), and the nonlinear SVDD can be constructed using only the kernel function, without needing to know the mapping φ explicitly. After solving the dual problem, the decision function implemented by the classifier for any test vector x is given by   n  αi K(xi , x) + b (4) f (x) = sgn i=1

where b can be easily computed from the αi that are neither 0 or C. Sparse (simple) classifiers are obtained as many αi tend to zero. Valid kernels for equation (4) are those fulfilling the Mercer’s theorem. In our tests, we used the Gaussian Radial Basis   Function (RBF) kernel, K(xi , xj ) = exp −xi − xj 2 /2σ 2 , where σ ∈ R+ is the Gaussian width. The RBF kernel is a one-parameter, flexible and universal kernel that includes other valid kernels as particular cases [12]. III. C OMBINATION OF O NE -C LASS C LASSIFIERS Several strategies are possible to combine the output of different classifiers. Assuming that the output of each classifier is an estimated posterior probability, pˆ(ωt , xi ), a simple combined decision is obtained averaging the classifiers outputs. Although not based on a solid foundation, this simple rule obtains good results [13]. Another possibility, adopting the Bayes theorem is, assuming the independence of the classifiers’ outputs, make the combination as the product of these outputs [14]. When working with OCCs, the problem is that only information about the target class is known, p(x|ωt ) (assuming the training set is an i.i.d. sample from the target distribution), but the outlier class is unknown, i.e. p(x|ωo ) is not known, being ωo the whole set of unknown classes. According to Bayes rule, the posterior probability for the target class should be computed as: p(ωt |x) =

p(x|ωt )p(ωt ) p(x|ωt )p(ωt ) = p(x) p(x|ωt )p(ωt ) + p(x|ωo )p(ωo ) (5)

Given that the outliers distribution p(x|ωo ) is unknown, and that prior probabilities p(ωt ) and p(ωo ) are hard (or impossible) to estimate, equation (5) cannot be used directly. A way to solve this problem is to assume that p(x|ωt ) is independent of x, i.e. to assume a uniform distribution for the outliers, and thus using p(x|ωt ) as p(ωt |x). The output of OCCs like mixture of Gaussians or Parzen windows classifiers is already an estimation of p(x|ωt ) and thus mean or product rules can be directly applied. However, the output of OCCs like k-NN and SVDD is an estimation of the distance to a model, d(x|ωt ), and this distance should be transformed into a probability before applying the combination rules. Formally, the mean rule is defined as y(x) =  1 ˆk (x|ωt ), where R is the number of OCCs employed, P k R and the product rule is defined as y(x) = k Pˆk (x|ωt ). Once combined, the output combination y(x) is to be treated as an standard OCC output, and thus a threshold has to be tuned on the training set (usually by means of the integration of the ROC curve) to define the final classifier. IV. E XPERIMENTAL RESULTS For our experiments, we used images collected in the Urban Expansion Monitoring (UrbEx) ESA-ESRIN DUP project [15]. The considered test sites are the cities of Rome and Naples (Italy), where images from ERS2 SAR and Landsat TM sensors were acquired at 1999. For classification we considered the seven Landsat TM spectral bands L1 − L7, the two SAR backscattering intensities (0-35 days), the coherence between the two SAR signals, and an additional feature computed by applying a multi-stage spatial filter to the coherence images (see more details in [16]). The experiments were conducted in a region of interest of 200×200 pixels, shown in Fig. 2 (left panel). The dataset contains three class labels for ‘urban’, ‘non-urban’ and ‘unknown’. Contextual and spectral features were extracted from both LandSat and SAR images, as follows • LandSat: the seven available spectral bands were used, {xlw }, and 28 contextual features, were extracted by averaging reflectances in 3×3, 5×5, 7×7 and 11×11 spatial windows, {xlc }. • SAR: the coherence between the two SAR signals, Co , and the filtered, Co , were used, {xrw }, and eight contextual features, were extracted by averaging them in 3×3, 5×5, 7×7 and 11×11 spatial windows, {xrc }. This pre-processing yields four different sets of features to build the following OCCs: Name l , Ir } {Icl , Icr , Iw w {Icl,r } l,r } {Iw l,r {Ic,w }

Description One OCC per each set of features separately. We call these classifiers dedicated OCCs. One OCC with contextual features ({xlc }, {xrc }). One OCC with spectral features ({xlw }, {xrw }). One OCC with all features stacked.

The above OCCs were combined using the mean and product rules to obtain the following combination OCCs:

Landsat band

True map

MoG

SVDD

Fig. 2. Rome (top row) and Naples (bottom row) classification maps. From left to right: one LandSat band, true classification map (white: urban, black: non-urban, gray: unknown); for MoG, all features stacked, best combined result; for SVDD, all features stacked, best combined result (cf. Tables I and II).

Mean {Mcl,r } l,r } {Mw l,r {Mc,w }

Product {Pcl,r } l,r {Pw } l,r {Pc,w }

Description Combination of contextual features OCCs. Combination of spectral features OCCs. Combination of the four dedicated OCCs.

A training set was made of 40% of the samples in the image. Final results are obtained testing the trained OCCs with the 60% remaining samples. We selected the best free parameters following a 5-fold cross-validation method on the training set. In order to measure the effectiveness of each OCC, the confusion matrix is built and the kappa coefficient, κ, and overall accuracy, OA[%], are estimated [17]. The kappa coefficient fits very well in our case, as it gives a good measure about the classifier ability to accept samples of the target class and reject samples of the outlier class. Table I shows the results obtained for the region of interest of the Rome site by the OCCs trained with different feature sets and combination rules. We can see that, for the group of dedicated and stacked OCCs (left panel of the table), the best result is obtained by the OCC working with LandSat contextual features {Icl }, and the worst is obtained by the OCC using LandSat spectral features {Iwl }. For this image, OCCs using contextual features exhibit better results. This suggests that, in this particular example, contextual features contain richer information than the spectral ones. It should be noted here, however, that contextual features are in fact filtered versions of the spectral and intensity channels. Finally, an OCC using all features, contextual and spectral stacked, l,r }, works worse than the OCC trained with contextual {Ic,w features only. This can be due to the well-known curse of dimensionality (i.e. the Hughes phenomenon [8]), by which addition of (potentially redundant) features without an increase on the number of training samples lead to the risk of overfitting the training data thus producing poor generalization results. Table I (right) shows that both mean and product rules improve the results obtained by dedicated OCCs noticeably, and they are considerably better than employing all features in a single classifier following the stacked approach. It is also worth noting that OCCs working with spectral features can be drastically improved (cf. Table I, left). Finally, when using MoG, both the mean and product rules obtain improved results

compared to all dedicated OCCs. The same effect is observed with the SVDD, but with higher classification rates. These numerical results are also confirmed through visual inspection. Figure 2 (top row) shows the classification maps obtained for the Rome site. Despite that all classification maps look similar, smoother and better defined boundaries are observed for the combination SVDD map (top-right of figure). Table II shows the results for the Naples site. In this case, the best results are obtained when using the SVDD with spectral features only, but no appreciable numerical differences are observed with contextual features. This is not the case for the MoG method, in which using only contextual features produces poor results, suggesting that for this particular case a non-Gaussian distribution underlies these features. In any case, dedicated SVDD classifiers are much better than MoG classifiers. This is specially noticeable when looking at the stacked approaches, Icl,r and Iwl,r , where MoG suffers from the overfitting problem when working with these high dimensional samples, even being a regularized method. The combined output classifiers yield a clear improvement of the overall results, thus confirming that this is a good strategy for model building, rather than stacking features and developing a unique OCC. This becomes more evident for the case of MoG classifier which is more affected by the Hughes phenomenon. In this second image, the combination results show that SVDD outperforms MoG. Results obtained by SVDD are additionally more uniform and robust to the dimensionality and combination, suggesting that this is a more appropriate classifier for operational scenarios. The classification maps for the Naples site are shown in Fig. 2 (bottom row). For the MoG method, the combined results outperforms the stacked approach. For the SVDD, both stacked and combination classification maps are equally good, which is a consequence of good robustness capabilities of the method to high dimensional samples and, in turn, confirms the suitability of the proposed combination methods. V. C ONCLUSIONS This paper presented a study of simple but yet powerful combination strategies on one-class remote sensing image

TABLE I R ESULTS ( OVERALL ACCURACY AND KAPPA STATISTIC ) FOR M O G ( TOP ROWS ) AND SVDD ( BOTTOM ROWS ) FOR THE ROME IMAGE . O N THE LEFT PANEL , RESULTS USING DEDICATED AND STACKED

OCC S USING DIFFERENT FEATURE SETS ARE SHOWN . O N THE RIGHT PANEL , RESULTS OF THE MEAN B EST RESULTS ARE BOLDFACED .

AND PRODUCT COMBINATION RULES ARE SHOWN .

{Icl } MoG κ OA% SVDD κ OA%

Contextual {Icr } {Icl,r }

l } {Iw

Spectral l,r r} {Iw {Iw }

Stacked l,r {Ic,w }

{Mcl,r }

Mean rule l,r l,r {Mw } {Mc,w }

{Pcl,r }

Product rule l,r l,r {Pw } {Pc,w }

0.60 95.3

0.48 93.7

0.64 95.4

0.37 90.9

0.59 94.3

0.61 95.0

0.61 94.4

0.70 96.3

0.65 95.8

0.75 96.9

0.67 93.6

0.63 95.2

0.71 96.6

0.59 94.7

0.57 93.7

0.70 96.4

0.26 87.8

0.54 93.8

0.68 95.7

0.64 95.1

0.77 97.2

0.55 94.8

0.74 96.9

0.76 97.2

0.56 94.8

0.74 96.8

TABLE II R ESULTS ( OVERALL ACCURACY AND KAPPA STATISTIC ) FOR M O G ( TOP ROWS ) AND SVDD ( BOTTOM ROWS ) FOR THE N APLES IMAGE . O N THE LEFT PANEL , RESULTS USING DEDICATED AND STACKED

OCC S USING DIFFERENT FEATURE SETS ARE SHOWN . O N THE RIGHT PANEL , RESULTS OF THE MEAN B EST RESULTS ARE BOLDFACED .

AND PRODUCT COMBINATION RULES ARE SHOWN .

{Icl } MoG κ OA% SVDD κ OA%

Contextual {Icr } {Icl,r }

l } {Iw

Spectral l,r r} {Iw {Iw }

Stacked l,r {Ic,w }

{Mcl,r }

Mean rule l,r l,r {Mw } {Mc,w }

{Pcl,r }

Product rule l,r l,r {Pw } {Pc,w }

0.77 88.7

0.58 79.6

0.75 87.7

0.80 90.5

0.63 82.0

0.81 90.7

0.74 87.0

0.78 89.2

0.81 91.0

0.83 91.6

0.73 86.8

0.78 89.5

0.74 87.3

0.88 94.3

0.78 89.2

0.90 95.4

0.84 92.4

0.88 90.4

0.92 96.0

0.90 95.1

0.87 93.9

0.87 93.5

0.88 94.4

0.87 93.8

0.88 94.2

0.89 94.7

classifiers. Essentially, we combine the outputs of dedicated classifiers trained on different feature sets using the mean and product rules on the classifiers’ posterior probabilities. Obtained results from these combinations generally improve the accuracy obtained by dedicated classifiers and by following the classical stacked approach. Further work will consider testing different combination rules and one-class classifiers in a wider range of scenes. ACKNOWLEDGMENTS This paper has been partially supported by the Spanish Ministry for Education and Science under project DATASAT (ESP2005-07724-C05-03). R EFERENCES [1] D.M.J. Tax and R.P.W. Duin. Support vector domain description. Pattern Recognition Letters, 20(11):11–13, 1999. [2] D.M.J. Tax. One-class classification; Concept-learning in the absence of counter-examples. PhD thesis, Delft University of Technology, 2001. [3] G. Camps-Valls and L. Bruzzone. Kernel-based methods for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 43(6):1351–1362, June 2005. ´ [4] G. Camps-Valls, , J.L. Rojo-Alvarez, and M. Mart´ınez-Ram´on, editors. Kernel Methods in Bioengineering, Signal and Image Processing. Idea Group Publishing, Hershey, PA (USA), Jan 2007. [5] X. Song, G. Cherian, and G. Fan. A ν-insensitive SVM approach for compliance monitoring of the conservation reserve program. IEEE Geoscience and Remote Sensing Letters, 2(2):99–103, Apr 2005. [6] G. Mercier and F. Girard-Ardhuin. Patially supervised oil-slick detection by SAR imagery using kernel expansion. IEEE Transactions on Geoscience and Remote Sensing, 4(10):2839–2846, Oct 2006.

[7] J. Mu˜noz-Mar´ı, L. Bruzzone, and G. Camps-Valls. A support vector domain description approach to supervised classification of remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 45, 2007. [8] G.F. Hughes. On the mean accuracy of statistical pattern recognizers. IEEE Transactions on Information Theory, 14(1):55–63, 1968. [9] G. Camps-Valls, L. Gomez-Chova, J. Muoz-Mar, L. Alonso, J. CalpeMaravilla, and J. Moreno. Multitemporal image classification and change detection with kernels. In SPIE International Symposium Remote Sensing XII, volume 6365, Stockholm, Sweden, Set 2006. [10] D.M.J. Tax and R.P.W. Duin. Combining one-class classifiers. In Multiple Classifier Systems, Proceedings Second International Workshop MCS 2001, volume 2096, pages 299–308, Cambridge, UK, Jul 2001. Springer Verlag, Berlin. [11] B. Sch¨olkopf and A. Smola. Learning with Kernels. Support Vector Machines, Regularization, Optimization and Beyond. MIT Press, 2002. [12] S. Sathiya Keerthi and Chih-Jen Lin. Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Computation, 15(7):1667–1689, 2003. [13] M. Tanigushi and V. Tresp. Averaging regularized estimators. Neural Computation, 9(5):1163–1178, 1997. [14] J. Benediktsson and P. Swain. Consensus theoretic classification methods. IEEE Transactions on Systems, Man and Cybernetics, 4(22):688– 704, 1992. [15] P. Castracane, F. Iavaronc, S. Mica, E. Sottile, C. Vignola, O. Arino, M. Cataldo, D. Fernandez-Prieto, G. Guidotti, A. Masullo, and I. Pratesi. Monitoring urban sprawl and its trends with EO data. UrbEx, a prototype national service from a WWF-ESA joint effort. In 2nd GRSS/ISPRS Joint Workshop on Remote Sensing and Data Fusion over Urban Areas, pages 245–248, 2003. [16] L. G´omez-Chova, D. Fern´andez-Prieto, J. Calpe, E. Soria, J. Vila, and G. Camps-Valls. Urban monitoring using multitemporal SAR and multispectral data. Pattern Recognition Letters, Special Issue on “Pattern Recognition in Remote Sensing”, 27(4):234–243, Mar 2006. [17] R.G. Congalton and K. Green. Assessing the Accuracy of Remotely Sensed Data: Principles and Practiques. CRC Press, 1998.