Feature Extraction from Remote Sensing Data ... - Semantic Scholar

Report 5 Downloads 125 Views
Intro

PLS

KPLS

Results

Conclusions

Feature Extraction from Remote Sensing Data using Kernel Orthonormalized PLS Jer´ onimo Arenas-Garc´ıa? and Gustavo Camps-Valls† ?

Dept. Signal Theory and Communications. University Carlos III. Spain. [email protected], http://www.tsc.uc3m.es/∼jarenas † Dept. Electronics Engineering. Universitat de Val` encia. Spain. [email protected], http://www.uv.es/gcamps

Refs.

Intro

PLS

KPLS

Results

Conclusions

Motivation

Feature selection/extraction is essential before classification or regression in remote sensing. High number of correlated features leads to: - Collinearity - Overfitting - Hughes phenomenon

Projection-based algorithms are commonly deployed. Partial Least Squares (PLS) has been paid attention because: 1 2

Better than PCA since considers the output label information. Interpretability ∼ knowledge discovery.

Starting hypothesis

Standard linear PLS is suboptimal in the mean-square-error sense Optimality is ensured with Orthonormalized PLS (OPLS) (Roweis, 1999). Unfortunately, real problems are commonly non-linear. How to make non-linear a well-stablished linear algorithm? Yes! Kernels methods can do the job!

Refs.

Intro

PLS

KPLS

Results

Conclusions

Objectives

Optimality: We focus on the OPLS. Kernelization: We present the Kernel Orthonormalized PLS (KOPLS). Scalability: We also make the method algorithmically feasible. We analyze and characterize the method: 1

Theoretically: Computational cost. Memory. Number of projections.

2

Experimentally: Toy examples. Remote Sensing image classification. Biophysical parameter estimation.

Refs.

Intro

PLS

KPLS

Results

Conclusions

Notation preliminaries

Notation Data Input Data Matrix Label Matrix Number of projections Projected Inputs Projected Outputs Projection matrices Covariance Frobenius norm of a matrix

{xi , yi }li=1 , xi ∈ RN , yi ∈ RM . X = [x1 , . . . , xl ]> Y = [y1 , . . . , yl ]> np X0 = XU Y0 = YV U (N × np ), and V (M × np ) 1 > Cxy = E {(x P −2µx )(y − µy )} ∼ l X Y 2 kAkF = ij aij

Refs.

Intro

PLS

KPLS

Results

Conclusions

Linear feature extraction Toy example

Imagine a classification problem in which labels matter (a lot!). “Blind” feature extraction is not a good choice. Let’s see what happens with different methods ...

Refs.

Intro

PLS

KPLS

Results

Conclusions

Linear feature extraction Principal Component Analysis (PCA)

“Find projections maximizing the variance of the data:” PCA:

maximize: subject to:

Tr{(XU)> (XU)} = Tr{U> Cxx U} U> U = I

Refs.

Intro

PLS

KPLS

Results

Conclusions

Linear feature extraction Partial Least Squares (PLS)

“Find directions of maximum covariance between the projected input and output data:” PLS:

maximize: subject to:

Tr{(XU)> (YV)} = Tr{U> Cxy V} U> U = V> V = I

Refs.

Intro

PLS

KPLS

Results

Conclusions

Linear feature extraction Orthonormalized Partial Least Squares (OPLS)

“OPLS chooses the projection U to make X0 the best approximation to X in a reduced dimensionality space:” OPLS:

find: where:

U = arg min{kY − X0 Wk2F } W = (X0> X0 )−1 X0 Y

Refs.

Intro

PLS

KPLS

Results

Conclusions

Linear feature extraction Orthonormalized Partial Least Squares (OPLS)

“... which can be rewritten as:” OPLS:

maximize: subject to:

Tr{U> Cxy C> xy U} U> Cxx U = I

Refs.

Intro

PLS

KPLS

Results

Conclusions

Remarks

Remarks on linear feature extraction for supervised problems

Feature extraction is important. Labels play an important role in feature extraction. Traditional PCA fails since labels are obviated. Traditional PLS does a good, yet suboptimal, job. Orthonormalized PLS excels in linear feature extraction.

Refs.

Intro

PLS

KPLS

Results

Linear vs. Non-linear feature extraction

Linear feature extraction. Advantages

Simplicity. Easy to understand. Leads to convex optimization problems. Linear feature extraction. Drawbacks

Lack expressive power. Unsuitable for non-linear problems.

Conclusions

Refs.

Intro

PLS

KPLS

Results

Conclusions

Linear vs. Non-linear feature extraction

Original data

PCA

Refs.

Intro

PLS

KPLS

Results

Conclusions

Linear vs. Non-linear feature extraction

Original data

OPLS

Refs.

Intro

PLS

KPLS

Results

Conclusions

Kernel methods for non-linear feature extraction Kernel methods Input features space

Kernel feature space

Φ

1 2

Map the data to an ∞-dimensional feature spaces, H. Solve a linear problem there.

Kernel trick

No need to know ∞ coordinates for each mapped sample φ(xi ) Kernel trick: “if an algorithm can be expressed in the form of dot products, its non-linear (kernel) version only needs the dot products among mapped samples, the so-called kernel function:” K (xi , xj ) = hφ(xi ), φ(xj )i Using this trick, we can implement K-PCA, K-PLS, K-OPLS, etc.

Refs.

Intro

PLS

KPLS

Results

Conclusions

Kernel PLS Notation Data Mapping Mapped inputs matrix Output matrix Number of projections Projections of Mapped Inputs Projections of Outputs Projection matrices

{φ(xi ), yi }li=1 φ(x) : RN → H Φ = [φ(x1 ), . . . , φ(xl )]> Y = [y1 , . . . , yl ]> np Φ0 = ΦU Y0 = YV U (dim(H) × np ), and V (M × np )

Formulation

“The objective of KPLS is to find directions for maximum covariance:” KPLS:

maximize: subject to:

>

˜ YV} ˜ Tr{U> Φ U> U = V> V = I

˜ and Y ˜ are centered versions of Φ and Y, respectively. where Φ Only a matrix of inner products of the patterns in H is needed (Shawe-Taylor, 2004).

Refs.

Intro

PLS

KPLS

Results

Conclusions

Kernel Orthonormalized PLS Formulation of the KOPLS

“The objective of KOPLS is:” KOPLS:

maximize: subject to:

˜Y ˜ > ΦU} ˜ ˜ >Y Tr{U> Φ > >˜ ˜ U Φ ΦU = I

The features derived from KOPLS are optimal (in the MSE sense). Kernel trick for the KOPLS

All projection vectors (the columns of U) can be expressed as a linear ˜ > A. combination of the training data, U = Φ The maximization problem is reformulated as: KOPLS:

maximize: subject to:

Tr{A> Kx Ky Kx A} A> Kx Kx A = I

˜Φ ˜ > and Ky = Y ˜Y ˜ >. Centered kernel matrices: Kx = Φ This is a generalized eigenproblem: Kx Ky Kx α = λKx Kx α Kx and Ky can be approximated without computing and storing the whole matrices.

Refs.

Intro

PLS

KPLS

Results

Conclusions

An illustrative example (cont’d)

Original data

PCA

Refs.

Intro

PLS

KPLS

Results

Conclusions

An illustrative example (cont’d)

Original data

OPLS

Refs.

Intro

PLS

KPLS

Results

Conclusions

An illustrative example (cont’d)

Original data

KPCA

Refs.

Intro

PLS

KPLS

Results

Conclusions

An illustrative example (cont’d)

Original data

KOPLS

Refs.

Intro

PLS

KPLS

Results

Conclusions

Remarks

Remarks on non-linear feature extraction

Linear methods such as PCA, PLS or OPLS are not suitable for non-linear classification/regression tasks. Non-linear versions of these algorithms are readily obtained by applying the kernel trick. KPLS and KOPLS consider labels for the derivation of the projection vector, thus outperforming KPCA. KOPLS inherits mean-square-error optimality from its linear counterpart. Methods Characterization

(see paper in the proceedings)

Kernel size Storage Max. np

KOPLS

KPLS

l ×l O(l 2 ) m´ın{rank(Φ), rank(Y)}

l ×l O(l 2 ) rank(Φ)

Refs.

Intro

PLS

KPLS

Results

Conclusions

Experiment 1: Classification of LandSat images Data collection

LandSat image, 82×100 pixels with a spatial resolution of 80m×80m Six classes: red soil, cotton crop, grey soil, damp grey soil, soil with vegetation stubble and very damp grey soil. Contextual information: stack neighbouring pixels in 3×3 windows → high-dimensional and redundant feature vectors!. Training: 4435 samples. Testing: 2000 samples. Experimental setup

Methods: linear OPLS, KPLS and KOPLS.  RBF kernel: k(xi , xj ) = exp −kxi − xj k2 /2σ 2 10-fold cross-validation on the training set to estimate σ. Classification procedure: 1 2 3 4

Extract np projections (np < rank(Y) for the KOPLS). Project test data. Linear discriminant with the pseudoinverse of the projected data. Winner-takes-all.

Refs.

Intro

PLS

KPLS

Results

Conclusions

Refs.

Experiment 1: Classification of LandSat images

Accuracy and feature expression

90

90

Overall Accuracy(%)

100

Overall Accuracy(%)

100

80 70 60 50 40 1

linear OPLS KOPLS 2

3 4 Number or Features

5

80 70 60 50 40 0 10

KOPLS KPLS 1

10 Number or Features (log)

2

10

The non-linear method provides a better representation of the discriminative information. KOPLS performance, with only 5 features, is 91 %. KPLS needs 100 features to achieve similar performance. Conclusions: 1 2

Non-linear OPLS methods provide much better results. KOPLS yields features which contain more discriminative information

Intro

PLS

KPLS

Results

Conclusions

Experiment 2: Oceanic chlorophyll concentration

Data collection

“Modeling the non-linear relationship between chlorophyll concentration and marine reflectance.” SeaBAM dataset (O’Reilly, 1998). 919 in-situ pigment measurements around the United States and Europe. Training: 460 samples Testing: 460 samples Experimental setup

Methods: linear PLS, KPLS and KOPLS.  RBF kernel: k(xi , xj ) = exp −kxi − xj k2 /2σ 2 Leave-one-out root mean square error (LOO-RMSE) to validate the model. σ tuned in the range [10−2 , 104 ] np = rank(Y) = 1 for the KOPLS.

Refs.

Intro

PLS

KPLS

Results

Conclusions

Experiment 2: Oceanic chlorophyll concentration

Accuracy and feature expression Model OPLS KPLS, np = 1 KPLS, np = 5 KPLS, np = 10 KPLS, np = 20 KOPLS, np = 1

ME -0.034 0.042 -0.013 -0.013 -0.009 -0.015

RMSE 0.257 0.366 0.189 0.149 0.138 0.154

MAE 0.188 0.278 0.140 0.115 0.106 0.111

r 0.903 0.790 0.947 0.968 0.972 0.967

Linear OPLS performs poorly as the linear assumption does not hold. KPLS and the proposed KOPLS show a clear improvement in both accuracy and bias compared to linear OPLS KPLS and KOPLS show similar accuracy to SVR, and outperform in bias. Results obtained with a lower computational and storage burden The only one feature extracted with KOPLS provides a similar performance to the 10 first features from KPLS.

Refs.

Intro

PLS

KPLS

Results

Conclusions

Conclusions

We studied the applicability of the KOPLS for feature extraction and dimensionality reduction in hyperspectral imaging. Both classification and regression problems were considered. Unlike KPLS, the proposed KOPLS is optimal in the sense of a minimum quadratic error approximation of the label matrix. The method produces similar results to SVM classifier and regression machines with much lower computational cost and memory requirements. Sparsifying the solution leads to efficient scalability (see paper in the proceedings). Further work

Theoretical: Inclusion of composite kernels in the formulation. Theoretical characterization of the method (bounds and Rademacher complexities).

Experimental: Use in large-scale remote sensing classification and regression scenarios. Toolbox for non-linear feature extraction in ENVI.

Refs.

Intro

PLS

KPLS

Results

Conclusions

References

J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, 2004. G. Camps-Valls, J. L. Rojo and M. Martinez, Kernel Methods in Bioengineering, Signal and Image Processing, Idea Inc., 2007. R. Rosipal and N. Kramer, “Overview and recent advances in partial least squares,” Subspace, Latent Structure and Feature Selection Techniques, 2006. J. Arenas-Garc´ıa, K. B. Petersen, and L. K. Hansen, “Sparse kernel orthonormalized PLS for feature extraction in large data sets,” in Advances in Neural Information Processing Systems 19, B. Sch¨ olkopf, J.C. Platt, and T. Hofmann, Eds. MIT Press, Cambridge, MA, 2007. http://www.tsc.uc3m.es/∼jarenas http://www.uv.es/gcamps/soft.htm

Refs.