Intro
PLS
KPLS
Results
Conclusions
Feature Extraction from Remote Sensing Data using Kernel Orthonormalized PLS Jer´ onimo Arenas-Garc´ıa? and Gustavo Camps-Valls† ?
Dept. Signal Theory and Communications. University Carlos III. Spain.
[email protected], http://www.tsc.uc3m.es/∼jarenas † Dept. Electronics Engineering. Universitat de Val` encia. Spain.
[email protected], http://www.uv.es/gcamps
Refs.
Intro
PLS
KPLS
Results
Conclusions
Motivation
Feature selection/extraction is essential before classification or regression in remote sensing. High number of correlated features leads to: - Collinearity - Overfitting - Hughes phenomenon
Projection-based algorithms are commonly deployed. Partial Least Squares (PLS) has been paid attention because: 1 2
Better than PCA since considers the output label information. Interpretability ∼ knowledge discovery.
Starting hypothesis
Standard linear PLS is suboptimal in the mean-square-error sense Optimality is ensured with Orthonormalized PLS (OPLS) (Roweis, 1999). Unfortunately, real problems are commonly non-linear. How to make non-linear a well-stablished linear algorithm? Yes! Kernels methods can do the job!
Refs.
Intro
PLS
KPLS
Results
Conclusions
Objectives
Optimality: We focus on the OPLS. Kernelization: We present the Kernel Orthonormalized PLS (KOPLS). Scalability: We also make the method algorithmically feasible. We analyze and characterize the method: 1
Theoretically: Computational cost. Memory. Number of projections.
2
Experimentally: Toy examples. Remote Sensing image classification. Biophysical parameter estimation.
Refs.
Intro
PLS
KPLS
Results
Conclusions
Notation preliminaries
Notation Data Input Data Matrix Label Matrix Number of projections Projected Inputs Projected Outputs Projection matrices Covariance Frobenius norm of a matrix
{xi , yi }li=1 , xi ∈ RN , yi ∈ RM . X = [x1 , . . . , xl ]> Y = [y1 , . . . , yl ]> np X0 = XU Y0 = YV U (N × np ), and V (M × np ) 1 > Cxy = E {(x P −2µx )(y − µy )} ∼ l X Y 2 kAkF = ij aij
Refs.
Intro
PLS
KPLS
Results
Conclusions
Linear feature extraction Toy example
Imagine a classification problem in which labels matter (a lot!). “Blind” feature extraction is not a good choice. Let’s see what happens with different methods ...
Refs.
Intro
PLS
KPLS
Results
Conclusions
Linear feature extraction Principal Component Analysis (PCA)
“Find projections maximizing the variance of the data:” PCA:
maximize: subject to:
Tr{(XU)> (XU)} = Tr{U> Cxx U} U> U = I
Refs.
Intro
PLS
KPLS
Results
Conclusions
Linear feature extraction Partial Least Squares (PLS)
“Find directions of maximum covariance between the projected input and output data:” PLS:
maximize: subject to:
Tr{(XU)> (YV)} = Tr{U> Cxy V} U> U = V> V = I
Refs.
Intro
PLS
KPLS
Results
Conclusions
Linear feature extraction Orthonormalized Partial Least Squares (OPLS)
“OPLS chooses the projection U to make X0 the best approximation to X in a reduced dimensionality space:” OPLS:
find: where:
U = arg min{kY − X0 Wk2F } W = (X0> X0 )−1 X0 Y
Refs.
Intro
PLS
KPLS
Results
Conclusions
Linear feature extraction Orthonormalized Partial Least Squares (OPLS)
“... which can be rewritten as:” OPLS:
maximize: subject to:
Tr{U> Cxy C> xy U} U> Cxx U = I
Refs.
Intro
PLS
KPLS
Results
Conclusions
Remarks
Remarks on linear feature extraction for supervised problems
Feature extraction is important. Labels play an important role in feature extraction. Traditional PCA fails since labels are obviated. Traditional PLS does a good, yet suboptimal, job. Orthonormalized PLS excels in linear feature extraction.
Refs.
Intro
PLS
KPLS
Results
Linear vs. Non-linear feature extraction
Linear feature extraction. Advantages
Simplicity. Easy to understand. Leads to convex optimization problems. Linear feature extraction. Drawbacks
Lack expressive power. Unsuitable for non-linear problems.
Conclusions
Refs.
Intro
PLS
KPLS
Results
Conclusions
Linear vs. Non-linear feature extraction
Original data
PCA
Refs.
Intro
PLS
KPLS
Results
Conclusions
Linear vs. Non-linear feature extraction
Original data
OPLS
Refs.
Intro
PLS
KPLS
Results
Conclusions
Kernel methods for non-linear feature extraction Kernel methods Input features space
Kernel feature space
Φ
1 2
Map the data to an ∞-dimensional feature spaces, H. Solve a linear problem there.
Kernel trick
No need to know ∞ coordinates for each mapped sample φ(xi ) Kernel trick: “if an algorithm can be expressed in the form of dot products, its non-linear (kernel) version only needs the dot products among mapped samples, the so-called kernel function:” K (xi , xj ) = hφ(xi ), φ(xj )i Using this trick, we can implement K-PCA, K-PLS, K-OPLS, etc.
Refs.
Intro
PLS
KPLS
Results
Conclusions
Kernel PLS Notation Data Mapping Mapped inputs matrix Output matrix Number of projections Projections of Mapped Inputs Projections of Outputs Projection matrices
{φ(xi ), yi }li=1 φ(x) : RN → H Φ = [φ(x1 ), . . . , φ(xl )]> Y = [y1 , . . . , yl ]> np Φ0 = ΦU Y0 = YV U (dim(H) × np ), and V (M × np )
Formulation
“The objective of KPLS is to find directions for maximum covariance:” KPLS:
maximize: subject to:
>
˜ YV} ˜ Tr{U> Φ U> U = V> V = I
˜ and Y ˜ are centered versions of Φ and Y, respectively. where Φ Only a matrix of inner products of the patterns in H is needed (Shawe-Taylor, 2004).
Refs.
Intro
PLS
KPLS
Results
Conclusions
Kernel Orthonormalized PLS Formulation of the KOPLS
“The objective of KOPLS is:” KOPLS:
maximize: subject to:
˜Y ˜ > ΦU} ˜ ˜ >Y Tr{U> Φ > >˜ ˜ U Φ ΦU = I
The features derived from KOPLS are optimal (in the MSE sense). Kernel trick for the KOPLS
All projection vectors (the columns of U) can be expressed as a linear ˜ > A. combination of the training data, U = Φ The maximization problem is reformulated as: KOPLS:
maximize: subject to:
Tr{A> Kx Ky Kx A} A> Kx Kx A = I
˜Φ ˜ > and Ky = Y ˜Y ˜ >. Centered kernel matrices: Kx = Φ This is a generalized eigenproblem: Kx Ky Kx α = λKx Kx α Kx and Ky can be approximated without computing and storing the whole matrices.
Refs.
Intro
PLS
KPLS
Results
Conclusions
An illustrative example (cont’d)
Original data
PCA
Refs.
Intro
PLS
KPLS
Results
Conclusions
An illustrative example (cont’d)
Original data
OPLS
Refs.
Intro
PLS
KPLS
Results
Conclusions
An illustrative example (cont’d)
Original data
KPCA
Refs.
Intro
PLS
KPLS
Results
Conclusions
An illustrative example (cont’d)
Original data
KOPLS
Refs.
Intro
PLS
KPLS
Results
Conclusions
Remarks
Remarks on non-linear feature extraction
Linear methods such as PCA, PLS or OPLS are not suitable for non-linear classification/regression tasks. Non-linear versions of these algorithms are readily obtained by applying the kernel trick. KPLS and KOPLS consider labels for the derivation of the projection vector, thus outperforming KPCA. KOPLS inherits mean-square-error optimality from its linear counterpart. Methods Characterization
(see paper in the proceedings)
Kernel size Storage Max. np
KOPLS
KPLS
l ×l O(l 2 ) m´ın{rank(Φ), rank(Y)}
l ×l O(l 2 ) rank(Φ)
Refs.
Intro
PLS
KPLS
Results
Conclusions
Experiment 1: Classification of LandSat images Data collection
LandSat image, 82×100 pixels with a spatial resolution of 80m×80m Six classes: red soil, cotton crop, grey soil, damp grey soil, soil with vegetation stubble and very damp grey soil. Contextual information: stack neighbouring pixels in 3×3 windows → high-dimensional and redundant feature vectors!. Training: 4435 samples. Testing: 2000 samples. Experimental setup
Methods: linear OPLS, KPLS and KOPLS. RBF kernel: k(xi , xj ) = exp −kxi − xj k2 /2σ 2 10-fold cross-validation on the training set to estimate σ. Classification procedure: 1 2 3 4
Extract np projections (np < rank(Y) for the KOPLS). Project test data. Linear discriminant with the pseudoinverse of the projected data. Winner-takes-all.
Refs.
Intro
PLS
KPLS
Results
Conclusions
Refs.
Experiment 1: Classification of LandSat images
Accuracy and feature expression
90
90
Overall Accuracy(%)
100
Overall Accuracy(%)
100
80 70 60 50 40 1
linear OPLS KOPLS 2
3 4 Number or Features
5
80 70 60 50 40 0 10
KOPLS KPLS 1
10 Number or Features (log)
2
10
The non-linear method provides a better representation of the discriminative information. KOPLS performance, with only 5 features, is 91 %. KPLS needs 100 features to achieve similar performance. Conclusions: 1 2
Non-linear OPLS methods provide much better results. KOPLS yields features which contain more discriminative information
Intro
PLS
KPLS
Results
Conclusions
Experiment 2: Oceanic chlorophyll concentration
Data collection
“Modeling the non-linear relationship between chlorophyll concentration and marine reflectance.” SeaBAM dataset (O’Reilly, 1998). 919 in-situ pigment measurements around the United States and Europe. Training: 460 samples Testing: 460 samples Experimental setup
Methods: linear PLS, KPLS and KOPLS. RBF kernel: k(xi , xj ) = exp −kxi − xj k2 /2σ 2 Leave-one-out root mean square error (LOO-RMSE) to validate the model. σ tuned in the range [10−2 , 104 ] np = rank(Y) = 1 for the KOPLS.
Refs.
Intro
PLS
KPLS
Results
Conclusions
Experiment 2: Oceanic chlorophyll concentration
Accuracy and feature expression Model OPLS KPLS, np = 1 KPLS, np = 5 KPLS, np = 10 KPLS, np = 20 KOPLS, np = 1
ME -0.034 0.042 -0.013 -0.013 -0.009 -0.015
RMSE 0.257 0.366 0.189 0.149 0.138 0.154
MAE 0.188 0.278 0.140 0.115 0.106 0.111
r 0.903 0.790 0.947 0.968 0.972 0.967
Linear OPLS performs poorly as the linear assumption does not hold. KPLS and the proposed KOPLS show a clear improvement in both accuracy and bias compared to linear OPLS KPLS and KOPLS show similar accuracy to SVR, and outperform in bias. Results obtained with a lower computational and storage burden The only one feature extracted with KOPLS provides a similar performance to the 10 first features from KPLS.
Refs.
Intro
PLS
KPLS
Results
Conclusions
Conclusions
We studied the applicability of the KOPLS for feature extraction and dimensionality reduction in hyperspectral imaging. Both classification and regression problems were considered. Unlike KPLS, the proposed KOPLS is optimal in the sense of a minimum quadratic error approximation of the label matrix. The method produces similar results to SVM classifier and regression machines with much lower computational cost and memory requirements. Sparsifying the solution leads to efficient scalability (see paper in the proceedings). Further work
Theoretical: Inclusion of composite kernels in the formulation. Theoretical characterization of the method (bounds and Rademacher complexities).
Experimental: Use in large-scale remote sensing classification and regression scenarios. Toolbox for non-linear feature extraction in ENVI.
Refs.
Intro
PLS
KPLS
Results
Conclusions
References
J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, 2004. G. Camps-Valls, J. L. Rojo and M. Martinez, Kernel Methods in Bioengineering, Signal and Image Processing, Idea Inc., 2007. R. Rosipal and N. Kramer, “Overview and recent advances in partial least squares,” Subspace, Latent Structure and Feature Selection Techniques, 2006. J. Arenas-Garc´ıa, K. B. Petersen, and L. K. Hansen, “Sparse kernel orthonormalized PLS for feature extraction in large data sets,” in Advances in Neural Information Processing Systems 19, B. Sch¨ olkopf, J.C. Platt, and T. Hofmann, Eds. MIT Press, Cambridge, MA, 2007. http://www.tsc.uc3m.es/∼jarenas http://www.uv.es/gcamps/soft.htm
Refs.