November 23, 2009 15:47 WSPC/181-IJWMIP
00324
International Journal of Wavelets, Multiresolution and Information Processing Vol. 7, No. 6 (2009) 781–801 c World Scientific Publishing Company
REGULARIZED LEAST SQUARE REGRESSION WITH SPHERICAL POLYNOMIAL KERNELS
LUOQING LI Faculty of Mathematics and Computer Science Hubei University, Wuhan, 430062, P. R. China
[email protected] Received 22 December 2008 Revised 6 August 2009 In commemoration of the 80th birthday anniversary of Professor Yongsheng Sun This article considers regularized least square regression on the sphere. It develops a theoretical analysis of the generalization performances of regularized least square regression algorithm with spherical polynomial kernels. The explicit bounds are derived for the excess risk error. The learning rates depend on the eigenvalues of spherical polynomial integral operators and on the dimension of spherical polynomial spaces. Keywords: Regularized least square regression; spherical polynomial kernel; reproducing kernel Hilbert space; de la Vall´ee-Poussin operator; learning rate. AMS Subject Classification: 68T05, 62J02
1. Introduction Learning from examples can be regarded as the regression problem of estimating an unknown functional dependency given only a finite (possibly small) number of instances. Much work has been done to understand how high-performance learning may be achieved.2,7–10,19 The seminal work of Vapnik23 shows that the key to effectively solve this problem is by controlling the complexity of the solution. In the context of statistical learning, this leads to techniques known as regularization networks14 or regularized kernel methods.20,23 The regularized least square (RLS) algorithms form an important part of learning theory and has already been extensively studied in the statistical learning literature.11,14 The error analysis and the choice of the regularization parameter for the RLS algorithms have undergone indepth study in the recent years.24,12 We refer to Ref. 13 for some recent developments in this respect. We assume that the input X is a compact subset of Rn and the output Y is contained in [−M, M ] ⊂ R. The assumption of compactness is for technical reasons 781
November 23, 2009 15:47 WSPC/181-IJWMIP
782
00324
L. Li
and simplifies some proof. We let ρ be the unknown probability distribution on Z = X × Y describing the relation between x ∈ X and y ∈ Y , and ρX be the marginal probability distribution of Z on X. For every x ∈ X, let ρ(y | x) be the conditional (with respect to x) probability distribution on Y . Notice that ρ, ρ(y | x) and ρX (x) are related via ρ(x, y) = ρ(y | x)ρX (x). Given a measurable function f : X → Y , the ability of f to describe the distribution ρ is measured by its expected risk defined as (1.1) E(f ) = Eρ (f ) = (y − f (x))2 dρ(x, y). Z
The regression function fρ of ρ is defined as ydρ(y | x),
fρ (x) =
x ∈ X.
(1.2)
Y
It is well known that the regression function fρ is the minimizer of the expected risk over the set of all measurable functions and always exists since Y is compact.12,15,23 Usually, the regression function cannot be reconstructed exactly since we are given only a finite, possibly small, random examples on Z. A well-known learning algorithm is the empirical risk minimization (ERM) algorithm.23 The central question of the ERM theory is whether the expected risk of the minimizer of the empirical risk in the hypothesis space is close to the expected risk of fρ . One expects that the minimizer is a good approximation of fρ in certain sense. However, the problem of approximating a function from sparse data is ill-posed. To overcome this problem, classical regularization theory can be applied. Here we will consider the regularized least square algorithm.
2. The Regularized Least Square Algorithms The hypothesis space we consider for the RLS algorithm is a reproducing kernel Hilbert space (RKHS) associated with the Mercer kernel. We usually call a symmetric and positive semidefinite continuous function K : X × X → R a Mercer kernel.1 Typical examples of Mercer kernels include the Gaussian, polynomial, and spline kernels. The RLS algorithm we investigate in this article is a Tikhonov regularization scheme associated with polynomial Mercer kernel on the sphere. The RKHS HK associated with the kernel K is defined to be the closure of the linear span of the set of functions {Kx := K(x, ·) : x ∈ X} with the inner product · K satisfying Kx , Kx K = K(x, x ). The reproducing property f (x) = f, Kx K , ∀x ∈ X, f ∈ HK follows from the definition. The reproducing property with the Schwartz inequalityyields that |f (x)| ≤ K(x, x) f K . Then f ∞ ≤ κ f K , where κ := supx∈X K(x, x). So HK can be embedded into the space of bounded functions.
November 23, 2009 15:47 WSPC/181-IJWMIP
00324
Regularized Least Square Regression with Spherical Polynomial Kernels
783
m m Given a training sample set z := {zi }m i=1 = {(xi , yi )}i=1 ∈ Z , the corresponding empirical error for the expect risk (1.1) takes the form
1 (yi − f (xi ))2 . m i=1 m
Ez (f ) =
(2.1)
The RLS algorithm associated with the RKHS HK is defined as to be the minimizer of the following optimization problem fz = arg min {Ez (f ) + λf 2K }, f ∈HK
(2.2)
where λ > 0 is the regularization parameter. We denote by Ωn the unit spherical surface of Rn , i.e. Ωn := {x = (x1 , . . . , xn ) ∈ Rn : x21 + · · · + x2n = 1}. The polynomial kernel used in this article is defined by Kd (x, x ) = (1 + x · x )d ,
x, x ∈ Ωn ,
(2.3)
where d is the degree of kernel polynomial and x · x is the Euclidean inner product of x and x in Rn . We know from Refs. 23 and 11 that Kd is a Mercer kernel and the corresponding RKHS HKd is the set of polynomials of degree d in n variables. In this article, we only consider the input X = Ωn . We focus on bounding the excess error E(fz ) − E(fρ ) = fz − fρ L2ρ
X
for a minimizer fz of the optimization problem m 1 2 2 fz = fz,λ,d = arg min (yi − f (xi )) + λf Kd f ∈HKd m i=1
(2.4)
(2.5)
n m involving a set of random samples z = {(xi , yi )}m independently i=1 ∈ (Ω × Y ) random drawn according to ρ. The learning rates of the RLS algorithms was derived by Wu, Ying and Zhou in Ref. 24 and the regularized classifiers (including SVM) involving RKHS associated with univariate and multivariate polynomial kernels were studied by Zhou and Jetter27 and Tong, Chen and Peng,21 respectively. Concerning for the learning on the sphere, we refer to Ref. 18. The rest of this article is organized as follows. Section 3 reviews some basic knowledge of the spherical harmonics and the approximation behavior of the de la Vall´ee-Poussin operators on the sphere. Reproducing kernel Hilbert spaces on the sphere are introduced in Sec. 4. Sections 5 and 6 mainly concern bounding the excess approximation and sample errors. In Sec. 7, an upper bound for the excess error is presented. Explicit learning rates are derived. A conclusion is given in Sec. 8.
November 23, 2009 15:47 WSPC/181-IJWMIP
784
00324
L. Li
3. Spherical Harmonics and de la Vall´ ee-Poussin Operators Let f be a harmonic homogeneous polynomial of degree k in n variables. The restriction of f on Ωn is called spherical harmonic of degree k in n variables. We denote by Hkn (Ωn ) the set of spherical harmonic functions of degree k in n variables. By C(Ωn ) we denote the space of continuous, real-valued functions endowed n p n with the uniform norm. By L∞ µ (Ω ) and Lµ (Ω ), 1 ≤ p < ∞, we denote the space of all essentially bounded functions and the space of (the equivalence classes of) p-integrable functions on Ωn endowed with the respective norms f ∞ := ess sup |f (x)| x∈Ωn
and f
Lp µ
p1 |f (x)| dµ(x) , p
:= Ωn
1 ≤ p < ∞,
where dµ denotes the surface measure element on Ωn . Let L2µ (Ωn ) =
∞
Hkn (Ωn )
k=0
be the decomposition of the space L2µ (Ωn ) in a direct orthogonal sum of the finitedimensional SO(n)-invariant and -irreducible subspaces Hkn (Ωn ). The subspaces Hkn (Ωn ) are the eigenspaces of the Laplace–Beltrami operator ∆ on the sphere corresponding to the eigenvalues −k(k + 2α), 2α = n − 2: Hkn (Ωn ) = {f ∈ C ∞ (Ωn ): ∆f = −k(k + 2α)f } . The orthogonal projection Yk : L2µ (Ωn ) → Hkn (Ωn ) is given by26 Γ(α)(k + α) Yk (f ; x) := Pkα (x · x )f (x ) dµ(x ), 2π α+1 Ωn Pkα , α = (n − 2)/2, being the ultraspherical (or Gegenbauer) polynomials defined by the generating equation (1 − 2r cos θ + r2 )−α =
∞
rk Pkα (cos θ),
0 ≤ θ ≤ π,
k=0
and x · x the inner product of x and x in Ωn . To a function f in L1µ (Ωn ) we associate its spherical harmonic expansion f∼
∞
Yk (f ).
k=0
For f ∈ L1µ (Ωn ), by using the most common definition, we define the spherical means by the shift operator Sγ : 1 Sγ f (x) := n−1 f (x ) dµ(x ), 0 < γ < π. |Ω | sin2α γ x·x =cos γ
November 23, 2009 15:47 WSPC/181-IJWMIP
00324
Regularized Least Square Regression with Spherical Polynomial Kernels
785
In this formula, |Ωn−1 | denotes the (n − 2)-dimensional surface area of the unit sphere in Rn−1 , and we integrate over the set of points x in Ωn whose spherical distance from the given point x ∈ Ωn is equal to γ. Clearly, dµ(x ) equals sin2α γ n : times the surface measure element of the sphere Ωn−1 . Let Ω⊥ x = {x ∈ Ω ⊥ x · x = 0} denote the equator in Ωn with respect to x – Ωx being isomorphic to Ωn−1 , and let x be the point in Ω⊥ x , where the arc of the great circle emanating from x in the direction of y, intersects the equator, the point y can be written as y = x cos γ + x sin γ, and consequently, the spherical mean as 1 Sγ f (x) = n−2 f (x cos γ + x sin γ) dµn−1 (x ), 0 < γ < π, |Ω | Ω⊥ x where dµn−1 (x ) denotes the surface measure element of the sphere Ωn−1 . The properties of the spherical means are well known.4,26 We note, in particular, its series expansion ∞ Pkα (cos γ) Yk f, ∀f ∈ L1µ (Ωn ), Sγ f ∼ Pkα (1) k=0
and for all f ∈ Lpµ (Ωn ), Sγ f Lpµ ≤ f Lpµ
and
lim Sγ f − f Lpµ = 0.
γ→0
The spherical modulus of smoothness is then defined by ω(f, δ; Lpµ ) := sup Sγ f − f Lpµ , 0 0, we denote BR as the closed ball of HKd (Ωn ) with radius R centered at origin: BR = {f ∈ HKd (Ωn ) : f Kd ≤ R}.
(6.3)
From Lemma 6.1, we have the following conclusion. Lemma 6.2. For R > 0 and η > 0, we have log N (BR , η) ≤ N log
4R η
,
(6.4)
where N is the dimension of HKd (Ωn ). For RKHS HKd (Ωn ), we know from (4.1) that N = dim(HKd (Ωn )) ≤ 2dn . In addition, for any f ∈ BR , the reproducing property yields d f ∞ ≤ sup Kd (x, x) f Kd ≤ 2 2 R. x∈Ωn
In order to bound inequality.11
1 m
m
i=1 ξ2 (zi )
(6.5)
(6.6)
− E(ξ2 ) we need the one-side Bernstein
Lemma 6.3. Let ξ be a random variable on a probability space Z with mean E(ξ), variance σ 2 (ξ) = σ 2 , and satisfying |ξ(z) − E(ξ)| ≤ Mξ for almost all z ∈ Z. Then for all ε > 0, m mε2 1
− . ξ(z ) − E(ξ) ≥ ε ≤ exp Prob i z∈Z m 1 m i=1 2 σ 2 + Mξ ε 3
November 23, 2009 15:47 WSPC/181-IJWMIP
00324
Regularized Least Square Regression with Spherical Polynomial Kernels
791
Lemma 6.4. For every 0 < δ < 1, with probability greater than 1 − δ, there holds
m 1 1 56M 2 2 log ξ2 (zi ) − E(ξ2 ) ≤ Vd (fρ ) − fρ L2ρ + . X m i=1 3m δ Proof. For simplicity we set fd = Vd (fρ ). Observe that ξ2 = (fd (x) − y)2 − (fρ (x) − y)2 = (fd (x) − fρ (x)){(fd (x) − y) + (fρ (x) − y)}. Since |fρ (x)| ≤ M almost everywhere, we have fd∞ ≤ M and |ξ2 | ≤ (fd ∞ + M )(fd ∞ + 3M ) ≤ 8M 2 . Hence |ξ2 − E(ξ2 )| ≤ 16M 2 . Moreover we have E(ξ22 ) = E((fd (x) − fρ (x))2 · ((fd (x) − y) + (fρ (x) − y))2 ) ≤ 16M 2 fd − fρ 2L2ρ
X
2
E(ξ22 )
which implies that σ (ξ2 ) ≤ ≤ 16M 2fd − fρ 2L2 . ρX Now we apply Lemma 6.3 to ξ2 . It asserts that for any t > 0, 1 ξ2 (zi ) − E(ξ2 ) ≤ t m i=1 m
with confidence at least
mt2 mt2 ≤ 1 − exp−
1 − exp . − 1 1 2 σ 2 (ξ2 ) + 16M 2 t 32M 2 fd − fρ 2L2ρ + t X 3 3 Let t∗ be the the unique positive solution of the quadratic equation −
mt2
= log δ. 1 32M 2 fd − fρ 2L2ρ + t X 3
Then, with confidence 1 − δ, there holds 1 ξ2 (zi ) − E(ξ2 ) ≤ t∗ . m i=1 m
An elementary calculation yields
2
2 1 1 1 16M 2 16M 1 ∗ 2 2 log log t = + 32mM fd − fρ L2 log + ρX 3m δ m 3 δ δ
1 1 1 16M 2 16M 2 1 2 2 log log 32mM fd − fρ L2 log ≤ + + ρX 3m δ 3m δ m δ
November 23, 2009 15:47 WSPC/181-IJWMIP
792
00324
L. Li
1 1 32M 2 32M 2 log fd − fρ 2L2 log = + ρX 3m δ m δ
1 1 32M 2 8M 2 log log ≤ + + fd − fρ 2L2ρ X 3m δ m δ
1 56M 2 log ≤ + fd − fρ 2L2ρ . X 3m δ This implies the desired estimate. We now estimate the sample error involving the sample z through fz . We will use the idea of empirical risk minimization to bound the first term in (6.2) by means of a covering number. Lemma 6.5. For all ε > 0 and R ≥ M, {E(π(f )) − E(fρ )} − {Ez (π(f )) − Ez (fρ )} √ ≤ ε Prob sup z∈Z m f ∈BR {E(π(f )) − E(fρ )} + ε
3mε ε ≥ 1 − N B1 , d exp − . 32M 2 2 2 4MR
(6.7)
Proof. The following ratio probability inequality (Lemma 2 in Ref. 25) is a standard result in learning theory (see, e.g., Ref. 27). It deals with variances for a function class, since the Bernstein inequality takes care of the variance well only for a single random variable. Let G be a set of functions on Z such that, for some c ≥ 0, |g − E(g)| ≤ B almost everywhere and E(g 2 ) ≤ cE(g) for each g ∈ G. Then, for every ε > 0, m 1 E(g) − g(zi ) m i=1 √ mε Prob sup ≤ ε ≤ N (G, ε) exp − (6.8) z∈Z m f ∈G 2c + 2B/3 E(g) + ε where N (G, ε) denotes the covering number of the set G under the uniform norm. We consider the set FR = {(π(f )(x) − y)2 − (fρ (x) − y)2 : f ∈ BR },
R > 0.
Let g ∈ FR . Then g has the form g(z) = (π(f )(x) − y)2 − (fρ (x) − y)2 for some f ∈ BR . It is easy to see that 1 g(zi ) = Ez (π(f )) − Ez (fρ ). m i=1 m
E(g) = E(π(f )) − E(fρ ) ≥ 0,
November 23, 2009 15:47 WSPC/181-IJWMIP
00324
Regularized Least Square Regression with Spherical Polynomial Kernels
793
Since |π(f )| ≤ M and |fρ (x)| ≤ M almost everywhere, we find that |g(z)| = |(π(f )(x) − fρ (x))((π(f )(x) − y) + (fρ (x) − y))| ≤ 8M 2 . It follows that |g(z) − E(g)| ≤ 16M 2 almost everywhere and E(g 2 ) = E((π(f )(x) − fρ (x))2 ((π(f )(x) − y) + (fρ (x) − y))2 ) ≤ 16M 2 π(f ) − fρ 2L2ρ = 16M 2 E(g). We apply (6.8) to the set of functions FR and obtain that 1 g(zi ) m i=1 m
sup
f ∈BR
{E(π(f )) − E(fρ )} − {Ez (π(f )) − Ez (fρ )} ≤ sup {E(π(f )) − E(fρ )} + ε f ∈FR √ ≤ ε
with confidence at least
E(g) −
E(g) + ε
3mε 1 − N (FR , ε) exp − . 32M 2
Observe that for g1 , g2 ∈ FR there exist f1 , f2 ∈ BR such that gj (z) = (π(fj )(x) − y)2 − (fρ (x) − y)2 ,
j = 1, 2.
It follows that |g1 (z) − g2 (z)| = |(π(f1 )(x) − y)2 − (π(f2 )(x) − y)2 | = |π(f1 )(x) − π(f2 )(x)| · |π(f1 )(x) + π(f2 )(x) − 2y| ≤ 4M π(f1 ) − π(f2 )∞ ≤ 4M f1 − f2 ∞ d
≤ 2 2 4M f1 − f2 Kd . d
In the last inequality we use the fact f ∞ ≤ 2 2 f Kd provided by (6.6). We see d that for any ε > 0, an (ε/(2 2 4M ))-covering of BR provides an ε-covering of FR . Therefore
ε . N (FR , ε) ≤ N BR , d 2 2 4M Note that we assume R > M we have the confidence
3mε 3mε ε 1 − N (FR , ε) exp − exp − ≥ 1 − N BR , d 32M 2 32M 2 2 2 4M
3mε ε exp − ≥ 1 − N B1 , d 32M 2 2 2 4MR which provides (6.7). We thus complete the proof of the lemma.
November 23, 2009 15:47 WSPC/181-IJWMIP
794
00324
L. Li
For R > 0, we denote W(R) = {z ∈ Z m : fz Kd ≤ R}. For 0 < δ < 1, we denote v ∗ (m, δ) ≡ v ∗ (m, δ, d, R) as the unique positive solution (with respect to η) to the equation24
δ 3mη η − log N B1 , d = log . 2 32M 2 2 2 4MR Lemma 6.6. For all 0 < δ < 1 and R > M, there is a set VR ⊂ Z m with ρ(VR ) ≤ δ such that for all z ∈ W(R)\VR ,
2 112M 2 ∗ log E(π(fz )) − E(fρ ) ≤ 2v (m, δ) + 4D(Vd (fρ ), λ) + . 3m δ Proof. Let fz be defined as in (2.2). If fλ ∈ HKd (Ωn ) satisfies fλ ∞ ≤ M , then we have the error decomposition E(π(fz )) − E(fρ ) + λfz 2Kd = E(π(fz )) − Ez (π(fz )) + {Ez (fλ ) − E(fλ )} + {Ez (π(fz ) + λfz 2Kd − (Ez (fλ ) + λfλ 2Kd )} + {E(fλ ) − E(fρ ) + λfλ 2Kd }. By the definition of fz we have that Ez (π(fz )) + λfz 2Kd ≤ Ez (fz ) + λfz 2Kd ≤ Ez (fλ ) + λfλ 2Kd . Therefore we can bound the difference E(π(fz )) − E(fρ ) as follows: E(π(fz )) − E(fρ ) ≤ E(π(fz )) − E(fρ ) + λfz 2Kd ≤ S(fλ ) + D(fλ , λ),
(6.9)
where S(fλ ) = {E(π(fz )) − Ez (π(fz ))} + {Ez (fλ ) − E(fλ )} is defined as in (6.1). In particular, when we choose fλ = fd = Vd (fρ ) in (6.9) we obtain E(π(fz )) − E(fρ ) ≤ E(π(fz )) − E(fρ ) + λfz 2Kd ≤ S(fd ) + D(fd , λ).
(6.10)
Note that
√ 1 {E(π(f )) − E(fρ )} + ε ε ≤ {E(π(f )) − E(fρ )} + ε. 2 Choose ε = v ∗ (m, δ) in (6.7). It follows that there exists a set VR ⊂ Z m with measure at most δ/2 such that for every z ∈ W(R)\VR 1 (E(π(f )) − E(fρ )) − (Ez (π(f )) − Ez (fρ )) ≤ (E(π(f )) − E(fρ )) + ε 2 1 ≤ (E(π(f )) − E(fρ )) + v ∗ (m, δ). 2 In particular, for z ∈ W(R)\VR and fz ∈ BR , we have 1 ξ1 (zi ) = (E(π(fz )) − E(fρ )) − (Ez (π(fz )) − Ez (fρ )) m i=1 m
E(ξ1 ) −
≤
1 (E(π(fz )) − E(fρ )) + v ∗ (m, δ). 2
(6.11)
November 23, 2009 15:47 WSPC/181-IJWMIP
00324
Regularized Least Square Regression with Spherical Polynomial Kernels
795
Applying Lemma 6.4 with δ replaced by δ/2 we may find a set VR ⊂ Z m with measure at most δ/2 such that for all z ∈ W(R)\VR
m 2 1 56M 2 2 log ξ2 (zi ) − E(ξ2 ) ≤ fd − fρ L2ρ + . X m i=1 3m δ
(6.12)
Set VR = VR ∪ VR . From estimates (6.11) with (6.12), we see that for every z ∈ W(R)\VR with ρ(VR ) ≤ δ S(fd ) = (E(π(fz )) − E(fd )) − (Ez (π(fz )) − Ez (fd )) m m 1 1 ξ1 (zi ) + ξ2 (zi ) − E(ξ2 ) = E(ξ1 ) − m i=1 m i=1
2 56M 2 1 ∗ 2 log ≤ (E(π(fz )) − E(fρ )) + v (m, δ) + fd − fρ L2ρ + . X 2 3m δ
This in connection with (6.10) yields, for every z ∈ W(R)\VR with ρ(VR ) ≤ δ, E(π(fz )) − E(fρ ) ≤ 2v ∗ (m, δ) + 4D(fd , λ) +
2 112M 2 log . 3m δ
Note that fd = Vd (fρ ). The desired estimate for E(π(fz )) − E(fρ ) follows. Theorem 6.1. Let 0 < λ < 1 and let fz be defined by (2.2). Then, for any 0 < δ < 1, with confidence 1 − δ, we have π(fz ) −
fρ 2L2ρ
X
2 112M 2 M log ≤ 2v m, δ, d, √ + 4D(Vd (fρ ), λ) + . 3m δ λ ∗
Proof. The definition of fz tells us that, for 0 < λ < 1 and f = 0, λfz 2Kd ≤ Ez (fz ) + λfz 2Kd ≤ Ez (fz ) + 0 =
1 (yi − 0)2 ≤ M 2 . m i=1 m
√ Therefore, fz 2Kd ≤ M 2 /λ for almost all z ∈ Z m . It follows that W(M/ λ) = Z m . This tells us the ball BM/√λ contains fz for almost all z ∈ Z m . When 0 < √ λ ≤ 1, we take R := M/ λ ≥ M in Lemma 6.6 which implies the desired error bound.
7. Learning Rates More quantitative decay estimates for v ∗ (m, δ, d, R) will lead to the useful bounds for the difference E(π(fz )) − E(fρ ).
November 23, 2009 15:47 WSPC/181-IJWMIP
796
00324
L. Li
The dimension of N = HKd (Ωn ) is bounded by 2dn (see (6.5)). By Lemma 6.2, we know that, for R > 0 and η > 0 d
d ε 2 2 16MR 4 · 2 2 4MR n ≤ 2d log . log N B1 , d+4 ≤ N log ε ε 2 2 MR Consequently, for all ε > 0 and R =
M √ ,0 λ
< λ ≤ 1, we have
√ ε λ
3mε exp − d 32M 2 2 2 4M 2 d 2 16M 2 2 3mε √ ≥ 1 − exp 2dn log . − 32M 2 ε λ
1 − N B1 ,
We define the function h by
2 2 16M 2 √ h(η) := 2d log η λ d
n
−
3mη . 32M 2
1 n+1 ,
d = mτ and λ = exp(−2nmτ ). We have mτ 3mη 2 2 16M 2 nτ − h(η) = 2m log η exp(−nmτ ) 32M 2
Let 0 < τ