Nonparametric regression estimation using ... - Semantic Scholar

Report 4 Downloads 191 Views
Nonparametric regression estimation using penalized least squares  Michael Kohler1 and Adam Krzy_zak2

Mathematisches Institut A, Universitat Stuttgart, Pfa enwaldring 57, D-70569 Stuttgart, Germany, email: [email protected] Department of Computer Science, Concordia University, 1455 De Maisonneuve West, Montreal, Canada H3G 1M8, email: [email protected] 1

2

Abstract

We present multivariate penalized least squares regression estimates. We use Vapnik{ Chervonenkis theory and bounds on the covering numbers to analyze convergence of the estimates. We show strong consistency of the truncated versions of the estimates without any conditions on the underlying distribution. Key words and phrases: penalized least squares, smoothing splines, regression estimate, strong consistency, Vapnik{Chervonenkis theory.

The rst author's research was supported by DFG. The second author was supported by NSERC and the Alexander von Humboldt Foundation. 

1

1 Introduction Let (X; Y ), (X ; Y ), (X ; Y ),... be independent identically distributed IRd  IR - valued random vectors with EY < 1. In the regression analysis we want to estimate Y after having observed X , i.e. we want to determine a function f with f (X ) \close" to Y . If \closeness" is measured by the mean squared error, then one wants to nd a function f  such that E ff (X ) ? Y g = min E ff (X ) ? Y g : (1) f 1

1

2

2

2

2

2

Let m(x) := EfY jX = xg be the regression function and denote the distribution of X by . The well-known relation which holds for each measurable function f

Z Eff (X ) ? Y g = Efm(X ) ? Y g + j f (x) ? m(x) j (dx) 2

2

2

(2)

implies that m is the solution of the minimization problem (1), and for an arbitrary f , L { R error j f (x) ? m(x) j (dx) is the di erence between Eff (X ) ? Y g and Efm(X ) ? Y g the minimum of (2). In the regression estimation problem the distribution of (X; Y ) (and consequently m) is unknown. Given a sequence Dn = f(X ; Y ),...,(Xn ; Yn )g of independent observations of (X; Y ), our goal is to construct an estimate mn(x)=mn (x; Dn ) of m(x) such that the R L error jmn (x) ? m(x)j (dx) is small. A sequence of estimators (mn )n2 is called weakly (strongly) universally consistent if Z jmn(x) ? m(x)j (dx) ! 0 in L (a:s:) d 2

2

2

1

2

1

2

2

IN

2

1

IR

for all distributions of (X; Y ) with EY < 1. Stone (1977) rst pointed out that there exist weakly universally consistent estimators. Since then various results about weak and strong universal consistency of special estimators, e.g. kernel estimators, nearest neighbor estimators, histogram estimators and least squares estimators have been published. See Devroye et al. (1994) for a list of papers on universal consistency and, in addition, Gyor and Walk (1996, 1997), Gyor et al. (1998), Kohler (1999) and Walk (1997). The regression function minimizes the L risk Efjf (X ) ? Y j g (cf. (1)). This motivates the construction of an estimate of the regression function by minimization of an estimate 2

2

2

2

of the L risk, e.g. by minimization of the empirical L risk n 1X jf (X ) ? Y j : 2

2

i

ni

i

(3)

2

=1

If one minimizes (3) over all functions then this leads (at least if the X , . . . , Xn are distinct) to a function which interpolates the data. There are two di erent strategies to avoid this: For least squares estimates one minimizes the empirical L risk only over some suitable chosen set Fn of functions, which depends on the sample size. For penalized least squares estimates one minimizes the sum of the empirical L risk and a penalty term, which penalizes the roughness of the function f , over basically all functions. Universally consistency of various least squares estimates has been shown in Lugosi and Zeger (1995) and Kohler (1997a, 1999). The proofs there use results of the Vapnik{ Chervonenkis theory together with a truncation argument. In this paper we use similar techniques to study penalized least squares estimates. A slightly di erent approach to derive the risk bounds for least squares estimates in the case of unbounded Y is described in Chapter 5 in Vapnik (1998). The results there require the existence of a moment of order p > 1 of the random variable jf (X ) ? Y j . If we assume EY < 1 then this is not always satis ed, therefore these results are not applicable in the context of this paper. Next, we give an exact de nition of the penalized least squares estimates we will use. Let k 2 IN with 2k > d and denote by W k (IRd ) the Sobolev space containing all functions f : IRd ! IR whose derivatives of total order k are in L (IRd). The condition 2k > d implies that the functions in W k (IRd ) are continuous and hence the evaluation of a function at a point is well de ned. Set 1

2

2

2

2

2

Jk (f ) = 2

X 1 ;:::; d 2 ; 1 ::: d IN

+

+

Z k! k !  : : :  d !

=

IR

1

@ k f (x) dx: d @x : : : @x d d 2

1

1

Then the penalized least squares estimate m~ n is de ned by ! n X 1 m~ n() = arg min jf (Xi ) ? Yij + nJk (f ) ; f 2W k d n 2

(IR )

i

2

(4)

=1

where n > 0 is a parameter of the estimate. Here we do not require that the minimum be unique. Observe that m~ n depends on the data Dn and that we have suppressed this in our notation. 3





Let l = d kd? and let  , . . . , l be all monomials x  : : :  x d d of total degree + : : : + d less than k. De ne R : IR ! IR by +

1

1

1

1

1

+

8 > < u k?d  ln(u) if 2k ? d is even; R(u) = > : u k?d if 2k ? d is odd; 2

2

and denote the euclidean norm of a vector x 2 IRd by kxk . It follows from Section V in Duchon (1976) that there exists a function of the form 2

m~ n (x) =

n X i

ai R(kx ? Xi k ) + 2

=1

Xl j

bj l (x)

(5)

=1

which achieves the minimum in (4), and that the coecients a ; : : : ; an ; b ; : : : ; bl 2 IR of this function can be computed by solving a linear system of equations. Under some additional assumptions on the X , . . . , Xn this is also shown in Section 2.4 of Wahba (1990). Penalized least squares estimates have been studied by many authors, see e.g. the monographs Eubank (1988) and Wahba (1990) and the literature cited therein. Most of the results in the literature are derived for xed design regression (where the Xi are nonrandom) and cover the case d = 1 only. In the contex of random design regression consistency and rate of convergence of univariate penalized least squares estimates have been studied by means of empirical process theory by van de Geer (1987, 1988, 1990) and by Kohler (1997b). In all papers a kind of boundedness assumption on m~ n is essential: For instance the second assumption in Lemma 6.1 of van de Geer (1990) implies that m~ n is bounded (see van de Geer (1990), proof of Example 2.1 (ii)). It is easy to modify the estimate such that this assumption is no longer needed (cf. Kohler (1997b)). But this is only true for d = 1. Unfortunately the equivalent multivariate assumption is much more restrictive. Therefore we use a di erent approach to ensure boundedness of the estimate: we truncate it at some data{independent threshold tending to in nity for n tending to in nity. We then use Vapnik{Chervonenkis theory to show strong consistency of the truncated estimate for all distributions of (X; Y ) with kX k bounded a:s: and EY < 1. Furthermore, after a slight modi cation of the de nition of the estimate the assumption kX k bounded a:s: is no longer necessary and the estimate is strongly universally consistent. 1

1

1

2

2

2

4

The main result is stated in Section 2, preliminary results, including a key covering lemma, are proved in Section 3 using Vapnik{Chervonenkis theory, and the proof of the main result is given in Section 4.

2 Main result For L > 0 and z 2 IR set

8 > > < L if z > L; TL z = > z if ?L  z  L; > : ?L if z < ?L:

For a function f : IRd ! IR de ne TL f : IRd ! IR by (TL f )(x) = TL (f (x)) (x 2 IRd ). Let m~ n be de ned by (4), and set

mn(x) = T

~ n (x): nm

ln(

(6)

)

Theorem 1 Let k 2 IN with 2k > d. Depending on the data choose n = n(Dn ) > 0 such that and Then

Z

n ! 0 (n ! 1) a:s:

(7)

n  n ! 1 (n ! 1) a:s:

(8)

jmn(x) ? m(x)j (dx) ! 0 (n ! 1) 2

a.s. for every distribution of (X; Y ) with kX k bounded a.s. and EY < 1. 2

2

Remark 1. If the regression function is bounded in absolute value by some known constant L, then TL m~ n (x) is closer to m(x) than T n m~ n (x) for all x 2 IRd whenever ln(

)

ln(n) > L, i.e. whenever n is suciently large. This implies that under the conditions of Theorem 1 Z jTLm~ n(x) ? m(x)j (dx) ! 0 (n ! 1) a:s: 2

for all distributions of (X; Y ) with kX k bounded a:s:, jm(x)j  L for all x 2 IRd and EY < 1. Hence in an application one can truncate the estimate on any known bound 2

2

5

of the regression function. Furthermore, a similar argument implies that the L error of the truncated estimate is smaller than the L error of the estimate m~ n . Remark 2. We want to stress that in Theorem 1 there is no assumption on the underlying distribution of (X; Y ) besides boundedness of kX k . In particular it is not required that X have a density with respect to the Lebesgue{Borel measure or that m be (in some sense) smooth. Remark 3. The assumption kX k is bounded a.s. may be dropped if we slightly modify the estimate. De ne m~ n by 2

2

2

2

n 1X n jf (Xi ) ? Yij  I ? n ; n d (Xi ) + nJk (f )

m~ n() = arg min k d f 2W

[

i

(IR )

!

2

2

ln(

) ln(

)]

=1

and set mn (x) = T n m~ n(x)  I ? n ; n d (x). Using

Z

ln(

)

[

Z jmn(x) ? m(x)j (dx) = +Efjmn (X ) ? Y j  I ? 2

ln(

) ln(

)]

2

[

jmn(x) ? m(x)j (dx) d (X )jDn g ? Efjm(X ) ? Y j  I ? 2

d n[? ln(n);ln(n)]d

IR

n; n

ln(

) ln(

2

)]

[

n ; n d (X )g

ln(

) ln(

)]

in the proof of the theorem it is easy to see that mn is strongly consistent for all distributions of (X; Y ) with EY < 1, provided 2k > d and (7) - (8) hold. Remark 4. It is well{known that one cannot derive a non{trivial rate of convergence result for the L error of any estimate without restricting the class of distributions considered, e.g. by assuming some smoothness property on m (see, e.g., Theorem 7.2 in Devroye, Gyor and Lugosi (1996) and section 3 in Devroye and Wagner (1980)). Van de Geer (1990) and Kohler (1997b) derive rate of convergence results of the univariate estimate under the assumption m 2 W k (IR). 2

2

3 Application of Vapnik{Chervonenkis theory In this section we use Vapnik{Chervonenkis approach to show

Lemma 1 Let L > 0, let the estimate mn be de ned by (4) and (6) and let (7) { (8) hold. Then

n X 1 Efjmn(X ) ? TLY j jDng ? n jmn(Xi ) ? TLYij ! 0 (n ! 1) 2

2

i

=1

a.s. for every distribution of (X; Y ) with X 2 [0; 1]d a.s. and EY < 1. 2

6

In the proof of Lemma 1 we apply Pollard's inequality (Lemma 2 below) which uses the concept of covering numbers.

De nition 1 Let F be a class of functions f : IRd ! IR. The covering number N (; F ; xn ) is de ned for any  > 0 and xn = (x ; :::; xn ) 2 (IRd )n as the smallest integer k such that there exist functions g ; :::; gk : IRd ! IR with n X 1 min jf (xj ) ? gi(xj )j   ik n +1

+1

1

j

1

+1

1

1

1

=1

for each f 2 F .

If Z n = (Z ; :::; Zn ) is a sequence of IRd {valued random variables, then N (; F ; Z n ) is a random variable with expected value EN (; F ; Z n ). +1

1

1

1

1

Lemma 2 (Pollard (1984), Th. 24, p. 25) Let F be a class of functions f : IRd ! [0; B ], and let Z n = (Z ; :::; Zn ) be IRd {valued i.i.d. random variables. +1

+1

1

1

Then for any  > 0

) X !    n n  1 n P sup n f (Zi) ? Ef (Z ) >   8E N ( 8 ; F ; Z ) exp ? 128 B : f 2F i (

2

1

1

2

=1

Proof of Lemma 1. W.l.o.g. we assume L  ln(n). By de nition of the estimate and strong law of large numbers

n n 1X 1X j m ~ ( X ) ? Y j +  J ( m ~ )  n i i n k n n n j0 ? Yi j + n  0 ! EY 2

2

2

i

(n ! 1) a:s:;

2

i

=1

=1

which implies together with (8) that with probability one we have for n suciently large

Jk (m~ n)  2EY = 2EY nn  n; n n 2

2

2

which in turn implies

n

mn 2 Fn = T

n

ln(

)

o

f : f 2 W k (IRd ) and Jk (f )  n : 2

Hence it suces to show

n

n X 1 sup Efg(X; Y )g ? n g(Xi ; Yi ) ! 0 (n ! 1) a:s:; g2Gn i =1

(9)

o

where Gn = g : IRd  IR ! IR : g(x; y) = jf (x) ? TL yj for some f 2 Fn . 2

7

If gj (x; y) = jfj (x) ? TL yj ((x; y) 2 IRd  IR) for some function fj bounded in absolute value by ln(n) (j = 1; 2), then 2

n n 1X 1X j g ( X ; Y ) ? g ( X ; Y ) j  4 ln( n ) i i i i n n jf (Xi ) ? f (Xi )j; i

1

2

1

i

=1

2

=1

which implies

     n N 8 ; Gn; (X; Y )n  N 32 ln( ; F ; X : n) n 1

1

Using this, Lemma 2 and Lemma 3 below, one gets for 0 <  < 1 arbitrary

) n X 1 P sup Efg(X; Y )g ? n g(Xi ; Yi) >  g2Gn i  p ! !c n  n  kd c c ln( n )  n n 8 : exp ?  128(4 ln (n)) n 00 p 1 ! kd 1 ! 32 c ln ( n )  n n n 32 ln( n ) + c A ln  8 exp @@c ? 128  16  ln (n) A   ! n 1  8 exp ? 2  128  16  ln (n) (

=1

32 ln( )

2

+ 3

2

1

2

32 ln(

2

)

2

1

2

3

2

4

2

4

for n suciently large, where we have observed that 2k > d implies

p

( n ln(n))d=k  ln(ln (n)  n) = ln d=k (n)  ln(ln (n)  n) ! 0 (n ! 1): n ?d= k n= ln (n) Application of the Borel{Cantelli lemma concludes the proof. 2 Remark 5. If one restricts Fn to contain only functions of the form (5), measurability of the above supremum follows from the appendix C in Pollard (1984). 4+

2

2

4

1

(2 )

In the proof of Lemma 1 we have used

Lemma 3 Let L; c > 0 and set n o F = TLf : f 2 W k (IRd) and Jk (f )  c : 2

Then for any 0 <  < L and x ; : : : xn 2 [0; 1]d 1

 L n c n N (; F ; x )  c

2(

1

1



c d  ) k +c3

p

;

where c , c , c 2 IR are constants which only depend on k and d. 1

2

3

+

8

Proof of Lemma 3. We proceed similarily as in the proof of Lemma 3.3.1 in van de

Geer (1987). We rst construct a rectangular partition of the unit cube and then we use a family of piecewise polynomial functions on the given partition to obtain a bound on N (; F ; xn ). Fix f 2 W k (IRd) with Jk (f )  c. First we partition [0; 1]d into d{dimensional rectangles A ; : : : ; AK with the following properties 2

1

1





@kf k ::: d k 1 ::: d @x 1 1 :::@x d d (x)

R P i) Ai

!

1+

+

=

!

!

p

2

dx  c(=pc)d=k i = 1; : : : ; K

ii) supx;z2Ai jjx ? z jj1  (= c) =k i = 1; : : : ; K

p

1

p

iii) K  (( c=) =k + 1)d + ( c=)d=k . 1

The partition fA ; : : : ; AK g may be constructed as follows. p Start by dividing [0; 1]d into K~  (( c=) =k + 1)d equi{volume cubes B ; : : : ; BK p of side length (= c) =k . Then partition each cube Bi into d-dimensional rectangles Bi; ; : : : ; Bi;li such that 1

1

1

~

1

1

Z

Bi;j 1 ::: d +

and



+

Z



@ k f pc)d=k j = 1; : : : ; l ? 1 k! dx = c ( = ( x ) i d k !  : : :  d ! @x : : : @xd

X

1





@ k f pc)d=k j = l : k! dx  c ( = ( x ) i d k !  : : :  d ! @x : : : @xd

X

Bi;j 1 ::: d +

1

1

=

2

+

1

1

=

2

1

P P This leads to partition with K = Ki li = K~ + Ki (li ? 1) rectangles satisfying (i) and (ii). Since ~

~

=1

=1

1 k k! @ f (x) A dx !  : : :  ! d @x : : : @xd d ::: d k 0 1 K lX i? Z k X X k! @ f (x) A dx @  !  : : :  ! Bi;j ::: d k d @x : : : @xd d i j K X p = (l ? 1)c(= c)d=k Z 0 c d@ IR

~

X

1+

+

2

1

1

=

1

2

1

=1

=1

1+

+

=

1

1

1

~

i

i

=1

we have

K=

K X ~

i

=1

as required in (iii).

li = K~ +

K X ~

i

p

p

(li ? 1)  (( c=) =k + 1)d + ( c=)d=k

=1

9

1

Next approximate f on each rectangle Ai by a polynomial of total degree k ? 1. Fix 1  i  K . By the Sobolev integral identity, see Oden and Reddy (1976), Theorem 3.6, there exists a polynomial pi of total degree not exceeding k ? 1 and in nitely di erentiable bounded function Q (x; y) such that for all x 2 Ai

Z ! kf X @ 1 Q (x; z) @x : : : @x d (z) dz jf (x) ? pi(x)j = d ? k Ai jjx ? zjj ::: d k d ! Z Z X kf @ k ? d Q (x; z) @x : : : @x d (z) dz  jjx ? zjj dz Ai Ai ::: d k d ! Z Z kf X X @ k ? d  jjx ? zjj dz jQ (x; z)j @x : : : @x d (z) dz Ai Ai ::: d k d ::: d k ! Z Z kf X k ! @ dz:  jjx ? zjj k? d dz c ( z ) d Ai Ai ::: d k !  : : :  d ! @x : : : @xd 2

2

1

2

1+

+

1

=

2

2

2

1

2

1+

+

1

=

2

2

2

2

1

2

1+

+

1+

=

+

1

=

2

2

2

0

2

1+

+

1

1

=

1

Using (i) and (ii) we get

jf (x) ? pi(x)j  (d(=pc) =k ) k? d(=pc)d=k c  c  (=pc)d=k = d 2

1

2

2

This implies

k?d

2(

0

)

c: 2

0

N ((pc d k?d + 1); F ; xn )  N (; TL G ; xn ) (

)

0

1

1

where TL G = fTL g : g 2 Gg and G is the set of all piecewise polynomials of total degree less than or equal to k ? 1 with respect to a rectangular partition of [0; 1]d consisting of p at most K  (2d + 1)( c=)d=k + 2d rectangles. In the last step we bound the covering number of G . Notice that the partition above can be obtained by intersecting at most 2d  K hyperplanes yielding n  (nd ) dK 2

where n is the number of partitions of fx ; : : : ; xn g generated by intersections with the hyperplanes. Proposition 1 of Nobel (1996) and Corollary 29.2 of Devroye, Gyor and Lugosi (1996) imply the nal result 1

N (; G ; xn )  n d

d

2 ((2 +1)( 2

1

pc= d=k )

d +2 )

10



 8eL 

kd

2(



d

+1)((2 +1)(

pc= d=k )

d

+2 )

:

2

4 Proof of Theorem 1 W.l.o.g. we may assume X 2 [0; 1]d a.s. Let L,  > 0 be arbitray, set YL = TL Y and Yi;L = TL Yi (i = 1; : : : ; n). Choose g 2 W k (IRd) such that

Z

jm(x) ? g(x)j (dx) <  and Jk (g ) < 1: 2

2

We use the following error decomposition:

Z

jmn(x) ? m(x)j (dx) = Efjmn(X ) ? Y j jDng ? Efjm(X ) ? Y j g 2

2

2

= Efjmn (X ) ? Y j jDn g ? (1 + )Efjmn (X ) ? YL j jDn g ! n X 1 +(1 + ) Efjmn (X ) ? YL j jDn g ? n jmn (Xi ) ? Yi;L j 2

2

2

2

i

!

=1

n 1X

n 1X

+(1 + ) n jmn (Xi ) ? Yi;Lj ? n jm~ n (Xi ) ? Yi;Lj i i n n X X +(1 + ) n1 jm~ n (Xi ) ? Yi;Lj ? (1 + ) n1 jm~ n (Xi ) ? Yi j 2

=1

=1

2

i

2

=1

+(1 + )

n 1X

2

i

!

=1

n 1X

n i jm~ n(Xi ) ? Yij ? n i jg (Xi ) ? Yi j

2

2

=1

+(1 + )

2

2

2

!

=1

n 1X n jg (Xi ) ? Yij ? Efjg(X ) ? Y j g 2



2

i

=1



+(1 + ) Efjg (X ) ? Y j g ? Efjm(X ) ? Y j g 2

2

2

+((1 + ) ? 1)Efjm(X ) ? Y j g 2

=

X 8

j

2

Tj;n:

=1

Because of (a + b)  (1 + )a + (1 +  )b (a; b > 0) and the strong law of large numbers we get 2

1

2

2

T ;n = Efj(mn(X ) ? YL) + (YL ? Y )j jDn g ? (1 + )Efjmn(X ) ? YLj jDn g  (1 + 1 )EfjY ? YLj g 2

2

1

2

and

n X 1 1 T ;n  (1 + )(1 +  ) n jYi ? Yi;L j i 1 ! (1 + )(1 +  )EfjY ? YLj g (n ! 1) a:s:: 2

4

=1

2

11

By Lemma 1,

T ;n ! 0 (n ! 1) a:s:: 2

Furthermore, if x; y 2 IR with jyj  ln(n) and z = T n x, then jz ? yj  jx ? yj, which implies T ;n  0 for n suciently large. ln(

)

3

It follows from the de nition of the estimate and (7) that

T ;n  (1 + ) (n Jk (g ) ? n Jk (m~ n))  (1 + ) nJk (g ) ! 0 (n ! 1) a:s: 2

5

2

2

2

2

By the strong law of large numbers,

T ;n ! 0 (n ! 1) a:s: 6

Finally,

T ;n = (1 + ) 7

2

Z

jg(x) ? m(x)j (dx)  (1 + ) : 2

2

Using this one concludes

Z

lim sup jmn (x) ? m(x)j (dx) n!1  (2 + )(1 + 1 )EfjY ? YLj g + (1 + )  + (2 +  )Efjm(X ) ? Y j g a:s: 2

2

2

2

2

With L ! 1 and  ! 0 the result follows.

2

Acknowledgement The authors wish to thank Dominik Schafer for many detailed and helpful comments.

References [1] Devroye, L., Gyor , L., Krzy_zak, A. and Lugosi, G. (1994). On the strong universal consistency of nearest neighbor regression function estimates. Ann. Statist. 22, 1371{ 1385. [2] Devroye, L., Gyor , L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer-Verlag, New York. 12

[3] Devroye, L. P. and Wagner, T. J. (1980). Distribution{free consistency results in nonparametric discrimination and regression function estimation. Ann. Statist. 8, 231{239. [4] Duchon, J. (1976). Interpolation des fonctions de deux variables suivant le principe de la exion des plaques minces. R.A.I.R.O. Analyse Numerique 10, 5{12. [5] Eubank, R. L. (1988). Spline Smoothing and Nonparametric Regression. Marcel Dekker, New York. [6] Gyor , L., Kohler, M. and Walk, H. (1998). Weak and strong universal consistency of semi-recursive partitioning and kernel regression estimates. Statistics & Decisions 16 1{18. [7] Gyor , L. and Walk, H. (1996). On the strong universal consistency of a series type regression estimate. Mathematical Methods of Statistics 5, 332{342. [8] Gyor , L. and Walk, H. (1997). On the strong universal consistency of a recursive regression estimate by Pal Revesz. Statistics and Probability Letters 31, 177{183. [9] Kohler, M. (1997a). On the universal consistency of a least squares spline regression estimator. Math. Methods of Statistics 6, 349{364. [10] Kohler, M. (1997b). On optimal global rates of convergence for nonparametric regression with random design. Preprint 97{11, Mathematisches Institut A, Universitat Stuttgart. Submitted for publication. [11] Kohler, M. (1999). Universally consistent regression function estimation using hierarchical B-splines. J. Multivariate Anal. 67, 138{164. [12] Lugosi, G. and Zeger, K. (1995). Nonparametric estimation via empirical risk minimization. IEEE Trans. Inform. Theory 41 677-687. [13] Nobel, A. (1996). Histogram regression estimation using data-dependent partitions. Ann. Statist. 24, 1084{1105. [14] Oden, J. T. and Reddy, J. N. (1976). An Introduction to the Mathematical Theory of Finite Elements. Wiley, New York. 13

[15] Pollard, D. (1984). Convergence of Stochastic Processes. Springer-Verlag, New York. [16] Stone, C. J. (1977). Consistent nonparametric regression. Ann. Statist. 5, 595{645. [17] van de Geer, S. (1987). A new approach to least squares estimation, with applications. Ann. Statist. 15, 587{602. [18] van de Geer, S. (1988). Regression Analysis and Empirical Processes. CWI-tract 45, Centre for Mathematics and Computer Science, Amsterdam. [19] van de Geer, S. (1990). Estimating a regression function. Ann. Statist. 18, 907{924. [20] V. N. Vapnik (1998). Statistical Learning Theory. John Wiley & Sons. [21] Wahba, G. (1990). Spline Models for Observational Data. SIAM, Philadelphia. [22] Walk, H. (1997). Strong universal consistency of kernel and partitioning regression estimates. Preprint 97{1, Mathematisches Institut A, Universitat Stuttgart.

14