Consistent Nonparametric Regression Charles J ... - Semantic Scholar

Report 5 Downloads 169 Views
Consistent Nonparametric Regression Charles J. Stone The Annals of Statistics, Vol. 5, No. 4. (Jul., 1977), pp. 595-620. Stable URL: http://links.jstor.org/sici?sici=0090-5364%28197707%295%3A4%3C595%3ACNR%3E2.0.CO%3B2-O The Annals of Statistics is currently published by Institute of Mathematical Statistics.

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/journals/ims.html. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission.

The JSTOR Archive is a trusted digital repository providing for long-term preservation and access to leading academic journals and scholarly literature from around the world. The Archive is supported by libraries, scholarly societies, publishers, and foundations. It is an initiative of JSTOR, a not-for-profit organization with a mission to help the scholarly community take advantage of advances in technology. For more information regarding JSTOR, please contact [email protected].

http://www.jstor.org Thu Sep 27 11:36:42 2007

The Annals of Statistics 1977, Vol. 5, No. 4,595445

CONSISTENT NONPARAMETRIC REGRESSION1

BY CHARLES J . STONE University of California, Los Angeles Let (X, Y) be a pair of random variables such that X i s Rd-valued and Y is Rd'-valued. Given a random sample (XI, Yl), (X,, Y,) from the distribution of (X, Y), the conditional distribution PY(-I X) of Y given X can be estimated nonparametrically by B,Y(AIX) = C; Wmi(X)l~(Yi), where the weight function W, is of the form Wmi(X) = Wni(X, XI, X,), 1 5 i n. The weight function W, is called a probability wejght function if it is nonnegative and C; W,i(X) = 1. Associated with P,Y(* I X) in a natural way are nonparametric estimators of conditional expectations, variances, covariances, standard deviations, correlations and quantiles and nonparametric approximate Bayes rules in prediction and multiple classification problems. Consistency of a sequence { W,) of weight functions is defined and sufficient conditions for consistency are obtained. When applied to sequences of probability weight functions, these conditions are both necessary and sufficient. Consistent sequences of probability weight functions defined in terms of nearest neighbors are constructed. The results are applied to verify the consistency of the estimators of the various quantities discussed above and the consistency in Bayes risk of the approximate Bayes rules. a * ,

..

a ,

1. Introduction. Let (X, Y) be a pair of random variables such that X is Rdvalued and Y is Rd'-valued. An important concept in probability and statistics is that of the conditional distribution PY(. / X) of Y given X and quantities defined in terms of this conditional distribution-conditional expectations, variances, standard deviations, covariances, correlations and quantiles. There are simple formulas for these conditional quantities if the joint distribution PXsYof (X, Y) is a multivariate Gaussian distribution N ( p , 2 ) with known mean p and covariance matrix 2. Typically in practice PXxYis not known exactly but a random sample (XI, Y,), . . . , (X,, Y,) from PXsYis available. In the Gaussian case estimators p and 2 of Y, and 2 based on this data can be obtained and PXxYcan be estimated as @,X,Y = X(j2,2). Then PY(*/ X) can be estimated by a m y ( ./ X), defined to be the conditional distribution of Y given X corresponding to the joint distribution pnxsy.The various conditional quantities defined in terms of PY(*I X ) can in turn be estimated by the corresponding quantities defined in terms of @,Y(*I X). This paper is concerned with the problem of estimating the conditional Received October 1974; revised September 1976.

1 Research was supported by NSF Grant No. MPS 72-04591 and, through the Health Sciences

Computer Facility a t UCLA, by NIH special research grant RR-3. A.MS 1970 subject classifications. Primary 62G05; Secondary 62H30. Key words and phrases. Regression function, conditional quantities, prediction, multiple classification, consistency in Bayes risk, approximate Bayes rules, nonparametric estimators, nearest neighbor rules. 595

distribution of Y given X and the various conditional quantities related to it when the joint distribution of X and Y is not assumed to be Gaussian or, in fact, to belong to any prespecified parametric family of distributions. In this context nonparametric methods of estimation are appropriate. If a number of the Xi's in the random sample are exactly equal to X, which can happen if X is a discrete random variable, PY(./ X) can be estimated by the empirical distribution of the Y,'s corresponding to X,'s equal to X. If few or none of the Xi's are exactly equal to X, it is necessary to use Yi's corresponding to Xi's near X. This leads to estimators Pny(. / X) of the form where W,,(X) = W,,(X, XI, . . . , X,), 1 5 i 5 n, weights those values of i for which X, is close to X more heavily than those values of i for which Xi is far from X. Set W,,(X) = 0 for i > n. The weight function W, is said to be normal if C, W,,(X) = 1, nonnegative if W, 2 0, and a probability weight function if it is both normal and nonnegative. In the last case amY(./ X) is a probability distribution on I f B d ' . Let g be a Borel function on Rd' such that E/g(Y)/< co and let E(g(Y) I X) denote the conditional expectation of g(Y) given X. Corresponding to W, is the estimator J?,(g(y) / X) of E(g(Y) / X) defined by

Note that if A is a Borel set in Rd', then Other conditional quantities defined in terms of PY(*/ X) can again be estimated by the corresponding quantities defined in terms of Pmy(.I X). Observe that the estimators considered here are estimators of function values at specified points of the domain, not estimators of parameters of the function. Suppose, for example, that Y is real valued and El YI < co. The value E(Y / X = x) = \ yPY(dy/ X = x) of the regression function of Y on X at the point x is estimated by B,(Y

1 x = x) = \ yPny(dy/ x = X) = C iW,,(x)Y, .

This setup differs from that of nonparametric linear regression models. There the regression function is assumed to belong to the parametric family of linear functions on R d , but no parametric form is assumed for the distribution of the residuals. For such models the goal is to obtain robust estimators of the regression coefficients (see Adichie (1967), JureCkovi (1971), Jaeckel (1972) and Bickel (1973)). Let X, Xi, X,, . . . be a fixed sequence of independent and identically distributed (i.i.d.) Rd-valued random variables on a probability space Q . It is assumed that there is a sequence of independent standard normal random variables on Q which is independent of (X, XI, X,, . . .) (which fact is required to obtain the

CONSISTENT NONPARAMETRIC REGRESSION

597

necessity of (5) in Theorem 1 below). A sequence {W,) of weights is said to be

consistent if whenever (X, Y), (XI, Y,), (X,, Y,), are i.i.d., Y is real valued,

r > 1, and ElYIT < co; t h e n ~ , ( ~ / ~ ) + ~ ( ~ ~ ~ ) i n L ' ( ~ , - + ~ i n ~ ' m e a n EIZ, - ZI' + 0).

In Theorem 1 of Section 2 sufficient conditions for {W,) to be consistent are stated. If {W,) is a sequence of probability weights, then, as noted in Corollary 1, the conditions simplify and becomes both necessary and sufficient for consistency. The conditions in Theorem 1 and Corollary 1 involve the unknown underlying distribution of X. A sequence {W,) of weights is said to be universally consistent if it is consistent regardless of the distribution of X. Theorem 2 of Section 3 shows how to obtain universally consistent sequences of weights defined in terms of the ranks of the distances from XI, . . ., X, to X. The proof of Theorem 2 depends crucially on an inequality stated as Proposition 11 in Section 11. This inequality, which is interesting in itself, is the key to the truly nonparametric (distribution free) aspect of this paper-i.e., to the fact that results are obtained which are completely free of regularity conditions on the distribution of X or the joint distribution of (X, Y). In Section 4 a method is discussed for modifying a consistent sequence of weights to obtain another consistent sequence which hopefully yields more accurate estimators. Section 5 discusses "trend removal," which can take advantage of a fairly accurate linear approximation to E(Y I X). By definition a consistent sequence of weights yields consistent estimators of conditional expectations. Sections 6 , 7 and 8 show respectively how to obtain consistent estimators of conditional second order quantities, conditional quantiles, and Bayes rules. The results in Sections 4-8 are all based on starting out with a consistent sequence of weights. They become truly nonparametric if the weights are assumed to be universally consistent. Related papers in the literature are briefly reviewed in Section 9. The results from Sections 2-8 are proved in Sections 10-13. An experimental packaged program is currently being developed in cooperation with the Health Sciences Computer Facility at UCLA, which should make it easy to determine the performance of the estimators discussed in this paper on real and simulated data sets. Preliminary experience in using this program on simulated data sets shows the effectiveness of the modifications discussed in Sections 4 and 5. 2. Consistent sequences of weights. Let Rd denote d-dimensional Euclidean space with the usual inner product x . y and norm IIxII. For x and y in W let x v y and x A y denote respectively the maximum and minimum of x and y . For x E W set x+ = x v 0, x- = -(x A 0) and sign (x) = - 1, 0, or 1 according as x < 0, x = 0, or x > 0. Given any set A, let #(A) denote the number of elements in A. All random variables considered in this paper are assumed to be defined on

the probability space Q. Let Z,, n 2 1 , and Z be real valued random variables. Then Z , -+ Z in probability if lim, P(IZ, - Z / > a ) = 0 for all a > 0 . For r 2 1 , Z , + Z in LT if lim, EIZ, - ZIT = 0 . Note that Z , + Z in LT implies that Z , -+ Z in probability. Finally Z , is bounded in probability if lim,,, lim sup, P(IZ,I 2 M ) = 0. The following result will be proven in Section 10. THEOREM 1 . Let {W,) be a sequence of weights. Suppose the followingPve conditions are satisjied: there is a C >= 1 such that for every nonnegative Borel function f on Rd

(1)

E

C i I W,,(X)l f ( X i ) 2

CEf(X)

for all

n2 1;

there is a D 2 1 such that

(3) (4) (5)

C i / W,,(X)II,,,,,-,,,,, -+0 C i W,,(X) -+ 1 maxi I W,,(X)I

-+

in probability for all

a

>0 ;

in probability; and 0

in probability.

Then { W,} is consistent. Suppose, conversely, that {W,) is consistent. Then ( 4 ) and ( 5 ) hold. If W , 2 0 for all n 2 1 , then (3) holds, and if W , 2 0 for all n 2 1 and ( 2 ) holds, then ( 1 ) holds.

If {W,} is a sequence of probability weights, then ( 2 ) and ( 4 ) hold automatically and the three remaining conditions are necessary and sufficient for consistency. This result is summarized in the following corollary.

1 . Let {W,) be a sequence of probability weights. It is consistent if COROLLARY and only if the following three conditions hold: there is a C 2 1 such that, for every nonnegative Borel function f on Rd,E C i W,,(X)f(X,) 5 C E f ( X ) for all n 2 1 ; C iW,,(X)I,,,xi~x,,,a, + 0 in probability for all a > 0 ; and max, W,,(X) -+ 0 in probability. The following consequence of Theorem 1 will be used in Section 4 . COROLLARY 2. Let {U,} be a consistent sequence of probability weights, let {W,) be a sequence of normal weights, and suppose that there is an M 2 1 such that I W,I 5 MU, for all n >= 1 . Then {W,} is consistent. 3. Nearest neighbor weights. In this section consistent sequences of probability weights will be constructed. The weights will depend on the distances from X to X I , . . . , X, in terms of a suitable metric on Rd. The obvious metric on Rd to use is the Euclidean metric. This metric may well be appropriate if the various coordinates of X are measured in the same units, but it is most likely inappropriate otherwise.

599

CONSISTENT NONPARAMETRIC REGRESSION

When the individual coordinates are measured in dissimilar units, e.g., grams, centimeters, and seconds, it makes sense to transform them to be unit free before applying the Euclidean metric. Let s, be a scale based on X,, . ., X,, that is a nonnegative function of the form smj= snj(X, X,, . . . , X,), 1 5 j 5 d. The random (pseudo) metric p, corresponding to this scale is defined by

where u = (u,, . .. , u,), v = (v,, . . . , v,), and the sum extends over all j, 1 5 j 5 d, such that smj> 0. Let {s,} be a sequence of scales and let {p,} be the corresponding sequence of metrics determined by (6). In order to obtain a consistent sequence of weights, a number of assumptions need to be imposed on {s,}. First, it is assumed that if 1 5 j d and the jth coordinate of X has a nondegenerate distribution, then lim, P(s,, > 0) = 1. Secondly, it is assumed that if 1 5 j, I d d and the jth and lth coordinates of X both have nondegenerate distributions, then snj/s,, is bounded in probability. Finally it is assumed that there are positive constants a and b 2 a independent of n such that whenever n 2 1, 1 5 i 5 n, 1 5 j 5 d, and the jth coordinates of XI, . . . , X, do not coincide, then (7)

asmj(X,,X1, .

.

a

,

X,

.

a

-

,

X,)



5 smj(X,X,, X,) 5 bs,,(X,, X,, . . ., X, .

.

a

,

.

.

a

,

X,).

Here (Xi, XI, . . . , X, . . . , X,) denotes the sequence (X, X,, . . ., X,) with X and Xi interchanged. The last condition is obviously satisfied with a = b = 1 if snj(X, XI, . . ., X,) is a symmetric function of X, X,, . . ., X,. The condition allows for a certain amount of asymmetry. If {s,} satisfies these assumptions it is said to be regular. If s, r 1 for all n 2 1, then {s,} is obviously regular. If E((X(('< co and snj is the sample standard deviation of the jth coordinate of X, X,, . , X,, then {s,} is regular. From now on it is assumed that Isj} is a regular sequence of scales and that {p,} is the corresponding sequence of metrics. For 1 5 k 5 n let I,,(X) denote the collection of all indices i, 1 5 i n, such that fewer than k of the points X,, . . . , X, are strictly closer to X in the metric p, than is Xi. Suppose, for example, that n = 4 and that

Then 14,(X) = {3}, 14,(X)= 14,(X)= (2, 3, 4) and 14,(X)= 11, 2, 3, 4). Clearly #(Ink(X))2 k, and #(I,,(X)) = k for 1 5 k 5 n if and only if the n numbers p,(X,, X), . . ., p,(X,, X) are distinct. The points i in I,,(X) are called the k nearest neighbors of X. If W , is a weight function such that W,,(X) = 0 for i 4 I,,(X), it is called a k nearest neighbor (k-NN) weight function. Let c,,, i 2 1, be such that c,, 2 . . . >= c,, 2 0, c,, = 0 for i > n, and c,, + . . . + c,, = 1. Associated with c, is the probability weight function W,

defined as follows: for 1 5 i

5n

where u = 1

+ #({I: 1 5 1 5 n, I # i,

and p,(X,, X)

=1

+ #({I : 1 s I 5 n, 1 + i,

and p,(X,, X) = p,(Xi, 4 1 )

and

< p,(Xt,

41)

.

.

In particular, if X, is the unique uth closest point among X,, . ., X, to Xin the metric p,, then R = 1 and hence W,,(X) = c,,. Since W, >= 0 and C, W,,(X) = C, c,, = 1, W, is indeed a probability weight function. EXAMPLE 1 (uniform k-NN weight function). c,, = llk for 1 5 i c,, = 0 for i > k. i

EXAMPLE 2 (triangular k-NN weight function). c,, = (k - i = 0 for i > k. Here b, = k(k + 1)/2.

5 k and

+ l)/bk for 1 5

5 k and c,,

EXAMPLE 3 (quadratic k-NN weight function). c,, = (k2- (i - l)')/bk for 1 5 I k and c,, = 0 for i > k. Here b, = k(k + 1)(4k - 1)/6. iOne expects k , ( g ( ~ )I X) to be a smoother function of X for triangular and quadratic k-NN weight functions than for uniform k-NN weight functions. The next result will be proven in Section 11. THEOREM 2. For n 2 1 let W, be theprobability weight function corresponding to c,, = 0 for all a > 0 and lim, c,, = 0, then {W,} is consistent. c,. If lim, C

,,

COROLLARY 3. For n 2 I , let W, be the uniform, triangular, or quadratic k,NN probability weight function. If k, --t co and k,/n --t 0 as n --t oo, then {W,} is consistent. 4. Local linear weights. Assume through Section 9 that (X, Y), (XI, Y,), (X,, Y,), . . . is an i.i.d. sequence, where X is Rd-valued and Y is Rd'-valued. In this section it is assumed that d' = 1, so that Y is real valued. Let U, be a probability weight function. A related weight function, corresponding to a different method for estimating E(Y I X), will now be constructed. Choose 6, E R and 8, E Rd to be values of a and b which minimize

and set &,(Y 1 X) = 6, + &, . X, where denotes the usual inner product on Rd. This local linear regression estimator, in effect, uses weighted least squares, with the ith case having weight U,,(X), to fit a linear regression function to the data and then evaluates this function at X. It can be written in the form &,(Y I X) = where

+

C, V,i(X)Y, ,

v,,~(x) = U , ~ ( X ) ( ~ (X - 3). C-'(X)(X, -

x)).

CONSISTENT NONPARAMETRIC REGRESSION

Here

B = C i U,,(X)X, , Clm(X)=

C i Uni(X)(Xdl- Bl)(Xim- Bm),

1 5 1, m

5d,

and, for simplicity, the matrix (Clm(X))is assumed to be nonsingular with probability one (in implementing this procedure, a "tolerance" is used to avoid pivoting on small elements). The weight function V, is called the (untrimmed) local linear weight function corresponding to U,. It is normal but generally not a probability weight function. Let {U,} be a consistent sequence of probability weights. The corresponding sequence {V,} of local linear weights is not necessarily consistent. It will now be shown how to trim V, to obtain consistency. Choose A 5 1 and B 2 1 and set Wnf1)= (V, v AU,) A BU,. Then AU, 5 Wnf1)5 BU,. Now W,(l) is not necessarily normal. To guarantee normality one more trimming is necessary: if Ci W;','(X) < 1, set w,,(x) = W:l,,(x) v (A,(X)u,,(X)) where A,(X) E (A, 11 is chosen so that W,,(X) = W;','(X)

A

C i W,,(X)

(B,(X)U,,(X))

for = 1; if

for

i 2 1,

Ci w $ ) ( x )> 1, set i

>= 1 ,

where B,(X) E [ I , B) is chosen so that Ci W,,(X) = 1; and if Ci W$)(X) = 1, set W,,(X) = W:','(X) for i 2 1. The weight function W, so defined is called the trimmed local linear weight function corresponding to U,, and the parameters A and B. By construction, W, is normal and AU, 5 W, j BU,. If A 2 0,then W, is a probability weight function. If U, is a k,-NN weight function, then so is W,. The following result follows immediately from Corollary 2. COROLLARY 4. Let {U,} be a consistent sequence of probability weights, let A .I + De

It follows from (2) and (3) that 1% E

Ci lWnt(X)II{llxi-X>a)= 0

Thus lim sup, E Since

E

C

IWni(X)llf(Xi) - f(X)lr 9 DE

can be made arbitrarily small, the conclusion of Proposition 1 holds.

PROPOSITION 2. Let {W,) be a sequence of nonnegative weights. Suppose that (1)-(3) hold and that there are sequences {M,) and {N,) of nonnegative constants such that N,) = 1 . lim, P(M, _< X i W,,(X) Let f be a nonnegative Bore1function on Wd such that Ef(X) lim inf, E and

Ci W,,(X)f(X,) 2 (lim inf,


I I$

-

a1 1

as desired.

1 1 . Let W , be the probability weight function corresponding to c, PROPOSITION and let a and b 2 a satisfy ( 7 ) . I f f is a nonnegative Bore1 function on Rd such that E f ( X ) < a,then

PROOF.NOW W n i ( X )= W n i ( X ,X I , . . . , X,), where X , X,, Thus X and X, can be interchanged to obtain E( W,,(X, X I ,

, X,) f(X,)) = E(W,,(Xi, XI,

. .. , X ,

are i.i.d.

., X , ) f ( X ) )

X,

for

l z i s n .

X,) for 1 5 i 5 n Set u,,(X) = U,,(X, X I , . X,) = W,,(Xi, X,, ..., x, and U,,(X) = 0 for i > n. Then E(W,,(X)f(X,)) = E(U,,(X)f(X)) and hence a ,

a ,

Proposition 1 1 follows immediately from ( 1 7 ) and the next result.

PROOF.Think of X , X,,

. . .,,X,

as fixed points in Rd. Write p, = p,,

,,

xl ,...,

and set 15i

for

p,, = pn,x,,x,,..., x ,...,x,

5n.

It follows from the definitions of W , and U, that U,,(X) = (c,, c,,,+l-l)lA, where

+ #{1: 15 15 n , 1 # i , R = 1 + #{1: 1 5 1 5 n , I # i,

Y

and

=1

and. p,,(X,, Xi)

< p,,(X,

+ ... +

X,)})

and p,,(X,, X,) = p,,(X, X,)})

.

Assume first that ( 7 ) holds and that s,, > 0 for 1 5 j $ d. Set I, = { i : 1 $ i 5 n and X, = X } and r = #(Io). If i E I,, then Y = 1 and A = t , so that U,,(X) = (c,, . . c,,)/t. Thus

+

(18) For 1

+

C i S I ,

Um,(X) =

c,i

e:=1

5 i 5 n and 1 5 j 5 d define smj6by s*,, = snj,(X, X I ,

* .

., X,)

= s,,(X,,X,,

., x,

., X,).

Consider the transformations T , T I , . . . , T , from Rd to itself defined as follows: ( T u ) , = LU .



( T , u ) , = U. L

and

for

l $ j s d ,

S, j i

Smj

where u = (u,, . . ., u,). Observe that p,(u, v ) = ( ( T u- T v ( ( and p,,(u, v ) = [ [ T , u- T,vll. Observe also that ( T , u ) ~= b,,(Tu),, where bji = s,,/s,,~. It follows from ( 7 ) that a 5 b,, 5 b for 1 5 j 5 d. Choose V c Y ( d , alb). Set I = { i : 15 i s n, Xi f X , and T X , - T X E V }

and p = # ( I ) . Then I = {i,, . ., i,}, where 0 < [lTX,l - TXll 5 . . 5 ( ( T X , p- TXII. Let 1 5 q < r 5 p. Then T X i s - T X E V , TX,, - T X E V , and 0 < l[TXiq- ?'XI[ 5 IITX,, - TXII. It now follows from Proposition 10 that I [Ti,Xi, - Ti,Xiq[[< [[Ti,x,, - T,, XI[ or equivalently that p,,r(X,q, 4,)< p,,,(X, Xi ). Thus U,,,(X) = (c,, . . + c,,,+~-,)/A, where v 2 r a n d A 2 r 1 . Since en, . . . 2 c,., it follows that U,,,(X) 5 (c,, + .. . + c,,,+,)/(t + 1 ) and hence that

2

+

+

Since Rdcan be covered by p(d, a/b) elements V e V ( d , alb), it now follows that

It follows from ( 1 8 ) and ( 1 9 ) and elementary algebra that

To verify the inequality of Proposition 12, it is necessary to show that the right

6 15

CONSISTENT NONPARAMETRIC REGRESSION

side of the above inequality is bounded above by P(d, alb). By elementary algebra and the formula C;=, c,, = 1, this reduces to showing that

c:=,[P (d, +) (r + 1 - i) - (t + I)] c,,

2

o.

The last inequality follows easily from the observation that c,, 2 . . 2 c,, 2 0, B(d, alb) 2 2, and C:=,[2(t + 1 - i ) - (r + I)] = 0. This shows that the inequality of Proposition 12 is valid whenever (7) holds and snj > 0 for 1 5 j 5 d. Consider now the general case. Let J denote the collection of all j such that 1 5 j 5 d, smj> 0, and the jth coordinates of XI, . . . , X, do not coincide. Set d = #(J). Suppose first that d = 0. Then p,,(X,, X,) = 0 for 1 5 i, I 5 d. It follows easily that U,,(X) = c,, if p,,(X, Xi) > 0 and U,,(X) = lln if p,,(X, Xi) = 0. In any case U,,(X) 5 lln for 1 5 i 5 n and hence C, U,,(X) 5 1 < P(d, alb). Suppose next that d > 0. Let p,, be the pseudometric obtained by setting

where u = (u,, . . . , u,) and v = (v,, . . . , v,). Note that p,,(X,, Xi) = p,,(X,, Xi) and p,,(X, Xi) 5 p,{(X, Xi) for 1 5 i, I 5 n. Set 5= 1

and

+ #({I: 1 5 I 5 n, I + i,

and pai(x,, x i ) < pni(X, Xi)})

4 = 1 + #({I: 1 5 I 5 n, I + i, and p,,(X,, Xi) = pni(X, Xi)}) .

Then 6 5 v and 5 + 4 5 v + 2 . Set R,(x) = (c,; U,,(X) 5 O,,(X) for 1 5 i 5 n. Thus

+ . . + C,,~+T-,)/~. Then

Thus Proposition 12 holds in general, and hence Proposition 11 is valid. PROOFO F THEOREM 2. Let W, be the probability weight function corresponding to c, and suppose that lim, C,,,, c,, = 0 for all a > 0. Proposition 11 implies that the first condition of Corollary 1 holds. To show that the second condition of Corollary 1 holds, choose a > 0 and E > 0. For given a > 0 let A, denote the event that It follows from Proposition 9 that a can be chosen so that lim sup, P(A,) 5 Now Ci Wni(X)z{IIXi-xll>a~ 5 C i > a n C m i f IA,'

E.

Since E can be made arbitrarily small, the second condition of Corollary 1 is valid. Suppose also that c,, + 0. Since mag; J?",,(X) 5 c,,, the third condition of Corollary 1 holds and hence {W,} is consistent.

12. Proof of Theorem 3. In this section Theorem 3 of Section 7 will be proven using the notation of that section. Also, in various proofs, the abbreviated notations L(X) for LY(pI X), L,(x) for L , Y ( I~X), etc., will be used.

PROPOSITION 13. Let {W,) be a consistent sequence of probability weights and let 0 < p < 1. Then for every E > 0 lim, ~ ( i , ~ I (X) p 2 Ly(p1 X) and

lim, p(Ofiy(p1 X) j UY(p( X )

E)

+

E)

=1 = 1.

PROOF.Only the first result will be proven, the proof of the second result being similar. Define the function f on Rd by

Then 0

5 f < p. It follows from the consistency of {W,} that - f(X)I = 0 limn EICi W~i(X)I~yisL(xi)-t/al

and hence that (20)

Ci W,i(X)I{yis~(xi)-t~a) =

2

It follows from Proposition 4 that

C i W,i(X)~lL(xilsL(x,-c/,,--, 0 in probability and hence that (21)

Ci W,~(X)IIL(X~,SL(X)-~/~) = 2

Equations (20) and (21) together imply that

< P) = 1 limn P ( E i Wni(X)I{Yi~Ltx~-c~ and hence that lim, P(L,(x) 2 L(X) - e) = 1. Thus the first result of Proposition 13 is valid, as desired. 14. Let 0 PROPOSITION

< p < 1 and r > 1.

Then

and

PROOF.Only the first result will be proven, the proof of the second result being similar. If L(X) 5 0, then E(IYIr I X) 9 p[L(X)Ir. If L(X) 2 0, then E(IYIr I X) 2 (1 - p)lL(X)Ir. Thus in general E(IYIr I X) 2 p (1 - p)lL(X)Iy and hence

as desired.

CONSISTENT NONPARAMETRIC REGRESSION

617

PROPOSITION 15. Let W, be a probability weight function satisfying (1) and let 0 < p < 1 and M > 0. Then

and the same inequality holds with L,Y(p I X) replaced by Ony(pI X). PROOF.It is easily seen that

Thus by (1)

The same argument works if L,(x) is replaced by O,(X). PROOFOF THEOREM 3. Let {W,} be a consistent sequence of probability weights and let 0 < p < 1. It follows from Proposition 13 that (9) and (10) hold. It now follows from Propositions 14 and 15 that if r 2 1 and EIYIr < a, then in (9) and (10) convergence in probability can be replaced by convergence in Lr. This completes the proof of Theorem 3. 13. Proof of Theorem 4. When applied to Model I , Theorem 4 follows immediately from the consistency of {W,} and the formula for the Bayes risk of 8, given in the discussion of Model 1. When applied to Model 2, Theorem 4 follows immediately from Theorem 3 and the inequality

Consider now Model 3 and let {W,} be a consistent sequence of weights. Set e,(X) = max, I ~ , ( P ( YY), I X) - E ( P ( Y , Y) I X)I = max, I P , Y ( { ~ IX) ~ - PY({YllX)I

It will be shown that

Observe that E(P(

Y, 8,(x)) I X) $ &,(P( Y, J,(X)) I X)

+ edx)

+ +

5 &,(P(Y, 6(X)) I X ) e*(X) 5 E ( P ( Y , 8(X)) I X) 2e,(X) and hence that

EP(Y,

Jn(x))

s R + 2~e,(X).



Thus it follows from (22) that {8,} is consistent in Bayes risk. In the important special case that the distribution of Y has finite support, (22) follows immediately from the consistency of {W,}. T o prove the result in

general set A, = { y : P(Y = y ) = 0 ) . Then with probability one, no value of y E A, occurs more than once among Y,, Y,, . . . and hence It now follows from ( 2 ) and (5) that Set A, = { y : P(Y = y ) > 0 ) . Choose E > 0 and let A, and A, be disjoint sets whose union is A, and such that A, is finite and P(Y E A,) 5 E . It follows from the consistency of {W,) that limn E maxg~A, lpn({y)I X ) - P Y ( { y )I X)I = 0 Clearly max,G~4

p Y ( { ~I X) ,

5

It follows from (1) that Since E can be made arbitrarily small, ( 2 2 ) follows from the last four displayed results. This completes the proof of Theorem 4. REFERENCES [I] ADICHIE,J. N. (1967). Estimates of regression parameters based on rank tests. Ann. Math. Statist. 38 894-904. M. A., BRAVERMAN, E. and ROZONOER, L. (1970). Extrapolative problems in [2] AIZERMAN, . S O C . Transl. automatic control and the method of potential functions. ~ m e iMath. 87 281-303. A. E. and TUKEY, J. W. (1974). The fitting of power series, meaning polynomials, [3] BEATON, illustrated on band-spectroscopic data. Technometrics 16 147-192. J. (1974). Kernel estimation of regression functions. Ph. D. dissertation, Univ. [4] BENEDETTI, of Washington.

J. (1975). Kernel estimation of regression functions. Proc. of Computer Science [5] BENEDETTI, and Statistics: 8th Annual Symposium on the Interface 405-408. Health Science Computing Facility, UCLA. [6] BICKEL,P. J. (1973). On some analogues to linear combinations of order statistics in the linear model. Ann. Statist. 1 597-616. [7] BREIMAN, L. and MEISEL, W. S. (1976). General estimates of the intrinsic variability of data in nonlinear regression models. J. Amer. Statist. Assoc. 71 301-307. [8] BUTLER, G. A. (1975). Heuristic regression for large commercial problems. Proc. of Computer Science and Statistics: 8th Annual Symposium on the Interface 398-404. Health Science Computing Facility, UCLA. [9] CLEVELAND, W. S , and KLEINER, B. (1975). A graphical technique for enhancing scatterplots with moving statistics. Technometrics 17 447-454. [lo] COVER,T. M. (1968). Estimation by the nearest neighbor rule. ZEEE Trans. Information Theory IT-14 50-55. [ l l ] COVER, T. M. and HART,P. E. (1967). Nearest neighbor pattern classification. IEEE Trans. Information Theory IT-13 21-27. [12] FERGUSON, T. S. (1967). Mathematical Statistics: A Decision Theoretic Approach. Academic Press, New York. [13] FISHER,L. and YAKOWITZ, S . (1976). Uniform convergence of the potential function algorithm. SZAM J. Control 14 95-103.

CONSISTENT NONPARAMETRIC REGRESSION

619

FIX,E. and HODGES, J . L., JR. (1951). Discriminatory analysis, nonparametric discrimination, consistency properties. Randolph Field, Texas, Project 21-49-004, Report No. 4. J. H. (1976). A variable metric decision rule for nonparametric classification. FRIEDMAN, IEEE Trans. Comput., to appear. FRITZ,J. (1975). Distribution-free exponential error bound for nearest neighbor pattern classification. IEEE Trans. Information Theory IT-21 552-557. GORDON, L. and OLSHEN, R. A. (1975). Asymptotically efficient, computationally feasible solutions to the classification problem. Unpublished manuscript. L. A. (1972). Estimating regression coefficients by minimizing the dispersion of JAECKEL, the residuals. Ann. Math. Statist. 43 1449-1458. JUREEKOVA, J . (1971). Nonparametric estimate of regression coefficients. Ann. Math. Statist. 42 1328-1338. LIGGETT, T. M. (1977). Extensions of the Erdos-KO-Rado theorem and a statistical application. J. Combinatorial Theory (A) 22, No. 3. MAJOR,P. (1973). On non-parametric estimation of the regression function. Studia Sci. Math. Hungar. 8 347-361. MORGAN, J . N. and SONQUIST, J. A. (1963). Problems in the analysis of survey data, and a proposal. J. Amer. Statist. Assoc. 58 415-434. E. A. (1964). On estimating regression. Theor. Probability Appl. 9 141-142. NADARAYA, NADARAYA, E. A. (1970). Remarks on nonparametric estimates for density functions and regression curves. Theor. Probability Appl. 15 134- 137. PARZEN, E. (1962). On estimation of a probability density and mode. Ann. Math. Statist. 35 1065-1076. M. B. and CHAO,M. T. (1972). Non-parametric function fitting. J. Roy. StaPRIESTLEY, tist. Soc. Ser. B 34 385-392.

RAMAN, S. (1971). Contribution to the theory of Fourier estimation of multivariate probability density functions with application to data on bone age determinations. ~ h D. . dissertation, Univ. of California, Berkeley. REVESZ, P. (1973). Robbins-Munro procedure in a Hilbert space and its application in the theory of learning processes. I. Studia Sci. Math. Hungar. 8 391-398. ROSENBLATT, M. (1956). Remarks on some nonparametric estimates of a density function. Ann. Math. Statist. 27 642-669. ROSENBLATT, M. (1969). Conditional probability density and regression estimators. Multivariate Analysis I1 25-31, Academic Press, New York. ROYALL, R. M. (1966). A class of nonparametric estimators of a smooth regression function. Ph. D. dissertation, Stanford Univ. SCHUSTER, E. F. (1968). Estimation of a probability density function with applications in statistical inference. Ph. D. dissertation, Univ. of Arizona. SCHUSTER, E. F. (1972). Joint asymptotic distribution of the estimated regression function a t a finite number of distinct points. Ann. Math. Statist. 43 84-88. S ~ N Q U I SJ.T A. , and MORGAN, J. N. (1964). The Detection of Interaction Effects. Survey Research Center Monograph No. 35, Institute for Social Research, Univ. of Michigan, Ann Arbor. STONE,C. J. (1975). Nearest neighbor estimators of a nonlinear regression function. Proc. of Computer Science and Statistics: 8th Annual Symposium on the Interface 413-418. Health Sciences Computer Facility, UCLA. VANRYZIN,J. (1966). Bayes risk consistency of classification procedures using density estimation. Sankhyd Ser. A 28 261-270. WAHBA,G. and WOLD,S. (1975). A completely automatic French curve: Fitting spline functions by cross-validation. Comm. Statist. 6 1-17. G. S. (1964). Smooth regression analysis. SankhycT Ser. A 26 359-372. WATSON, WOLD,S. (1974). Spline functions in data analysis. Technometrics 16 1-11.

620

CHARLES J. STONE

(ANDDISCUSSANTS)

[40] YAKOWITZ, S. and FISHER,L. (1975). Experiments and developments on the method of potential functions. Proc. of Computer Science and Statistics: 8th Annual Symposium on the Interface 419-423. Health Science Computer Facility, UCLA. OF MATHEMATICS DEPARTMENT OF CALIFORNIA UNIVERSITY 90024 Los ANGELES, CALIFORNIA

DISCUSSION PETERJ . BICKEL University of California at Berkeley As Professor Stone has pointed out, over the years a large variety of methods have been proposed for the estimation of various features of the conditional distributions of Y given X on the basis of a sample (XI, Y,), . . . , (X,, Y,). The asymptotic consistency of these methods has always been subject to a load of regularity conditions. In this elegant paper, Professor Stone has given a unified treatment of consistency under what seem to be natural necessary as well as sufficient conditions. His work really reveals the essentials of the problem. He has been able to do this by defining the notion of consistency properly from a mathematical point of view in terms of L, convergence. However, the notions of convergence that would seem most interesting practically are pointwise notions. An example is uniform convergence on (x, y) compacts of the conditional density of Y given X = x. The study of this convergence necessarily involves more regularity conditions. At the very least there must be a natural, unique choice of the conditional density. However, such a study and its successors, studies of speed of asymptotic convergence, asymptotic normality of the estimates of the density at a point, asymptotic behavior of the maximum deviation of the estimated density from its limit (see [ I ] for the marginal case), etc., would seem necessary to me and to Professor Stone too! (He informed me, when I raised this question at a lecture he recently gave in Berkeley, that a student of his had started work on such questions.) One important question that could be approached by such a study is, how much is lost by using a nonparametric method over an efficient parametric one? If density estimation is a guide, the efficiency would be 0 at the parametric model for any of the nonparametric methods surveyed by Professor Stone. However, even if this is the case, it seems clear that one can construct methods which are asymptotically efficient under any given parametric model and are generally consistent in Stone's sense. This could be done by forming a convex combination of the best parametric and a nonparametric estimate, with weights