Journal of Machine Learning Research 16 (2015) 1863-1877
Submitted 6/15; Published 9/15
On the Asymptotic Normality of an Estimate of a Regression Functional L´ aszl´ o Gy¨ orfi
∗
[email protected] Department of Computer Science and Information Theory Budapest University of Technology and Economics Magyar Tud´ osok k¨ or´ utja 2., H-1117 Budapest, Hungary
Harro Walk
[email protected] Department of Mathematics University of Stuttgart Pfaffenwaldring 57, D-70569 Stuttgart, Germany
Editor: Alex Gammerman and Vladimir Vovk
Abstract An estimate of the second moment of the regression function is introduced. Its asymptotic normality is proved such that the asymptotic variance depends neither on the dimension of the observation vector, nor on the smoothness properties of the regression function. The asymptotic variance is given explicitly. Keywords: nonparametric estimation, regression functional, central limit theorem, partitioning estimate
1. Introduction This paper considers a histogram-based estimate of second moment of the regression function in multivariate problems. The interest in the second moment is motivated by the fact that by estimating it one obtains an estimate of the best possible achievable mean squared error, a quantity of obvious statistical interest. It is shown that the estimate is asymptotically normally distributed. It is remarkable that the asymptotic variance only depends on moments of the regression function but neither on its smoothness, nor on the dimension of the space. The proof relies on a Poissonization technique that has been used successfully in related problems. Let Y be a real valued random variable with E{Y 2 } < ∞ and let X = (X (1) , . . . , X (d) ) be a d-dimensional random observational vector. In regression analysis one wishes to estimate Y given X, i.e., one wants to find a function g defined on the range of X so that g(X) is “close” to Y . Assume that the main aim of the analysis is to minimize the mean squared error : min E{(g(X) − Y )2 }. g
∗. This research has been partially supported by the European Union and Hungary and co-financed by the European Social Fund through the project TMOP-4.2.2.C-11/1/KONV-2012-0004 - National Research Center for Development and Market Introduction of Advanced Information and Communication Technologies. c
2015 L´ aszl´ o Gy¨ orfi and Harro Walk.
¨ rfi and Walk Gyo
As is well-known, this minimum is achieved by the regression function m(x), which is defined by m(x) = E{Y | X = x}. (1) For each measurable function g one has E{(g(X) − Y )2 } = E{(m(X) − Y )2 } + E{(m(X) − g(X))2 } Z = E{(m(X) − Y )2 } + | m(x) − g(x) |2 µ(dx), where µ stands for the distribution of the observation X. It is of great importance to be able to estimate the minimum mean squared error L∗ = E{(m(X) − Y )2 } accurately, even before a regression estimate is applied: in a standard nonparametric regression design process, one considers a finite number of real-valued features X (i) , i ∈ I, and evaluates whether these suffice to explain Y . In case they suffice for the given explanatory task, an estimation method can be applied on the basis of the features already under consideration, if not, more or different features must be considered. The quality of a subvector {X (i) , i ∈ I} of X is measured by the minimum mean squared error 2 L∗ (I) := E Y − E{Y |X (i) , i ∈ I} that can be achieved using the features as explanatory variables. L∗ (I) depends upon the unknown distribution of (Y, X (i) : i ∈ I). The first phase of any regression estimation process therefore heavily relies on estimates of L∗ (even before a regression estimate is picked). Concerning dimension reduction the related testing problem is on the hypothesis L∗ = L∗ (I). This testing problem can be managed such that we estimate both L∗ and L∗ (I), and accept the hypothesis if the two estimates are close to each other. (Cf. De Brabanter et al. (2014).) Devroye et al. (2003), Evans and Jones (2008), Liiti¨ainen et al. (2008), Liiti¨ainen et al. (2009), Liiti¨ ainen et al. (2010), and Ferrario and Walk (2012) introduced nearest neighbor based estimates of L∗ , proved strong universal consistency and calculated the (fast) rate of convergence. Because of L∗ = E{Y 2 } − E{m(X)2 } and E{Y 2 } < ∞, estimating L∗ is equivalent to estimating the second moment S ∗ of the regression function: Z S ∗ = E{m(X)2 } = m(x)2 µ(dx). In this paper we introduce a partitioning based estimator of S ∗ , and show its asymptotic normality. It turns out that the asymptotic variance depends neither on the dimension of the observation vector, nor on the smoothness properties of the regression function. The asymptotic variance is given explicitly. 1864
On the Asymptotic Normality of a Regression Functional Estimate
2. A Splitting Estimate We suppose that the regression estimation problem is based on a sequence (X1 , Y1 ), (X2 , Y2 ), . . . of i.i.d. random vectors distributed as (X, Y ). Let Pn = {An,j , j = 1, 2, . . .} be a cubic partition of IRd of size hn > 0. The partitioning estimator of the regression function m is defined as mn (x) =
νn (An,j ) if x ∈ An,j , µn (An,j )
(2)
(interpreting 0/0 = 0) with n
νn (A) =
1X I{Xi ∈A} Yi n i=1
and
n
1X µn (A) = I{Xi ∈A} . n i=1
(Here I denotes the indicator function.) If for cubic partition nhdn → ∞
hn → 0
and
(3)
as n → ∞, then the partitioning regression estimate (2) is weakly universally consistent, which means that Z 2 lim E (mn (x) − m(x)) µ(dx) = 0 (4) n→∞
for any distribution of (X, Y ) with E{Y 2 } < ∞, and for bounded Y it holds Z lim (mn (x) − m(x))2 µ(dx) = 0 n→∞
a.s. (Cf. Theorems 4.2 and 23.1 in Gy¨orfi et al. (2002).) Assume splitting data Zn = {(X1 , Y1 ), . . . , (Xn , Yn )} and Dn0 = {(X10 , Y10 ), . . . , (Xn0 , Yn0 )} such that (X1 , Y1 ), . . . , (Xn , Yn ), (X10 , Y10 ), . . . , (Xn0 , Yn0 ) are i.i.d. The splitting data estimate of S ∗ is defined as n
n
∞
νn (An,j ) 1 XX 1X 0 . Sn := Yi mn (Xi0 ) = I{Xi0 ∈An,j } Yi0 n n µn (An,j ) i=1
i=1 j=1
1865
(5)
¨ rfi and Walk Gyo
Put
n
νn0 (A)
1X = I{Xi0 ∈A} Yi0 , n i=1
then Sn has the equivalent form Sn =
∞ X
νn0 (An,j )
j=1
νn (An,j ) . µn (An,j )
(6)
Theorem 1 Assume (3) and that µ is non-atomic and has bounded support. Suppose that there is a finite constant C such that E{|Y |3 | X} < C. Then
√
D
n (Sn − E{Sn }) /σ → N (0, 1),
where 2
σ =2
Z
(7)
Z
2
M2 (x)m(x) µ(dx) −
2 Z m(x) µ(dx) − m(x)4 µ(dx), 2
with M2 (X) = E{Y 2 | X}. The estimation problem is motivated by the above mentioned dimension reduction such that one estimates S ∗ for the original observation vector and for the observation vector where some components are left out. If the two estimates are ”close” to each other, then we decide that the left out components are ineffective. Theorem 1 is on the random part of the estimates. Therefore there is a further need to study the difference of the biases of the estimates. Under (3) we have lim E{Sn } = S ∗ n→∞
and for Lipschitz continuous m the rate of convergence can be of order n−1/d for suitable choice of hn . (Cf. Devroye et al. (2013).) Similarly to De Brabanter et al. (2014) we conjecture that this difference of the biases has universally a fast rate of convergence. Obviously, there are several other possibilities for defining partitioning based estimates and proving their asymptotic normality, for example, n
1X mn (Xi0 )2 n i=1
or
∞ X νn (An,j )2 j=1
µn (An,j )
.
Notice that both estimates have larger bias and variance than our estimate (6) has. The proof of Theorem 1 works without any major modification for consistent kn nearest neighbor (kn -NN) estimate mn if kn → ∞ and kn /n → 0. A delicate and important research problem is the case of non-consistent 1-NN estimate mn , because for 1-NN estimate mn the bias is smaller. We conjecture that even in this case one has a CLT. We prove Theorem 1 in the next section. 1866
On the Asymptotic Normality of a Regression Functional Estimate
3. Proof of Theorem 1 Introduce the notations Un = and Vn = then
√
√
√
n (Sn − E{Sn | Zn })
n (E{Sn | Zn } − E{Sn }) ,
n (Sn − E{Sn }) = Un + Vn .
We prove Theorem 1 by showing that for any u, v ∈ IR u v P{Un ≤ u, Vn ≤ v} → Φ Φ σ1 σ2 where Φ denotes the standard normal distribution function, and Z 2 Z 2 2 2 m(x) µ(dx) σ1 = M2 (x)m(x) µ(dx) − and σ22
Z =
Z
2
m(x)4 µ(dx).
M2 (x)m(x) µ(dx) −
(8)
(9)
(10)
Notice that Vn is measurable with respect to Zn , therefore P{Un ≤ u, Vn ≤ v} − Φ u Φ v σ1 σ2 v u Φ = E{I{Vn ≤v} P{Un ≤ u | Zn }} − Φ σ1 σ2 u ≤ E I{Vn ≤v} P{Un ≤ u | Zn } − Φ σ1 u v + P{Vn ≤ v} − Φ Φ σ2 σ1 u + P{Vn ≤ v} − Φ v . ≤ E P{Un ≤ u | Zn } − Φ σ1 σ2 Thus, (8) is satisfied if P{Un ≤ u | Zn } → Φ
u σ1
(11)
in probability and P{Vn ≤ v} → Φ
v σ2
.
Proof of (11). Let’s start with the representation Un =
√
n
! n 1X 0 0 0 0 (Yi mn (Xi ) − E{Yi mn (Xi ) | Zn }) n i=1
n 1 X 0 =√ (Yi mn (Xi0 ) − E{Yi0 mn (Xi0 ) | Zn }). n i=1
1867
(12)
¨ rfi and Walk Gyo
Because of (7) and the Jensen inequality, for any 1 ≤ s ≤ 3, we get Ms (X) := E{|Y |s | X} = (E{|Y |s | X}1/s )s ≤ (E{|Y |3 | X}1/3 )s ≤ C s/3 ,
(13)
especially, for s = 1 M1 (X) = |m(X)| ≤ C 1/3 and E{|Y |3 } ≤ C. Next we apply a Berry-Esseen type central limit theorem (see Theorem 14 in Petrov (1975)). It implies that ! u c E{|Y10 mn (X10 )|3 | Zn } √ ≤ P{Un ≤ u | Zn } − Φ p p n Var(Y 0 mn (X 0 ) | Zn )3 Var(Y10 mn (X10 ) | Zn ) 1
1
with the universal constant c > 0. Because of E{Y10 mn (X10 )
Z | Zn } =
m(x)mn (x)µ(dx),
we get that 2
Var(Y10 mn (X10 ) | Zn ) = E{Y10 mn (X10 )2 | Zn } − E{Y10 mn (X10 ) | Zn }2 Z 2 Z 2 = M2 (x)mn (x) µ(dx) − m(x)mn (x)µ(dx) . Now (4), together with the boundedness of M2 by (13), implies that Var(Y10 mn (X10 ) | Zn ) → σ12 in probability, where σ12 is defined by (9). Further Z 0 0 3 E{|Y1 mn (X1 )| | Zn } ≤ C |mn (x)|3 µ(dx). Put An (x) = An,j if x ∈ An,j . Again, applying the Jensen inequality we get P n I 3/2 2 {Xi ∈An (x)} |Yi | i=1 3 |mn (x)| ≤ Pn , I i=1 {Xi ∈An (x)} the right hand side of which is the square of the regression estimate, where Y is replaced by |Y |3/2 . Thus, (4) together with E{|Y |3 } < ∞ implies that Z Pn 3/2 2 I |Y | i {Xi ∈An (x)} i=1 Pn µ(dx) → E{E{|Y |3/2 | X}2 } < C I i=1 {Xi ∈An (x)} 1868
On the Asymptotic Normality of a Regression Functional Estimate
in probability. These limit relations imply (11). Proof of (12). n Assuming that the support S of µ is bounded, let ln be such that S ⊂ ∪lj=1 An,j . Also we re-index the partition so that µ(An,j ) ≥ µ(An,j+1 ), with µ(An,j ) > 0 for j ≤ ln , and µ(An,j ) = 0 otherwise. Then, Sn =
ln X
νn0 (An,j )
j=1
and ln ≤
νn (An,j ) , µn (An,j )
(14)
c . hdn
The condition nhdn → ∞ implies that ln /n → 0. Because of (14) we have that ln √ X νn (An,j ) νn (An,j ) 0 Vn = n E{νn (An,j ) | Zn } −E µn (An,j ) µn (An,j ) j=1
ln √ X νn (An,j ) νn (An,j ) = n ν(An,j ) −E , µn (An,j ) µn (An,j ) j=1
where ν(A) = E{νn (A)}. Observe that we have to show the asymptotic normality for a finite sum of dependent random variables. In order to prove (12), we follow the lines of the proof in Beirlant and Gy¨orfi (1998) and use a Poissonization argument. With this we introduce a modification Mn of Vn such that ∆n := Vn − Mn → 0, the proof of which follows, starting from (23). Now we proceed arguing for Mn . Introduce the notation Nn for a Poisson(n) random variable independent of (X1 , Y1 ), (X2 , Y2 ), . . .. Moreover put n˜ νn (A) =
Nn X
I{Xi ∈A} Yi
i=1
and n˜ µn (A) =
Nn X
I{Xi ∈A} .
i=1
The key result in this step is the following property: 1869
¨ rfi and Walk Gyo
Proposition 2 (Beirlant and Mason (1995), Beirlant et al. (1994).) Put ln √ X ν˜n (An,j ) ν˜n (An,j ) ˜n = n M ν(An,j ) , −E µ ˜n (An,j ) µ ˜n (An,j )
(15)
j=1
and
ln √ X νn (An,j ) ν˜n (An,j ) Mn = n ν(An,j ) . −E µn (An,j ) µ ˜n (An,j )
(16)
j=1
Assume that
2 2 2 ˜ n + iv Nn√− n → e−(t ρ +v )/2 Φn (t, v) = E exp itM n √ for a constant ρ > 0, where i = −1. Then D
Mn /ρ → N (0, 1). Put
˜ n + v Nn√− n , Tn = tM n for which a central limit result is to hold: D Tn → N 0, t2 ρ2 + v 2
(17)
as n → ∞. Remark that Nn − n ˜ ˜ Var(Tn ) = t Var(Mn ) + 2tvE Mn √ + v2. n 2
For a cell A = An,j from the partition with µ(A) > 0, let Y (A) be a random variable such that P{Y (A) ∈ B} = P{Y ∈ B|X ∈ A}, where B is an arbitrary Borel set. Introduce the notations qn,k
n = P{nµn (A) = k} = µ(A)k (1 − µ(A))n−k k
and
(nµ(A))k −nµ(A) e . k! Concerning the expectation, with (Y1 (A), Y2 (A), . . .) an i.i.d. sequence of random variables distributed as Y (A) we find that X ∞ ν˜n (A) ν˜n (A) E = E | n˜ µn (A) = k P{n˜ µn (A) = k} µ ˜n (A) µ ˜n (A) k=0 (P ) ∞ k X Y (A) i i=1 = E q˜n,k k q˜n,k = P{n˜ µn (A) = k} =
k=1
= E {Y1 (A)} (1 − q˜n,0 ) ν(A) = (1 − q˜n,0 ), µ(A) 1870
(18)
On the Asymptotic Normality of a Regression Functional Estimate
further, by (24) Yn (A) ν(A) νn (A) = nE = E (1 − (1 − µ(A))n )), µn (A) 1 + (n − 1)µn−1 (A) µ(A)
(19)
Moreover, E
ν˜n (A)2 µ ˜n (A)2
∞ X
ν˜n (A)2 = | n˜ µn (A) = k P{n˜ µn (A) = k} E µ ˜n (A)2 k=0 2 Pk ∞ X i=1 Yi (A) = E q˜n,k k2 k=1 ∞ X kE Y1 (A)2 + k(k − 1)E {Y1 (A)}2 = q˜n,k k2
k=1
∞ X 1 q˜n,k + E {Y1 (A)}2 (1 − q˜n,0 ), = Var (Y1 (A)) k k=1
and ∞ ∞ X X 1 1 (nµ(A))k −nµ(A) q˜n,k = e k k k! k=1
k=1
=
∞ X k=1
∞
1 (nµ(A))k −nµ(A) X 1 (nµ(A))k −nµ(A) e + e k+1 k! k(k + 1) k! k=1
1 3 ≤ (1 − q˜n,0 ) + 2 (1 − q˜n,0 ). nµ(A) n µ(A)2 The independence of the Poisson masses over different cells leads to ln X ν˜n (An,j ) 2 ˜ Var(Mn ) = n ν(An,j ) Var µ ˜n (An,j ) j=1
≤n
ln X
ν(An,j )2 Var (Y1 (An,j ))
j=1
+
1 (1 − e−nµ(An,j ) ) nµ(An,j )
3 −nµ(An,j ) (1 − e ) n2 µ(An,j )2
+ E {Y1 (An,j )}2 (1 − e−nµ(An,j ) ) − E {Y1 (An,j )}2 (1 − e−nµ(An,j ) )2 ≤
ln X ν(An,j )2 j=1
+
µ(An,j )2
Var (Y1 (An,j )) µ(An,j )
ln X 3Var (Y1 (An,j )) ν(An,j )2 j=1
+n
ln X
nµ(An,j )2 ν(An,j )2 E {Y1 (An,j )}2 e−nµ(An,j )
j=1
1871
¨ rfi and Walk Gyo
such that the bounding error in these inequalities is of order O(ln /n). (4) together with the boundedness of M2 and m implies that ln X ν(An,j )2
Var (Y1 (An,j )) µ(An,j ) µ(An,j )2 R !2 Z R An (x) M2 (z)µ(dz) An (x) m(z)µ(dz)
j=1
=
µ(An (x))
µ(An (x))
Z µ(dx) −
R
An (x) m(z)µ(dz)
µ(An (x))
= σ22 + o(1), where σ22 is defined by (10). Moreover, ln X 3Var (Y1 (An,j )) ν(An,j )2
nµ(An,j )2
j=1
≤
3C 4/3 ln → 0. n
Then n
ln X
ν(An,j )2 E {Y1 (An,j )}2 e−nµ(An,j )
j=1
=
ln X ν(An,j )2 j=1
µ(An,j )2
≤C
4/3
≤C
4/3
ln X
E {Y1 (An,j )}2 nµ(An,j )e−nµ(An,j ) µ(An,j )
nµ(An,j )2 e−nµ(An,j )
j=1
(max z 2 e−z )ln /n → 0. z>0
So we proved that ˜ n) → σ2. Var(M 2 To complete the asymptotics for Var(Tn ), it remains to show that
˜ n Nn√− n E M n
→ 0 as n → ∞.
Because of Nn = n
ln X
µ ˜n (An,j )
j=1
and n=n
ln X
µ(An,j ),
j=1
1872
!4 µ(dx)
On the Asymptotic Normality of a Regression Functional Estimate
we have that
Nn − n ˜ E Mn √ n l n X ν˜n (An,j ) =n E ν(An,j )(˜ µn (An,j ) − µ(An,j )) µ ˜n (An,j ) j=1
=n
ln X
ν(An,j ) E {˜ νn (An,j )} − E
j=1
=n
ln X j=1
=n
ln X
ν˜n (An,j ) µ ˜n (An,j )
µ(An,j ))
ν(An,j ) −nµ(An,j ) ν(An,j ) ν(An,j ) − (1 − e )µ(An,j )) µ(An,j ) ν(An,j )2 e−nµ(An,j )
j=1
≤C
2/3
(max z 2 e−z )ln /n → 0. z>0
To finish the proof of (17) by Lyapunov’s central limit theorem, it suffices to prove that n
3/2
ln n ν˜ (A ) X 3 o ν˜n (An,j ) n n,j −E ν(An,j ) + v (˜ µn (An,j ) − µ(An,j )) → 0 E t µ ˜n (An,j ) µ ˜n (An,j ) j=1
or, by invoking the c3 inequality |a + b|3 ≤ 4(|a|3 + |b|3 ), that n3/2
ln X j=1
( 3 ) ν˜n (An,j ) ν ˜ (A ) n n,j ν(An,j )3 → 0 E −E µ ˜n (An,j ) µ ˜n (An,j )
(20)
and n
3/2
ln X
E |˜ µn (An,j ) − µ(An,j )|3 → 0.
(21)
j=1
In view of (20), because of (13) it suffices to prove Dn := n3/2
ln X j=1
( 3 ) ν˜n (An,j ) ν ˜ (A ) n n,j µ(An,j )3 → 0 E −E µ ˜n (An,j ) µ ˜n (An,j )
(22)
For a cell A, (18) implies that ( ( 3 ) 3 ) ν˜n (A) ν˜n (A) ν ˜ (A) ν(A) n E ≤ 4E −E − (1 − q ˜ )I n,0 {˜ µ (A)>0} n µ µ ˜n (A) µ ˜n (A) ˜n (A) µ(A) ( 3 ) ν(A) ν(A) + 4E (1 − q˜n,0 )I{˜µn (A)>0} − (1 − q˜n,0 ) . µ(A) µ(A) 1873
¨ rfi and Walk Gyo
On the one hand, (18), (13) and (25) imply that, for a constant K, ( 3 ) ν˜n (A) ν(A) E − (1 − q˜n,0 )I{˜µn (A)>0} µ ˜n (A) µ(A) ) ( 3 ∞ X ν˜n (A) ν(A) = − (1 − q˜n,0 )I{˜µn (A)>0} | n˜ µn (A) = k P{n˜ µn (A) = k} E µ ˜n (A) µ(A) k=0 3 Pk ∞ (Y (A) − E{Y (A)}) X i i=1 i q˜n,k = E k3 k=1 ∞ X 1 q˜n,k ≤K 3/2 k k=1
≤ c1
1 n3/2 µ(A)3/2
,
where we applied the Marcinkiewicz and Zygmund (1937) inequality for absolute central moments of sums of i.i.d. random variables. On the other hand ( 3 ) ν(A) ν(A) E (1 − q˜n,0 )I{˜µn (A)>0} − (1 − q˜n,0 ) ≤ C q˜n,0 . µ(A) µ(A) Therefore ln X
1 −nµ(An,j ) +e µ(An,j )3 Dn ≤ n c2 3/2 µ(A )3/2 n n,j j=1 ln ln X X ≤ c2 µ(An,j )3/2 + n3/2 e−nµ(An,j ) µ(An,j )3 3/2
j=1
≤ c2
ln X
j=1
3/2 3/2 −z µ(An,j ) 1 + max z e z>0
j=1
Z = c3
µ(An (x))1/2 µ(dx)
→ 0, where we used the assumption that µ is non-atomic. Thus, (20) is proved. The proof of (21) is easier. Notice that (21) means 3 ln Nn X X Fn := n−3/2 E I{Xi ∈An,j } − nµ(An,j ) → 0. j=1
i=1
1874
On the Asymptotic Normality of a Regression Functional Estimate
One has 3 Nn X I{Xi ∈An,j } − nµ(An,j ) E i=1 3 Nn X n o + 4E |(Nn − n)µ(An,j )|3 ≤ 4E (I{Xi ∈An,j } − µ(An,j )) i=1 ! ∞ n o k X 3 3/2 3/2 −n n 3 ≤ c4 k µ(An,j ) e + E |Nn − n| µ(An,j ) k! k=1 ≤ c5 n3/2 µ(An,j )3/2 + n3/2 µ(An,j )3 . Therefore Fn ≤ 2c5
ln X
µ(An,j )3/2 → 0,
j=1
and so (21) is proved, too. The remaining step in the proof of (12) is to show that
∆n := Vn − Mn = n
1/2
ln X ν˜n (An,j ) νn (An,j ) E −E ν(An,j ) → 0. µ ˜n (An,j ) µn (An,j ) j=1
By (18) and (19) have that ln 1/2 X ν(An,j ) −nµ(An,j ) |∆n | = n (e − (1 − µ(An,j ))n )ν(An,j ) µ(An,j ) j=1 =n
1/2
ln X ν(An,j )2 j=1
≤C
µ(An,j )2
2/3 1/2
n
ln X
(e−nµ(An,j ) − (1 − µ(An,j ))n )µ(An,j )
(e−nµ(An,j ) − (1 − µ(An,j ))n )µ(An,j ).
j=1
For 0 ≤ z ≤ 1, using the elementary inequalities 1 − z ≤ e−z ≤ 1 − z + z 2 we have that e−nz − (1 − z)n = (e−z − (1 − z))
n−1 X k=0
1875
e−kz (1 − z)n−1−k ≤ nz 2 e−(n−1)z ,
(23)
¨ rfi and Walk Gyo
and thus we get that |∆n | ≤ C
2/3 1/2
n
ln X
(e−nµ(An,j ) − (1 − µ(An,j ))n )µ(An,j )
j=1
≤C
2/3 1/2
n
ln X
nµ(An,j )3 e−(n−1)µ(An,j )
j=1
≤
ln C 2/3 X
n1/2
µ(An,j ) [nµ(An,j )]2 e−nµ(An,j ) e
j=1
ln C 2/3 X ≤ 1/2 µ(An,j ) max(z 2 e−z )e z≥0 n j=1
→ 0. This ends the proof of (12) and so the proof of Theorem 1 is complete. Next we give two lemmas, which are used above. Lemma 3 If B(n, p) is a binomial random variable with parameters (n, p), then 1 − (1 − p)n+1 1 = . E 1 + B(n, p) (n + 1)p Lemma 4 If P o(λ) is a Poisson random variable with parameter λ, then 1 24 E I{P o(λ)>0} ≤ 3 . 3 P o(λ) λ
(24)
(25)
References J. Beirlant and L. Gy¨ orfi. On the asymptotic L2 -error in partitioning regression estimation. Journal of Statistical Planning and Inference, 71:93–107, 1998. J. Beirlant and D. Mason. On the asymptotic normality of lp -norms of empirical functionals. Mathematical Methods of Statistics, 4:1–19, 1995. J. Beirlant, L. Gy¨ orfi, and G. Lugosi. On the asymptotic normality of the l1 - and l2 - errors in histogram density estimation. Canadian J. Statistics, 22:309–318, 1994. K. De Brabanter, P. G. Ferrario, and L. Gy¨orfi. Detecting ineffective features for nonparametric regression. In J. A. K. Suykens, M. Signoretto, and A. Argyriou, editors, Regularization, Optimization, Kernels, and Support Vector Machines, pages 177–194. Chapman & Hall/CRC Machine Learning and Pattern Recognition Series, 2014. L. Devroye, D. Sch¨ afer, L. Gy¨ orfi, and H. Walk. The estimation problem of minimum mean squared error. Statistics and Decisions, 21:15–28, 2003. 1876
On the Asymptotic Normality of a Regression Functional Estimate
L. Devroye, P. Ferrario, L. Gy¨ orfi, and H. Walk. Strong universal consistent estimate of the minimum mean squared error. In B. Sch¨olkopf, Z. Luo, and V. Vovk, editors, Empirical Inference - Festschrift in Honor of Vladimir N. Vapnik, pages 143–160. Springer, Heidelberg, 2013. D. Evans and A. J. Jones. Non-parametric estimation of residual moments and covariance. Proceedings of the Royal Society, A 464:2831–2846, 2008. P. G. Ferrario and H. Walk. Nonparametric partitioning estimation of residual and local variance based on first and second nearest neighbors. Journal of Nonparametric Statistics, 24:1019–1039, 2012. L. Gy¨orfi, M. Kohler, A. Krzy˙zak, and H. Walk. A Distribution-Free Theory of Nonparametric Regression. Springer–Verlag, New York, 2002. E. Liiti¨ainen, F. Corona, and A. Lendasse. On nonparametric residual variance estimation. Neural Processing Letters, 28:155–167, 2008. E. Liiti¨ainen, M. Verleysen, F. Corona, and A. Lendasse. Residual variance estimation in machine learning. Neurocomputing, 72:3692–3703, 2009. E. Liiti¨ainen, F. Corona, and A. Lendasse. Residual variance estimation using a nearest neighbor statistic. Journal of Multivariate Analysis, 101:811–823, 2010. J. Marcinkiewicz and A. Zygmund. Sur les fonctions ind´ependantes. Fundamenta Mathematicae, 29:60–90, 1937. V. V. Petrov. Sums of Independent Random Variables. Springer-Verlag, Berlin, 1975.
1877