Estimation of Integrated Squared Density Derivatives by
Peter Hall Australian National University
AMS 1980 subject classification:
1
J.S. Marron Australian National University and University of North Carolina
primary 62G05. secondary 62G20.
Key words and phrases: Integrated squared nonparametric estimation. rates of convergence
derivative.
1Research partially supported by NSF Grant DMS-8400602.
kernel
estimators.
Abstract:
Kernel density estimators are used for the estimation of integrals of
various squared derivatives of a probability density.
Rates of convergence in
mean squared error are calculated, which show that appropriate values of the smoothing parameter are much smaller than those for ordinary density estimation. The rate of convergence increases with stronger smoothness assumptions. however, unlike ordinary density estimation, the parametric rate of n
-1
can be achieved
even when only a finite amount of differentiability is assumed.
The implications
for
estiamation
data-driven
considered.
bandwidth
selection
in
ordinary
density
are
-
1 -
Introduction The estimation of the integral of a squared probability density has long been important in the study of rank-based nonparametric statistics.
See Sheather
and Hettmansperger (1987)) and section 4.4 of Prakasa Rao (1983) for an account of the li terature on this topic. for
density
estimation
One method of data-driven bandwidth selection
involves
plugging
estimates
of
integrated
squared
derivatives into an asymptotic representation for the optimal bandwidth. Under nonparametric assumptions it is natural to form estimates of these quantities based on a kernel describes
estimate of the underlying density.
two methods for doing this,
Section 2
and provides motivation for
a
slight
modification of the estimators. Section 3 contains rate-of-convergence results in mean squared error of the type developed by Rosenblatt (1956,
1971) and Parzen (1962).
As for standard
density estimation, the rates become faster when stronger smoothness assumptions are made.
An optimality theory is developed in which variance and bias are
balanced.
Since integration is a smoothing operation, it is not surprising that
the optimal bandwidth is much smaller for the integrated squared derivatives of a A more surprising result is that.
density than for the ordinary derivatives. unlike
the
case
convergence of n-
of
1
standard
density
estimatioij,
the
parametric
rate
of
may be achieved even when only a finite number of derivatives
are assumed to exist for the underlying density. Section 4 has some remarks, including a discussion of the implications of the
convergence
estimation.
rate
results
for
automatic
All proofs are in the appendix.
bandwidth
selection
in
density
- 2 -
2.
The Estiaators Consider the problem of estimating, for some m
using a random sample, Xl"" 'X
n
0,1, ... , the parameter
from a probability density f.
An obvious first
attempt at estimation is
One candidate is the kernel
where f(x) is some reasonable estimator of f(x). estimator n
n
-1
I:
I{
i=l '11
(x-X.) 1
where here and in the following a subscript h means a rescaling of the type K (.) = h h
K is called the
kernelfunc~ion,
bandwidth h.
-1
K( . / h) ,
and the amount of smoothing is controlled by the
See Prakasa, Rao(1983}, Devroye and Gyorfi (1984), and Silverma_
(1986) for access to the
l~~ge l~terature
concerning f . h
,c,an. b~ impr,oY,ed follows from the expansion 1n n- 1 h- 2m - 1 ~(~)*K(m)(O) + n- 2 x X K (m)*K (m)(x.-x.)
The. fact tha,t,9 (2.1)
;
=
m"
~.
,.,
where * denotes convolution.
1...
h
J
h
1
J
Note that the first term does not make use of the
data, and hence may be thought of as adding a type of bias in the estimator. This motivates the estimator 9
=
m
n- 1 (n_1)-1 X X K (m)*K (m) (X1.-X 1
J
h
h
.). J
The convergence rate methods described in Section 3 can be used to show that the bias introduced by the first term in (2.1) can actually dominate the mean squared error, and so only 9 m is treated here. 9
m
is never inferior to that of 9
m
The squared-error rate of convergence of
- 3 -
Another estimate of 9
is motivated by the fact that,
m
under strong enough
conditions, 9
m
(_l)m J f(2m)(x)f(x)dx,
which can be estimated by n 9
~
m
The same argument used above to motivate 9 version of 9
m
first
derivatives
1
m
can be employed to show that a better
is 9
At
i(2m) (X. ).
i=l
m
glance it might seem- tha,t 9
of
appear
f
derivatives appear in 9. m
to
be
u~ed
+,
m
in' the
will[~be
y
\-1
inferior to
.'~
..
t
motivation
of
G'
m
em.
since 2m
while
only
m
The fact that thfs is 'not thetase 'is' demonstrated in
".- ,~, ' . c" :. ,,"'t . Section 3. where it is seen that the fwo estimat6rsrhave"very similar properties. "
'....1
even when f has fewer than 2m derivatives. given by writing 9
m
9
m
t~ :':< -
"~
Some idea'of why this is the case is
- 4 -
Rates of Convergence
3.
In ordinary kernel density estimtion. the rate of convergence is typically determined either by the smoothness of the underlying density or by the order of the kernel function. The density f will be said to have smoothness of order p > 0 whenever there is a constant
where p = t+a
M
>
so that. for all real x and y,
0
and 0
2m.
var(e ) m
p
> m and K has order k, then as
~
•
- 5 -
+ o(n (b )
-4m-1
-1 + n )
)
m
for p > k+m,
(E(~ (d)
h
f of p $ 2m, var(~
(c)
-2
) - 8
m
}2 = h 2k (k!)-2 {f uk K(U)du}2 (f f(m)f(m+k)}2 + 0(h 2k ) m
for p $ k+m,
(E(~ ) - 8 }2 m m
= O(h
2 (p-m)).
The proof of Lemma 3.1 is in the appendix. The various special cases appearing in Lemma 3.1 may be combined into a general mean squared error result if we introduce the notation IJ
~ost
cases allow statements only
=
min(p-'!I, .k): ,.
~bQut
"1.·c
.
the best
e~p~~ent
of convergence.
These
are summarized in: Theorem 3.2:
Under the assumptions of Lemma 3.1,'1~ .(,
(a)
when
when
J~
5 2m + 2
IJ
o(n- 2 /(21J+4m+1)).
by taking h (b)
{,'l
1
IJ
> 2m +
1
2' E(~
by taking hEn [ When both k
-1/(4m+1)
and
p are
, n
-1/21J]
-9
)2
m m
.
sufficiently
large,
not
only
convergence, but also the best constants, may be given. 2
2(Jf )K , m
the
best
First define
exponent
of
- 6 -
c Theorem 3.3:
2
=
Under the assumptions of Lemma 3.1, minimum mean squared errors are
achieved as follows: ( a)
21
k < 2m +
k < p-m.
-2 2k/(4m+2k+1)
E(~
_(_2k_+_4_m_+_1_)_C_2 r(4m+1)C 1n -9 ) 2
m m
l
4m+1
2k C2
1 f
+
0
(n
-4k/ (4m+2k+ 1) )
.
by taking -2 1/(4m+2k+l)
1
J(4m+1) C1n
1
h
(b)
~
+ 0 (n
f'
2k C2
-1/(4m+2k+1»
1
v > 2m + ~'
,,;
"
J1
E(e -9 )2' = 4 [(f(2m») 2 _ ,
m 'm
'
•
by taking any h which satTstri'e!S'J .,
h n 1 / (4m+ 1 ) ....
00,
h n 1I2JJ
.... o.
The proofs of Theorems 3.2 and 3.3 are immediate from Lemma 3.1. there are a number;. of
"boundary cases".
such as k = 2m
+
1
2'
~ote
that
that are not
explicitly stated here, but may be handled with no additional work.
r
- 7 -
4.
Discussion
Remark 4.1:
For rate of convergence results which include some special cases of
those presented (1987).
here,
see
Schweder
(1975)
and
Sheather
and
Hettsmansperger
These papers also treat the important problem of how to choose the
bandwidth, h. Remark 4.2:
A very important question is:
the best possible?
are the rates obtained in Theorem 3.2
We conjecture that they are, in the sense of Farrell (1972)
and Stone (1980, 1982).
In some as yet unpublished work in a closely related
setting, L. Goldstein and K. Messer have established some interesting results of this type.
Unfortunately that work does not extend to our case. When v > 2m + ~, Theorem 3.3 ~till leaves a good deal of room for
Remark 4.3: choice of h. develop
a
A slight extension of the expansion of Lemma 3.1 can be used to second
"deficiency".
Remark 4.4: compare?
order
optimality
theory
of
m
type
sometimes
called
See Marron and Sheather i '11987)' faY. an account of the literature on
Another natural q'uestion is:
. how do the estimators 9
m
and 9
m
It is easily seen that
f
Hence 9
the
has smaller variance and 9
m
"
.~,
has less bias.
A means of comparison is to
look at the minimum mean square error as given in (a) of Theorem 3.3.
Note that
C and C appear as a weighted geometric mean, so the question of which of 8 and m 1 2 9
m
is better can only be resolved for each specific K.
Remark 4.4:
Lemma 3.1 can also be used to obtain a theory for optimal choice of
- 8 -
K,
such as
the
one
Mammitzsch (1985).
studied
~ote
by
Epanechnikov
(1969)
and
Muller
Gasser,
and-
that the answer here is the same as that of Epanechnikov
in the case of 9 , 0 Remark 4.5:
It is completely straightforward to extend the results of this paper
to the case where f(x)
d
is a density on IR.
For clarity of presentation, this
case is not explicitly treated here. Remark 4.6:
Theorem 3.2 has
important implications for automatic bandwidth
selection of an ordinary kernel density estimator. shown that, if h for k=2 and p
~
C
is the bandwidth chosen by least squares cross-validation, then
2, n
where h
o
Hall and Marron (1987a) have
-1/10,
is the bandwidth which minimizes mean integrated squared error.
Scott
and Terrell (1986) have proposed another bandwidth selector which gives simila~ • performance when k=2 and p
~
4.
Hall and Marron (1987b) describe a sense in which the rate n best possible for p essentially no bigger than 2. h
o
N
-1/10
is the
When k=2 and p > 2
h~ = n- 1/5 [! K2 {!X 2K(x)dx}-2 9;1]1/5.
See Rosenblatt (1971), for example. This motivates using the bandwidth h = n- 1/5 [I K2 {1 x2 K(x)dx}-2 8;1]1/5,
when
~2
is either 9
2
or 9 , 2
To compare this with h ' note that, by Theorem 3.2, c
for properly chosen hand k sufficiently large, n
-2(p-2)/2p+5 1 2
p
~
6.5
p > 6.5
- 9 -
,Thus, if we ignore the difference between h * and h ' h is better than h for O O c p
~
2.5.
However, the important feature of this observation is not so much the
accuracy confirmed by the faster rate of convergence,
plug-in bandwidth,with a relative
error
of
n
1 2
but the stability.
much
is
more
robust
The
against
sampling fluctuations than is the cross-validatory bandwidth with an error of n
-1/10
nf.'5
:
'~l
i
.. ~
~-,.J
"
q
. .,!
.J
- 10 -
Appendiz
Proof of Lemma 3.1:
First consider the bias.
~ote
that
E(9 ) = h- 2m - 1 (_1)m II K(2m){(x-y)/h}f(x)f(y)dx dy m
h
h
-2m
-m
(-1)
(-1)
m
m
(2m)
II K
(m)
II K
(u)f(x)f(x-hu)dx du
(u)f(x)f
(m)
(x-hu)dx du
II K(u)f(m)(X)f(m)(x-hu)dx duo Similarly,
E(; ) m
=
II K*K(U)f(m)(x)f(m)(x-hu)dx dUo
Part (c) now follows by a k-th order Taylor expansion of f(m) (x-hu). with the fact (m-t)-th
order
that both K and K*K are of order k. Taylor
expansion
of
f(m)(x-hu)
condition (3.1). For the variance component. note that var(S)
= n- 2 (n_l)-2 h- 4m - 2 L L i~j
But
L L
i
'~j
I
Part (d)
together
with
together
follows by an the
Lipschitz
- 11 -
E(h- 2m - 1 K(2m){(X -X )/h}]2 1
and for p
~
h- 4m - 1 II{K(2m)(u)}2 f(x)f(x-hu)dx du
2
2m, E(h- 4m - 2 K(2m){(X -X )/h} K(2m){(X -X )/h}] = 1
2
2
3
= IfI h-4m K(2m) (u)K (2m) (v)f(y+hu)f(y)f(y-hv)du dy dv
(A.2)
(-1)
m
IfI
K(u)K(v)f
(2m)
(y+hu)f(y)f
(2m)
(y-hv)du dy dv
Hence, since E(h- 2m - 1 K(2m){(X -X )/h}] ~ (_l)m e ) 1 2 ,~ m
and for p
~
2m, ~.
'.
1.
"
" ..
~.
l :.,
"
h- 4m - 2 COV(K(2m){(X -X )/h}, K(2m){(X -X )/h}] = {I(f(2m»2 f - e 2 } + 0(1). 1 2 2 3 ' m For the estimator 9 , part (a) now follows from (A.I). m
To modify this argument
for part (b), the only change required is in (A.2), where fewer integrations by parts should be done and the Lipschitz condition (3.1) is again applied. proof for 6
m
is entirely similar.
The
- 12 -
References
Devroye, L. and Gyt)rfi, L.
(1984).
Nonparametric Density Estimation: The L View. 1
Wiley, New York. Farrell, R. H. (1972), "On the best obtainable rates of convergence in estimation of a density function at a point," Annals of t\f.athematical Statistics, 43, 170-180. Gasser, T., MUller, H. G. and Mammitzsch, V. (1985), "Kernels for nonparametric curve estimation," Journal of the Royal Statistical Society, Series B, 47, 238-252. Hall, P. and Marron, J. S. (1987a), "Extent to which least-squares cross-validation minimises integrated square error in nonparametric density estimation," Theory of Probability and Related Fields, to appear. Hall, P. and Marron, J. S. (1987b), "The amount of noise inherent in bandwidth selection for a kernel density estimator, " Annals of Statistics, to appear. Hall, P. and Marron, J. S. (1986a), unpublished manuscript.
"Choice of kernel order in density estimation",
Hall, P. and Marron, J. S. (1986b), "Variable window width kernel probability densities", unpublished manuscript. Marron, J. S. and Sheather, S. preparation.
(1987), "Kernel Quantile Estimators",
.'
estimates of .., manuscript in
Parzen, E. (1962), "On estimation of a probability density function and mode," Annals of ~athematical Statistics, 33, 1065-1076. Prakasa Rao, B. New York.
L.
S.
(1983), NOliparametric FUlictiolial Estimatioli, Academic Press.
Rosenblatt, M. (1956), "Remarks on some non-parametric estimates function," Annals of ~athelllatical Statistics, 27, 832-837. Rosenblatt, M. 1815-1842.
(1971),
"Curve
estimates." Annals of
~athematlcal
of
a
density
Statistics, 42.
Schweder, T. (1975). "Window estimation of the asymptotic variance of rank estimators of location". Scandinavian Journal of Statistics. Scott. D. J. and Terrell. G. R. (1986). "Biased and unbiased cross-validation density estimation," Rice University Tech. Report No. 87-02. Sheather,
S.
"A data-based algoritha for 2 choos ing the window width when estimating the integral of f (x)", unpublished manuscript. J.
and
Hettsmansperger.
T.
P.
(1987)
in
- 13 -
Silverman, B. W. (1986), Density Estimation for Statistics and Data Analysis, Chapman and Hall, New York. Stone, C. J. (1980), "Optimal convergence rates for nonparametric estimators," Annals of Statistics, 8, 1348-1360. Stone, C. J. (1982) , "Optimal global rates of convergence regression," Annals of Statistics, 10, 1040-1053.
:::
1+
..,
~
1",
•(
i :
of
nonparametric