Estimation of Integrated Squared Density Derivatives ... - NCSU Statistics

Report 48 Downloads 54 Views
Estimation of Integrated Squared Density Derivatives by

Peter Hall Australian National University

AMS 1980 subject classification:

1

J.S. Marron Australian National University and University of North Carolina

primary 62G05. secondary 62G20.

Key words and phrases: Integrated squared nonparametric estimation. rates of convergence

derivative.

1Research partially supported by NSF Grant DMS-8400602.

kernel

estimators.

Abstract:

Kernel density estimators are used for the estimation of integrals of

various squared derivatives of a probability density.

Rates of convergence in

mean squared error are calculated, which show that appropriate values of the smoothing parameter are much smaller than those for ordinary density estimation. The rate of convergence increases with stronger smoothness assumptions. however, unlike ordinary density estimation, the parametric rate of n

-1

can be achieved

even when only a finite amount of differentiability is assumed.

The implications

for

estiamation

data-driven

considered.

bandwidth

selection

in

ordinary

density

are

-

1 -

Introduction The estimation of the integral of a squared probability density has long been important in the study of rank-based nonparametric statistics.

See Sheather

and Hettmansperger (1987)) and section 4.4 of Prakasa Rao (1983) for an account of the li terature on this topic. for

density

estimation

One method of data-driven bandwidth selection

involves

plugging

estimates

of

integrated

squared

derivatives into an asymptotic representation for the optimal bandwidth. Under nonparametric assumptions it is natural to form estimates of these quantities based on a kernel describes

estimate of the underlying density.

two methods for doing this,

Section 2

and provides motivation for

a

slight

modification of the estimators. Section 3 contains rate-of-convergence results in mean squared error of the type developed by Rosenblatt (1956,

1971) and Parzen (1962).

As for standard

density estimation, the rates become faster when stronger smoothness assumptions are made.

An optimality theory is developed in which variance and bias are

balanced.

Since integration is a smoothing operation, it is not surprising that

the optimal bandwidth is much smaller for the integrated squared derivatives of a A more surprising result is that.

density than for the ordinary derivatives. unlike

the

case

convergence of n-

of

1

standard

density

estimatioij,

the

parametric

rate

of

may be achieved even when only a finite number of derivatives

are assumed to exist for the underlying density. Section 4 has some remarks, including a discussion of the implications of the

convergence

estimation.

rate

results

for

automatic

All proofs are in the appendix.

bandwidth

selection

in

density

- 2 -

2.

The Estiaators Consider the problem of estimating, for some m

using a random sample, Xl"" 'X

n

0,1, ... , the parameter

from a probability density f.

An obvious first

attempt at estimation is

One candidate is the kernel

where f(x) is some reasonable estimator of f(x). estimator n

n

-1

I:

I{

i=l '11

(x-X.) 1

where here and in the following a subscript h means a rescaling of the type K (.) = h h

K is called the

kernelfunc~ion,

bandwidth h.

-1

K( . / h) ,

and the amount of smoothing is controlled by the

See Prakasa, Rao(1983}, Devroye and Gyorfi (1984), and Silverma_

(1986) for access to the

l~~ge l~terature

concerning f . h

,c,an. b~ impr,oY,ed follows from the expansion 1n n- 1 h- 2m - 1 ~(~)*K(m)(O) + n- 2 x X K (m)*K (m)(x.-x.)

The. fact tha,t,9 (2.1)

;

=

m"

~.

,.,

where * denotes convolution.

1...

h

J

h

1

J

Note that the first term does not make use of the

data, and hence may be thought of as adding a type of bias in the estimator. This motivates the estimator 9

=

m

n- 1 (n_1)-1 X X K (m)*K (m) (X1.-X 1

J

h

h

.). J

The convergence rate methods described in Section 3 can be used to show that the bias introduced by the first term in (2.1) can actually dominate the mean squared error, and so only 9 m is treated here. 9

m

is never inferior to that of 9

m

The squared-error rate of convergence of

- 3 -

Another estimate of 9

is motivated by the fact that,

m

under strong enough

conditions, 9

m

(_l)m J f(2m)(x)f(x)dx,

which can be estimated by n 9

~

m

The same argument used above to motivate 9 version of 9

m

first

derivatives

1

m

can be employed to show that a better

is 9

At

i(2m) (X. ).

i=l

m

glance it might seem- tha,t 9

of

appear

f

derivatives appear in 9. m

to

be

u~ed

+,

m

in' the

will[~be

y

\-1

inferior to

.'~

..

t

motivation

of

G'

m

em.

since 2m

while

only

m

The fact that thfs is 'not thetase 'is' demonstrated in

".- ,~, ' . c" :. ,,"'t . Section 3. where it is seen that the fwo estimat6rsrhave"very similar properties. "

'....1

even when f has fewer than 2m derivatives. given by writing 9

m

9

m

t~ :':< -

"~

Some idea'of why this is the case is

- 4 -

Rates of Convergence

3.

In ordinary kernel density estimtion. the rate of convergence is typically determined either by the smoothness of the underlying density or by the order of the kernel function. The density f will be said to have smoothness of order p > 0 whenever there is a constant

where p = t+a

M

>

so that. for all real x and y,

0

and 0
2m.

var(e ) m

p

> m and K has order k, then as

~



- 5 -

+ o(n (b )

-4m-1

-1 + n )

)

m

for p > k+m,

(E(~ (d)

h

f of p $ 2m, var(~

(c)

-2

) - 8

m

}2 = h 2k (k!)-2 {f uk K(U)du}2 (f f(m)f(m+k)}2 + 0(h 2k ) m

for p $ k+m,

(E(~ ) - 8 }2 m m

= O(h

2 (p-m)).

The proof of Lemma 3.1 is in the appendix. The various special cases appearing in Lemma 3.1 may be combined into a general mean squared error result if we introduce the notation IJ

~ost

cases allow statements only

=

min(p-'!I, .k): ,.

~bQut

"1.·c

.

the best

e~p~~ent

of convergence.

These

are summarized in: Theorem 3.2:

Under the assumptions of Lemma 3.1,'1~ .(,

(a)

when

when

J~

5 2m + 2

IJ

o(n- 2 /(21J+4m+1)).

by taking h (b)

{,'l

1

IJ

> 2m +

1

2' E(~

by taking hEn [ When both k

-1/(4m+1)

and

p are

, n

-1/21J]

-9

)2

m m

.

sufficiently

large,

not

only

convergence, but also the best constants, may be given. 2

2(Jf )K , m

the

best

First define

exponent

of

- 6 -

c Theorem 3.3:

2

=

Under the assumptions of Lemma 3.1, minimum mean squared errors are

achieved as follows: ( a)

21

k < 2m +

k < p-m.

-2 2k/(4m+2k+1)

E(~

_(_2k_+_4_m_+_1_)_C_2 r(4m+1)C 1n -9 ) 2

m m

l

4m+1

2k C2

1 f

+

0

(n

-4k/ (4m+2k+ 1) )

.

by taking -2 1/(4m+2k+l)

1

J(4m+1) C1n

1

h

(b)

~

+ 0 (n

f'

2k C2

-1/(4m+2k+1»

1

v > 2m + ~'

,,;

"

J1

E(e -9 )2' = 4 [(f(2m») 2 _ ,

m 'm

'



by taking any h which satTstri'e!S'J .,

h n 1 / (4m+ 1 ) ....

00,

h n 1I2JJ

.... o.

The proofs of Theorems 3.2 and 3.3 are immediate from Lemma 3.1. there are a number;. of

"boundary cases".

such as k = 2m

+

1

2'

~ote

that

that are not

explicitly stated here, but may be handled with no additional work.

r

- 7 -

4.

Discussion

Remark 4.1:

For rate of convergence results which include some special cases of

those presented (1987).

here,

see

Schweder

(1975)

and

Sheather

and

Hettsmansperger

These papers also treat the important problem of how to choose the

bandwidth, h. Remark 4.2:

A very important question is:

the best possible?

are the rates obtained in Theorem 3.2

We conjecture that they are, in the sense of Farrell (1972)

and Stone (1980, 1982).

In some as yet unpublished work in a closely related

setting, L. Goldstein and K. Messer have established some interesting results of this type.

Unfortunately that work does not extend to our case. When v > 2m + ~, Theorem 3.3 ~till leaves a good deal of room for

Remark 4.3: choice of h. develop

a

A slight extension of the expansion of Lemma 3.1 can be used to second

"deficiency".

Remark 4.4: compare?

order

optimality

theory

of

m

type

sometimes

called

See Marron and Sheather i '11987)' faY. an account of the literature on

Another natural q'uestion is:

. how do the estimators 9

m

and 9

m

It is easily seen that

f

Hence 9

the

has smaller variance and 9

m

"

.~,

has less bias.

A means of comparison is to

look at the minimum mean square error as given in (a) of Theorem 3.3.

Note that

C and C appear as a weighted geometric mean, so the question of which of 8 and m 1 2 9

m

is better can only be resolved for each specific K.

Remark 4.4:

Lemma 3.1 can also be used to obtain a theory for optimal choice of

- 8 -

K,

such as

the

one

Mammitzsch (1985).

studied

~ote

by

Epanechnikov

(1969)

and

Muller

Gasser,

and-

that the answer here is the same as that of Epanechnikov

in the case of 9 , 0 Remark 4.5:

It is completely straightforward to extend the results of this paper

to the case where f(x)

d

is a density on IR.

For clarity of presentation, this

case is not explicitly treated here. Remark 4.6:

Theorem 3.2 has

important implications for automatic bandwidth

selection of an ordinary kernel density estimator. shown that, if h for k=2 and p

~

C

is the bandwidth chosen by least squares cross-validation, then

2, n

where h

o

Hall and Marron (1987a) have

-1/10,

is the bandwidth which minimizes mean integrated squared error.

Scott

and Terrell (1986) have proposed another bandwidth selector which gives simila~ • performance when k=2 and p

~

4.

Hall and Marron (1987b) describe a sense in which the rate n best possible for p essentially no bigger than 2. h

o

N

-1/10

is the

When k=2 and p > 2

h~ = n- 1/5 [! K2 {!X 2K(x)dx}-2 9;1]1/5.

See Rosenblatt (1971), for example. This motivates using the bandwidth h = n- 1/5 [I K2 {1 x2 K(x)dx}-2 8;1]1/5,

when

~2

is either 9

2

or 9 , 2

To compare this with h ' note that, by Theorem 3.2, c

for properly chosen hand k sufficiently large, n

-2(p-2)/2p+5 1 2

p

~

6.5

p > 6.5

- 9 -

,Thus, if we ignore the difference between h * and h ' h is better than h for O O c p

~

2.5.

However, the important feature of this observation is not so much the

accuracy confirmed by the faster rate of convergence,

plug-in bandwidth,with a relative

error

of

n

1 2

but the stability.

much

is

more

robust

The

against

sampling fluctuations than is the cross-validatory bandwidth with an error of n

-1/10

nf.'5

:

'~l

i

.. ~

~-,.J

"

q

. .,!

.J

- 10 -

Appendiz

Proof of Lemma 3.1:

First consider the bias.

~ote

that

E(9 ) = h- 2m - 1 (_1)m II K(2m){(x-y)/h}f(x)f(y)dx dy m

h

h

-2m

-m

(-1)

(-1)

m

m

(2m)

II K

(m)

II K

(u)f(x)f(x-hu)dx du

(u)f(x)f

(m)

(x-hu)dx du

II K(u)f(m)(X)f(m)(x-hu)dx duo Similarly,

E(; ) m

=

II K*K(U)f(m)(x)f(m)(x-hu)dx dUo

Part (c) now follows by a k-th order Taylor expansion of f(m) (x-hu). with the fact (m-t)-th

order

that both K and K*K are of order k. Taylor

expansion

of

f(m)(x-hu)

condition (3.1). For the variance component. note that var(S)

= n- 2 (n_l)-2 h- 4m - 2 L L i~j

But

L L

i

'~j

I

Part (d)

together

with

together

follows by an the

Lipschitz

- 11 -

E(h- 2m - 1 K(2m){(X -X )/h}]2 1

and for p

~

h- 4m - 1 II{K(2m)(u)}2 f(x)f(x-hu)dx du

2

2m, E(h- 4m - 2 K(2m){(X -X )/h} K(2m){(X -X )/h}] = 1

2

2

3

= IfI h-4m K(2m) (u)K (2m) (v)f(y+hu)f(y)f(y-hv)du dy dv

(A.2)

(-1)

m

IfI

K(u)K(v)f

(2m)

(y+hu)f(y)f

(2m)

(y-hv)du dy dv

Hence, since E(h- 2m - 1 K(2m){(X -X )/h}] ~ (_l)m e ) 1 2 ,~ m

and for p

~

2m, ~.

'.

1.

"

" ..

~.

l :.,

"

h- 4m - 2 COV(K(2m){(X -X )/h}, K(2m){(X -X )/h}] = {I(f(2m»2 f - e 2 } + 0(1). 1 2 2 3 ' m For the estimator 9 , part (a) now follows from (A.I). m

To modify this argument

for part (b), the only change required is in (A.2), where fewer integrations by parts should be done and the Lipschitz condition (3.1) is again applied. proof for 6

m

is entirely similar.

The

- 12 -

References

Devroye, L. and Gyt)rfi, L.

(1984).

Nonparametric Density Estimation: The L View. 1

Wiley, New York. Farrell, R. H. (1972), "On the best obtainable rates of convergence in estimation of a density function at a point," Annals of t\f.athematical Statistics, 43, 170-180. Gasser, T., MUller, H. G. and Mammitzsch, V. (1985), "Kernels for nonparametric curve estimation," Journal of the Royal Statistical Society, Series B, 47, 238-252. Hall, P. and Marron, J. S. (1987a), "Extent to which least-squares cross-validation minimises integrated square error in nonparametric density estimation," Theory of Probability and Related Fields, to appear. Hall, P. and Marron, J. S. (1987b), "The amount of noise inherent in bandwidth selection for a kernel density estimator, " Annals of Statistics, to appear. Hall, P. and Marron, J. S. (1986a), unpublished manuscript.

"Choice of kernel order in density estimation",

Hall, P. and Marron, J. S. (1986b), "Variable window width kernel probability densities", unpublished manuscript. Marron, J. S. and Sheather, S. preparation.

(1987), "Kernel Quantile Estimators",

.'

estimates of .., manuscript in

Parzen, E. (1962), "On estimation of a probability density function and mode," Annals of ~athematical Statistics, 33, 1065-1076. Prakasa Rao, B. New York.

L.

S.

(1983), NOliparametric FUlictiolial Estimatioli, Academic Press.

Rosenblatt, M. (1956), "Remarks on some non-parametric estimates function," Annals of ~athelllatical Statistics, 27, 832-837. Rosenblatt, M. 1815-1842.

(1971),

"Curve

estimates." Annals of

~athematlcal

of

a

density

Statistics, 42.

Schweder, T. (1975). "Window estimation of the asymptotic variance of rank estimators of location". Scandinavian Journal of Statistics. Scott. D. J. and Terrell. G. R. (1986). "Biased and unbiased cross-validation density estimation," Rice University Tech. Report No. 87-02. Sheather,

S.

"A data-based algoritha for 2 choos ing the window width when estimating the integral of f (x)", unpublished manuscript. J.

and

Hettsmansperger.

T.

P.

(1987)

in

- 13 -

Silverman, B. W. (1986), Density Estimation for Statistics and Data Analysis, Chapman and Hall, New York. Stone, C. J. (1980), "Optimal convergence rates for nonparametric estimators," Annals of Statistics, 8, 1348-1360. Stone, C. J. (1982) , "Optimal global rates of convergence regression," Annals of Statistics, 10, 1040-1053.

:::

1+

..,

~

1",

•(

i :

of

nonparametric