generalized nonparametric regression via ... - Semantic Scholar

Comment

Report 2 Downloads 229 Views

GENERALIZED NONPARAMETRIC REGRESSION VIA PENALIZED LIKELIHOOD

by

Dennis D. Cox Finbarr O'Sullivan

TECHNICAL REPORT No. 170 August 1989

Department of Statistics, GN-22 University of Washington Seattle, Washington 98195 USA

Generalized Nonparametric Regression via Penalized Likelihood DennisD. Coxl Department of Statistics University Urbana. 1161801. Finbarr 0' Sullivan2

Department of Statistics University of Washington Seattle. WA 98195.

ABSTRACT

We consider the asymptotic analysis of penalized likelihood type estimators for generalized non-parametric regression problems in which the target parameter is a vector valued function defined in terms of the conditional distribution of a response given a set of covariates, A variety of examples including ones related to generalized linear models and robust smoothing are covered by the theory. Upper bounds on rates of convergence for penalized likelihood-type estimators are obtained by approximating estimators in terms of one-step Taylor series expansions.

AMS 1980 subject classifications. Primary, 62-G05, Secondary, 62J05, 41-A35, 41-A25, 47-A53, 45-LlO, 45-M05. Key words and phrases. Maximum Penalized Likelihood, Nonparametric Regression, Smoothing· Splines, Rates of Convergence.

Research partially supported

Foundationunder Gram No. MCS·86..Q3083. Foundation Grant No. MCS·840·3239 and by the Departrnent of

In

parameter spaces. These approximations give rates error of the estimator and its derivatives, but also provide insigat

estnnanon error which

can be approximately decomposed into the sum of a bias (deterministic) term and a variance (random) term. We anticipate the approximations will also prove useful for analysis of methods for choosing the smoothing parameter. We begin by describing the estimation methodology and giving some examples.

1.2. Penalized Likelihood for Regression Function Estimation Suppose one observes a sample of n i.i.d. pairs (X 1 , Y 1) , (X2, Y z)•...• (X It , YIt ) from the joint distribution P x:y

-

the Xi'S are thought of as covariates and the Yi 's as responses. The condi-

tional distribution of Y given X =x is denoted Law(Y IX =x) = P y Ix

(.

Ix) .

Let P1;) denote the joint empirical measure of the (Xi, Yi ), i.e.

where IA denotes the indicator function of the set A , and Y ( respectively

YiC respectively ,8

denotes the range

whenever one

"ti",h...o

Y

is non-constant

(i)

depends sInolDthly on x as

::-:'en natural

choices would be e(x) = (9 1(x ),e2(X» = p(ylx,e)

(11

[0'

= -e2(X)

Here we have chosen the parameterization e2(X) = log [O'(x)] to avoid awkward ('\'~icivity constraints. (iii) If the errors in (i) are no longer assumed normal but to have density

f . then one

would

naturally use p(y lx, e) = -log [f (y -e(x»] .

One may wish to replace -logf by a more general function p as in robust estimation of location; consult Huber(9] for further details. Again one may incorporate scale estimation as well, as was done in (ii). (iv)

If the response Y is binary (zero or one) with "pointwise" success probability

p (x)

= P [Y =11 X

then a natural choice is

p(y Ix,e) = log(1+e

a non-parametric logistic regression esnmator,

mean

lhat the

a nonparametric 10Q"-11!'lear Pni~:"nn reg]'esSlOfl estimator

is obtained

=

(wench is assumed to be t a bounded same. we

mllflthl~ri~tl'

or over all of lR as

X, integration over all of lRd in

penalty is

POSSIble, but

to this situation.

not A

the minimi,1:ati 0

Note that the target parameter 80 is determined by p and Py Ix as indicated above. If p is obtained by taking the negative log of a (point-wise)likelihood, then we are not assuming that Py Ix(' Ix) is in the given parametric model. In this case,So(x) is the value of the point-wise

parameter which minimizes Kullbak-Leibler "distance" between

Pyl x ( '

Ix) and the model.

Only D(iv) represents a strong departure from Cramer's assumptions. It is used to deal with some of the problems that arise from the infinite dimensional parameter. Note that all boundedness requirementsonp or 'If only involve the point-wise parameter t

bounded subsets

. We

will consider the examples introduced in section 1 in light of Assumption D. Examples (i), (ii), (iv), and (v) are exponential families, and so

p(y IX,8(x)) =

As

3

as

's have corldltional

(y ,x)l1j(x, Sex)) + a.(l1(x, Sex)~) .

C0I1rtinl110UISly differentiable with derivatives bounded in x second moments bounced in x, then Assumpdon D

that

1, note

= 1, ... .r ],

11 counting mea:sure.

E

X, and

geI1teraliZf:d score vector). The variational equation is defined by DlnJ",(8)

=2

111..,

where D

denotes differentiation (see section 3.1). 2 111.. is eenned

A.<W8,q» . for $ E 8. The limiting version of 2 111..(8) as n

-t

00

will also be of interest. This is defined as

= J'V(y I z , 8(x »$(x) P.xy (dxdy) + A.<W8,$> .

The existence of 2 111..(8) and 21..(8) follows from the discussion in section 3.1. Let 1I·ll b be the standard norm on W;F (X.;JRq) for b

E

[0,1]. We have the following asymptotic convergence

result

THEOREM 2.1. Suppose there is an cr. satisfying d/2m < cr. < (s/m -d/2m)l2, and a sequence A.n

-t

0 satisfying n- 1 A.;(2CJ.+dlm)

-t

0.

Then with probability approaching one there is a root 8111.." of 2 111..,,(8) = b

E

(2.8)

(2.9)

°

satisfying for all

[0, cr.]

(2.10)

where M is a constant independent 0/80, b , A.11 •and n . PROOF. The results follow from Theorems 3.4 and 3.5 in section 3. 0

convergence rates. an

the

on

of

on

second term on

of A. so obtained is

rate is obtained ::::

conver-

II

interpolanon as defined

Using

have P{111iV~llpl't norms for b

E

=

sets

[0,1]. From Assumption C above s1m >

so by the

equivalence of the eb spaces with Sobolev spaces, we have the existence of a such that

eo

ea.

E

and

dl2m < a;;" (stm -dI2m)l2.

(3.5)

We will not be interested in a outside of the interval given in (3.5).

eo in ea' For some R > 0, N 9

0

C

S (R ;ecJ.

Now consider Assumption A.3 of CO. Computing directly the third order derivative of In in the direction defined by functions u .v ,w we obtain

(3.6) Since evaluation is continuous in ea., the operator D 3In (fJ) is the third In in e a at e. We need to show that

the map

is coeanuous

j

,1.

XE

integral is a

=

+

is therefore a

limiting penaltzed lilc:elil:lood is

derivative is continuous for e E N 90 ,

suffices to show

to tYDilcalLy

,8(x)) ~M(R)lluICqllllvICqllllwIL211

s

(3.10)

M(R)lIuli a IIvli a Ilwllo

where the first inequality follows from Assumption D(iii) , Assumption B and Cauchy-Schwarz, and the second from Sebolev's imbedding theorem. For part (ii)

I

[D 21. (8)- D'[ (8)]Uv!'

={ 1~ fa,

'lfj'(Y l r , 8(x»Uj (x) v,(x )[P$)-P..,](dx

,dy)}' .

We need estimates which are uniform in 8 E N eo. This requirement is especially difficult, and it is here that we use Assumption D(iv). Let (3.11)

If d/2m ¢lav along with the defining relations for the ¢l* v's v

for the fifth equality. Now the claim follows from the argument of Theorem 2.3 of Cox[4}. See also Lemma 5.4 of Cox[3}. Part (ii) follows by Markov's inequality. For e* = 6A. II

en A. -e".lIt s M l: (1 +yiv)(1 +/"y*v)-2 v

(3.25)

braces are mean 0 and variance

inequaliry uses ness

=

1

r.o... b)

1

s o, . n- ~ 1,;-(' +o+dl2.on);' + n-7 A,;" (Ho+dlm);' + n- 1 A.-I'

=o, fl n -7 A.-("+dl2m) A(0-b)12 [ Al o +dl2m -0)12 + 1 + n

A

-OJ]

r Thus if An is a sequence such that (3.18) holds then rn (An, a) -t O. From here, Theorem 3.2 of

CO gives the desired result.

4. Appendix

4.1. Results for Spectral Norms As indicated in section 3 we can consider the spectral decomposition of W relative to

U (8*) for 8*

E

N eo. This gives sequences {$* v: v = 1,2, ... } of eigenfunctions and {Y* v} of

eigenvalues satisfying (by analogy with (3.2)) < $*v , U (8*)$* J.L > = OvJ.L = Y*vOvJ.L

(A. I)

for all pairs v, 11 of positive integers, where OvJ.L is Kronecker's delta. Let

(A.2)

U9U'b = {

and let 8*b eenete U'

complettng {8

norm,

1,

~

is posstble.

,

1.

..

- 30-

AOaInS,

R.,

;:;Ol')oU.~V "1-I U I.. C;..> ,

Acaaemic

New York, 1975.

convergence rates," Ann. Statist.

Chen, Z., "Interaction

ted), Statistics Dept., University of Wisconsin-Madison, 1989. 3.

Cox, D. D., "Multivariate smoothing spline functions," SIAM J. Numer. Anal., vol, 21, pp. 789-813, 1984.

4.

Cox, D. D., "Approximation of the method of regularization estimators," Ann. Statist., vol, 16,pp. 694-712,1988.

5.

Cox, D. D. and O'Sullivan, F., "Asymptotic analysis of penalized likelihood estimators,"

Ann. Statist., 1989 (tentatively accepted). 6.

Cramer, H., Mathematical Methods of Statistics, Princeton University Press, Princeton, N.J., 1946.

7.

Craven, P. and Wahba, G., "Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation," Numer. Math.,

vol, 31, pp. 377-403, 1979. 8.

Good, 1. J. and Gaskins, R. A., "Non-parametric roughness penalties for probability densities," Biometrika, vol. 58,

9.

J.J.UI.)l;;;J.,

P. 1., Robust Statistics,

T.

11.

255-277, 1971.

& Sons, New York,

G., "Cemmnrator estimates

1.