GENERALIZED NONPARAMETRIC REGRESSION VIA PENALIZED LIKELIHOOD
by
Dennis D. Cox Finbarr O'Sullivan
TECHNICAL REPORT No. 170 August 1989
Department of Statistics, GN-22 University of Washington Seattle, Washington 98195 USA
Generalized Nonparametric Regression via Penalized Likelihood DennisD. Coxl Department of Statistics University Urbana. 1161801. Finbarr 0' Sullivan2
Department of Statistics University of Washington Seattle. WA 98195.
ABSTRACT
We consider the asymptotic analysis of penalized likelihood type estimators for generalized non-parametric regression problems in which the target parameter is a vector valued function defined in terms of the conditional distribution of a response given a set of covariates, A variety of examples including ones related to generalized linear models and robust smoothing are covered by the theory. Upper bounds on rates of convergence for penalized likelihood-type estimators are obtained by approximating estimators in terms of one-step Taylor series expansions.
AMS 1980 subject classifications. Primary, 62-G05, Secondary, 62J05, 41-A35, 41-A25, 47-A53, 45-LlO, 45-M05. Key words and phrases. Maximum Penalized Likelihood, Nonparametric Regression, Smoothing· Splines, Rates of Convergence.
Research partially supported
Foundationunder Gram No. MCS·86..Q3083. Foundation Grant No. MCS·840·3239 and by the Departrnent of
In
parameter spaces. These approximations give rates error of the estimator and its derivatives, but also provide insigat
estnnanon error which
can be approximately decomposed into the sum of a bias (deterministic) term and a variance (random) term. We anticipate the approximations will also prove useful for analysis of methods for choosing the smoothing parameter. We begin by describing the estimation methodology and giving some examples.
1.2. Penalized Likelihood for Regression Function Estimation Suppose one observes a sample of n i.i.d. pairs (X 1 , Y 1) , (X2, Y z)•...• (X It , YIt ) from the joint distribution P x:y
-
the Xi'S are thought of as covariates and the Yi 's as responses. The condi-
tional distribution of Y given X =x is denoted Law(Y IX =x) = P y Ix
(.
Ix) .
Let P1;) denote the joint empirical measure of the (Xi, Yi ), i.e.
where IA denotes the indicator function of the set A , and Y ( respectively
YiC respectively ,8
denotes the range
whenever one
"ti",h...o
Y
is non-constant
(i)
depends sInolDthly on x as
::-:'en natural
choices would be e(x) = (9 1(x ),e2(X» = p(ylx,e)
(11
[0'
= -e2(X)
Here we have chosen the parameterization e2(X) = log [O'(x)] to avoid awkward ('\'~icivity constraints. (iii) If the errors in (i) are no longer assumed normal but to have density
f . then one
would
naturally use p(y lx, e) = -log [f (y -e(x»] .
One may wish to replace -logf by a more general function p as in robust estimation of location; consult Huber(9] for further details. Again one may incorporate scale estimation as well, as was done in (ii). (iv)
If the response Y is binary (zero or one) with "pointwise" success probability
p (x)
= P [Y =11 X
then a natural choice is
p(y Ix,e) = log(1+e
a non-parametric logistic regression esnmator,
mean
lhat the
a nonparametric 10Q"-11!'lear Pni~:"nn reg]'esSlOfl estimator
is obtained
=
(wench is assumed to be t a bounded same. we
mllflthl~ri~tl'
or over all of lR as
X, integration over all of lRd in
penalty is
POSSIble, but
to this situation.
not A
the minimi,1:ati 0
Note that the target parameter 80 is determined by p and Py Ix as indicated above. If p is obtained by taking the negative log of a (point-wise)likelihood, then we are not assuming that Py Ix(' Ix) is in the given parametric model. In this case,So(x) is the value of the point-wise
parameter which minimizes Kullbak-Leibler "distance" between
Pyl x ( '
Ix) and the model.
Only D(iv) represents a strong departure from Cramer's assumptions. It is used to deal with some of the problems that arise from the infinite dimensional parameter. Note that all boundedness requirementsonp or 'If only involve the point-wise parameter t
bounded subsets
. We
will consider the examples introduced in section 1 in light of Assumption D. Examples (i), (ii), (iv), and (v) are exponential families, and so
p(y IX,8(x)) =
As
3
as
's have corldltional
(y ,x)l1j(x, Sex)) + a.(l1(x, Sex)~) .
C0I1rtinl110UISly differentiable with derivatives bounded in x second moments bounced in x, then Assumpdon D
that
1, note
= 1, ... .r ],
11 counting mea:sure.
E
X, and
geI1teraliZf:d score vector). The variational equation is defined by DlnJ",(8)
=2
111..,
where D
denotes differentiation (see section 3.1). 2 111.. is eenned
A.<W8,q» . for $ E 8. The limiting version of 2 111..(8) as n
-t
00
will also be of interest. This is defined as
= J'V(y I z , 8(x »$(x) P.xy (dxdy) + A.<W8,$> .
The existence of 2 111..(8) and 21..(8) follows from the discussion in section 3.1. Let 1I·ll b be the standard norm on W;F (X.;JRq) for b
E
[0,1]. We have the following asymptotic convergence
result
THEOREM 2.1. Suppose there is an cr. satisfying d/2m < cr. < (s/m -d/2m)l2, and a sequence A.n
-t
0 satisfying n- 1 A.;(2CJ.+dlm)
-t
0.
Then with probability approaching one there is a root 8111.." of 2 111..,,(8) = b
E
(2.8)
(2.9)
°
satisfying for all
[0, cr.]
(2.10)
where M is a constant independent 0/80, b , A.11 •and n . PROOF. The results follow from Theorems 3.4 and 3.5 in section 3. 0
convergence rates. an
the
on
of
on
second term on
of A. so obtained is
rate is obtained ::::
conver-
II
interpolanon as defined
Using
have P{111iV~llpl't norms for b
E
=
sets
[0,1]. From Assumption C above s1m >
so by the
equivalence of the eb spaces with Sobolev spaces, we have the existence of a such that
eo
ea.
E
and
dl2m < a;;" (stm -dI2m)l2.
(3.5)
We will not be interested in a outside of the interval given in (3.5).
eo in ea' For some R > 0, N 9
0
C
S (R ;ecJ.
Now consider Assumption A.3 of CO. Computing directly the third order derivative of In in the direction defined by functions u .v ,w we obtain
(3.6) Since evaluation is continuous in ea., the operator D 3In (fJ) is the third In in e a at e. We need to show that
the map
is coeanuous
j
,1.
XE
integral is a
=
+
is therefore a
limiting penaltzed lilc:elil:lood is
derivative is continuous for e E N 90 ,
suffices to show
to tYDilcalLy
,8(x)) ~M(R)lluICqllllvICqllllwIL211
s
(3.10)
M(R)lIuli a IIvli a Ilwllo
where the first inequality follows from Assumption D(iii) , Assumption B and Cauchy-Schwarz, and the second from Sebolev's imbedding theorem. For part (ii)
I
[D 21. (8)- D'[ (8)]Uv!'
={ 1~ fa,
'lfj'(Y l r , 8(x»Uj (x) v,(x )[P$)-P..,](dx
,dy)}' .
We need estimates which are uniform in 8 E N eo. This requirement is especially difficult, and it is here that we use Assumption D(iv). Let (3.11)
If d/2m ¢lav along with the defining relations for the ¢l* v's v
for the fifth equality. Now the claim follows from the argument of Theorem 2.3 of Cox[4}. See also Lemma 5.4 of Cox[3}. Part (ii) follows by Markov's inequality. For e* = 6A. II
en A. -e".lIt s M l: (1 +yiv)(1 +/"y*v)-2 v
(3.25)
braces are mean 0 and variance
inequaliry uses ness
=
1
r.o... b)
1
s o, . n- ~ 1,;-(' +o+dl2.on);' + n-7 A,;" (Ho+dlm);' + n- 1 A.-I'
=o, fl n -7 A.-("+dl2m) A(0-b)12 [ Al o +dl2m -0)12 + 1 + n
A
-OJ]
r Thus if An is a sequence such that (3.18) holds then rn (An, a) -t O. From here, Theorem 3.2 of
CO gives the desired result.
4. Appendix
4.1. Results for Spectral Norms As indicated in section 3 we can consider the spectral decomposition of W relative to
U (8*) for 8*
E
N eo. This gives sequences {$* v: v = 1,2, ... } of eigenfunctions and {Y* v} of
eigenvalues satisfying (by analogy with (3.2)) < $*v , U (8*)$* J.L > = OvJ.L = Y*vOvJ.L
(A. I)
for all pairs v, 11 of positive integers, where OvJ.L is Kronecker's delta. Let
(A.2)
U9U'b = {
and let 8*b eenete U'
complettng {8
norm,
1,
~
is posstble.
,
1.
..
- 30-
AOaInS,
R.,
;:;Ol')oU.~V "1-I U I.. C;..> ,
Acaaemic
New York, 1975.
convergence rates," Ann. Statist.
Chen, Z., "Interaction
ted), Statistics Dept., University of Wisconsin-Madison, 1989. 3.
Cox, D. D., "Multivariate smoothing spline functions," SIAM J. Numer. Anal., vol, 21, pp. 789-813, 1984.
4.
Cox, D. D., "Approximation of the method of regularization estimators," Ann. Statist., vol, 16,pp. 694-712,1988.
5.
Cox, D. D. and O'Sullivan, F., "Asymptotic analysis of penalized likelihood estimators,"
Ann. Statist., 1989 (tentatively accepted). 6.
Cramer, H., Mathematical Methods of Statistics, Princeton University Press, Princeton, N.J., 1946.
7.
Craven, P. and Wahba, G., "Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation," Numer. Math.,
vol, 31, pp. 377-403, 1979. 8.
Good, 1. J. and Gaskins, R. A., "Non-parametric roughness penalties for probability densities," Biometrika, vol. 58,
9.
J.J.UI.)l;;;J.,
P. 1., Robust Statistics,
T.
11.
255-277, 1971.
& Sons, New York,
G., "Cemmnrator estimates
1.