Computational Statistics & Data Analysis 42 (2003) 545 – 560 www.elsevier.com/locate/csda
Computational aspects of nonparametric smoothing with illustrations from the sm library A.W. Bowmana; ∗ , A. Azzalinib a Department
of Statistics, University of Glasgow, G12 8QQ Glasgow, UK of Statistical Sciences, University of Padua, Padua, Italy
b Department
Received 1 January 2001; received in revised form 1 March 2002
Abstract Smoothing techniques such as density estimation and nonparametric regression are widely used in applied work and the basic estimation procedures can be implemented relatively easily in standard statistical computing environments. However, computationally e2cient procedures quickly become necessary with large datasets, many evaluation points or more than one covariate. Further computational issues arise in the use of smoothing techniques for inferential, rather than simply descriptive, purposes. These issues are addressed in two ways by (i) deriving e2cient matrix formulations of nonparametric smoothing methods and (ii) by describing further simple modi6cations to these for the use of ‘binned’ data when sample sizes are large. The implications for other graphical and inferential aspects of the estimators are also discussed. These issues are dealt with in an algorithmic manner, to allow implementation in any programming environment, but particularly those which are geared towards vector and matrix representations of data. Speci6c examples of S-Plus code from the sm library of Bowman and Azzalini (Applied Smoothing Techniques for Data Analysis: the Kernel Approach With S-Plus Illustrations, Oxford University c 2002 Elsevier Science B.V. Press, Oxford, 1997) are given in an appendix as illustrations. All rights reserved. Keywords: Binning; Density estimation; Graphics; Kernel methods; Large datasets; Nonparametric regression; Quadratic forms; sm library; Smoothing techniques; S-Plus
1. Introduction Smoothing techniques such as density estimation and nonparametric regression have become established tools in applied statistics. There is now a wide variety of texts ∗
Corresponding author. Tel.: +44-141-330-4046; fax: +44-141-330-4814. E-mail address:
[email protected] (A.W. Bowman).
c 2002 Elsevier Science B.V. All rights reserved. 0167-9473/03/$ - see front matter PII: S 0 1 6 7 - 9 4 7 3 ( 0 2 ) 0 0 1 1 8 - 4
546
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
which describe these methods and a huge literature of research papers. Recent texts include Green and Silverman (1994), Wand and Jones (1995), Fan and Gijbels (1996), SimonoM (1996), Bowman and Azzalini (1997) and Schimek (2000a). A broader framework for the case of regression, known as generalized additive models, is also described by Hastie and Tibshirani (1990). Some statistical computing environments such as S-Plus, Genstat and XLispStat include some general procedures for smoothing techniques and SimonoM (1996) and Schimek (2000b) give extensive references to more specialist software. However, the simplest forms of nonparametric smoothing techniques can be implemented relatively easily through elementary programming techniques. Modern statistical computing environments are generally geared towards vector and matrix representations of data. It is therefore a principal aim of this paper to provide simple matrix formulations of smoothing techniques which allow e2cient implementation in this type of environment. A second aim of the paper is to address the computational issues which arise when nonparametric methods are applied to large datasets. This issue is also dealt with in the context of vector and matrix representations of the estimators. E2cient matrix formulations of smoothing techniques are discussed in Section 2 of the paper, for both density estimation and regression. The issue of binning and its implementation within a matrix framework is described in Section 3. Computational issues also arise when techniques other than straightforward construction of the estimators for descriptive purposes are required. Bowman and Azzalini (1997, Chapters 4 and 5) describe a variety of testing procedures based on quadratic form calculations and the implementation of these for binned data is described in Section 4. Some graphical issues and other topics are also discussed in that section. Further discussion is given in Section 5. The computational methods discussed in the paper have been implemented in the sm library written to accompany the book by Bowman and Azzalini (1997). This library contains an extensive collection of functions for diMerent data structures and statistical problems and version 2 now includes the binning techniques described in the paper. Examples of S-Plus code from the core instructions of the two principal functions, sm.density and sm.regression, are given in an appendix to illustrate the ideas described in the paper. The principal emphasis is on the algebraic expressions provided in the main body of the paper, to allow readers to implement the ideas in other programming environments. In addition, they also provide speci6c guidance to those who wish to amend the existing sm library functions or to implement the techniques themselves in S-Plus.
2. Matrix formulations of smoothing techniques 2.1. Density estimation The formula for computing a density estimate at the evaluation point z from a sample y = (y1 ; : : : ; yn )T of univariate data is n
ˆ =1 f(z) w(z − yj ; h); n j=1
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
547
where the kernel function w(x; h) is itself a density function which is symmetric with mean 0. The degree of smoothing applied to the data is controlled by the scale parameter h, which is conveniently parameterised as the standard deviation of the kernel function. The detailed shape of the kernel function is relatively unimportant, and for convenience a normal density function will subsequently be used in the illustrative code in the appendix. The calculation of the estimate may be expressed in vector form as ˆ = 1 w T 1n ; f(z) n where w denotes the vector with elements w(z − yj ; h) and 1n denotes a vector of length n which consists entirely of 1’s. An estimate usually needs to be computed over a vector of evaluation points (z1 ; : : : ; zk )T . The entire calculation can then be expressed in matrix form as 1 fˆ = W 1n ; n
(1)
where fˆ denotes the vector of estimated values and W denotes the matrix whose (i; j)th element is w(zi − yj ; h). There are some occasions when diMerent weights p = (p1 ; : : : ; pn )T are associated with the observations; in the sequel it is assumed that the weights are scaled to sum to 1. The estimate is then of the form ˆ i) = f(z
n
pj w(zi − yj ; h)
j=1
and the matrix formulation becomes fˆ = Wp: Further generality can be obtained by the use of a diMerent smoothing parameter hj in the kernel function around each observation. This is easily accommodated by de6ning the elements of W as w(zi − yj ; hj ). When the data are two-dimensional, {(y1j ; y2j ); j = 1; : : : ; n}, the simplest form of the estimate at an evaluation point (z1 ; z2 ) is n
ˆ 1 ; z2 ) = 1 f(z w(z1 − y1j ; h1 )w(z2 − y2j ; h2 ): n j=1
This corresponds to the use of a kernel function which is a product of univariate kernels. In matrix form, the vector of estimates at the evaluation points {(z1i ; z2i ); i = 1; : : : ; k} can still be expressed in the form (1), where the (i; j)th element of W is now w(z1i − y1j ; h1 )w(z2i − y2j ; h2 ). This can also be written as 1 fˆ = (W1 ∗ W2 )1n ; n
548
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
where the (i; j)th elements of W1 and W2 are w(z1i − y1j ; h1 ) and w(z2i − y2j ; h2 ) respectively, and ∗ denotes the Hadamard product, namely elementwise multiplication. Where weights apply to the observations, this becomes fˆ = (W1 ∗ W2 )p: With two-dimensional data, the evaluation points will often lie on a rectangular grid, denoted by {(z1r ; z2s ); r; s =1; : : : ; k}. For simplicity we will assume that the number of grid points in each margin is identical; diMerent numbers can be accommodated easily, at the expense of a slightly more detailed notation. The form of the estimate at each grid point n
ˆ 1r ; z2s ) = 1 f(z w(z1r − y1j ; h1 )w(z2s − y2j ; h2 ) n j=1
allows the full set of estimates to be neatly computed in matrix form as 1 fˆ = W1 W2T ; n where the (i; j)th elements of W1 and W2 are now de6ned as w(z1i − y1j ; h1 ) and w(z2i − y2j ; h2 ), respectively. This signi6cantly reduces the dimensionality of the constituent matrices. Where weights apply, these can be incorporated as fˆ = W1 PW2T ; where P is a diagonal matrix with p as its diagonal vector. 2.2. Nonparametric regression Smoothing techniques allow observed data {(xj ; yj ); j = 1; : : : ; n} to be used in investigating a regression relationship through the model y = m(x) + , where m is assumed only to be smooth. The concept of smooth local averaging leads immediately to a ‘local mean’ approach to the construction of an estimate at an evaluation point z, as n j=1 w(xj − z; h)yj m(z) ˆ = n ; j=1 w(xj − z; h) as proposed by Nadaraya (1964) and Watson (1964). The concept of ‘local linear’ estimation was proposed some time ago by Cleveland (1979) and others, and its attractive properties were clearly argued by Fan and Gijbels (1992) and Fan (1993). At a point of interest z, the estimate is computed as the intercept term ˆ in the least-squares problem min ;
n j=1
{yj − − (xj − z)}2 w(xj − z; h):
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
549
Some simple algebra shows that n
m(z) ˆ =
1 {s2 (z; h) − s1 (z; h)(xj − z)}w(xj − z; h)yj n s2 (z; h)s0 (z; h) − s1 (z; h)2 j=1
where sr (z; h) = { (xj − z)r w(xj − z; h)}=n. In matrix notation, for estimation over a vector of evaluation points (z1 ; : : : ; zk )T , the starting point for both approaches is the matrix of basic kernel weights W whose (i; j)th element is w(xj − zi ; h). The ‘smoothing matrix’ S is de6ned by S = W=(W 1n 1Tn );
(2)
where = denotes elementwise division of two matrices. The local mean estimator can then be written in the simple form mˆ = Sy; where mˆ denotes the vector of 6tted values at the estimation points. The matrix representation of the local linear case is a little more complex. The 6rst step is to de6ne the matrix of diMerences D whose (i; j)th element is (xj −zi ). The k ×n matrices S0 , S1 and S2 can then be de6ned as Sr = (W ∗ Dr )1n 1Tn , where Dr denotes elementwise exponentiation. The vector of 6tted values is again given by m=Sy, ˆ where now S = (S2 − D ∗ S1 ) ∗ W=(S2 ∗ S0 − S1 ∗ S1 ):
(3)
When weights apply to the observations, these can be viewed as attached to the kernel functions w(xj − zi ; h). For example, in the local linear case the least-squares problem becomes min ;
n
{yj − − (xj − z)}2 w(xj − z; h)pj :
j=1
The de6nitions of the smoothing matrices given in (2) and (3) therefore remain valid, with W replaced by WP. When there are two covariates, x1 and x2 , the local mean estimator is de6ned by n j=1 w(x1j − z1 ; h1 )w(x2j − z2 ; h2 )yj m(z ˆ 1 ; z2 ) = n j=1 w(x1j − z1 ; h1 )w(x2j − z2 ; h2 ) and the local linear estimator as the value of ˆ from the least-squares problem min ;;
n
{yj − − (x1j − z1 ) − (x2j − z2 )}2 w(x1j − z1 ; h)w(x2j − z2 ; h):
(4)
j=1
The basic kernel weight matrix can be written as W = W1 ∗ W2 , where the (i; j)th elements of W1 and W2 are w(x1j −z1i ; h1 ) and w(x2j −z2i ; h2 ) respectively. The smoothing matrix for the local mean estimator is then given again by (2). Furthermore, when the evaluation points lie on a regular grid de6ned by {(z1r ; z2s ); r; s=1; : : : ; k} then a matrix
550
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
of 6tted values can be computed as mˆ = W1 YW2T W1 W2T ; where Y is a diagonal matrix with diagonal vector y and the (i; j)th elements of W1 and W2 are w(x1j − z1i ; h1 ) and w(x2j − z2i ; h2 ), respectively. In the weighted case the corresponding expression is simply W1 YPW2T . In the local linear case, for estimation at the point (z1 ; z2 ), let Xz denote the n × 3 matrix whose ith row is (1; (x1i − z1 ); (x2i − z2 )), following the notation of Ruppert and Wand (1994). If Wz denotes the diagonal matrix whose (i; i)th element is wi = w(x1i − z1 ; h)w(x2i − z2 ; h), then the solution of the weighted least-squares problem (4) is (XzT Wz Xz )−1 XzT Wz y: The local linear estimate is de6ned by the 6rst element of this vector of length 3. The elements of the 3 × 3 matrix A = (aij ) = (XzT Wz Xz ) are all of the form i wi (x1i − z1 )r (x2i − z2 )s , where r + s 6 2. To obtain the 6rst element of the least-squares solution we need only the 6rst row of (XzT Wz Xz )−1 , denoted by (b1 ; b2 ; b3 ). By applying standard linear algebra results, reported for instance by Healy (1986, Section 3.4), these can be written as 1 b1 = 1 a11 − {(a12 a33 − a13 a23 )a12 + (a13 a22 − a12 a23 )a13 } ; d b2 =
b1 (a13 a23 − a12 a33 ); d
b3 =
b1 (a12 a23 − a13 a22 ); d
(5)
where d = a22 a33 − a223 . When the vector (b1 ; b2 ; b3 ) is post-multiplied by XzT Wz the result is a vector of length n whose ith element is {b1 + b2 (x1i − z1 ) + b3 (x2i − z2 )}. The inner product of this vector with y produces the local linear estimate at (z1 ; z2 ). This calculation can be vectorised easily. A very important case is when the evaluation points lie on a rectangular grid. In order to describe an e2cient computational method, it is helpful 6rst to observe that XzT Wz y can be written as (c1 ; c2 ; c3 )T = ( yi wi ; yi wi (x1i − z1 ); yi wi (x2i − z2 ))T and that the local linear estimate can then be represented as b1 c1 + b2 c2 + b3 c3 . For the rectangular grid {(z1r ; z2s ); r; s = 1; : : : ; k}, the diMerence matrices D1 ; D2 can be de6ned to have (i; j)th elements (x1j − z1i ) and (x2j − z2i ), respectively. In this case, the aij quantities can be computed in matrix form as a11 = W1 W2T ; a12 = (W1 ∗ D1 )W2T ; a13 = W1 (W2 ∗ D2 )T ;
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
551
a22 = (W1 ∗ D12 )W2 ; a23 = (W1 ∗ D1 )(W2 ∗ D2 )T ; a33 = W1 (W2 ∗ D22 )T ; where the basic kernel weight matrices W1 ; W2 are as de6ned just after (4). Matrices d; b1 ; b2 ; b3 can be computed using elementwise operations in the expressions (5), where the 1 at the start of the expression for b1 is replaced by 1k 1Tk . Finally, the matrix of 6tted values is computed as c1 = W1 (W2 ∗ Y )T ; c2 = (W1 ∗ D1 )(W2 ∗ Y )T ; c3 = W1 (W2 ∗ Y ∗ D2 )T ; mˆ = b1 ∗ c1 + b2 ∗ c2 + b3 ∗ c3 : In the case of two covariates, care has to be taken when plotting the estimate to ensure that the smooth 6tted surface is not plotted at points which extrapolate far beyond the covariate values. In the one covariate case the plotting region can easily be de6ned by the range of the data. In the two covariate case a simple analogue is to plot within the convex hull of the covariate values. The computational method described above produces a matrix of estimates over a rectangular grid. It remains to mask this grid to match the convex hull of the covariate values. Denote the points on the regular grid by (g1i ; g2i ) and the vertices of the convex hull of the covariate values by (e1j ; e2j ), where the 6nal vertex in the list coincides with the 6rst. Let a denote the vector diMerence between two consecutive vertices of the convex hull ej+1 − ej and b denote the vector diMerence gi − ej , for particular values of i and j. It is assumed that the vertices of the convex hull are arranged in clockwise order. A grid point lies ◦ within the convex hull if the angle between a and b is less than 180 for all values of j de6ning the convex hull edges. The condition for this to happen is a1 b2 6 a2 b1 . In the original co-ordinates this can be expressed as g1i a2j − g2i a1j ¿ e1j a2j − e2j a1j . In matrix notation, this can be written as 0 1 GLF ¿ J (E ∗ LF); L= ; −1 0 where the rows of the matrix G contain the co-ordinates of the grid points, the columns of E contain the vertices of the convex hull, the columns of F contain the diMerences between the vertices of the convex hull and J is a two-column matrix of 1’s with number of rows equal to the number of grid-points. The inequality sign must be read in an elementwise sense. Grid points which lie inside the convex hull are identi6ed by rows of the matrices on the left- and right-hand sides where the inequality holds for all elements.
552
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
3. Binning for large datasets The matrix formulae described in Section 2 provide compact computational expressions for nonparametric estimators. However, the size of many of the matrices is dependent on the sample size n and this becomes ine2cient as n increases. A number of highly e2cient smoothing algorithms are available, as discussed by Fan and Marron (1994), for example. In particular, the concept of binning, where the raw data are reduced to frequencies over a 6ne grid, is a practical solution. HPardle and Scott (1992) and Wand and Jones (1995, Appendix D) provide good introductions to different forms of this technique. Binning can be implemented with linear interpolation and other re6nements and fast Fourier transform ideas can also be used to great eMect, as described by Silverman (1982) and HPardle (1987). However, if the number of grid points is set to be large then the accuracy of the simplest form of binning can be maintained at a high level without introducing complications into the computational formulae. When a 6ne grid is placed over the sample space, the original data can be recoded as frequencies at grid locations and, for regression data, sample means and standard deviations of the response variable. Bins containing no observations should be omitted from the recoded list. The table below identi6es the form of the recoded data explicitly and establishes notation. Estimate
Dimensionality
Original data
Recoded data
Density
1 2 1 2
{yi } {y1i ; y2i } {xi ; yi } {x1i ; x2i ; yi }
{yQ i ; ni } {yQ 1i ; yQ 2i ; ni } {xQi ; yQ i ; si ; ni } {xQ1i ; xQ2i ; yQ i ; si ; ni }
Regression
The number of bins will be denoted by b. For density estimation, yQ denotes bin location. For regression, xQ denotes bin location while yQ and si denote the mean and standard deviation of the responses for the data in each bin. In all cases, ni denotes the bin frequencies. It is sometimes convenient to be able to refer to the individual observations within each bin. A double subscript will be used so that, for example, yij refers to observation j which lies in bin i. For density estimation, it is immediately obvious from the de6nition of the estimators that all that is required is to replicate the kernel function over each bin, according to the bin frequencies, in order to mimic the form of the estimator from the original data. All of the matrix formulae then apply with the elements of the weight vector de6ned as pi =ni = i ni . The de6nition of the local mean form of nonparametric regression makes it clear that replication is su2cient to deal with this case too. From the discussion of weights in Section 2.2, the smoothing matrix is then given by (WP)=(WP1b 1Tb ); where P is the diagonal matrix with p as its diagonal vector. In the local linear case, with a single covariate, the observations yij lying within bin i are all now regarded as
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
553
having the same covariate value xQi . The expansion ij
{yQ ij − − (xQi − z)}2 w(xQi − z; h)
=
ni si2 w(xQi − z; h) +
i
i
ni {yQ i − − (xQi − z)}2 w(xQi − z; h)
shows that the estimation of , and hence the construction of the nonparametric regression curve, can be based solely on the bin means, locations and frequencies. The same principle applies in the two covariate case. The matrix formulae discussed in Section 2 then remain valid, with the basic kernel weight matrix W replaced by WP. However, the dimensionality of the matrices involved in these manipulations is now controlled by k and b rather than k and n, which is a very considerable reduction. Clearly, the process of constructing the nonparametric estimates changes very little in the presence of binning. However, the estimates themselves are often only the starting points for further analyses. The eMects of binning on computations related to more inferential aspects of nonparametric models are explored in Section 4. In preparation for that discussion, the inRuence of binning on the problem of estimating the error variance in a nonparametric model is considered here. With a single covariate, a simple nonparametric model is expressed as y = m(x) + , where the errors have independent Normal distributions with constant variance (2 . Although it is a relatively straightforward matter to compute an estimate of (2 without binning, some inferential uses of the estimates in the quadratic form methods to be described in Section 4 are impractical to implement without binning when the sample size is large. This is because it is necessary to work with the speci6c matrix which de6nes the estimator as a quadratic form and this matrix is of size n × n in the unbinned case. Estimation of (2 in a two covariate model is much more problematic, even without binning, and there has so far been relatively little discussion of this problem in the research literature. A simple, but very useful, estimator of (2 was proposed by Rice (1984) as (ˆ2 =
n
1 (yi − yi−1 )2 ; 2(n − 1) j=2
where it is assumed that the observations (xi ; yi ) have been ordered by xi . The behaviour of the estimator is made clear by considering E(yi − yi−1 )2 = E{(yi − m(xi )) − (yi−1 − m(xi−1 )) + (m(xi ) − m(xi−1 ))}2 ; = 2(2 + {m(xi ) − m(xi−1 )}2 : The estimate (ˆ2 is therefore inRated by a term which measures the variation in the underlying function m but the eMect of this will rapidly diminish if the covariate values spread more densely over the design space as the sample size increases. If the means
554
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
of binned data are used, the corresponding expansion is E(yQ i − yQ i−1 )2 = var(yQ i ) + var(yQ i−1 ) + {m(xi ) − m(xi−1 )}2 ; 1 1 = (2 + {m(xi ) − m(xi−1 )}2 : + ni ni−1 An estimate of (2 can therefore be constructed as (ˆ2b =
b
1 (yQ i − yQ i−1 )2 : b−1 (1=ni + 1=ni−1 ) i=2
However, information on (2 is also present in the bin sample variances. These two sources of information can then be combined to form the estimate b (b − 1)(ˆ2b + i=1 (ni − 1)si2 2 : (ˆ = n−1 The denominator arises by combining the b − 1 ‘degrees of freedom’ from the bin means and the i (ni − 1) degrees of freedom from the bin variances. An alternative approach to estimating (2 was proposed by Gasser et al. (1986). This involves the creation of ‘pseudo-residuals’ xi+1 − xi xi − xi−1 ˜i = yi−1 + yi+1 − yi xi+1 − xi−1 xi+1 − xi−1 as the diMerence between observation yi and the linear interpolation from its two neighbours. If the coe2cients of yi−1 and yi+1 in this expression are denoted by ai and bi , respectively, then the estimate of error variance is (ˆ2 =
n−1
˜2i 1 ; n−2 (a2i + b2i + 1) i=2
(a2i
+ b2i
+ 1) is the coe2cient of (2 in var(˜i ). This second diMerence approach where should reduce the eMect of the underlying function in inRating the estimate. When this approach is applied to binned data the coe2cients must be amended to (ˆ2b =
b−1
˜2i 1 : b−2 (a2i =ni−1 + b2i =ni+1 + 1=ni ) i=2
This estimator can then be combined with the bin variances to produce b (b − 2)(ˆ2b + i=1 (ni − 1)si2 : (ˆ2 = n−2 4. Inferential issues in nonparametric regression The standard error of a nonparametric regression curve at its evaluation points provides a very useful indication of the variability of the estimator. Since the vector of
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
555
estimated values mˆ can be expressed as Sy, the estimated variance is simply m) var( ˆ = diag(SP −1 S T )(ˆ2 ; where P is the diagonal matrix with the bin frequencies as its diagonal vector. This refers to the case of binned data. The formula which applies in the absence of binning is obtained by setting b = n and P = I . Bowman and Young (1996) and Bowman and Azzalini (1997) discuss the use of ‘reference bands’ as a graphical display of the level of agreement between a nonparametric curve estimate and a reference model of interest. Particular reference models of interest include ‘no eMect’, represented by a horizontal line, and a simple linear regression. Since these are linear models they can be represented in the binned data case as Hy, where H = X (X T PX )−1 X T P and X is the appropriate design matrix. The widths of the reference bands are determined by the standard errors of mˆ − Hy under the assumption that the linear model holds. This is easily derived from mˆ − Hy) = diag{(S − H )P −1 (S − H )T }(ˆ2 : var( A variation applies in the ‘no eMect’ case when the response data are in fact residuals from a linear model projection matrix H1 and interest lies in detecting patterns with respect to a further explanatory variable not yet included in the model. In the unbinned case, the estimated variance is then mˆ − Hy) = (S − H )(I − H1 )(S − H )T (ˆ2 : var( In the binned case it becomes rather complex to compute the appropriate projection matrix because of the need to track the design matrix entries of the binned residuals. It is also possible to carry out formal tests which compare parametric and nonparametric models in a global manner, or which compare two nonparametric models. As an example, consider a comparison of a linear model with a nonparametric regression, when there are one or two covariates. A general treatment can be given if the comparison is formulated in terms of residuals from the linear model. The null hypothesis is that these residuals have mean 0 while the alternative proposes the mean to be a smooth function of the covariate(s), as a consequence of nonlinearity. By analogy with an F-test for comparing linear models, a natural test statistic is RSS0 − RSS1 ; RSS1 where in this case RSS0 and RSS1 denote the sums-of-squares of the linear model residuals and the smoothed residuals, respectively. The residual sums-of-squares are quadratic forms and a p-value for the test can be calculated numerically, as described in Azzalini and Bowman (1993) and Bowman and Azzalini (1997, Chapter 5). With binned data, it is convenient to adopt the notation eij = yij − y( ˆ xQi ), where y( ˆ xQi ) ˆ xQi ). We denotes the 6tted value from the linear model, and also to de6ne eQ i = yQ i − y( then have eij = (yij − yQ i ) + eQ i and in matrix notation the smoothed residuals are given
556
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
by mˆ = S e. Q The residual sums-of-squares can then be written as RSS0 = eij2 = (ni − 1)si2 + ni eQ2i ij
i
= RSS1 =
i
(ni − 1)si2 + eQ T P e; Q
i
2
(eij − mˆ i ) =
ij
(ni − 1)si2 +
i
=
ni (eQ i − mˆ i )2
i
(ni − 1)si2 + eQ T (I − S)T P(I − S)e: Q
i
The p-value for the test is calculated as eQ T P eQ − eQ T AeQ ¿ Fobs ; p=P eQ T AeQ + i (ni − 1)si2 where, for convenience, A denotes the matrix (I − S)T P(I − S). The calculation can be reformulated as
T T 2 p = P eQ P eQ − (1 + Fobs )eQ AeQ − Fobs (ni − 1)si ¿ 0 i
T
= P eQ BeQ − Fobs
(ni −
1)si2
¿0 ;
i
where B denotes the matrix P − (1 + Fobs )A. Under the assumption that the data yij , and hence the residuals eij , follow a Normal distribution, the distributional calculations on eQ T BeQ follow from well known results in quadratic forms. Exact calculations can be performed using the eigenvalues of B while Kuonen (1999) describes an Edgeworth expansion approach. Johnson and Kotz (1972) give the details of how the cumulants, and hence the moments, of eQ T BeQ may be calculated and this leads to simple but accurate distributional approximations. The error variance (2 may be assumed to be 1 because its eMect is scaled out in the ratio form of the test statistic. The covariance matrix of eQ is also required. This is easily seen T to be (I − X (X T PX )−1 X P)P −1 , where X is the design matrix for the linear model. The additional term Fobs i (ni − 1)si2 introduced by binning is independent of eQ and 2 it follows a -n−b distribution. The 6rst three cumulants of this additional component 2 3 are therefore Fobs (n − b), Fobs 2(n − b), Fobs 8(n − b). These cumulants can be added to those of the quadratic form to obtain the moments of the whole expression. The probability calculation can then be completed by matching moments with a shifted and scaled -2 distribution, namely -2 + . Other inferential problems, such as the comparison of two or more nonparametric regression curves as described by Young and Bowman (1995), involve estimates of the error variance (2 in the appropriate test statistics, using the expressions provided at
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
557
the end of Section 3. Similar quadratic form calculations, with appropriate adjustments for binning, then also apply.
5. Discussion The methods described in this paper provide e2cient vector–matrix implementations of standard smoothing procedures. There are, of course, many other types of smoothing procedure in common use. Local logistic regression for binary data (Fan et al., 1995) and quantile smoothing for survival data (Bowman and Wright, 2000) are particularly important cases. Vector–matrix implementations of the algorithms for these cases follow similar principles to those described above but the details have not been included here. An important issue in the application of smoothing techniques is the selection of appropriate values for the smoothing parameters. A wide variety of methods are described in the literature but the implementations of these are rather speci6c to each procedure. This topic has therefore not been addressed in the general principles discussed in the paper. Where binning is used to handle large datasets a decision is necessary on the numbers of bins to be used. We have found that setting the number of bins to 8 log(n=p), where p denotes the number of dimensions (1 or 2), provides an eMective empirical rule, applied only when the sample size n is larger than 500. In the two-dimensional case, the value produced by this rule refers to the number of bins in each dimension.
Software The sm library is available from the following Web sites: http://www.stats.gla.ac.uk/∼adrian/sm and http://www.stat.unipd.it/∼azzalini/Book sm/ Version 2 of the library includes the binning procedures discussed in the paper.
Appendix A. Implementations in S-Plus The code listed here gives an indication of how the formula described in the paper can be implemented in S-Plus. The sm library for S-Plus is based on the same principles but contains more complex code to allow greater generality. A.1. Density estimation in one dimension The following code constructs a density estimate est from the vector of data y at the vector of evaluation points z, using a smoothing parameter h and a vector of observation weights p whose elements sum to 1.
558
A.W. Bowman, A. Azzalini / Computational Statistics & Data Analysis 42 (2003) 545 – 560
n k W W W est