DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706
TECHNICAL REPORT NO. 952 (rev) December 22, 1995
Smoothing Spline ANOVA Fits for Very Large, Nearly Regular Data Sets, with Application to Historical Global Climate Data 1
by
Grace Wahba and
Zhen Luo
To appear in the Festschrift in Honor of Ted Rivlin, C. Micchelli, Ed., Baltzer Press, 1996. Research sponsored in part by NSF under Grant DMS 9121003 and NASA under Grant NAGW2961 1
Baltzer Journals
December 22, 1995
Smoothing Spline ANOVA Fits For Very Large, Nearly Regular Data Sets, With Application to Historical Global Climate Data y
Grace Wahba and Zhen Luo 1
1
E-mail:
1
Department of Statistics, University of Wisconsin Madison, WI 53706
[email protected],
[email protected] We review smoothing spline analysis of variance (SS-ANOVA) methods for tting a function of several variables to scattered, noisy data. The tted function is obtained as a sum of functions of one variable (main eects) plus a sum of functions of two variables (two-factor interactions), and so forth. The terms are found as solutions to a variational problem in a reproducing kernel Hilbert space which is built up from tensor sums and products of Hilbert spaces of functions of fewer variables. These methods have found application in environmental and demographic data analysis problems, and provide an intuitively interpretable technique for examining responses as (smooth) functions of several variables. Matrix decomposition methods can be used to compute the SS-ANOVA ts while adaptively choosing multiple smoothing parameters by generalized cross validation (GCV), provided that matrix decompositions of size n n can be carried out, where n is the sample size. We review the randomized trace technique and the back tting algorithm, and remark that they can be combined to solve the variational problem while choosing the smoothing parameters by GCV for data sets that are much too large to use matrix decomposition methods directly. Some intermediate calculations to speed up the back tting algorithm are given which are useful when the data has a tensor product structure. We describe an imputation procedure which can take advantage of data with a (nearly) tensor product structure. As an illustration of an application we discuss the algorithm in the context of tting and smoothing historical global winter mean surface temperature data and examining the main eects and interactions for time and space. Subject classi cation: AMS(MOS) 41A15, 62H11, 62G07, 65D07, 65D10, 65D15, 62M30, 65K10, 62F15,49J55. Keywords: smoothing spline ANOVA, Gauss-Seidel algorithm, back tting algoThis research in part by NSF under Grant DMS-9121003 and in part by NASA Grant NAGW-2961 y
Wahba and Luo / Smoothing Spline ANOVA for Large Data Sets
2
rithm, randomized trace estimates, generalized cross validation, RKPACK, large environmental data sets.
1 The Smoothing Spline ANOVA Decomposition Some of this section is reprised from [43]. Let T be a measurable space, t 2 T , = 1; ; d; (t ; ; td) = t 2 T = T T d . For f satisfying some measurability conditions on T a unique ANOVA decomposition of f of the form X X (1) f (t ; ; td ) = + f (t ) + f (t ; t ) + ( )
( )
(1)
1
1
( )
can always be de ned as follows: Let d be a probability measure on T () and de ne the averaging operator E on T by (E f )(t) =
Z
T ()
f (t1 ; ; td )d (t ):
(2)
Then the identity is decomposed as Y X Y Y X Y E + I = (E +(I E )) = E + (I E ) E + (I E )(I E )
Y
6=
6=;
<
+ (I E):
(3)
The components of this decomposition generate the ANOVA decomposition of f of the form (1) by Y Y Y E )f; (4) = ( E )f; f = ((I E ) E )f; f = ((I E )(I E )
6=;
6=
and so forth. The idea behind SS-ANOVA is to construct a reproducing kernel Hilbert space (RKHS) H of functions on T so that the components of the SS-ANOVA decomposition represent an orthogonal decomposition of f in H, and, given noisy data, to estimate (some of) these components. We suppose there are observations yi generated according to the model yi = f (t1 (i); ; td (i)) + i; i = 1; ; n; (5) where = (1 ; ; n )0 N (0; 2 Inn ) is a `Gaussian white noise' vector. Some (or all) of the components in the ANOVA decomposition of f will be estimated by nding f in (an appropriate subspace of) H to minimize n X i=1
(yi f (t(i)))2 +
X
J (f ) +
X
J (f ) + ;
(6)
Wahba and Luo / Smoothing Spline ANOVA for Large Data Sets
3
where the J ; J are `roughness penalties' and the series may be truncated at some point. This is done as follows: Let H() be an RKHS of (real valued) R () functions on T with T f (t )d = 0 for f () 2 H() , and let [1() ] be the one dimensional space of constant functions on T () . Construct H as ( )
H=
Yd
(f[1() ]g fH() g) = [1]
=1
X
H ( )
X <
[H() H( ) ] ;
(7)
where [1() ] is the constant functions on T () and [1] is the constant functions on T . With some abuse of notation, factors of the form [1() ] are omitted whenever they multiply a term of a dierent form. Thus H() is a shorthand for [1(1) ]
[1( 1) ] H() [1(+1) ] R [1(d) ] (which is a subspace of H). By letting the square norm on [1() ] be ( fd )2 , and using the induced tensor product norm, the components of the ANOVA decomposition will be in mutually orthogonal subspaces of H. Next, H() is decomposed into a parametric part and a smooth part, by letting H() = H() Hs(), where H() is nite dimensional (the \parametric" part) and Hs() (the \smooth" part) is the orthocomplement of H() in H() . Elements of H() are not penalized through the device of letting J (f) = kPs() fk2 where Ps() is the orthogonal projector onto Hs() . [H() H( ) ] is now a direct sum of four orthogonal subspaces: [H() H( ) ] = [H() H( ) ] [H() Hs( ) ] [Hs()
H( ) ] [Hs() Hs( ) ]. By convention the elements of the nite dimensional space [H() H( ) ] will not be penalized. Continuing this way results in an orthogonal decomposition of H into sums of products of unpenalized nite dimensional subspaces, plus main eects `smooth' subspaces, plus two factor interaction spaces of the form parametric smooth [H() Hs( ) ], smooth parametric [Hs() H( ) ] and smooth smooth [Hs() Hs( ) ] and similarly for the three and higher factor subspaces. Now suppose that we have selected the model M, that is, we have decided which subspaces will be included. Collect all of the included unpenalized subspaces into a subspace, call it H0 , of dimension M , and relabel the other subspaces as H ; = 1; 2; ; p. H may stand for a subspace Hs() , or one of the three subspaces in the decomposition of [H() H( ) ] which contains at least one `smooth' component, or, a higher order subspacePwith at least one `smooth' component. Collecting these subspaces as M = H0 H , the estimation problem becomes: P Find f in M = H0 H to minimize n X i=1
(yi f (t(i)))2 +
p X =1
1 kP f k2 ;
(8)
Wahba and Luo / Smoothing Spline ANOVA for Large Data Sets
4
where P is the orthogonal projector in M onto H , and 1 = . The minimizer, call it f ( = (1 ; ; p )) of (8) is known to have a representation [40], Chapter 10 in terms of a basis f g for H0 and the reproducing kernels (RK's) fR (s; t)g for the H . Letting
Q (s; t) = it is
f (t) = where
M X =1
d (t) +
n X i=1
p X
=1
R (s; t);
(9)
ci Q (t(i); t) = (t)0 d + (t)0c;
(10)
0(t) = (1 (t); ; M (t)); 0(t) = (Q (t(1); t); ; Q (t(n); t)): cn1 and dM 1 are vectors of coecients which satisfy (Q + I )c + Sd = y S0c = 0 (11) where here and below we are letting Q be the n n matrix with ij th entry Q (t(i); t(j )), and S be the n M matrix with i th entry (t(i)). This system will have a unique solution for any set of positive f g provided S is of full column rank, which we will always assume. This condition on S is equivalent to the uniqueness of least squares regression onto span f g. Since the RK of a tensor product space is the product of the RK's of the component spaces, the computation of the R 's is straightforward. For example, the RK corresponding to the subspace H() Hs( ) is (in an obvious notation), RH (s ; t )RHs (s ; t ). Of course any positive de nite function may in principle play the role of a reproducing kernel here. Conditionally positive de nite functions [32] as in thin plate splines [44] may also be used. The point evaluation functionals f ! f (t(i)) may be replaced by bounded linear functionals on H, and other functions can be added to H0 subject just to the uniqueness conditions, making this class of function estimates broadly useful in many applications. See [4] [5] [6] [7] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24][25] [29] [33] [34] [35] [37] [38] [39] [40] [41] [44] . ( )
( )
2 Back tting Hastie and Tibshirani [26] Section 5.2.3, discuss the back tting (a. k. a. GaussSeidel) algorithm in the context of the general setup of SS-ANOVA problems
Wahba and Luo / Smoothing Spline ANOVA for Large Data Sets
5
as was described by [6]. Further discussion of the back tting P algorithm can be found in [3] P and elsewhere. Referring to (8)-(11), let f0P(t) = M =1 d (t) and let f (t) = ni=1 ci R (t(i); t). Then f () = f0 () + p =1 f () with c and d satisfying (11) is the minimizer over f in M of n X i=1
(yi f (t(i)))2 +
p X
=1
kP f k2 ;
(12)
where we have set = 1. Now, de ne f~0 () = f0 () and f~ () = Pni=1 ci R (t(i); ), for arbitrary ci . In what follows it will be useful to recall that kf~ k2 kP f~k2 = c0 R c where c = (c1 cn )0 and R is the n n matrix with ij th entry R (t(i); t(j )), see [40]. In [26] the authors consider minimizing (12) in the span of all functions of the form f () = f~0 () + f~1 () + + f~p() and they discuss nding d and c ; = 1; ; p to minimize
ky Sd
p X
=1
R c k + 2
p X
=1
c0 R c :
(13)
They note that the (vector) smooths corresponding to the minimizers, de ned as ~f 0 S d~ and ~f R c ; = 1; ; p, satisfy the back tting equations ~f = S (y X ~f ); = 0; 1; ; p; (14) 6=
with the smoother matrix S0 given by S0 = S (S 0 S ) 1 S 0 and the other smoother matrices S given by S = R (R + I ) 1 ; = 1; p. The back tting algorithm solves for ~f0 and ~f ; = 1; ; p by cycling through ~f = S (y P6= ~f ); = 0; 1; ; p. The back tting algorithm is known to converge if the Frobenius norm of each product S S is less than 1. Since the minimizer of (12) is in the M + n dimensional space spanned P p by f (); = 1; ; M g [ f =1 R (t(i); ); i = 1; ng, minimizing (12) in the larger space spanned by the M + np functions f (); = 1; ; M g [ fR (t(i); ); = 1; p; n = 1; ; ng will result in a solution in the n + M dimensional space. Setting ~f = R c , the last p back tting equations become
R ( c +
p X
=1
R c ) = R (y Sd); = 1; ; p;
(15)
although c will not be uniquely determined if R is not of full rank. However, suppose R c = R c for some c, = 1; ; p. Recalling that = 1, this would give p X R (I + R)c = R (y Sd); = 1; ; p; (16) =1
Wahba and Luo / Smoothing Spline ANOVA for Large Data Sets
6
P
so that if c satis es (I + p =1 R )c = y Sd, then c = c; = 1; ; p satis es the back tting equations. Thus, despite the apparently larger number np + M of unknowns in (14) compared to the n + M unknowns in (11), the back tting solutions ~f ; = 0; 1; p are, at convergence, equivalent to solving (Q + I )c + Sd = y (17) 0 Sc = 0 (18) for c and d and setting ~f0 = Sd; ~f = R c. Equation (18) follows by observing that the rst back tting equation becomes Sd = S0 (y Q c). Substituting this into (I + Q )c = (y Sd) results in c = (I PS0 )(y Q c) which entails (18). p ~f , which may then be used Substituting back into (17) results in c = P y =0 to compute f (t) for general t via f (t) = ni=1 ci R (t(i); t). Ansley and Kohn [1] have a nice discussion of the back tting algorithm in the context of von Neuman's alternating projection method.
3 Choosing the Smoothing Parameters Probably the most popular method for choosing = (1 ; ; p ) (1 1 ; ; p 1 ) is the GCV method,[7] [14] which chooses as the minimizer of yk2 (19) V () = k[tr(I(I AA(())))] 2 where A() is the n n matrix satisfying A()y = (f (t(1)); f (t(n)))0 : (20) The matrix I A() is known to have a representation I A() = 2 ( 02 ( + I ) 2 ) 1 02 (21) where 2 is any n (n M ) orthogonal matrix satisfying 02 S = 0 and is the n n matrix with i; j th entry Q (t(i); t(j )), see, for example [40], p. 13. A() is a so-called `smoother' matrix, that is, a symmetric, non-negative de nite matrix with all its eigenvalues in the interval [0; 1]. If the variance 2 in Equation (5) is given, then the unbiased risk estimate for is given by the minimizer of U () = k(I A())yk2 + 22 trA() (22) can also be used. See [14], [7]. Other estimates are discussed in [40]. The code RKPACK ([15], [20], [23]) is designed to compute trA() and compute and nd the minimizer of V () and solve (11), using matrix decomposition methods.
Wahba and Luo / Smoothing Spline ANOVA for Large Data Sets
7
Recently Girard [9] [10], [11] and Hutchinson [27] have proposed the randomized trace technique for estimating trA() for large n. This method is feasible for n larger than allowable with matrix decomposition methods, and can be used whenever a `black box' is available for obtaining ~f , the n-vector with ith entry f (t(i)); i = 1; ; n, given y. That is, it is assumed that some algorithm is available which produces (a good numerical approximation to) ~f = A()y for any y. The the randomized trace estimate is based on the following fact: Let be an n-dimensional (pseudo-)random vector with mean zero and covariance matrix I . Then the expected value of 0 A() = trA(). Furthermore, q 2 1if is2 a Gaussian 1 0 random vector then the standard deviation of n A() is n [ n trA ()]1=2 , [10]. 1 Since A() is a smoother matrix then n trA() 2 [0; 1] and the standard deviation q of n1 0 A() is no greater than n2 [ n1 trA()]1=2 . In practice it is preferable to estimate n1 tr(I A()). Letting ~f ( ) be ~f with the data vector y replaced by , then the estimate of n1 tr(I A()) is n1 0 [ ~f ( )]. The same should be used for all values of . This results in an estimate for V () which is a smooth function of and appears to have the same shape as V () computed exactly. Excellent results with n around 600 were reported in [42], for example, and have been also reported by other authors, see, for example [13]. Since the (converged) back tting P p ~ algorithm produces f =0 ~f it can be used to compute the randomized trace estimate of tr(I A(P )) for selected values of . Now, let z = y 6= ~f . Note that at convergence, when (14) is satis ed, X~ ~ 2 ky ~f k2 = k(y f ) f k (23) 6= S z k2 ;
= kz = 0; 1; ; p: (24) This suggests that ; = 1; ; p can be updated at each step as the back tting proceeds, by considering to be xed for 6= . Let A( ) stand for A() with all the considered xed except and choose to minimize 2 (25) V ( ) = trkz( I SA (z k)) ; = 1; ; p;
similarly for U ( ). This is the BRUTO algorithm in [26], p 262.
4 Global Climate Data The algorithm we describe is well suited to the analysis of certain kinds of global environmental data. Rather than give a completely general description, we will motivate the discussion with respect to a particular example, of broad general interest. Generalizations to more complicated models will be fairly evident.
Wahba and Luo / Smoothing Spline ANOVA for Large Data Sets
8
We have in mind monthly average surface temperature data, that has been computed from daily observations of surface temperature that have been collected at a large number of meteorological stations around the globe for varying periods of time. One of the oldest observing stations, at Trondheim, Norway, has been collecting such data since 1761. Records from around 1700 stations are available for the period 1961-90, with varying numbers of missing data. Just to give some concrete numbers, as an illustration consider data for the n1 = 30 years 1961-1990, for, say n2 = 1500, stations that have mostly complete records for that period. Considering the occasional missing data point there will then be (for a particular month) somewhat fewer than n = n1 n2 = 45; 000 observations, indexed (after scaling) by t1 2 ft1 (1); ; t1 (n1 )g 2 T (1) [0; 1], and t2 2 S , where S is the unit sphere. (t2 takes on the values of (latitude; longitude) of the stations). For expository purposes and generality we let ; = 1; 2 be Lebesgue measure on the unit interval and on the sphere, respectively. In this example, if the t1 (j ) are equally spaced, T (1) can (possibly more naturally) be taken as a set of n1 points, with 1 assigning mass n1 to each point. In this problem, the choice of Lebesgue measure on the sphere is a natural one, since we will be interested in estimating the global average temperature. We remark parenthetically that although from a mathematical point of view this choice appears trivially obvious, among climatologists who have to deal with very irregular data of this type the choice of a commonly acceptable working de nition of `global average surface temperature' is far from obvious. If T (1) = [0; 1], we take for H(1)R the subspace of W2m = ff; f 0abs:cont:; f 00 2 L2[0; 1]g of functions which satisfy 01 fd 1 = 0. If we endow this space with the R 1 2 2 square norm kf kH = (f (1) f (0)) + 0 (f 00 (u))2 du then we can let H(1) be the one dimensional space spanned by the `trend' function (u) = (u 12 )=12. Then Hs(1) can be taken as the subspaceRof W2m satisfying R01 f (u)du = (f (1) f (0)) = 0 with the square norm kf k2 Hs = 01 (f 00 (u))2 du. The RK R~ 1 (u; v) for Hs(1) is given in [40] as R~ 1 (u; u0 ) = k2 (u)k2 (u0 ) k4 ([u u0 ]) (26) where k` (u) = B` (u)=`! with B` the `th Bernoulli polynomial. Elements in Hs(1) have a natural interpretation as having been `detrended'. If we take T (1) = P n f1; ; n1g, and let J (f ) = i=12[f (i + 2) 2f (i + 1) f (i)]2 , then H(1) is the subspace of E n of vectors whose components sum to 0, the trend function is (u) = u (n1 + 1)=2, and Hs(1) is the n1 2 dimensional subspace of E n of vectors perpendicular, in the Euclidean inner product, to the constant function and the trend function. The RK for Hs(1) can easily be shown to be the n1 n1 matrix obtained as follows: Let L be the (n1 2) n1 matrix with 1 down the diagonal, 2 down the super-diagonal and 1 down the next super-diagonal, i. e. J (f ) = f 0 L0 Lf , where f = (f (1); f (n1))0 . Then R~ (j; j 0 ) is the jj 0 th entry of 1
(2)
(1)
1
1
Wahba and Luo / Smoothing Spline ANOVA for Large Data Sets
9
(L0 L)y where y is the Moore-Penrose generalized inverse, R Splines on the sphere have been discussed in [34],[35] where J (f ) = P 2S (m=2 f )2 dP with the surface Laplacian on the sphere. Closed form expressions for RK's with a norm which is (topologically) equivalent to J () are given there for m = 3=2; 2; 5=2; :::; 6. The parameter m may be estimated from the data by minimizing min V () considered as a function of m, see [44]. Other positive de nite functions on the sphere may be found in [45] and references cited there. See also [8] and [36] who obtain RK's from historical meteorological data. Among suciently differentiable functions on the sphere for which J (f ) is nite, the null space of J (f ) is just the constant functions, so we take H(2) as empty and H(2) as Hs(2) . In the example below we will choose the m = 2 case which (from [34]) gives the following RK R~ 2 for H(2) : Letting P; P 0 be two points on the sphere, and z = cos (P; P 0 ), where (P; P 0 ) is the angle between P and P 0 ~R2 (P; P 0 ) = 1 1 q2 (z ) 1 (27) 2 2 6 where s ( 1 z 2 1 z 1 z 3=2 1 z ) 2 1 4 2 ] 12 2 +6 2 +1 : q2 (z) = 2 ln(1 + 1 z [12 2 (28) In the remainder of this section we will relable t = (t1 ; t2 ) as t = (x; P ). Our RKHS of historical global temperature functions is H = [[1(1) ] [] Hs(1) ]
[[1(2) ] Hs(2) ], a collection of functions f (x; P ), on [0; 1] S , where H and f have corresponding decompositions given below: H = [1] [] [Hs(1) ] [Hs(2) ] [[] Hs(2) ] [Hs(1) Hs(2) ] f (x; P ) = C + d(x) + f1 (x) + f2(P ) + (x)f;2 (P ) + f12 (x; P ) = mean + global + time + space + trend + space time main main by space time trend eect eect eect interaction For the T (1) = [0; 1] case, the components of the ANOVA decomposition satis y the side conditions 0 = 0 = 0 =
Z
1
Z
1
0
Z
0
S
(x)dx f1(x)dx = (f1 (1) f1 (0)) =
f2 (P )dP =
Z
S
f;2 (P )dP =
Z
Z
1 0
S
f12 (x; P )dx = f12(1; P ) f12 (0; P )
f12 (x; P )dP;
the equalities involving f12 holding for all P and for all x. Here M = 2 and H0 is the span of the two functions 0(x; P ) 1 and 1 (x; P ) (x). For the
Wahba and Luo / Smoothing Spline ANOVA for Large Data Sets
T
10
= f1; ; n1 g case the integral over x is replaced by the sum over 1; n1 . In this case, (1)
n X 1
j =1 0
(j 0 ) 0;
n n X ~R (j; j 0 ) 0; X R~ (j; j 0 )(j 0 ) 0; j = 1; ; n : (29) 1
1
1
j =1
j =1
0
1
1
0
This will lead to some algorithmic simpli cations to be described, in the regular data case. There are p = 4 subspaces whose components will be penalized, with RK's given below: space RK (1) 1 Hs R1 (x; P ; x0 ; P 0 ) = R~ 1 (x; x0 ) 2 Hs(2) R2 (x; P ; x0 ; P 0 ) = R~ 2 (P; P 0 ) (2) 3 [] Hs R3 (x; P ; x0 ; P 0 ) = (x)(x0 )R~ 2 (P; P 0 ) (1) (2) 4 Hs Hs R4 (x; P ; x0 ; P 0 ) = R~ 1 (x; x0 )R~ 2 (P; P 0 )
5 Matrix decompositions with back tting for regular data In this section we describe an approach combining back tting and matrix decompositions of size n1 n1 and n2 n2 which can be used for the global climate data and other examples when the data are perfectly regular, that is, there is an observation for each pair (xj ; Pk ); j = 1; ; n1 ; k = 1; ; n2 , where j indexes time and k indexes station. (Note: this does not mean that the data is regular on the sphere, just that each station is reporting at each time.) In the subsequent section we will show how data imputation may (safely) be used when there are (a small number of) missing data points. The results in this and the next section will appear in [31]. The results will become clear after we establish some notation. As before, let X~ f ): (30) z (y 6=
here the vectors in (30) all are of dimension n = n1 n2 , unless otherwise noted, we consider them partitioned into n1 blocks of dimension n2 and let z be a generic vector of this form, with P z(j; k) be the kth entry in the j th block of z. Let Po(1) z (2) n 1 be the n1 -vector with P n k=1 z (j; k) in the j position, j = 1; ; n1 and Po z be the n2 -vector with n1 nj =1 z (j; k) in the k th position, k = 1; ; n2 . Finally, let zj be the n2 vector in the j th block of z. In the rst back tting equation = 0 in (14), ~f0 is the result of least squares regression of z0 onto the columns of S . To solve the other back tting equations, write the remaining equations in (14) as (R + I )~f = R z ; = 1; p: (31) 2
2
1
1
Wahba and Luo / Smoothing Spline ANOVA for Large Data Sets
11
Let R~ 1 be the n1 n1 matrix with jj 0 th entry R~ 1 (xj ; xj ); j; j 0 = 1; ; n1 , and R~ 2 the n2 n2 matrix with kk0 th entry R~ 2 (Pk ; Pk ); k; k0 = 1; ; n2 . Then, examining the j th block of (14) for = 2 gives R~2 Po(2) ~f2 + n2 ~f2j = R~ 2 Po(2) z2 ; j = 1; n1 ; (32) 1 which entails that ~f21 = = ~f2n = Po(2) ~f2 , say. De ning the marginal smoother matrix S~2 () = R~2 (R~ 2 + I ) 1 results in Po(2)~f2 = S~2 (2 =n1 )Po(2) z2 : (33) 0
0
1
A similar argument, interchanging the roles of j and k gives, for = 1, Po(1)~f1 = S~1 (1 =n2 )Po(1) z1 ;
(34)
where S~1 () = R~1 (R~ 1 + I ) 1 For = 3 the j th block of (14) becomes n X 1
j =1 0
(xj )(xj )R~2~f3j + 3~f3j =
n X 1
0
0
j =1 0
(xj )(xj )R~ 2 z3j ; j = 1; n1 : 0
0
(35)
It can be seen that ~f3j =P(xj )v; j = 1; ; n1 , for some n2 vector v. Letting P1(2) z be the n2 vector with nj =1 (xj )z (j; k) in the kth position, and substituting this into (35) results in 1
(xj )
n X 1
j =1 0
Therefore
(
2(xj )R~ 2v + 3 (xj )v = (xj )R~2 P1(2) z3 : 0
n X 1
j =1 0
which gives
v=(
2 (xj )R~2 + 3 I )v = R~ 2 P1(2) z3 ; 0
n X 1
j =1 0
2 (xj )R~ 2 + 3 I ) 1 R~2 P1(2) z3 : 0
(36) (37) (38)
Letting Pnj =1 2 (xj ) = 2 nally gives 1
Considering = 4,
~f3j = ((xj )=2 )S~2 (3 =2 )P1(2) z3 :
(39)
~f4 = S4 (4 )z4 ;
(40)
Wahba and Luo / Smoothing Spline ANOVA for Large Data Sets
12
where S4 () = R4 (R4 + I ) 1 with R4 = R~ 1 R~2 . The n1 n1 and n2 n2 smoother matrices S~1 () and S~2 () appear repeatedly with varying values of . The approach we have taken in ongoing work for max (n1 ; n2 ) not too large is to calculate the eigenvalue-eigenvector decompositions of R~ 1 and R~ 2 . Letting R~ = D 0 ; S~ () = D(D + I ) 1 , S~() can be computed for varying values of . The eigenvalues and eigenvectors of R4 are obtained as the tensor products of the eigenvalues and eigenvectors of R~ 1 and R~2 and S4 () can then be computed similarly. In general if some components can be combined to reduce p then the number of back tting iterations required is likely to be smaller. In the present example, set ~f1+4 = ~f1 + ~f4 and z1+4 = y (~f2 + ~f3) and let S1+4 (1 ; 4 ) = R1+4 (R1+4 + I ) 1 , where R1+4 = 1 R1 + 4 R4 = R~ 1 [1 110 + 4 R~ 2 ]. Then the = 1 and = 4 back tting steps can be replaced by the 1 + 4 step, ~f1+4 = S1+4 (1 ; 4 )z1+4 . This may require repeated matrix decompositions of, say [ 110 + R~ 2 ], say, as 1 ; 2 are varied but may still represent a speedup. The speed of convergence of the back tting algorithm generally depends on the magnitudes of the product matrices S S , becoming faster as these products become `smaller'. If the products were all 0 for 6= , then the back tting iteration would converge in one step. In the example with T (1) = f1; ; n1 g; R~1 = (L0 L)y , the conditions (29) lead to all of these products equal 0 except S0 S2 ; S0 S3 and S1 S4 . In general the sizes of these product matrices have a dependency on the magnitute of the , decreasing as the increase. The most ecient method for computing the ~f for very large, perfectly regular data sets under various circumstances is under study. See [2] [12], Chapter 10. 1 4
6 Missing Data Imputation In practice perfectly regular observational data, at least for climate data, is the exception rather than the rule. Unfortunately, a few data points missing from a regular set fxj ; Pk g; j = 1; ; n1 ; k = 1; ; n2 means that both the outer and inner loop back tting equations would not all involve the same smoother matrices S~1 ; S~2 , and the ability to use a common eigenvalue- eigenvector decomposition appears to be lost. We show how to get around this with an `imputation' loop. To demonstrate that the imputation loop is legitimate we rst need a slight variation of the leaving-out-one lemma in Craven and Wahba [7].
Lemma .1 The Leaving-Out-K Lemma
Let H be an RKHS with subspace H0 of dimension M as before, and for f 2 H let P p kPf k2 = =1 1kP f k2. Let f [K ] be the solution to the variational problem:
Wahba and Luo / Smoothing Spline ANOVA for Large Data Sets Find f 2 H to minimize
n X i=1 i62SK
13
(yi f (t(i)))2 + kPf k2 ;
(41)
where SK = fi1 ; ; iK g is a subset of 1; ; n with the property that (41) has a unique minimizer, and let yi ; i 2 SK be `imputed' values for the `missing' data imputed as yi = f [K ](t(i)); i 2 SK . Then the solution to the problem: Find f 2 H to minimize n X
i=1 i62SK
(yi f (t(i))2 +
X
i2SK
(yi f (t(i)))2 + kPf k2
(42)
is f [K ].
Proof
Let h = f [K ] and let f be any element in H 6= f [K ]. Then: n X i=1
(yi h(t(i))2 +
i62SK
X i2SK
(yi h(t(i)))2 + kPhk2 =