Generalization Bounds for Function Approximation from ... - CiteSeerX

Report 3 Downloads 111 Views
Baltzer Journals

September 2, 1998

Generalization Bounds for Function Approximation from Scattered Noisy Data Partha Niyogi1and Federico Girosi2 Arti cial Intelligence Laboratory 545 Technology Square Massachusetts Institute of Technology Cambridge, MA 02141, USA 1

E-mail: [email protected] 2

Center for Biological and Computational Learning 45 Carleton Street, E25-201 Massachusetts Institute of Technology Cambridge, MA, 02141, USA

E-mail: [email protected]

We consider the problem of approximating functions from scattered data using linear superpositions of non-linearly parameterized functions. We show how the total error (generalization error) can be decomposed into two parts: an approximation part that is due to the nite number of parameters of the approximation scheme used; and an estimation part that is due to the nite number of data available. We bound each of these two parts under certain assumptions and prove a general bound for a class of approximation schemes that include radial basis functions and multilayer perceptrons.

1 Introduction In this paper we investigate the problem of providing error bounds for approximation of an unknown function from scattered, noisy data. This problem has particular relevance in the eld of machine learning, where the unknown function represents the task that has to be learned and the scattered data represents the examples of this task. An obvious quantity of interest for us is the generalization error { a measure of how much the result of the approximation scheme di ers from the unknown function { typically studied as a function of the number of data points. Since the data are randomly generated and noisy, the analysis of the generalization error necessarily involves statistical considerations in addition to the traditional  Current Address, Bell Labs, Lucent Technologies, Murray Hill, NJ 07974.

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 2 approximation theoretic ones. In this paper, we show how the total generalization error can be decomposed into two parts: (1) an approximation component that exists due to the nite dimensionality of the manifold to which the approximating function belongs and (2) an estimation component that is due to the niteness of the randomly drawn data. Each of these two components has been investigated in separate contexts in the past. In classical approximation theory, a well known quantity is the degree of approximation, which depends only on the nature of the function being approximated and the class of approximating functions. As the dimensionality of the approximating family (parameterized in some fashion) increases, the approximant converges to the true function. In statistics, one typically assumes the true function belongs to the same family to which the approximating function belongs | but is now known only through a randomly drawn data set. Estimates of this function are constructed from the data and one typically studies the convergence of this estimate to the true function as the data goes to in nity. In this paper, we discuss how approximating from scattered data requires us to deal with both aspects of the problem and provide explicit bounds for a class of approximation techniques. In particular, we will show how one needs to grow the size of the approximating family as a function of the data for convergence to the true target function. We will discuss these issues in greater detail as the paper progresses. Before proceeding any further, it is worthwhile to introduce some terminology from machine learning that we will use at convenient points in the paper. The function (f ) that has to be approximated is called the target function. The class F to which the target belongs is called the target class. The approximant itself is called the hypothesis function and the class (H ) to which it belongs is called the hypothesis class. The learner is exposed to noisy data (examples, or (x; y ) pairs) from which it constructs an approximant (hypothesis). From this perspective, the learner is simply an algorithm that maps data to hypotheses, i.e, an approximation scheme. The problem of learning from examples is the problem of approximating the target function from randomly drawn (x; y ) pairs using the approximating family H: In this paper, we focus on a class of approximation problems where the approximating family (H ) can be represented as a linear combination of functions drawn from some xed nite-dimensional class G whose elements can be typically represented by a nite number of parameters. Of course, the class H is in nite dimensional and such approximating families include radial basis functions with free centers and multilayer perceptrons | two kinds of \neural network" learning schemes that have received much attention in the machine learning community. We are able to derive generalization error bounds for such schemes and focus in particular on cases where the rate of approximation can be shown to be independent of the number of input variables. We are motivated to look at such cases

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 3 due to the fact that machine learning problems (in computer vision for example) typically require a large number of variables where the curse of dimensionality becomes a serious consideration. The rest of the paper proceeds as follows: in section 2, we describe the approximation problem formally and discuss how the generalization error can be usefully viewed as composed of two parts. We present our main results in section 3, provide formal proofs in section 4 and conclude with some remarks and open problems in section 5.

2 Statement of the Problem 2.1 The General Problem We discuss various components of the problem of approximating an unknown (target) function from scattered data (examples).

The Target Function

Let X and Y be arbitrary subsets of Rk and R respectively. Let there be a probability distribution P on the space X  Y according to which (x; y ) pairs (labeled examples) are drawn in i.i.d. fashion and presented to the learner. In a least squares setting, the learner would ideally like to minimize the \mean squared error" or the expected risk over the class F : In other words, the learner would like to obtain:

f0 = arg min I [f ] = arg min E [(y ? f (x))2] f 2F f 2F

Here the expected risk I [f ] is the functional E [(y ? f (x))2] where the expectation is taken with respect to the distribution P: The solution f0 ; that minimizes the mean squared error (expected risk) is referred to as the regression function. It can easily be shown that

f0 = E [yjx]

(1) This f0 2 F represents the target function that the learner wishes to approximate with high accuracy and great con dence. Note that we have implicitly assumed that E (y jx) 2 F :

The Learner's Estimate

The learner constructs an estimate of the target function f0 by searching within a class of parameterized estimators. In general, there are two sources of error for the learner. First, it uses an estimator with a nite number of parameters. Second, it does not have access to the underlying distribution, therefore cannot

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 4 compute I [f ]: Instead the learner samples the distribution obtaining l data points and constructs an empirical least squares estimate as follows: Let (xi ; yi); i = 1 : : :l be a data set of l sample points obtained by sampling X  Y in i.i.d. fashion according to the probability distribution P: The empirical risk, Iemp [f ]; for any function f is given by:

Iemp [f ]  1l

Xl i=1

(yi ? f (xi))2

Finally, let us denote a generic subset of F whose elements are parametrized by a number of parameters proportional to n, by Hn . Moreover, let us assume that the sets Hn form a nested family, that is

H1  H2  : : :  Hn  : : :  H:

The estimator is chosen from this class by minimizing the empirical risk over the set Hn : In other words, the learner's estimate is given by

I [f ] : f^n;l  arg fmin 2H emp n

(2)

For example, Hn could be the set of polynomials in one variable of degree n?1, Radial Basis Functions with n centers, multilayer perceptrons with n sigmoidal hidden units, multilayer perceptrons with n threshold units and so P on. In particular, if Hn is the class of functions which can be represented as f = n =1 c G(x; w ) then eq. (2) can be written as

Iemp[f ] : f^n;l  arg cmin ;w

The distance (error) between the target f0 and the learner's estimate f^n;l is due to (i) the nite number of parameters (n) and (ii) the nite number of data (l). This paper bounds this distance under some speci c conditions. 2.2 Approximation and Estimation Errors At this stage it might be worthwhile to review and remark on some general features of the problem of learning from examples. Let us remember that our goal is to minimize the expected risk I [f ] over the set F . If we were to use a nite number of parameters, then it is clear that the best we could possibly do is to minimize our functional over the set Hn , yielding the estimator fn :

I [f ] : fn  arg fmin 2H n

However, not only is the parametrization limited, but the data is also nite, and we can only minimize the empirical risk Iemp, obtaining as our nal estimate the

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 5 function f^n;l . Our goal is to bound the distance of f^n;l (that is our solution), from f0, that is the \optimal" solution. If we choose to measure the distance in the L2 (P ) metric, the quantity that we need to bound, that we will call generalization error, is: R E [(f0 ? f^n;l )2] = X dx P (x)(f0(x) ? f^n;l (x))2 = = kf0 ? f^n;l k2L2 (P ) There are 2 main factors that contribute to the generalization error, and we are going to analyze them separately for the moment. 1. A rst cause of error comes from the fact that we are trying to approximate an in nite dimensional object, the regression function f0 2 F , with a nite number of parameters. We call this error the approximation error, and we measure it by the quantity E [(f0 ? fn )2], that is the L2(P ) distance between the best function in Hn and the regression function. The approximation error can be expressed in terms of the expected risk using the decomposition (11) of proposition 2 as

E [(f0 ? fn)2 ] = I [fn] ? I [f0] :

(3)

Notice that the approximation error does not depend on the data set Dl, but depends only on the approximating power of the class Hn . In the following we will always assume that it is possible to bound the approximation error as follows:

E [(f0 ? fn )2]  "(n)

S

where "(n) is a function that goes to zero as n goes to in nity if H = 1 n=1 Hn is dense in F . In other words, as the number n of parameters gets larger the representation capacity of Hn increases, and allows a better and better approximation of the regression function f0 . The approximation error has been studied extensively in approximation theory, for many di erent choices of Hn and F . More recently, several results have been proved for a class of hypothesis spaces called multilayer perceptron, that seems to be important in practical applications [8, 18, 2, 3, 13, 28, 27]. 2. Another source of error comes from the fact that, due to nite data, we minimize the empirical risk Iemp[f ], and obtain f^n;l , rather than minimizing the expected risk I [f ], and obtaining fn . As the number of data goes to in nity we hope that f^n;l will converge to fn , and convergence will take place

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 6 if the empirical risk converges to the expected risk uniformly in probability [41]. The quantity

jIemp[f ] ? I [f ]j is called estimation error, and conditions for the estimation error to converge to zero uniformly in probability have been investigated by [42, 43, 41, 44], [36], [12], and [16]. Under a variety of di erent hypotheses it is possible to prove that, with probability 1 ?  , a bound of this form is valid:

jIemp[f ] ? I [f ]j  !(l; n; ) 8f 2 Hn

(4)

The speci c form of ! depends on the setting of the problem, but, in general, we expect ! (l; n;  ) to be a decreasing function of l. However, we also expect it to be an increasing function of n. The reason is that, if the number of parameters is large then the expected risk is a very complex object, and then more data will be needed to estimate it. Therefore, keeping xed the number of data and increasing the number of parameters will result, on the average, in a larger distance between the expected risk and the empirical risk. The approximation and estimation error are clearly two components of the generalization error, and it is interesting to notice, as shown in the next proposition, the generalization error can be bounded by the sum of the two:

Proposition 1

The following inequality holds:

kf0 ? f^n;l k2L2(P )  "(n) + 2!(l; n; ) :

(5)

Proof: using the decomposition of the expected risk (11), the generalization error can be written as:

kf0 ? f^n;l k2L2(P ) = E [(f0 ? f^n;l )2] = I [f^n;l] ? I [f0] :

(6)

A natural way of bounding the generalization error is as follows:

E [(f0 ? f^n;l)2 ]  jI [fn] ? I [f0]j + jI [fn] ? I [f^n;l]j :

(7) In the rst term of the right hand side of the previous inequality we recognize the approximation error (3). If a bound of the form (4) is known for the generalization error, it is simple to show that the second term can be bounded as

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 7

jI [fn] ? I [f^n;l]j  2!(l; n; )

To see this, let us assume that, with probability 1 ?  a uniform bound has been established:

jIemp[f ] ? I [f ]j  !(l; n; ) 8f 2 Hn :

(8)

We want to prove that the following inequality also holds:

jI [fn] ? I [f^n;l]j  2!(l; n; ) :

(9) This is easily established by using the fact that the bound is uniform (so that it holds for both fn and f^n;l ) and by noticing that, by de nition of fn and f^n;l we have: I [fn]  I [f^n;l] (10) Iemp[f^n;l]  Iemp[fn ] In fact, combining inequalities (8) and (10) we have:

I [f^n;l]  Iemp[f^n;l] + !  Iemp[fn] + !

However, we also have by (8) that:

Iemp[fn ]  I [fn] + !

and combining the last 2 inequalities one obtains:

I [f^n;l]  Iemp[fn ] + !  I [fn] + 2! :

and the result follows. Thus proposition (1) is proved 2. Thus we see that the generalization error has two components: one, bounded by "(n), is related to the approximation power of the class of functions fHn g, and is studied in the framework of approximation theory. The second, bounded by ! (l; n;  ), is related to the diculty of estimating the parameters given nite data, and is studied in the framework of statistics. Consequently, results from both these elds are needed in order to provide an understanding of the problem of learning from examples. Figure (1) also shows a picture of the problem.

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 8

f (x) 0

TARGET SPACE

^ f (x) n,l

f (x) n

HYPOTHESIS SPACE

Figure 1: This gure shows a picture of the problem. The outermost circle represents the target space F . As a subset of F we nd the hypothesis space H . f0 is an arbitrary target function in F , f is the closest element of H and f^ is the element of H which the learner hypothesizes on the basis of data. n

n

n

n;l

n

1

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 9

Proposition 2

The function that minimizes the expected risk

I [f ] =

Z

X Y

P (x; y)dxdy(y ? f (x))2 :

is the regression function de ned in eq. (1) as

f0 = E [yjx]

Furthermore,

I [f ] =

Z X

dxP (x)(f0(x) ? f (x))2 + I [f0]

Proof: It is sucient to add and subtract the regression function in the de nition of expected risk:

R

I [f ] = X Y dxdyP (x; y)(y ? f0 (x) + f0 (x) ? f (x))2 =

R

= X Y dxdyP (x; y )(y ? f0 (x))2+

R

+ X Y dxdyP (x; y )(f0(x) ? f (x))2 +

R

+ 2 X Y dxdyP (x; y )(y ? f0 (x))(f0(x) ? f (x)) By de nition of the regression function f0 (x), the cross product in the last equation is easily seen to be zero, and therefore

I [f ] =

Z

X

dxP (x)(f0(x) ? f (x))2 + I [f0] :

(11)

Since the last term of I [f ] does not depend on f , the minimum is achieved when the rst term is minimum, that is when f (x) = f0 (x)2. Remark In the case in which the data come from randomly sampling a function f in presence of additive noise, ; with probability distribution P () and zero mean, we have P (y jx) = P (y ? f (x)) and then

dxdyP (x; y)(y ? f0 (x))2 =

(12)

dxP (x) (y ? f (x))2P (y ? f (x)) =

(13)

I [f0] = =

Z

X

Z X YZ

=

ZY X

dxP (x)

Z

Y

2 P ()d =  2

(14)

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 10 where  2 is the variance of the noise. When data are noisy, therefore, even in the most favourable case we cannot expect the expected risk to be smaller than the variance of the noise. 2.3 Previous Work Approximation error bounds have been extensively studied in approximation theory, for many di erent hypothesis and target spaces. Among recent results, it is noteworthy to mention the work of Mhaskar and Micchelli [25] who derived a number of results that apply to both Radial Basis Functions and Multilayer Perceptrons and to many di erent target spaces. Their results include some dimension independent bounds, similar in spirit to the result we present in this paper. Dimension independent bounds for approximation schemes of the Multilayer Perceptron type have been derived by a number of authors [20, 3, 7, 29, 19, 26, 32], while for Radial Basis Functions results similar to the ones presented in this paper were derived in [15]. Results which are speci c for the estimation error are well known in statistics and in computational learning theory, and have been extensively studied in the work of Vapnik [42, 43, 44], Pollard [36] and Haussler [17]. Our work is based on the work of Pollard, and based on concepts like capacity and pseuodimension, rather than on the work of Vapnik, which concentrates around the concept of VCdimension. Since the theory developed by Vapnik provides necessary and sucient conditions for the uniform convergence in probability of the empirical risk to the expected risk, we suspect that better bounds than the ones we compute could be obtained using that framework. However, in order to use those results one needs to compute the VC-dimension of the hypothesis classes we are interested in, which is still an open problem. Barron [4] obtained estimation bounds for Multilayer Perceptrons which are better than ours using a framework which di ers from both Pollard and Vapnik. The main reason for his improved rate of convergence is that for the case he considers the Bernstein's inequality can be used, while the starting point of many results of Pollard and Vapnik is the Hoe ding inequality.

3 Results We have discussed how the total generalization error can be decomposed into an approximation component ("(n)) and an estimation component (! (n; l;  )). In this section we present two general results for each of these two kinds of errors. The conditions under which these two bounds hold are compatible allowing us to use proposition 1 to then derive bounds on the total generalization error. Our general results can be applied to some well known approximation techniques from

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 11 scattered data. We show such applications for the example cases discussed later.

Theorem 3 Let P (x) be a probability measure and let G(x; t) be a function de ned over Rk Rd

such that:

kGk2L2(P ) =

Z Rk

dx P (x)G2(x; t)  b2 8t 2 Rd

for some b > 0, and let f be a function de ned over Rk , such that

f (x) =

Z

Rd

dt (t)G(x; t)

(15)

where (t) 2 L1(Rd ) and kkL1 = 1. Then it is possible to nd n Boolean parameters fci gni=1 with values in f?1; 1g and n vectors fti gni=1 such that:

2

n b2 ? kf k2L2 (P )

f ? 1 X

n i=1 ciG(x; ti)

L2 (P )  n

(16)

Theorem 3 gives a bound on the approximation P error ("(n)) if the approximating family happens to be of the form Hn = ni=1 ci G(x; ti) and the target space F is the set of functions that have an integral representation of the type (15). Notice that the bound (16), which is similar in spirit to the result of A. Barron on multilayer perceptrons [2, 3], is interesting because the rate of convergence does not depend on the dimension d of the input space. It is known, from the theory of linear and nonlinear widths [40, 33, 23, 24, 10, 9, 11, 27], that if the function that has to be approximated has d variables and a degree of smoothness s, we should not expect to nd an approximation technique whose approximation error goes to zero faster than O(n? ds ). Here \degree of smoothness" is a measure of how constrained the class of functions we consider is, for example the number of derivatives that are uniformly bounded, or the number of derivatives that are integrable or square integrable. Therefore, from classical approximation theory, we expect that, unless certain constraints are imposed on the class of functions to be approximated, the rate of convergence will dramatically slow down as the number of dimensions increases, showing the phenomenon known as \the curse of dimensionality"[5]. For the target space F of functions of the form (15) it is obviously the presence of the signed measure  in the integral representation that makes the rate to be O( p1n ), but it is not entirely obvious what its implications are on the smoothness of the functions in F , because that depends on the choice of the basis function G.

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 12 Some examples will be given in section 3.1. In general, however, we expect that the smoothness increases with increasing dimension. We now state a bound on the estimation error.

Theorem 4 Let G be a class of real functions de ned on Rk with the following properties: (1) for all g 2 G , kg kL1  V ; (2) elements of G can be expressed as a composition of a xed monotonic function, h; with functions belonging to a nite dimensional vector space, i.e., G  fg = h  f jf 2 Gg where G is a vector space of dimensionality d. Consider the following hypothesis class:

Hn = ff =

n X i=1

i Gi j

n X i=1

j ij  M and Gi 2 Gg

and let P be a probability distribution on Rk  [?Q; Q] (Q being an arbitrary constant) according to which (x; y ) pairs are drawn independently at random. Then if and

E [(y ? h(x))2] fn = arg hmin 2H n

f^n;l = arg hmin 2H the following holds:

Xl

n i=1

(yi ? h(xi ))2

0  21 1 nd log( nl ) ? log(  ) A jE [(y ? fn )2] ? E [(y ? f^n;l)]j = jI [fn] ? I [f^n;l]j  O @ l Here the constant in the O term depends only upon Q; M; and V:

Theorem 4 is a bound on ! (n; l;  ) where the class Hn can be expressed as a sum of basis functions drawn from G . The result can be generalized in many ways, e.g., to cover a variety of metrics, or to cover function classes that can be represented as successive compositions and so on. We have presented the result in this form partly for simplicity, and partly to highlight the compatibility of the conditions under which approximation and estimation bounds hold.

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 13 Note that if we let G be some parameterized class of bounded functions, i.e.,

G = fG(x; t) j t 2 Rd and jG(x; t)j  V g

(17) such that G also satis es the conditions of theorem 4, then we could combine the approximation and estimation errors appropriately to prove the following theorem: a bound on the total generalization error.

Theorem 5 For a class G as in eq. 17, let n n X X Hn = ff = iGi j j ij  M and Gi 2 Gg: i=1

Further, let

i=1

Z

dt (t)G(x; t)g where  is a function in L1 such that kkL1  M: Finally, let P be a probability distribution on Rk  [?Q; Q] according to which (x; y ) pairs are drawn indepen-

F = ff jf =

Rd

dently at random. De ning and

f0 = arg min E [(y ? f (x))2] f 2F f^n;l = arg fmin 2H

l X

n i=1

(yi ? h(xi ))2

then, with probability greater than 1 ? ;

kf0 ? f^n;lk2L2 (P )  O

1

n +O

 nd log(nl) ? log()  12

l Here the constant in the O( n1 ) term depends only upon V and the constant in the other O term depends only upon Q; M; and V:

Proof: From proposition 1, we immediately decompose the total generalization error kf0 ? f^n;l k2L2 (P ) into two parts. Clearly, by eq. 17, we see that kGkL2(P )  V 2 :

Since u is bounded in L1 norm by M; the conditions of Theorem 3 hold implying that some choice of i 's taking values in f? Mn ; Mn g satisfy the approximation bound "(n) = O( n1 ): The conditions of Theorem 4 are clearly satis ed allowing us q nd to bound the estimation error ! (l; n;  ) by O( ln(nll)?ln() ) and the theorem is proved.2

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 14 Thus our main result is a bound on the total generalization error as a function of the order of the hypothesis class (n) and the number of examples (l). Two important observations need to be made: (1) as the order of the model (n) increases, the approximation error decreases since we are using a larger model; however the estimation error increases due to over tting. For a certain value of n (proportional to l 31 ) the approximation-estimation trade-o is best maintained1 and this is the optimal size of the model for the amount of data available, (2) for every choice of G and correspondingly Hn ; we obtain a class F to which the target must belong for the generalization error bounds to hold. Thus we are able to characterize the class of problems that can be solved \well" by the given approximation scheme, Hn. 3.1 Example Cases: The above theorem bounds the generalization error rate for approximation schemes that utilize combinations of functions drawn from a certain xed class (G ). We now show how by di erent choices of G we get some well known approximation schemes that satisfy the error bounds obtained above.

Radial Basis Functions Let the class G be given by:

G = fe?

kx?tk2 2

jt 2 Rk ;  2 Rg

This choice yields the Radial Basis Function approximation scheme [37, 30, 34, 31]. To see that G satis es the conditions of the theorem, notice that it is contained in the following set:

G  fe? j 2 Gg

where

G = span f1; x1; x2; : : :; xk ; kxk2g: Clearly G is a vector space of dimension k + 2:

For such a class of approximating functions, the approximation bound of theorem (3) holds only if the target function belongs to the set of functions W (Rk ) that have an integral representation of the form:

f (x) =

Z

dtd (t; )e? k+1

kx?tk2 2

(18)

R 1 Strictly speaking, we have not proved that n proportional to l 31 is the optimal number of parameters since the overall bound of theorem 3 may not be the best possible one. However, increasing the model complexity at this rate is sucient to guarantee convergence to the target.

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 15 where  is in L1 (Rk+1 ). A precise characterization of the set W (Rk ) is lacking, but we can show that it contains an important subset. In order to do that, we rst remind the reader that the isotropic Sobolev-Liouville space Lm;p [21], also known as the Bessel potential space [46], is de ned as the set of functions f that can be written as f =   Gm , where  stands for the convolution operation,  2 Lp and Gm is the Bessel-Macdonald kernel, i.e., the function whose Fourier transform is: G~ m(s) = (1 + 4 21ksk2)m=2 It now easy to show the following:

Lemma 6 Lm;1(Rk )  W (Rk ) if m > d. This lemma is easily proved by noticing that the Bessel-Macdonald kernel has the following integral representation [39]:

Gm(x) = Cm;k

Z

de?2  m?d?1 e?

kxk2

(19) where Cm;k is some positive constant that only depends on k and m. As a consequence, every function of Lm;1 (Rk ) has an integral representation of the form

f (x) =

Z

dtd (t)m;d ()e? k+1

R

2

kx?tk2 2

with m;d ( ) = e?2  m?d?1 , and therefore the function f belongs to W (Rk ) if m;d () is integrable, that is if m > d. Notice that, according to [39] if m is even, the space Lm;1 (Rk ) contains the Sobolev Space H m;1 of functions whose derivatives up to order m are integrable.

Multi Layer Perceptrons

A non-linear approximation scheme that has come to be very popular is the multi-layer perceptron scheme with sigmoidal computational units [38]. To apply our general result to such a case we let G be:

G = f(x  t + )g

where  is the so-called sigmoidal function. Clearly  is monotonic and it operates on a vector space (span (1; x1; x2; : : :; xk )) of dimension k + 1: For such a class of approximating functions, the target must belong to a class with the following integral representation:

f (x) =

Z

Rk+1

dtd (t; )(x  t + )

(20)

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 16 where  2 L1 (Rk+1 ). This represents a class of problems that can be solved by multilayer perceptrons with guaranteed generalization error bounds. This class is the set of functions that is in the closure of the convex hull of sigmoid functions, and A. Barron [3] showed how this is equivalent to the class whose Fourier transform f~(s) has the property:

Z

Rd

ds kskjf~(s)j < +1

(21)

However, equation (21) above does not give a more precise characterization than eq. (20). Some more insight is gained if, following Girosi and Anzellotti [15], we reexamine eq. (21) but not in the Fourier domain. Then one can show that functions of that class have an integral representation as (22) f (x) = 1  

kxkd?1

where  is any function whose Fourier transform is integrable. The appearance of the dimension d in the exponent in eq. (22) helps to understand why the rate of convergence is independent of the dimension, since it appears to impose a stronger and stronger constraint as the dimension d increases. However, it is not clear what the space of functions whose Fourier transform is integrable is, so this not a full characterization of this class of functions.

4 Proofs We provide proofs to Theorems 3 and 4 in the following sections. 4.1 Bounding the approximation error (Proof of Theorem 3) Proof: We start considering the case in which the function  is positive. We notice that, for an arbitrary choice of the vectors fti gni=1, the following equality holds:



2 n

f ? 1 X

2

n i=1 G(x; ti)

L (P ) = kf kL2(P )+ 2

n Z n Z n X X X 1 2 + n2 dx P (x)G(x; ti)G(x; tj ) ? n dx P (x)f (x)G(x; ti) = i=1 j =1

n Z X 1 2 = kf kL2 (P ) + n2 i=1

i=1

dx P (x)G2(x; ti)+

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 17 + n12

n Z X i6=j

dx P (x)G(x; ti)G(x; tj ) ? n2

n Z X i=1

dx P (x)f (x)G(x; ti)

We now consider the variables fti gni=1 as random variables, that we assume identically distributed with probability distribution (t), and de ne the following expected value

Z

Z

E [] = dt1 (t1 ) : : : dtn (tn ) : Taking the expected value of the equality above and applying Fubini's theorem together with the integral representation of f we obtain

2

2 n X 1

E 4

f ? n G(x; ti)

i=1

L2 (P )

3 Z 5 = kf k2L2(P ) + 1 dxdt (t)G2(x; t) + n

+ n12 n(n ? 1)kf k2L2(P ) ? n2 nkf k2L2 (P ) = n1 (A2 ? kf k2L2 (P )) where we have de ned

Z

A2 = dxdt P (x)(t)G2(x; t) : Therefore there exists at least a set of vectors fti gni=1 such that

2

n A2 ? kf k2L2 (P ) b2 ? kf k2L2(P )

f ? 1 X 

n i=1 G(x; ti)

L (P )  n n 2

If the function  is not positive we can always write

(t) = sign((t))j(t)j and therefore substitute G(x; ti) with sign((ti))G(x; ti), from which the theorem follows 2. 4.2 Bounding the estimation error (Proof of Theorem 4) In this part we attempt to bound the estimation error jI [f ] ? Iemp[f ]j. In order to do that we rst need to introduce some basic concepts and notations. Let S be a subset of a metric space S with metric d. In the discussion that follows, we will sometimes refer to objects like S as (S; d) to explicitly specify the metric and avoid confusion regarding multiple metrics that can be put on the

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 18 same set. We say that an -cover with respect to the metric d is a set T  S such that for every s 2 S; there exists some t 2 T satisfying d(s; t)  . The size of the smallest -cover is N (; S; d) and is called the covering number of S. In other words

N (; S; d) = Tmin jT j ; S where T runs over all the possible -covers of S and jT j denotes the cardinality of T . A subset B of a metric space S is said to be -separated if for all x; y 2 B , d(x; y) > . We de ne the packing number M(; S; d) as the size of the largest -separated subset of S. Thus M(; S; d) = sup jBj ; B S

where B runs over all the -separated subsets of S. It is easy to show that the covering number is always less than the packing number, that is N (; S; d)  M(; S; d). Now let P ( ) be a probability distribution de ned on S , and A be a set of real-valued functions de ned on S such that, for any a 2 A, 0  a( )  U 2 8 2 S : Let also  = (1 ; ::; l) be a sequence of l examples drawn independently from S according to P (). For any function a 2 A we de ne the empirical and true expectations of a as follows:

X E^ [a] = 1 a(i) l

E [a] =

l i=1

Z

S

dP ()a()

The di erence between the empirical and true expectation can be bounded by the following inequality, whose proof can be found in [36] and [16], that will be crucial in order to prove our main theorem.

Lemma 7 [36], [16] Let A and  be as de ned above. Then, for all  > 0,   P 9a 2 A : jE^ [a] ? E [a]j >   i 1 h  4E N ( 16 ; A; dL1 ) e? 128U 4 2l

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 19 In the above result, A is the restriction of A to the data set, that is:

A  f(a(1); : : :; a(l)) : a 2 Ag :

(23) The set A is a collection of points belonging to the subset [0; U 2]l of the ldimensional Euclidean space and is therefore totally bounded. Each function a in A is represented by a point in A, while every point in A represents all the functions that have the same values at the points 1 ; : : :; l. The distance metric dL1 in the inequality above is the standard L1 metric in Rl, that is

dL1 (x; y) = 1l

Xl =1

jx ? yj

where x and y are points in the l-dimensional euclidean space and x and y  are their -th components respectively. The above inequality is a result in the theory of uniform convergence of empirical measures to their underlying probabilities. This has been studied in great detail by Pollard and Vapnik, and similar inequalities can be found in the work of Vapnik [42, 43, 41], although they usually involve the VC dimension of the set A, rather than its covering numbers. Suppose now we choose S = X  Y , where X is an arbitrary subset of Rk and Y = [?Q; Q] as in the formulation of our original problem. The generic element of S will be written as  = (x; y ) 2 X  Y . We now consider the class of functions A de ned as:

A = fa : X  Y ! R j a(x; y) = (y ? h(x))2; h 2 Hn(Rk )g

where Hn (Rk ) is the class of approximating functions de ned on Rk that can be expressed as a linear combination of functions de ned in the conditions of theorem 4. Clearly, and therefore where we have de ned

jy ? h(x)j  jyj + jh(x)j  Q + MV; 0  a  U2

U  Q + MV : We notice that, by de nition of E^ (a) and E (a) we have

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 20

E^ (a) = 1l and

E (a) =

Z X Y

Xl i=1

(yi ? h(xi))2 = Iemp[h]

dxdy P (x; y)(y ? h(x))2 = I [h] :

Therefore, applying the inequality of lemma 7 to the set A, and noticing that the elements of A are essentially de ned by the elements of Hn , we obtain the following result:

P (8h 2 Hn ; jIemp[h] ? I [h]j  ) 

 1 ? 4E [N (=16; A; dL1 )]e? 1281U 4 2 l

(24)

:

so that the inequality of lemma 7 gives us a bound on the estimation error. However, this bound depends on the speci c choice of the probability distribution P (x; y), while we are interested in bounds that do not depend on P . Therefore it is useful to de ne some quantity that does not depend on P , and give bounds in terms of that. We therefore introduce the concept of metric capacity of A, that is de ned as

C (; A; dL1) = supfN (; A; dL1(P ))g P

where the supremum is taken over all the probability distributions P de ned over S , and dL1 (P ) is the standard L1(P ) distance2 induced by the probability distribution P :

dL1(P )(a1 ; a2) =

Z

S

dP ()ja1() ? a2 ()j a1 ; a2 2 A :

The relationship between the covering number and the metric capacity is showed in the following 2 Note that here A is a class of real-valued functions de ned on a general metric space If we consider an arbitrary A de ned on and taking values in n the L1 (P ) norm is appropriately S:

S

adjusted to be

L1 (P )

d

n X (f ; g) = n1

Z

i=1 S

R ;

d

;

j i (x) ? i (x)j (x) x f

g

P

d

where f (x) = (f1(x); : : : fi (x); : : : fn (x)); g(x) = (g1 (x); : : : gi (x); : : : gn (x)) are elements of A and P (x) is a probability distribution on S . Thus dL1 and dL1 (P ) should be interpreted according to the context.

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 21

Lemma 8

E [N (; A; dL1 )]  C (; A; dL1 ) :

Proof: For any sequence of points  in S , there is a trivial isometry between (A; dL1 ) and (A; dL1(P) ) where P is the empirical distribution on the space S P given by 1l li=1  ( ? i ). Here  is the Dirac delta function,  2 S , and i is the

i-th element of the data set. To see that this isometry exists, rst note that for every element a 2 A, there exists a unique point (a(1); : : :; a(l)) 2 A: Thus a

simple bijective mapping exists between the two spaces. Now consider any two elements g and h of A. The distance between them is given by

dL1 (P)(g; h) =

Z

S

jg() ? h()jP()d = 1l

Xl i=1

jg(i) ? h(i)j:

This is exactly what the distance between the two points (g (1); ::; g(l)) and (h(1); ::; h(l)), which are elements of A, is according to the dL1 distance. Thus there is a one-to-one correspondence between elements of A and A and the distance between two elements in A is the same as the distance between their corresponding points in A. Given this isometry, for every -cover in A, there exists an -cover of the same size in A, so that

N (; A; dL1 ) = N (; A; dL1(P ))  C (; A; dL1): and consequently E [N (; A; dL1 )]  C (; A; dL1 ). 2 The result above, together with eq. (24) shows that the following holds: (Note that we have not yet proved that C (; A; dL1 ) is a nite quantity. We will do so in the following lemmas).

Lemma 9

P (8h 2 Hn ; jIemp[h] ? I [h]j  ) 

 1 ? 4C (=16; A; dL

? 1 2 l 1 )]e 128U 4

:

(25)

Thus in order to obtain a uniform bound ! on jIemp[h] ? I [h]j, our task is reduced to computing the metric capacity of the functional class A which we have just de ned. We will do this in several steps. In lemma 10, we rst relate the metric capacity of A to that of the class of approximating functions (Hn ). Then lemmas 11 through 17 go through a computation of the metric capacity of Hn .

Lemma 10

C (; A; dL1 )  C (=4U; Hn; dL1 )

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 22

Proof: Fix a distribution P on S = X  Y . Let PX be the marginal distribution

with respect to X . Suppose K is an =4U -cover for Hn with respect to this probability distribution PX , i.e., with respect to the distance metric dL1 (PX ) on Hn. Further, let the size of K be N (=4U; Hn; dL1(PX )). This means that for any h 2 Hn, there exists a function h belonging to K , such that:

Z

jh(x) ? h(x)jPX (x)dx  =4U Now we claim the set H (K ) = f(y ? h(x))2 : h 2 K g is an  cover for A with respect to the distance metric dL1 (P ) . To see this, it is sucient to show that

R j(y ? h(x))2 ? (y ? h(x))2jP (x; y)dxdy 

 R j(2y ? h ? h)jj(h ? h)jP (x; y)dxdy   R (2Q + 2MV )jh ? hjP (x; y)dxdy  

which is clearly true. Now

N (; A; dL1(P ))  jK j = = N (=4U; Hn; dL1(PX ) ) 

 C (=4U; Hn; dL1 )

Taking the supremum over all probability distributions, the result follows. 2 So the problem reduces to nding C (; Hn; dL1 ), i.e. the metric capacity of the class of appropriately de ned Radial Basis Functions networks with n centers. To do this we will decompose the class Hn to be the composition of two classes de ned as follows.

De nitions/Notations HI is a class of functions de ned from the metric space (Rk ; dL1 ) to the metric space (Rn ; dL1 ). In particular, HI = fg(x) = (G1 ; G2; : : :; Gn)g where each Gi is an element of G : HF is a class of functions de ned from the metric space ([?V; V ]n; dL1 ) to the metric space (R; dL1 ). In particular,

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 23

HF = fh : h(x) =  x; x 2 [?V; V ]n and

n X i=1

j ij  M g

P

where  ( 1; : : :; n) is an n-dimensional vector satisfying the constraint j i j  M. Thus we see that

Hn = fhF  hI : hF 2 HF and hI 2 HI g

where  stands for the composition operation, i.e., for any two functions f and g , f  g = f (g(x)). We see that Hn as de ned above is de ned from Rk to R.

Lemma 11

 nd  C (; HI ; dL1 )  2n 4eV ln 4eV  

Proof: Fix a probability distribution P on Rk . Consider the class G : Let K be an N (; G ; dL1(P ))-sized  cover for this class3 . We rst claim that T = f(h1 ; ::; hn) : hi 2 K g is an -cover for HI with respect to the dL1 (P ) metric. Remember that the dL1(P ) distance between two vector-valued functions g(x) = (g1(x); ::; gn(x)) and g (x) = (g1(x); ::; gn(x)) is de ned as n Z X 1  dL1 (P ) (g; g ) = n i=1

jgi(x) ? gi(x)jP (x)dx

In order to prove the claim, pick an arbitrary g = (g1; : : :; gn) 2 HI . For each gi, there exists a gi 2 K which is -close in the appropriate sense for real-valued functions, i.e. dL1 (P ) (gi; gi)  . The function g = (g1; ::; gn) is an element of T . Also, the distance between (g1; ::; gn) and (g1; ::; gn) in the dL1 (P ) metric is

dL1 (P ) (g; g)  n1 Thus we obtain that

n X i=1

=:

N (; HI ; dL1(P ))  [N (; G ; dL1(P ))]n

and taking the supremum over all probability distributions as usual, we get 3 For a nite sized cover to exist, the set G needs to be totally bounded. This follows directly from the next lemma but for the time being, we assume niteness.

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 24

C (; HI ; dL1 )  (C (; G ; dL1))n : Now we need to nd the capacity of G . This is done in the lemma 14. From this the result follows. 2

De nitions/Notations Before we proceed to the next step in our proof, some more notation needs to be de ned. Let A be a family of functions from a set S into R. For any sequence  = (1; ::; m) of m points in S , let A be the restriction of F to the data set, as per our previously introduced notation. Thus A = f(a(1); : : :; a(m)) : a 2 Ag. If there exists some translation of the set A, such that it intersects all 2m orthants of the space Rm , then  is said to be shattered by A: Expressing this a little more formally, let B be the set of all possible m-dimensional boolean vectors. If there exists a translation t 2 Rm such that for every b 2 B, there exists some function ab 2 A satisfying ab (i ) ? ti  0 , bi = 1 for all i = 1 to m, then the set (1; ::; m) is shattered by A: Note that the inequality could easily have been de ned to be strict and would not have made a di erence. The largest m such that there exists a sequence of m points which are shattered by A is said to be the pseudo-dimension of A denoted by pdimA. 2 In this context, there are two important theorems which we will need to use. We give these theorems without proof.

Theorem 12 (Dudley)

Let F be a k-dimensional vector space of functions from a set S into R. Then pdim(F ) = k.

The following theorem is stated and proved in a somewhat more general form by Pollard. Haussler, using techniques from Pollard has proved the speci c form shown here.

Theorem 13 (Pollard, Haussler)

Let F be a family of functions from a set S into [M1; M2], where pdim(F ) = d for some 1  d < 1. Let P be a probability distribution on S . Then for all 0 <   M 2 ? M1 ,

 d M(; F; dL1(P )) < 2 1 2e(M2 ? M1) log 1 2e(M2 ? M1)

Here M(; F; dL1(P ) ) is the packing number of F according to the distance metric dL1 (P ) .

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 25

Lemma 14

 4eV d  ln C (; G ; dL1)  2 4eV  

Proof: Recall that the class G can be represented as G = fh  f : f 2 Gg

where h is a xed, monotonic function and G is a vector space of dimensionality d:. We claim that the pseudo-dimension of G denoted by pdim(G ) ful lls the following inequality, pdim (G )  pdim (G) = d To show that this inequality holds, it is enough to show that every set shattered by G is also shattered by G: Suppose there exists a sequence (x1; x2; : : :; xm) which is shattered by G . This means that by our de nition of shattering, there exists a translation t 2 Rm such that for every boolean vector b 2 f0; 1gm there is some function gb = h  fb (fb 2 G) satisfying gb (xi)  ti if and only if bi = 1, where ti and bi are the i-th components of t and b respectively. We now show that the set (x1; x2; : : :; xm) is shattered by G: To begin, let and

mi = minfgb(xi) = h  fb (xi )jbi = 1g

Mi = maxfgb(xi) = h  fb (xi )jbi = 0g Since, G shatters the set, (x1; x2; : : :; xm); by the de nition of shattering, we

have

Mi < t i  m i

By construction, and by the monotonicity of h; it is always possible to nd a 2 (h?1(Mi); h?1(mi)]; for monotonically increasing functions. We rst show how the result holds for monotonically increasing functions and then show how to extend it for monotonically decreasing ones. Thus, we nd that

t0i such that h(t0i) 2 (Mi ; mi]: Speci cally, t0i

gb(xi ) = h  fb(xi)  h(t0i ) , bi = 1:

Now, by monotonicity (increasing), we have

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 26

fb(xi )  t0i , bi = 1

Since, each fb 2 G; and we have found an appropriate translation t0 = ft01; t02; : : :; t0mg; we see that the set (x1; x2; : : :; xm) is shattered by G: For monotonically decreasing h; we let t0i = ?i where i 2 [h?1 (mi); h?1 (Mi)); and hb = ?fb where fb is as before. Then it is straightforward to show that

hb (xi)  t0i , bi = 1 Since G is a vector space, the hb's belong to G; and clearly the set (x1; x2; : : :; xm) is again seen to be shattered by G: Since G is a vector space of dimensionality d; an application of Dudley's Theorem [12] yields the value d for its pseudo-dimension. Further, functions in the class G are in the range [?V; V ]. Now we see (by an application of Pollard's theorem) that

N (; G ; dL1(P ))  M(; G ; dL1(P ))    pdim(G)  2 4eV ln 4eV   d   2 4eV ln 4eV

Taking the supremum over all probability distributions, the result follows. (This bound makes sense only for  < 4V ). 2

Lemma 15

 4MeV  4MeV n C (; HF ; dL1 )  2  ln 

Proof: The proof of this runs in very similar fashion. First note that HF  fy : y =  x and x; 2 Rng: The latter set is a vector space of dimensionality n and by Dudley's theorem[12], we see that its pseudo-dimension pdim is n. Also, clearly since HF is contained in the latter vector space, every set of points that can be shattered by HF , can

also be shattered by this latter vector space of functions. Thus, we have that pdim(HF )  n. To get bounds on the functions in HF , notice that

j

n X i=1

i xi j 

n X i=1

j ijjxij  V

n X i=1

j ij  MV:

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 27 Thus functions in HF are bounded in the range [?MV; MV ]. Now using Pollard's result [16], [36], we have that

N (; HF ; dL1(P ))  M(; HF ; dL1(P ))   4MeV n  ln :  2 4MeV  

Taking supremums over all probability distributions, the result follows. 2

Lemma 16

A uniform rst-order Lipschitz bound of HF is Mn, i.e., for all x; y 2 Rn such that dL1 (x; y)  ; the following holds:

8f 2 HF ; jf (x) ? f (y)j  Mn

Proof: Suppose we have x; y 2 Rn such that dL1 (x; y)  :

The quantity Mn is a uniform rst-order Lipschitz bound for HF if, for any element of HF , parametrized by a vector , the following inequality holds: Now clearly,

jx  ? y  j  Mn jx  ? y  j = j Pni=1 i(xi ? yi)j   Pn j jj(x ? y )j  i=1 i

i

i

 M Pni=1 j(xi ? yi )j  Mn

The result is proved. 2

Lemma 17

 ; H ; d 1 )C (  ; H ; d 1 ) C (; Hn; dL1 )  C ( 2Mn I L 2 F L

Proof: Fix a distribution P on Rk . Assume we have an =(2Mn)-cover for HI

with respect to the probability distribution P and metric dL1(P ) . Let it be K where

jK j = N (=2Mn; HI ; dL1(P )):

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 28 Now each function f 2 K maps the space Rk into Rn , thus inducing a probability distribution Pf on the space Rn . Speci cally, Pf can be de ned as the distribution obtained from the measure f de ned so that any measurable set A  Rn will have measure

f (A) =

Z

f ?1(A)

P (x)dx :

Further, there exists a cover Kf which is an =2-cover for HF with respect to the probability distribution Pf . In other words

jKf j = N (=2; HF ; dL1(Pf )):

We claim that

H (K ) = fg  f : f 2 K and g 2 Kf g

is an  cover for Hn . Further we note that

jH (K )j  Pf 2K jKf j  Pf 2K C (=2; HF ; dL1 ) 

 N (=(2Mn); HI ; dL1(P ))C (=2; HF ; dL1 )

Thus we see that

  N (; Hn; dL1(P ))  N =(2Mn); HI ; dL1(P ) C (=2; HF ; dL1 ):

Taking supremums over all probability distributions, the result follows. To see that H (K ) is an -cover, suppose we are given an arbitrary function hF  hI 2 Hn. There clearly exists a function hI 2 K such that

Z

dL1 (hI (x); hI (x))P (x)dx  =(2Mn) Now there also exists a function hF 2 KhI such that R jh  h(x) ? h  h(x)jP (x)dx = F I Rk F I Rk

R

= Rn jhF (y) ? hF (y)jPhI (y)dy  =2 : To show that H (K ) is an -cover it is sucient to show that

Z

Now

Rk

jhF  hI (x) ? hF  hI (x)jP (x)dx  :

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 29

R jh  h (x) ? h  h(x)jP (x)dx  F I Rk F I  R fjh  h (x) ? h  h (x)j+ F

Rk

I

F

I

+jhF  hI (x) ? hF  hI (x)jP (x)dxg by the triangle inequality. Further, since hF is Lipschitz bounded,

Also,

R jh  h (x) ? h  h(x)jP (x)dx  F I Rk F I  R Mnd 1 (h (x); h(x))P (x)dx  Mn(=2Mn)  =2 : Rk

L

I

I

R jh  h(x) ? h  h(x)jP (x)dx = F I Rk F I R = jh (y) ? h (y)jP  (y)dy  =2 : Rn F

F

hI

Consequently both sums are less than =2 and the total integral is less than .

2

Having obtained the crucial bound on the metric capacity of the class Hn , we can now prove the following

Lemma 18

With probability exceeding 1 ?  , and 8h 2 Hn , the following bound holds:

jIemp[h] ? I [h]j  O

 nd ln(nl) ? ln() 1=2! l

Proof: We know from the previous lemma that C (; Hn; dL1 )  h n  8MeV n ind h 8MeV  8MeV in   2n+1 8MeV  ln   ln  h i n ln( 16MeV n ) n(d+1) :  16MeV  

From lemma (9), we see that as long as

P (8h 2 Hn; jIemp[h] ? I [h]j  )  1 ? 

(26)

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 30

C (=16; A; dL1)e? 1281U 4 2 l  4

which in turn is satis ed as long as (by lemma 10)

C (=64U; Hn; dL1 )e? 1281U 4 2 l  4

which implies

1



ln 1 1024MeV Un In other words (for small  < An),  1024MeV Un

n(d+1)

e? 128U 2  l  4 1

2

 An  An n(d+1) 2 ln e? l=B   



4 for constants A; B . The latter inequality is satis ed as long as (An=)2n(d+1) e?2 l=B   4

which implies

2n(d + 1)(ln(An) ? ln()) ? 2 l=B  ln(=4) and in turn implies

2 l > B ln(4=) + 2Bn(d + 1)(ln(An) ? ln()):

It is possible to show (

=

 B [ln(4=) + 2n(d + 1) ln(An) + n(d + 1) ln(l)] 1=2 l

Putting the above value of  in the inequality of interest, we get

2(l=B) = ln(4=) + 2n(d + 1) ln(An) + n(d + 1) ln(l) 

 ln(4=) + 2n(d + 1) ln(An)+  1

l +2n(d + 1) 2 ln B [ln(4=)+2n(d+1)ln( An)+n(d+1) ln(l)] In other words,



n(d + 1) ln(l) 

 n(d + 1) ln



l B [ln(4=)+2n(d+1)ln(An)+n(d+1) ln(l)]



P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 31 Since

B [ln(4=) + 2n(d + 1) ln(An) + n(d + 1) ln(l)]  1 the inequality is obviously true for this value of : Taking this value of  in eq. 26 then proves our lemma. 2

5 Conclusions We have discussed the problem of approximating functions from scattered, noisy data using certain kinds of approximating schemes. We have seen that the total error rate can be decomposed into an approximation and estimation component. Crucially, it is observed that one cannot reduce the upper bounds on both error components simultaneously. To make the approximation component small, we need many parameters | to make the estimation component small, we need few parameters (alternatively, more data). By obtaining error bounds on the two sources of error, one is able to derive a principled way to trade these two error components in an optimal fashion analogous to Vapnik's principle of Structural Risk Minimization (Vapnik, 1982) as a model selection criterion. This work suggests some directions of research that are worth exploring. First, notice that we have used concepts like metric entropy to bound the estimation component. It would be interesting to use the VC-dimension to bound this component. Unfortunately, we have not been able to get VC bounds on our approximation schemes yet. Second, we have only obtained upper bounds on the generalization error. An important direction to explore would be the derivation of lower bounds on the total generalization error. While lower bounds for the approximation and estimation components exist separately, it is not obvious how to combine them non-trivially for a tight lower bound on the total generalization error. Finally, we have been using uniform laws of large numbers that are derived from the Hoe ding inequality. It is likely that uniform laws derived from the Bernstein inequality will yield convergence rates that are faster. We intend to investigate these issues in future work.

References [1] A.R. Barron. Approximation and estimation bounds for arti cial neural networks. Technical Report 59, Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, IL, March 1991. [2] A.R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. Technical Report 58, Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, IL, March 1991.

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 32 [3] A.R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transaction on Information Theory, 39(3):930{945, May 1993. [4] A.R. Barron. Approximation and estimation bounds for arti cial neural networks. Machine Learning, 14:115{133, 1994. [5] R.E. Bellman. Adaptive Control Processes. Princeton University Press, Princeton, NJ, 1961. [6] L. Breiman. Hinging hyperplanes for regression, classi cation, and function approximation. IEEE Transaction on Information Theory, 39(3):999{1013, May 1993. [7] L. Breiman. Stacked regression. Technical report, University of California, Berkeley, 1993. [8] G. Cybenko. Approximation by superposition of a sigmoidal function. Math. Control Systems Signals, 2(4):303{314, 1989. [9] R. DeVore, R. Howard, and C. Micchelli. Optimal nonlinear approximation. Manuskripta Mathematika, 1989. [10] R.A. DeVore. Degree of nonlinear approximation. In C.K. Chui, L.L. Schumaker, and D.J. Ward, editors, Approximation Theory, VI, pages 175{201. Academic Press, New York, 1991. [11] R.A. DeVore and X.M. Yu. Nonlinear n-widths in Besov spaces. In C.K. Chui, L.L. Schumaker, and D.J. Ward, editors, Approximation Theory, VI, pages 203{206. Academic Press, New York, 1991. [12] R.M. Dudley. Universal Donsker classes and metric entropy. Ann. Prob., 14(4):1306{1326, 1987. [13] K. Funahashi. On the approximate realization of continuous mappings by neural networks. Neural Networks, 2:183{192, 1989. [14] F. Girosi. On some extensions of radial basis functions and their applications in arti cial intelligence. Computers Math. Applic., 24(12):61{80, 1992. [15] F. Girosi and G. Anzellotti. Rates of convergence for radial basis functions and neural networks. In R.J. Mammone, editor, Arti cial Neural Networks for Speech and Vision, pages 97{113, London, 1993. Chapman & Hall. [16] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Technical Report UCSC-CRL-91-02, University of California, Santa Cruz, 1989. [17] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100(1):78{150, 1992. [18] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 2:359{366, 1989. [19] K. Hornik, M. Stinchcombe, H. White, and P. Auer. Degree of approximation results for feedforward network approximating unknown functions and their derivatives. Neural Computation, 6:1262{127, 1994. [20] L.K. Jones. A simple lemma on greedy approximation in Hilbert space and convergence rates for Projection Pursuit Regression and neural network training. The Annals of Statistics, 20(1):608{613, March 1992. [21] L.D. Kudryavtsev and S.M. Nikol'skii. Spaces of di erentiable functions of several variables. In S.M. Nikol'skii, editor, Analysis III. Springer-Verlag, Berlin, 1991. [22] R. P. Lippmann. An introduction to computing with neural nets. IEEE ASSP Magazine, pages 4{22, April 1987. [23] G. G. Lorentz. Metric entropy, widths, and superposition of functions. Amer. Math. Monthly, 69:469{485, 1962. [24] G. G. Lorentz. Approximation of Functions. Chelsea Publishing Co., New York, 1986. [25] H. Mhaskar and C. Micchelli. Degree of approximation by neural and translation networks with a single hidden layer. Advances in applied mathematics, 16:151{183, 1995. [26] H. N. Mhaskar. Neural networks for optimal approximation of smooth and analytic func-

P. Niyogi and F. Girosi / Generalization Bounds for Function Approximation 33 tions. Neural Computation, 8:164{167, 1996. [27] H.N. Mhaskar. Approximation properties of a multilayered feedforward arti cial neural network. Advances in Computational Mathematics, 1:61{80, 1993. [28] H.N. Mhaskar and C.A. Micchelli. Approximation by superposition of a sigmoidal function. Advances in Applied Mathematics, 13:350{373, 1992. [29] H.N. Mhaskar and C.A. Micchelli. Dimension independent bounds on the degree of approximation by neural networks. IBM Journal of Research and Development, 38:277{284, 1994. [30] C. A. Micchelli. Interpolation of scattered data: distance matrices and conditionally positive de nite functions. Constructive Approximation, 2:11{22, 1986. [31] J. Moody and C. Darken. Fast learning in networks of locally-tuned processing units. Neural Computation, 1(2):281{294, 1989. [32] N. Murata. An integral representation of functions using three layered networks and their approximation bounds. Neural Networks, 9:947{956, 1996. [33] A. Pinkus. N-widths in Approximation Theory. Springer-Verlag, New York, 1986. [34] T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings of the IEEE, 78(9), September 1990. [35] T. Poggio and F. Girosi. Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247:978{982, 1990. [36] D. Pollard. Convergence of stochastic processes. Springer-Verlag, Berlin, 1984. [37] M. J. D. Powell. Radial basis functions for multivariable interpolation: a review. In J. C. Mason and M. G. Cox, editors, Algorithms for Approximation. Clarendon Press, Oxford, 1987. [38] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Parallel Distributed Processing. MIT Press, Cambridge, MA, 1986. [39] E.M. Stein. Singular integrals and di erentiability properties of functions. Princeton, N.J., Princeton University Press, 1970. [40] A.F. Timan. Theory of approximation of functions of a real variable. Macmillan, New York, 1963. [41] V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, Berlin, 1982. [42] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequences of events to their probabilities. Th. Prob. and its Applications, 17(2):264{280, 1971. [43] V.N. Vapnik and A. Ya. Chervonenkis. The necessary and sucient conditions for the uniform convergence of averages to their expected values. Teoriya Veroyatnostei i Ee Primeneniya, 26(3):543{564, 1981. [44] V.N. Vapnik and A. Ya. Chervonenkis. The necessary and sucient conditions for consistency in the empirical risk minimization method. Pattern Recognition and Image Analysis, 1(3):283{305, 1991. [45] H. White. Connectionist nonparametric regression: Multilayer perceptrons can learn arbitrary mappings. Neural Networks, 3(535-549), 1990. [46] W.P. Ziemer. Weakly di erentiable functions : Sobolev spaces and functions of bounded variation. Springer-Verlag, New York, 1989.