Error Bounds for Approximation with Neural ... - Semantic Scholar

Report 2 Downloads 62 Views
Error Bounds for Approximation with Neural Networks Martin Burgerand Andreas Neubauer

Institut fur Industriemathematik, Johannes-Kepler-Universitat, A-4040 Linz, Austria

Abstract. In this paper we prove convergence rates for the problem of approximat-

ing functions f by neural networks and similar constructions. We show that the rates are the better the smoother the activation functions are, provided that f satis es an integral representation. We give error bounds not only in Hilbert spaces but in general Sobolev spaces W m;r ( ). Finally, we apply our results to a class of perceptrons and present a sucient smoothness condition on f guaranteeing the integral representation.

Key Words: Neural networks, error bounds, nonlinear function approximation. AMS Subject Classi cations: 41A30, 41A25, 92B20, 68T05

1. Introduction The aim of this paper is to nd error bounds for the approximation of functions by feed-forward networks with a single hidden layer and a linear output layer, which can be written as n X (1.1) fn(x) = cj (x; tj ) ; j =1

where cj 2 R and tj 2 P  R are parameters to be determined. An important special case of (1.1) are so-called Ridge-constructions, i.e., p

fn(x) =

n

X

j =1

cj (aTj x + bj ) :

(1.2)

The interest in such networks grew, since Hornik et al. [5] showed that functions of the form (1.2) are dense in C ( ), if  is a function of sigmoidal form. An other special case are radial basis function networks, where (x; t) = ( kx ? tk ) (cf. [7]). We consider the problem of approximating a function f 2 W m;r ( ), where W m;r ( ) denote the usual Sobolev spaces and is a (not necessarily bounded) domain in Rd . This problem can be written in the abstract form inf kf ? gk X ;

g2Xn

(1.3)

Supported by the Austrian Fonds zur Forderung der wissenschaftlichen Forschung under grant SFB F013/1308 

1

where Xn denotes the set of all functions of form (1.1), i.e.,

Xn = fg =

n

X

j =1

cj (x; tj ) : tj 2 P  Rp; cj 2 Rg :

(1.4)

 is assumed smooth enough so that Xn  X ; P is a (usually bounded) domain. Usually, the convergence of solutions of (1.3) if they exist (note that Xn is not a nite-dimensional subspace of X ) is arbitrarily slow, since the approximation problem is asymptotically ill-posed, i.e., arbitrarily small errors in the observation can lead to arbitrarily large errors in the approximation as n ! 1 (cf., e.g., [2, 3]). It was shown in [3] that the set of functions to which networks of the form (1.1) converge is just the closure of the range of the integral operator h 7! h(t)(; t) dt : P Rates are usually only obtained under additional conditions on f (cf., e.g., [4]). A natural condition seems to be that f is in the range of the above operator, i.e., f (x) = h(t)(x; t) dt (1.5) P It was shown in [6] that under this condition the rate (1.6) inf kf ? gk L ( ) = O(n? ) g2Xn Z

Z

1 2

2

is obtained if  is a continuous function. We improve this result under additional smoothness assumptions on the basis function  in the next section with estimates also in H m( ). Moreover, we will give error bounds in W m;r ( ) that depend on the dimension p of P , where the analysis is based on nite-element theory. In Section 3, we apply the results to perceptrons and give sucient conditions on f for condition (1.5) to hold.

2. Error Bounds An inspection of the proof of (1.6) in [6] shows that the result can be improved if the activation function  is Holder continuous. Moreover, rates can be obtained in H m ( ):

Theorem 2.1. Let Xn be de ned as in (1.4) with P  Rp bounded and  such that k(; t) ? (; s)k H m( )  c kt ? sk  ;  2 (0; 1] ; c > 0 ; m 2 N0 : (2.1) Moreover, let f 2 H m ( ) satisfy (1.5) with h 2 L1 (P ). Then we obtain the rate  m ( ) = O (n? ? p ) k f ? g k inf H g2Xn 1 2

Proof. Let P := ft 2 P : h(t)  0g (note that P is unique up to a set of measure zero) and n := [ n2 ]. Since P is bounded, it is possible to nd bounded measurable sets Pj such that

P=

n

[

j =1

Pj ;

Pi \ Pj = fg ; i 6= j ; diam(Pj

P nP =

1 ) = O(n? p ) ;

2

n

[

Pj ; j =n+1 jPj j = O( n1 ) ;

i; j = 1; : : : ; n :

(2.2)

We now de ne coecients

cj :=

Z

Pj

and probability measures

t 2 Pj otherwise

1 cj h(t) ;

(

j (t) := 0 ;

)

h(t) dt

for cj > 0; and j is arbitrary for cj = 0 :

Furthermore, we consider the variables tj 2 P as random variables distributed with probability distribution j . The expected value of z(t1 ; : : : ; tn) is de ned as

E [z] :=

Z

Z

P

: : : z(t1 ; : : : ; tn)1(t1)    n(tn) dt1 : : : dtn : P

With cj and j as above and f as in (1.5) we obtain that

E [ kf ?

n

X

j =1

cj (; tj )k 2H m( ) ]

= kf k 2H m( ) ? 2 +

n X

Z

n

X

j =1

Z

cj h f (x); j (tj )(; tj ) dtj iH m( ) P

Z

cicj h i(ti)(; ti) dti; j (tj )(; tj ) dtj iH m( )

P P i6=j =1 Z n X + c2j j (tj ) k(; tj )k 2H m( ) dtj P j =1 Z n X k P [h(t) ? cj j (t)](; t) dtk 2H m( ) j =1 Z Z n h i X + c2j j (t) k(; t)k 2H m( ) dt ? k j (t)(; t) dtk 2H m( ) P P j =1 Z hZ n Z  i X X @ j j (x; t) dt 2 dx @ j j (x; t))2 dt ?  ( t ) c2j j (t)( @x j @x P j =1 j jm P Z hZ n  i Z X X 2 @ j j (x; t) ? @ j j (x; s)) ds 2 dt dx cj j (t) j (s)( @x @x Pj j =1 j jm Pj

=

= =

Noting that h 2 L1(P ) and (2.2) imply that cj = O( n1 ), we now obtain together with (2.1) and (2.2) that

E [ kf ?

n

X

j =1

 =

cj (; tj )k 2H m( ) ] n

X

j =1 n X j =1

c2j c2j

X

Z

hZ

j jm

Z

Pj

j (t)

Pj Z

Pj



j (t)

Z

Pj

j (s)

 i @ j j (x; t) ? @ j j (x; s) 2 ds dt dx @x @x

j (s) k(; t) ? (; s)k 2H m( ) ds dt 

= O(n  n?2  n? p ) = O(n?1? p ) 2



2

3

Therefore, there exists a set of elements tj 2 P such that inf kf ? gk H m( )  kf ? g2Xn

n

X

j =1



cj (; tj )k H m( )  E [ kf ? =

n

X

j =1

1

cj (; tj )k 2H m( ) ]

2

 O(n? 21 ? p )

where cj is as above. We think that the proposition above is also true if h 2 L2(P ). However, the choice of the subsets Pj in (2.2) has to be more tricky, since cj = O( n1 ) will no longer hold, in general. We will now turn to other estimates in spaces W m;r ( ). The error bounds will depend on the dimension p of P  Rp. The proofs are based on the following results from niteelement theory: Let p

p

P := X [pi; pi ] and Pl :::lp := X [pi + pi? pi li; pi + pi? pi (li + 1)] ;  2 N : 1

i=1

i=1

Then, obviously

P=

Moreover, we de ne

[

li =0;:::; ?1 i=1;:::;p

Pl :::lp : 1

tj :::jp := (tj :::jp;1; : : : ; tj :::jp;p) 2 Rp ; tj :::jp;i := pi + pik?pi ji ; ji = 0; : : : ; k : (2.3) Then for all kli  i  k(li + 1) there exists a unique polynomial function q :::p 2 Qk;l :::lp := fq(t) = cj :::jp tj1    tjpp : 0  ji  k; 1  i  p; (2.4) t = (t1; : : : ; tp) 2 Pl :::lp g satisfying p (2.5) q :::p (tj :::jp ) = iji ; kli  i ; ji  k(li + 1) : 1

1

1

1

X

1

1

1

1

1

Y

1

1

i=1

The function uI , de ned by uI Pl :::l := 1

p

X

kli ji k(li +1)

u(tj :::jp )qj :::jp ; 1

(2.6)

1

interpolates u 2 C (P ) at the knots tj :::jp , 0  ji  k , 1  i  p. Note that uI 2 C (P ) \ H 1(P ). 1

Proposition 2.2. Let P  Rp be rectangular. If u 2 H k (P ) with k > 2p , then there is a constant c > 0 such that for all multiindices with j j =  < k and for all li 2 f0; : : : ;  ? 1g, i = 1; : : : ; p, it holds that (2.7) kD (u ? uI )k L (Pl :::lp )  c ?(k?)jujH k(Pl :::lp ) : If u 2 C k (P ), then there is a constant c > 0 such that for all multiindices with j j =  < k and for all li 2 f0; : : : ;  ? 1g, i = 1; : : : ; p, it holds that

uk 1 (2.8) k D kD (u ? uI )k L1(Pl :::lp )  c ?(k?) max L (Pl :::lp ) : j j=k 2

1

1

1

1

4

Proof. The proof follows with Theorem 3.1 and Theorem 3.3 in [8].

For our main result we need the following types of smoothness of :  2 W m;r ( ; Y ) with Y = H k (P ) or Y = C k (P ) and norms 8  > > >
> > :

1

x2

Theorem 2.3. Let Xn be de ned as in (1.4) with P  Rp bounded and rectangular and let  2 W m;r ( ; Y ) with Y = H k (P ), k > 2p , or Y = C k (P ). Moreover, let f 2 W m;r ( ) satisfy (1.5) with h 2 L2 (P ) if Y = H k (P ) and h 2 L1(P ) if Y = C k (P ). Then we obtain the rate

k

inf kf ? gk W m;r( ) = O(n? p ) : g2Xn

Proof. If we choose cj as

cj :=

Z

P

h(t) j (t) dt ;

j 2 L1(P ) ;

with h as in (1.5), then we obtain that

kf ?

n

X

j =1

Z



cj (; tj )k W m;r ( ) = k h(t) (; t) ? P

n

X

j =1



j (t)(; tj ) dtk W m;r ( ) :

Let us de ne  := [n pk]?1 and n := (k + 1)p  n. Then we choose tj and j as follows: For j = n + 1; : : : ; n let tj be arbitrary and j  0. For j = 1; : : : ; n let tj and j be the appropriate knots and basis functions such that the sum above equals the interpolating function I (; t) (see (2.3) { (2.6)), i.e., 1

kf ?

n

X

j =1

Z

cj (; tj )k W m;r( ) = k h(t)((; t) ? I (; t))dtk W m;r( ) : P

Note that this interpolating property also holds for all derivatives of  with respect to x. Applying (2.7) ( = 0) for Y = H k (P ) and (2.8) ( = 0) for Y = C k (P ) we obtain the estimates

kf ? and

kf ?

n

X

j =1 n

X

j =1

cj (; tj )k W m;r ( )  c0 ?k khk L (P ) kk W m;r ( ;H k (P ))

(2.9)

cj (; tj )k W m;r( )  c0  ?k khk L (P ) kk W m;r ( ;C k (P ))

(2.10)

2

1

respectively. Now the assertion follows together with the fact that   n p . 1

Remark 2.4. The idea of choosing cj , tj and j as in the prove above was found

in a paper by Whaba [9] for one-dimensional P . This idea was extended to higher dimensions, i.e., P  Rp. The following extensions of Theorem 2.3 are obvious from the proof: 5

 If P is not rectangular but supp(h)  P  P with P rectangular, then the results

are still valid.  If Y = C k (P ), the condition (1.5) for f with h 2 L1 (P ) may be replaced by: f is such that there exists a uniformly bounded sequence hl in L1(P ) with Z

kf ? P hl (t)(; t) dtk W m;r( ) ! 0 as l ! 1 :  Condition (1.5) may be generalized to f (x) =

X

Z

j j P

h (t) @@tj j (x; t) dt ;  < k :

(2.11)

If the functions j are chosen such that for each they coincide with the appropriate derivative of the basis functions qj :::jp in Pl :::lp , we obtain together with Proposition 2.2 the rates 1

1

inf kf ? gk W m;r( ) = O(n? g2Xn

k?) p ):

(

Finally, we want to mention that the rates above and in Theorem 2.3 decrease with increasing dimension p. There is no dimensionless term like n? in (1.6) or Theorem 2.1. Since the estimates in the proof of Theorem 2.3 are based on a xed choice of knots tj this dependence on p is to be expected. We were not able to improve the rates for a possible optimal choice of knots. However, since Proposition 2.2 is valid also for many other non-uniform choices of knots tj , the rates in Theorem 2.3 are valid for many choices tj (also non-optimal ones) if at least cj is chosen optimally. 1 2

3. Application to perceptrons We now apply the results of the previous section to perceptrons with a single hidden layer, namely Ridge-constructions (cf. (1.2)) where  is a function of sigmoidal form, i.e., n X Xn = fg = cj (aTj x + bj ) : a 2 A  Rd ; b 2 B  Rg j =1

and  is piecewise continuous, monotonically increasing, and such that lim (t) = 0

t!?1

If  is such that

and

lim (t) = 1 :

t!+1

1; t > 1; (t) := > p(t) ; ?1  t  1 ; : 0; t < ?1 ; with p the unique polynomial of degree 2k + 1 satisfying 8 >

:

0;

t > 1;

?1  t  1 ; t < ?1 ;

t+1 ; 2

(3.3)

d

and let A := X [?ai; ai] and B := [?b; b] with ai > 0 and b > 0 such that i=1

8a 2 A 8x 2 : jaT xj  b ? 1 : Since (x; a; b) := (aT x + b) satis es (2.1) with m = 0 and  = 1, Theorem 2.1 implies that inf kf ? gk L ( ) = O(n? ? d ) g2Xn 1 +1

1 2

2

if

Z

f (x) =

Z

=

Z

b

A ?b hZ

h(a; b)(aT x + b) db da 1?aT x

?1?aT x

A

for some h 2 L1(A  B ).

h(a; b) 1+aT2 x+b db +

Z

b

1?aT x

i

h(a; b) db da

(3.4)

Example 3.2. We consider now the general case, where  is de ned by (3.1), (3.2), and where A and B are as in Example 3.1. Since (x; a; b) := (aT x + b) satis es that  2 W m;1( ; C k?m(A  B )) (m  k) and  2 W m;1( ; H k+1?m(A  B )) (m  k + 1), we may apply Theorem 2.3 to obtain k?m

inf kf ? gk W m;r( ) = O(n? d ) g2Xn +1

if f 2 W m;r ( ) satis es

f (x) =

Z

A

hZ

1?aT x

?1?aT x

h(a; b)p(aT x + b) db + 7

Z

b 1?aT x

i

h(a; b) db da

(3.5)

for some h 2 L1 (A  B ), and inf kf ? gk W m;r( ) = O(n? g2Xn

k+1?m d+1 )

if f 2 W m;r ( ) satis es (3.5) for some h 2 L2 (A  B ) and k + 1 ? m > d+1 2 . Note that d +1 for m = 0 and k > 2 the rate above is better than the one in Example 3.1. From both examples, we can see that the conditions (3.4) and (3.5) can be only satis ed if f is several times di erentiable. We will now give a sucient condition on f that guarantees (3.4): Let "0 := 0 and "n := 2 (4nj ? 3), n 2 N, for some j 2 N to be speci ed later, and let n := "n"n . We de ne the function h as follows: +1

h(a; b) = where

n(a) :=

(

1

X

n=1

(n(a) cos(b"n) + n(a) sin(b"n))

(3.6)

?(2)? d "3n=f^(a"n) ; if a 2 Ann?1A ; 2

0; else ; ( (3.7) d ? "3 0, p > 1, and j > p p?1 , implies that 8

< 1;

k

 X

n=1

p p?1 1

"n? p?

Z

Z

b

A ?b

jh(a; b)jpdb da

1

= O

 X Z

= O

 X Z

"n(3+ )p jf^(a"n)jpda



n=1 Ann?1 A

1

n=1 "n An"n?1 A

"n(3+ )p?1 jf^(z)jpdz



if j is suciently large and = 0 for p = 1 and > 0 for p > 1 which we assume to hold in the following. Since

9C > 0 8z 2 "nAn"n?1A : "n(3+ )p?1  C (1 + jzj3+ ? p )p ; 1

we nally obtain that Z

Z

b

A ?b

jh(a; b)jpdb da

1

= O

 X Z

= O

Z

(1 + jzj3+ ? p )pjf^(z)jpdz 1



n=1 "n An"n?1 A  1 (1 + jzj3+ ? p )pjf^(z)jpdz :

Rd

This proves the assertion for p < 1. Let us now consider the case p = 1: We assume that > 0 and that j > 1 . Then we obtain for all a 2 k Ank?1A that

jh(a; b)j 

k

X

n=1

(jn(a)j + jn(a)j) k

= O

X

= O

X

n=1 k n=1

"3njf^(a"n)j



(1 + (jaj"n)3+ )"?n jf^(a"n)j



= O( k(1 + j  j3+ )f^()k L1(Rd )) This proves the assertion for p = 1.

Proposition 3.4. Let f , A, and B satisfy the conditions in Lemma 3.3. Moreover, let f be such that (1 + j  j)f^() 2 L1 (Rd ). Then f has an integral representation (3.4) for some h 2 Lp(A  B ). Proof. With the special choice of h as in (3.6) and (3.7) we know from Lemma 3.3 that h 2 Lp(A  B ). We will now show that

g(x) := =

Z

A

hZ

1 Z X

1?aT x

?1?aT x

h(a; b) 1+aT2 x+b db + k h X

n(a)

Z

Z

1?aT x

b

1?aT x

i

h(a; b) db da

cos(b"n) 1+aT2 x+b db +

Z

b

cos(b"n) db



1?aT x ?1?aT x Z b  Z 1?aT x i 1+aT x+b db + + n(a) sin( b" ) sin( b" ) db da n n 2 ?1?aT x 1?aT x

k=1 k Ank?1 A n=1

9

is identical to f up to a constant. The integrals with respect to b may be calculated analytically. Together with sin("n) = 1 this yields that

g(x) =

1

X

k

Z

X h

k=1 k Ank?1 A n=1

= (2)?

d 2

1

X



n(a) "?n 1 sin(b"n) + "?n 2 sin(aT x"n)

Z

1

X

i



+ n(a) ? "?n 1 cos(b"n) + "?n 2 cos(aT x"n) da (