MANUSCRIPT

Report 3 Downloads 428 Views
Accepted Manuscript Complexity of Gaussian radial basis networks approximating smooth functions Paul C. Kainen, Vˇera K˚urkov´a, Marcello Sanguineti PII: DOI: Reference:

S0885-064X(08)00055-1 10.1016/j.jco.2008.08.001 YJCOM 977

To appear in:

Journal of Complexity

Received date: 4 November 2007 Accepted date: 20 August 2008 Please cite this article as: P.C. Kainen, V. K˚urkov´a, M. Sanguineti, Complexity of Gaussian radial basis networks approximating smooth functions, Journal of Complexity (2008), doi:10.1016/j.jco.2008.08.001 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

RI P

T

Complexity of Gaussian Radial Basis Networks Approximating Smooth Functions

SC

Paul C. Kainen a,1 , Vˇera K˚ urkov´a b,2,3 , Marcello Sanguineti c,∗,3,4

a Department

AN U

of Mathematics, Georgetown University Washington, D. C., 20057-1233, USA

b Institute

of Computer Science, Academy of Sciences of the Czech Republic Pod Vod´ arenskou vˇeˇz´ı 2, 182 07, Prague 8, Czech Republic of Communications, Computer, and System Sciences (DIST) University of Genoa Via Opera Pia 13, 16145 Genova, Italy

TE DM

c Department

Abstract

AC C

EP

Complexity of Gaussian radial-basis-function networks, with varying widths, is investigated. Upper bounds on rates of decrease of approximation errors with increasing number of hidden units are derived. Bounds are in terms of norms measuring smoothness (Bessel and Sobolev norms) multiplied by explicitly given functions a(r, d) of the number of variables d and degree of smoothness r. Estimates are proven using suitable integral representations in the form of networks with continua of hidden units computing scaled Gaussians and translated Bessel potentials. Consequences on tractability of approximation by Gaussian-radial-basis function networks are discussed.

Key words: Gaussian radial-basis-function networks, rates of approximation, model complexity, variation norms, Bessel and Sobolev norms, tractability of approximation

Preprint submitted to Journal of Complexity

31 August 2008

ACCEPTED MANUSCRIPT

Introduction

T

1

RI P

Radial-basis function (RBF) networks with Gaussian computational units are known to be able to approximate with an arbitrary accuracy all continuous and all L2 -functions on compact subsets of Rd [10,23,27,30,31]. In such approximations, the number n of RBF units plays the role of a measure of model complexity and its size determines the feasibility of network implementation.

AN U

SC

Several authors have investigated rates of approximation by RBF networks with n Gaussians units of fixed width. Girosi and Anzellotti [9] derived an asymptotic upper bound of order n−1/2 on approximation error measured by the supremum norm for band-limited functions with continuous derivatives up to order r with r > d/2, where d is the number of variables [9, p. 106]. Using results from statistical learning theory, Girosi [8] extended these bounds to more general classes of kernels. For Gaussians of varying widths, Kon, Raphael, and Williams [14, Corollary 3] obtained bounds on a weighted L∞ -distance from the target function to a linear combination of Gaussians.

TE DM

Some bounds improve on the exponent of −1/2. Mhaskar [24,25] and Narcovich et al. [29] obtained bounds of order n−r/2d , and in one special case, Maiorov [21] found n−r/(d−1) . Although order with respect to n improves, the remaining multiplicative factors in such bounds involve constants that are unknown and these upper bounds increase as d increases. Also, they apply to different classes of target and approximating functions. Moreover, dependence on parameters may differ, and approximation error is computed with respect to different metrics. Thus, it is not easy to compare these bounds, In this paper, we approximate smooth functions by Gaussian RBF networks with units of varying widths, using L2 -distance with respect to Lebesgue measure. We derive upper bounds on rates of approximation in terms of the Bessel and Sobolev norms of the functions to be approximated. Bessel norms are defined in terms of convolutions with Bessel-potential kernels, while Sobolev

AC C

EP

∗ Corresponding author. Email addresses: [email protected] (Paul C. Kainen), [email protected] (Vˇera K˚ urkov´a), [email protected] (Marcello Sanguineti ). 1 Collaboration of P. C. K. with V. K. and M. S. was partially supported by Georgetown University. 2 Partially supported by GA CR ˇ grant 201/08/1744 and by the Institutional Research Plan AV0Z10300504. 3 Collaboration between V. K. and M. S. was partially supported by 2007-2009 Scientific Agreement among University of Genoa, National Research Council of Italy, and Academy of Sciences of the Czech Republic. 4 Partially supported by a PRIN Grant from the Italian Ministry for University and Research, project “Models and Algorithms for Robust Network Optimization”.

2

ACCEPTED MANUSCRIPT

T

norms use integrals of partial derivatives. Bessel and Sobolev norms of the same degree are equivalent but the ratios between them depend on the number d of variables.

SC

RI P

Our estimates hold for all numbers n of hidden units and all degrees r > d/2 of Bessel potentials. The estimates are of the form n−1/2 times the Bessel norm kf kL1,r of the function f to be approximated times a factor k(r, d). For a fixed c > 0 and the degree rd = d/2 + c, the factor k(rd , d) decreases to zero exponentially fast. We also derive estimates in terms of L2 Bessel and Sobolev norms. Our results show that reasonably smooth functions can be approximated quite efficiently by Gaussian radial-basis networks. A preliminary version of the results appeared in [13].

2

TE DM

AN U

The paper is organized as follows. Section 2 presents some concepts, notations, and auxiliary results for studying approximation by Gaussian RBF networks. Section 3 derives upper bounds on rates of approximation of Bessel potentials by linear combinations of scaled Gaussians in terms of variation norms obtained from integral representations of Bessel potentials and their Fourier transforms. In Section 4, for functions representable as convolutions with Bessel potentials, upper bounds are derived in terms of Bessel potential norms. These bounds are then combined with estimates of variational norms from the previous section to obtain bounds for approximation by Gaussian RBFs in terms of Bessel norms. Section 5 uses the relationship between Sobolev and Bessel norms to obtain bounds in terms of Sobolev norms. In Section 6, we discuss consequences for tractability of multivariate approximation by Gaussian radial basis networks.

Approximation by Gaussian RBF networks

EP

For Ω ⊆ Rd , L2 (Ω) denotes the space of real-valued functions on Ω with norm R 1/2 kf kL2 (Ω) = ( |f (x)|2 dx) . Two functions are identified if they differ only on a set of Lebesgue-measure zero. When Ω = Rd , we omit it in the notation.

AC C

For nonzero f ∈ L2 , f o = f /kf kL2 denotes the normalization of f ; for convenience, we put 0o = 0. For F ⊂ L2 , F |Ω denotes the set of functions from F restricted to Ω, Fˆ the set of Fourier transforms of functions in F , and F o the set of their normalizations. For n ≥ 1, define spann F :=

( n X

)

wi fi | fi ∈ F, wi ∈ R .

i=1

In this paper, we investigate accuracy measured by L2 -norm with respect 3

ACCEPTED MANUSCRIPT

T

to Lebesgue measure λ in approximation by Gaussian radial-basis-function networks.

RI P

A Gaussian radial-basis-function unit with d inputs computes all scaled and translated Gaussian functions on Rd . For b > 0, let γb : Rd → R denote the Gaussian function of width b defined by 2

γb (x) = e−bkxk .

kγb kL2 = (π/2b)d/4 . Let

SC

A simple calculation shows that

(1)

AN U

G0 = {γb | b > 0} denote the set of the Gaussians centered at 0 with varying widths. For τy the translation operator defined for any y ∈ Rd and any f on Rd as (τy f )(x) = f (x − y), let n o G = τy γb | y ∈ Rd , b > 0 denote the set of all translations of the Gaussians with varying widths.

TE DM

We investigate rates of approximation by networks with n Gaussian RBF units and one linear output unit, which compute functions from the set spann G. We exploit properties of the Fourier transform of the Gaussian function. The d-dimensional Fourier transform is the operator F on L2 ∩ L1 given by Z 1 eix·s f (x) dx , (2π)d/2 Rd where · denotes the Euclidean inner product on Rd .

F(f )(s) = fˆ(s) =

EP

For every b > 0,

(2)

γcb (x) = (2b)−d/2 γ 1 (x)

(3)

c. spann G0 = spann G 0

(4)

4b

AC C

(cf.[34, p. 43]). Thus

Plancherel’s identity [34, p. 31] asserts that the Fourier transform is an isometry on L2 , i.e., for all f ∈ L2 kf kL2 = kfˆkL2 , 4

(5)

ACCEPTED MANUSCRIPT

and directly by (1) we have ¶d/4

= kγcb kL2 .

(6)

T

π 2b

RI P

µ

kγb kL2 =

In a normed linear space (X , k.kX ), for f ∈ X and A ⊂ X , kf − AkX = inf kf − gkX g∈A

SC

denotes the distance from f to A. The following proposition shows that in estimating rates of approximation by linear combinations of scaled Gaussians centered at 0, one can switch between a function and its Fourier transform.

AN U

Proposition 2.1 For all positive integers d, n and all f ∈ L2 , b b k 2 = kfˆ − span G ˆ kf − spann G0 kL2 = kf − spann G 0 L n 0 kL2 = kf − spann G0 kL2 . Proof. Using (5) and (4), respectively, we get kf − spann G0 kL2 = kfˆ − spann Gˆ0 kL2 = kf − spann Gˆ0 kL2 = kfˆ − spann G0 kL2 . 2

TE DM

To derive our estimates, we use a result on approximation by convex combinations of n elements of a bounded subset of a Hilbert space derived by Maurey [32], Jones [11] and Barron [2,3]. Let F be a bounded subset of a Hilbert space (H, k.kH ), and ( ) ¯ n ¯ 1X uconvn F = fi ¯ fi ∈ F n i=1 ¯ denote the set of n-fold convex combinations of elements of F with all coefficients equal. By Maurey-Jones-Barron’s result [3, p. 934], for every function h in cl conv (F ∪ −F ), i.e., in the closure of the symmetric convex hull of F , we have ³ ´1/2 kh − uconvn F kH ≤ n−1/2 s2F − khk2H , (7)

EP

where sF = supf ∈F kf kH . The bound (7) implies an estimate of the distance from spann F holding for any function from H. The estimate is formulated in terms of a norm tailored to F , called F -variation, which was introduced in [15] as an extension of “variation with respect to half-spaces” defined in [2].

AC C

For any bounded subset F of any normed linear space (X , k.kX ), F -variation is defined as the Minkowski functional of the closed convex symmetric hull of F (where closure is taken with respect to the norm k.kX ). The variational norm with respect to F in X is denoted by k.kF,X , i.e., n

o

khkF,X = inf c > 0 | c−1 h ∈ cl conv (F ∪ −F ) .

(8)

Note that F -variation can be infinite (when the set on the right-hand side is empty) and that it depends on the ambient space norm. When we consider 5

ACCEPTED MANUSCRIPT

T

variation with respect to the L2 -norm, we omit L2 in the notation of variational norm.

³

kh − spann F kH ≤ n−1/2 khk2F o ,H − khk2H

RI P

Maurey-Jones-Barron’s estimate (7) implies that for any bounded subset F of a Hilbert space (H, k.kH ) and all positive integers n ´1/2

(9)

SC

(see [16]). To apply the upper bound (9) to approximation by Gaussian RBFs we take advantage of properties of variational norms given in the remainder of this section. From the definitions, if ψ is any linear isometry of (X , k.kX ), then for any f ∈ X , kf kF,X = kψ(f )kψ(F ),X . In particular, (10)

AN U

kf kGo0 ,X = kfˆkGo0 ,X

Variations with respect to two subsets satisfy the following inequality [19, Proposition 3(iii)]. Lemma 2.2 Let F, H be nonempty, nonzero subsets of a normed linear space (X , k.kX ) and sH,F := suph∈H khkF,X . Then for every f ∈ X ,

TE DM

kf kF,X ≤ sH,F kf kH,X .

The next lemma states that the variation of the limit of a sequence of functions is bounded by the limit of variations (see [18, Lemma 7.2]) and [12, Lemma 3.5]). Lemma 2.3 Let F be a nonempty bounded subset of a normed linear space (X , k · kX ), h ∈ X , and {hi } ⊂ X such that limi→∞ khi − hkX = 0. For all i, let bi = khi kF,X < ∞ and suppose that there exists limi→∞ bi = b. Then khkF,X ≤ b.

AC C

EP

Variation with respect to a parameterized family of functions can be estimated for functions representable by a suitable integral formula, where integration is with respect to the parameter. Let Ω ⊆ Rd , φ : Ω × Y → R. If for all y ∈ Y , φ(., y) ∈ L2 (Ω), then we denote by Φ : Y → L2 the mapping defined for every y ∈ Y as Φ(y) = φ(., y) and Φ(Y ) := {φ(., y) : Ω → R | y ∈ Y }.

The following theorem was proven in [12, Corollary 5.1] using properties of Bochner integral of the mapping Φ together with the limit property of varia6

ACCEPTED MANUSCRIPT

T

tional norms given in Lemma 2.3. We denote by wΦ : Y → L2 (Ω) the mapping defined for all y ∈ Y as wΦ(y) = w(y)Φ(y).

f (x) =

Y

w(y)φ(x, y)dy,

RI P

Theorem 2.4 Let Ω ⊆ Rd be Lebesgue measurable, f ∈ L2 (Ω) be such that for a.e. x ∈ Ω, Z

s2Φ kwk2L1 (Y ) − kf k2L2 (Ω)

AN U

kf − spann Φ(Y )k2L2 (Ω) ≤

SC

where Y , w, and φ satisfy the following three conditions: (i) Y ⊆ Rp is Lebesgue measurable, p is a positive integer, Y \ Y0 = ∪∞ m=1 Ym , where λ(Y0 ) = 0 and for all positive integers m, Ym is compact and Ym ⊆ Ym+1 , (ii) Φ(Y ) is a bounded subset of L2 (Ω), w ∈ L1 (Y ), and wΦ : Y \ Y0 → X is continuous, (iii) φ : Ω × Y → R is Lebesgue measurable. Then kf kΦ(Y ) ≤ kwkL1 (Y ) and for all positive integers n,

n

.

3

TE DM

Theorem 2.4 guarantees that if f can be represented as a neural network with a continuum of hidden units computing functions from Φ(Y ), then the Φ(Y )-variational norm of f is bounded by the L1 -norm of the weight function.

Approximation of Bessel potentials by Gaussian RBFs

In this section, we estimate rates of approximation by spann G for certain special functions, called Bessel potentials, which are defined by means of their Fourier transforms. For r > 0, the Bessel potential of order r, denoted by βr , is the function on Rd with Fourier transform

EP

βˆr (s) = (1 + ksk2 )−r/2 .

The L2 -norm of βr can be calculated by switching to βˆr and using Plancherel’s equality (5). For every r > d/2

AC C

kβr kL2

Ã

Γ(r − d/2) = kβˆr kL2 = λ(r, d) := π d/4 Γ(r) R

!1/2

.

(11)

Indeed, using radial symmetry kβˆr k2L2 = Rd (1 + kxk2 )−r dx = ωd I, where ωd := 2π d/2 /Γ(d/2) is the area of the unit sphere in Rd [7, p. 303] and I = 7

ACCEPTED MANUSCRIPT

0

(1 + ρ2 )−r ρd−1 dρ. Substituting σ = ρ2 , one gets dρ = (1/2)σ −1/2 dσ; hence, I = (1/2)

Z ∞ σ d/2−1 0

(1 + σ)r

dσ =

Γ(d/2)Γ(r − d/2) 2Γ(r)

RI P

(see [6, p. 60] for the last equality).

T

R∞

To estimate Go0 -variations of βr and βˆr , we use Theorem 2.4 with representations of these two functions as integrals of scaled Gaussians.

βr (x) = c1 (r, d)

Z ∞ 0

SC

For r > 0, it is known [33, p. 132] that βr is non-negative, radial, exponentially decreasing at infinity, analytic except at the origin, and belongs to L1 . It can be expressed by the integral formula (see [22, p. 296] or [33]) 2

e−t/(4π) t−d/2+r/2−1 e−(π/t)kxk dt,

AN U

where

(12)

c1 (r, d) = (2π)d/2 (4π)−r/2 /Γ(r/2)

R and Γ(z) = 0∞ tz−1 e−t dt is the Gamma function. The factor (2π)d/2 occurs

since our choice of Fourier transform (2) includes the factor (2π)−d/2 . Combining (12) with (1), we get a representation of the Bessel potential as an integral of normalized scaled Gaussians.

TE DM

Proposition 3.1 For every r > 0, d a positive integer, and x ∈ Rd βr (x) =

Z ∞ 0

o vr (t)γπ/t (x) dt ,

where vr (t) = c1 (r, d) 2−d/4 e−t/4π t−d/4+r/2−1 . The next proposition estimates Go0 -variation of βr . Proposition 3.2 For d a positive integer and r > d/2, kβr kGo ≤ kβr kGo0 ≤

Z ∞ 0

vr (t) dt = k(r, d),

(π/2)d/4 Γ(r/2−d/4) . Γ(r/2)

EP

where k(r, d) =

AC C

Proof. As Go0 ⊂ Go , we get kβr kGo ≤ kβr kGo0 . To estimate kβr kGo0 , we apply o Theorem 2.4 with w = vr , φ(x, y) = φ(x, t) = γπ/t (x), Y = (0, ∞), and Ω = Rd to the integral representation from Proposition 3.1, getting kβr kGo0 ≤

Z ∞ 0

vr (t) dt = c1 (r, d) 2−d/4

Z ∞ 0 ∞

Z

= (4π)−d/4+r/2 c1 (r, d) 2−d/4

8

0

e−t/(4π) t−d/4+r/2−1 dt

u−d/4+r/2−1 e−u du .

ACCEPTED MANUSCRIPT

T

Hence, by the definition of the Gamma function, one has kβr kGo0 ≤ c1 (r, d) 2−d/4 (4π)−d/4+r/2 Γ(r/2 − d/4) (π/2)d/4 Γ(r/2 − d/4) = k(r, d) . Γ(r/2)

RI P

=

2

SC

The Fourier transform of the Bessel potential can also be expressed as an integral of normalized scaled Gaussians. Proposition 3.3 For every r > 0, d a positive integer, and s ∈ Rd Z ∞

βˆr (s) =

wr (t)γto (s) dt ,

AN U

0

where wr (t) = (π/2t)d/4 tr/2−1 e−t /Γ(r/2).

Proof. First we show that βˆr (s) = I/Γ(r/2), where I=

Z ∞ 0

2

tr/2−1 e−t e−tksk dt.

TE DM

Indeed, putting u = t(1 + ksk2 ), so dt = du(1 + ksk2 )−1 , we get 2 −r/2

I = (1 + ksk )

Z ∞ 0

ur/2−1 e−u du = βˆr (s)Γ(r/2).

R By (1) kγt kL2 = (π/2t)d/4 , so βˆr (s) = 0∞ (π/2t)d/4 tr/2−1 e−t /Γ(r/2) γto (s)dt. 2

The next proposition gives an upper bound on Go0 -variation of βˆr .

EP

Proposition 3.4 For d a positive integer and r > d/2, kβˆr kGo ≤ kβˆr kGo0 ≤

0

wr (t)dt = k(r, d),

(π/2)d/4 Γ(r/2−d/4) . Γ(r/2)

AC C

where k(r, d) =

Z ∞

Proof. A straightforward calculation shows that the L1 -norm of the weighting function wr is the same as the L1 -norm of the weighting function vr and the upper bound follows from Theorem 2.4 as in Proposition 3.2 but with 2 φ(x, y) = φ(x, t) = γto (x). 9

ACCEPTED MANUSCRIPT

RI P

T

Because the Fourier transform is an isometry on L2 , by (10) the functions βr and βˆr have the same variation with respect to Go0 . Propositions 3.2 and 3.4 give the same upper bound k(r, d) on this number. If for some fixed c > 0, rd = d/2 + c, then k(rd , d) → 0 exponentially fast as d → ∞. As all elements of Gβr have the same L2 -norm equal to λ(r, d), k.kGoβr = λ(r, d)k.kGβr .

(13)

SC

An application of (9) and 11) with Propositions 3.2 or 3.4 shows the following result: Theorem 3.5 For d, n positive integers and r > d/2 1/2 kβr − spann G0 kL2 = kβˆr − spann G0 kL2 ≤ n−1/2 (k(r, d)2 − λ(r, d)2 ) .

Approximation of smooth functions by Gaussian RBFs

TE DM

4

AN U

As above, for c > 0 and d large enough, the theorem shows that the Bessel potential of order rd = d/2 + c can be well-approximated by a network with just one Gaussian unit; hence, βrd is close in L2 -norm to a multiple of some d-dimensional Gaussian centered at the origin.

In this section we estimate rates of approximation by Gaussian RBF for functions in the Bessel potential spaces. To obtain the estimates we first derive upper bounds on variation with respect to the set of translated Bessel potentials and then combine them with the estimates of G0 -variation of Bessel potentials from the previous section. Let h ∗ g denote the convolution of two functions h and g, Z

Rd

h(y)g(x − y)dy.

EP

(h ∗ g)(x) =

For d a positive integer, r > d/2, and q ∈ [1, ∞], the Bessel potential space (with respect to Rd ) [33, pp.134-136] denoted by (Lq,r , k.kLq ,r ) is defined as Lq,r := {f | f = w ∗ βr , w ∈ Lq }

AC C

and

kf kLq,r := kwkLq

for f = w ∗ βr .

Since the Fourier transform (2) of a convolution is (2π)d/2 times the product of the transforms, we have w ˆ = (2π)−d/2 fˆ/βˆr . Thus w = (2π)−d/2 (fˆ/βˆr )ˇ is uniquely determined by f and so the Bessel potential norm is well-defined. 10

ACCEPTED MANUSCRIPT

For τy the translation operator given by (τy f )(x) = f (x − y) let

T

Gβr = {τy βr | y ∈ Rd }

RI P

denote the set of translates of the Bessel potential of order r. For r > d/2, βr belongs to L2 ; since translation does not change the L2 -norm, Gβr ⊂ L2 .

Functions in the Bessel potential space are convolutions with βr which are integral formulas. Thus we get the following upper bound:

SC

Proposition 4.1 Let d be a positive integer, r > d/2, w : Rd → R continuous except on a set Z0 of measure zero, w ∈ L1 , and f = w ∗ βr . Then kf kGβr ≤ kwkL1 = kf kL1,r .

AN U

Proof. The bounds follow fromR Theorem 2.4, applied to the integral formula R f (x) = w(y)βr (x − y)dy = w(y)λ(r, d)βro (x − y)dy combined with (13). Take Y = Rd , Y0 = Z0 , let φ(x, y) = βro (x − y), and let w(y)λ(r, d) be the weight function. The condition r > d/2 is needed to ensure that Gβr ⊂ L2 . 2 For h : U → R, U a topological space, let

supp h = cl {u ∈ U | h(u) 6= 0} .

TE DM

Proposition 4.2 Let d be a positive integer, r > d/2, w ∈ L2 continuous except on a set of measure zero, λ(supp w) = ν < ∞, and f = w ∗ βr . Then kf kGβr ≤ ν 1/2 kf kL2,r .

Proof. By the Cauchy-Schwartz inequality, kwkL1 ≤ ν 1/2 kwkL2 = ν 1/2 kf kL2,r . But by Proposition 4.1, kf kGβr ≤ kwkL1 . 2 These estimates of variations give an upper bound on rates of approximation by linear combinations of n translates of the Bessel potential βr .

EP

Theorem 4.3 Let d, n be positive integers, r > d/2, w continuous except on a set of measure zero, f = w ∗ βr , and λ(r, d) = π d/4 (i) For w ∈ L1 , ³

³

´ Γ(r−d/2) 1/2 . Γ(r)

AC C

kf − spann Gβr kL2 ≤ n−1/2 λ(r, d)2 kf k2L1,r − kf k2L2

´1/2

.

(ii) For w ∈ L2 with ν = λ(supp w) < ∞, ³

kf − spann Gβr kL2 ≤ n−1/2 ν λ(r, d)2 kf k2L2,r − kf k2L2 11

´1/2

.

ACCEPTED MANUSCRIPT

RI P

T

Proof. (i) By Proposition 4.1, (13), and (9). (ii) As in Proposition 4.2, w ∈ L2 and supp(w) = ν < ∞ implies w ∈ L1 ; the rest follows from Proposition 4.2. 2 Composing estimates of variations with respect to sets of translated Bessel potentials and Gaussians, we get an upper bound on rates of approximation by networks with n Gaussian RBF units for functions from Bessel spaces.

³

SC

Theorem 4.4 Let d, n be positive integers, r > d/2, w continuous except on d/4 Γ(r/2−d/4) a set of measure zero, f = w ∗ βr , and k(r, d) = (π/2) Γ(r/2) . 1 (i) For w ∈ L ,

AN U

kf − spann GkL2 ≤ n−1/2 k(r, d)2 kf k2L1,r − kf k2L2

´1/2

.

(ii) For w ∈ L2 and λ(supp w) = ν < ∞, ³

kf − spann GkL2 ≤ n−1/2 k(r, d)2 ν kf k2L2,r − kf k2L2

´1/2

.

1/2

TE DM

Proof. By (9), kf − spann GkL2 ≤ (kf k2Go − kf k2L2 ) n−1/2 . By Lemma 2.2 with X = L2 , F = Go , H = Gβr , using Proposition 3.2 and the fact that Go is closed under translations, we have kf kGo ≤ sup{kτy (βr )kGo | y ∈ Rd }kf kGβr ≤ k(r, d)kf kGβr . ³

Thus kf − spann GkL2 ≤ n−1/2 k(r, d)2 kf k2Gβr − kf k2L2

´1/2

.

Upper bounds in terms of Sobolev norms

AC C

5

EP

Then the statements follow from the upper bounds on kf kGβr given in Propositions 4.1 and 4.2, resp. 2

In this section, bounds on approximation by Gaussian RBFs are given in terms of Sobolev norms. Two norms are equivalent if each is bounded by a multiple of the other. For integer smoothness, equivalence of Sobolev and Bessel potential norms is well-known (e.g., [34] or [1, p. 252]). As constants of equivalence were not readily available, we derive one of them here in a special case. 12

ACCEPTED MANUSCRIPT

1/2

kf kW 2,r = 

kD

α

f k2L2 

,

RI P

 X

T

Let r be a positive integer and let W 2,r denote the Sobolev space of functions with t-th order partials in L2 for t ∈ {0, 1, . . . , r} and norm

|α|≤r

where α denotes a multi-index (i.e., a vector of non-negative integers), Dα denotes the corresponding partial derivative operator, and |α| = α1 + · · · + αd .

SC

Let d and r be positive integers, r > d/2, w ∈ L2 , and f = w ∗ βr . Then kf kL2,r ≤ (2π)−d/2 (r!)1/2 kf kW 2,r

AN U

Indeed, since f = w ∗ βr , fˆ = (2π)d/2 wˆ βˆr and so kf kL2,r = (2π)−d/2 kfˆ/βˆr kL2 = (2π)−d/2 Let P

³ ´ r σ

|σ|=r

µZ

Rd

(14)

¶1/2

|fˆ(s)|2 (1 + |s|2 )r ds

.

denote the multinomial coefficient r!/σ1 ! . . . σt !. Note that (1+|s|2 )r =

³ ´ r σ

|u2σ |, for u ∈ Rd+1 defined by uj = sj , j = 1, . . . , d, ud+1 = 1, for σ = 2σ

Z Rd

TE DM

d+1 1 (σ1 , . . . σd+1 ) ∈ Nd+1 a multi-index of length d + 1, and |u2σ | = |u2σ 1 · · · ud+1 |. Hence, we have

|fˆ(s)|2 (1 + |s|2 )r ds ≤ C(r, d) n³ ´ ¯

Z Rd

|fˆ(s)|2

X

|s2α |ds,

|α|≤r

o

where C(r, d) = max σr ¯¯ |σ| = r . It follows from basic properties of the Fourier transform that the integral on the right-hand side is the square of the Sobolev norm of f ; see, e.g., [34, p. 162]. Clearly, C(r, d) ≤ r!, and equality holds if and only if r ≤ d. This establishes (14).

EP

Thus the larger the dimension, the more the magnitudes of these two equivalent norms differ. We can now estimate the rate of approximation by scaled and translated Gaussians in terms of the Sobolev norm of the function to be approximated.

AC C

Theorem 5.1 Let d, n, r be positive integers, r > d/2, w continuous except on a set of measure zero, and f = w ∗βr . For w ∈ L2 and λ(supp w) = ν < ∞,  1/2 !2 µ ¶d/2 Ã π Γ(r/2 − d/4) ν r! kf k2W 2,r − kf k2L2  . kf −spann GkL2 ≤ n−1/2 

8

Γ(r/2)

13

ACCEPTED MANUSCRIPT

Proof. Using Theorem 4.4(ii) and (14), the L2 -distance from f to spann G is at ³

´1/2

n−1/2 , and the result follows.2

6

RI P

T

most k(r, d)2 ν (2π)−d (r!)1 kf k2W 2,r − kf k2L2

Tractability of approximation by RBFs

SC

Our results can be interpreted in terms of tractability (see below for the definition) of multivariable approximation by Gaussian radial basis networks. For Ad ⊆ L2 (Rd ) and n a positive integer, let en (Ad ) denote the worst-case L2 -error in approximating the elements of Ad by elements from spann G; i.e., en (Ad ) = sup

inf

kg − f kL2 ,

AN U

f ∈Ad g ∈ spann G

where G is the family of all scaled and shifted Gaussians. Let d be a positive integer, r > d/2, ν > 0, and let w be continuous except on a set of measure zero. For Ad equal to any of two subsets defined below, we have shown that en (Ad ) ≤ n−1/2 a(r, d)

TE DM

and a(r, d) tends to zero exponentially fast as d → ∞, as explained below. Define the following two subsets of L2 :

¯

(1) Ad = {w ∗ βr ¯¯ kwkL1 ≤ 1, }

and

¯ ¯

(2)

Ad = {w ∗ βr ¯ kwkL2 ≤ 1, λ(supp w) ≤ ν} .

EP

In Theorems 4.4 (i) and (ii), we have shown that

(1)

(π/2)d/4 Γ(r/2 − d/4) Γ(r/2)

(15)

(π/2)d/4 Γ(r/2 − d/4) 1/2 ν , Γ(r/2)

(16)

AC C

en (Ad ) ≤ n−1/2

and

(2)

en (Ad ) ≤ n−1/2

14

ACCEPTED MANUSCRIPT

respectively.

RI P

T

Considering only the dependence on n, better rates have been obtained under different hypotheses than ours on the functions to be approximated by Gaussians without scaling, providing constructive algorithms; see, e.g., [24–26]. But many estimates available in the literature for en (Ad ), where Ad is a suitable set of functions of d variables (see, e.g., [3,5,9,28] for sigmoidal neural networks and radial basis function networks), are of the form en (Ad ) ≤ n−δ κ(d) ,

(17)

SC

where δ > 0 and κ(d) is an increasing function of the number d of variables. In some literature, the term κ(d) in (17) is referred to as “a constant”; however, this means constant only with respect to n but not with respect to the number d of variables.

AN U

For example, rates of approximation of functions on Rd with all l-order partial derivatives uniformly bounded for some positive integer l were investigated in [36], but it was not specified how the multiplicative factors in the estimates depend on l.

TE DM

The dependence of κ(d) on d may not be crucial for small values of d; however, for large values of d, approximation error en (Ad ) can even grow exponentially with d as a consequence of exponential growth in κ(d) (see, e.g., [3, item 9, p. 940]), when the “curse of dimensionality” [4] strikes. However, if κ(d) ≤ C dα for some C, α > 0 independent of n and d, then en (Ad ) ≤ C n−δ dα

(18)

and the approximation problem is said to be tractable in the number d of variables [35–37,26].

EP

As remarked in [36], in general estimating the dependence of en (Ad ) on d is much harder than estimating its dependence on n and only a few results are available. For neural-network approximation, upper bounds that depend polynomially on d, hence ensuring tractability, were derived in [3,17,19,20]. In [26, Theorem 4.2], a tractability result for approximation by RBF networks was obtained in the supremum norm on Rd which also gives an upper bound of the form (18).

AC C

The bounds (15) and (16) are of the form n−1/2 aj (r, d) , where aj (r, d) is given by

a1 (r, d) =

(π/2)d/4 Γ(r/2 − d/4) Γ(r/2) 15

(19)

ACCEPTED MANUSCRIPT

(π/2)d/4 Γ(r/2 − d/4) 1/2 ν , Γ(r/2)

(20)

RI P

a2 (r, d) =

T

and

respectively. For j = 1, 2, aj (c + d/2, d) → 0 exponentially fast for d → ∞ . Also, r0 > r > d/2 implies aj (r0 , d) < aj (r, d). (1)

SC

This behavior implies the tractability of multivariable approximation for Ad (2) and Ad by Gaussian radial-basis networks. In fact, for these two classes, for d sufficiently large equation (18) holds for all negative α > −∞.

AN U

Acknowledgement

The authors are grateful to one reviewer, who suggested we include the discussion of Section 6 and pointed out various references.

TE DM

References

[1] Adams, R. A., and Fournier, J. J. F.: Sobolev Spaces. Academic Press, Amsterdam (2003). [2] Barron, A. R.: Neural net approximation. Proc. 7th Yale Workshop on Adaptive and Learning Systems, K. Narendra, Ed., Yale University Press (1992) 69–72. [3] Barron, A. R.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory 39 (1993) 930–945.

EP

[4] Bellman, R.: Dynamic Programming. Princeton University Press, Princeton, NJ (1957). [5] Burger, M., and Neubauer, A.: Error bounds for approximation with neural networks. Journal of Approximation Theory 112 (2001) 235-250.

AC C

[6] Carlson, B. C.: Special Functions of Applied Mathematics. Academic Press, New York (1977). [7] Courant, R.: Differential and Integral Calculus, vol. 2. Wiley, New York, 1936 (1964 edition, transl. E. J. McShane). [8] Girosi, F.: Approximation error bounds that use VC-bounds. In Proceedings of the International Conference on Neural Networks, Paris (1995) 295-302.

16

ACCEPTED MANUSCRIPT

T

[9] Girosi, F., and Anzellotti, G.: Rates of convergence for radial basis functions and neural networks. In: Mammone, R. J., Ed., Artificial Neural Networks for Speech and Vision. Chapman & Hall (1993) 97-113.

RI P

[10] Hartman, E. J., Keeler, J. D., and Kowalski, J. M.: Layered neural networks with Gaussian hidden units as universal approximations. Neural Computation 2 (1990) 210–215. [11] Jones, L. K.: A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training. Annals of Statistics 20 (1992) 608–613

SC

[12] Kainen, P.C., and K˚ urkov´a, V.: An integral upper bound for neural-network approximation. Submitted to Neural Computation. http://www.cs.cas.cz/research, Research Report V-1023, Institute of Computer Science, Prague (2008).

AN U

[13] P. C. Kainen, V. K˚ urkov´a, M. Sanguineti, “Estimates of approximation rates by Gaussian radial-basis functions”, Lecture Notes in Computer Science, vol. 4432 (B. Beliczynski, A. Dzielinski, M. Iwanowski, B. Ribeiro, Eds.), pp. 11-18. Springer, Berlin-Heidelberg, 2007. [14] Kon, M. A., Raphael, L. A., and Williams, D. A.: Extending Girosi’s approximation estimates for functions in Sobolev spaces via statistical learning theory. Journal of Analysis and Applications 3 (2005) 67-90.

TE DM

urkov´a, V.: Dimension–independent rates of approximation by neural [15] K˚ networks. In: Computer–Intensive Methods in Control and Signal Processing: The Curse of Dimensionality, K. Warwick and M. K´arn´ y, Eds., Birkh¨auser, Boston (1997) 261–270. urkov´a, V.: High-dimensional approximation and optimization by neural [16] K˚ networks. In: Advances in Learning Theory: Methods, Models and Applications, J. Suykens et al., Eds., IOS Press, Amsterdam (2003). Chapter 4, 69–88. [17] K˚ urkov´a, V.: Minimization of error functionals over perceptron networks. Neural Computation 20 (2008) 252–270.

EP

[18] K˚ urkov´a, V., Kainen, P. C., and Kreinovich, V.: Estimates of the number of hidden units and variation with respect to half-spaces. Neural Networks 10 (1997) 1061–1068. [19] K˚ urkov´a, V., and Sanguineti, M.: Comparison of worst case errors in linear and neural network approximation. IEEE Transactions on Information Theory 48 (2002) 264–275.

AC C

[20] V. K˚ urkov´a, P. Savick´ y, and K. Hlav´aˇckov´a: Representations and rates of approximation of real–valued Boolean functions by neural networks. Neural Networks 11 (1998) 651–659.

[21] V. E. Maiorov: Approximation by ridge functions. J. of Approx. Theory 99 (1999) 68–94.

17

ACCEPTED MANUSCRIPT

T

[22] Mart´ınez, C., and Sanz, M.: The Theory of Fractional Powers of Operators. Elsevier, Amsterdam (2001).

RI P

[23] Mhaskar, H. N.: Versatile Gaussian networks. Proc. IEEE Workshop of Nonlinear Image Processing (1995) 70-73.

[24] Mhaskar, H. N.: An Introduction to the Theory of Weighted Polynomial Approximaton. World Scientific, Singapore (1996). [25] Mhaskar, H. N.: When is approximation by Gaussian networks necessarily a linear process? Neural Networks 17 (2004) 989-1001.

SC

[26] Mhaskar, H. N.: On the tractability of multivariate integration and approximation by neural networks. Journal of Complexity 20 (2004) 561-590.

AN U

[27] Mhaskar, H. N., and Micchelli, C. A.: Approximation by superposition of a sigmoidal function and radial basis functions. Advances in Applied Mathematics 13 (1992) 350-373. [28] Mhaskar, H. N., and Micchelli, C. A.: Dimension-independent bounds on approximation by neural networks. IBM Journal of Research and Development 38 (1994) 277-284. [29] Narcowich, F. J., Ward, J. D., and Wendland, H.: Sobolev error estimates and a Bernstein inequality for scattered data interpolation via radial basis functions. Constructive Approximation 24 (2006) 175-186.

TE DM

[30] Park, J., and Sandberg, I. W.: Universal approximation using radial–basis– function networks. Neural Computation 3 (1991) 246-257. [31] Park, J., and Sandberg, I.: Approximation and radial basis function networks. Neural Computation 5 (1993) 305-316. [32] Pisier, G.: Remarques sur un resultat non publi´e de B. Maurey. In: ´ Seminaire d’Analyse Fonctionelle, vol. I(12). Ecole Polytechnique, Centre de Math´ematiques, Palaiseau 1980-1981. [33] Stein, E. M.: Singular Integrals and Differentiability Properties of Functions. Princeton University Press, Princeton, NJ (1970).

EP

[34] Strichartz, R.: A Guide to Distribution Theory and Fourier Transforms. World Scientific, NJ (2003). [35] Traub, J. F. and Werschulz, A. G.: Complexity and Information. Cambridge University Press (1999).

AC C

[36] Wasilkowski, G. W. and Wo´zniakowski, H.: Complexity of weighted approximation over Rd . Journal of Complexity 17 (2001) 722-740. [37] Wo´zniakowski, H.: Tractability and strong tractability of linear multivariate problems. Journal of Complexity 10 (1994) 96-128.

18