On the tractability of multivariate integration and approximation by neural networks H. N. Mhaskar∗ Department of Mathematics, California State University Los Angeles, California, 90032, U.S.A. Dedicated to Hans–Peter Blatt and Charles Micchelli in celebration of their 61st birthday Abstract Let q ≥ 1 be an integer, Q be a Borel subset of the Euclidean space Rq , µ be a probability measure on Q, and F be a class of realR valued, µ-integrable functions on Q. The complexity problem of approximating f dµ using quasi-Monte Carlo methods is to estimate Z n 1X En (F, µ) := inf sup f dµ − f (xk ) x1 ,···,xn ∈Q f ∈F n k=1
The problem is said to be tractable if there exist constants c, α, β independent of q (but possibly dependent on µ and F) such that En (F, µ) ≤ cq α n−β . We explore different regions (including manifolds), function classes, and measures for which this problem is tractable. Our results include tractability theorems for integration with respect to non-tensor product measures, and over unbounded and/or non-tensor product subsets, including the unit spheres of Rq with respect to various norms. We discuss applications to approximation capabilities of neural and radial basis function networks.
AMS Classification: 65D30, 41A25, 41A55. Keywords: Tractability, neural networks, radial basis function networks, dimension independent bounds.
∗
The research of this author was supported, in part, by grant DMS-0204704 from the National Science Foundation and grant DAAD19-01-1-0001 from the U.S. Army Research Office.
1
1
Introduction
In integral R many applications, one needs to approximate the multi-dimensional q f (x)dµ(x) where Q is a Borel subset of a Euclidean space R (where q ≥ 1 is an integer), Q and µ is a probability measure supported on Q. Such problems arise, for example, in mathematical finance [17, 18], statistical learning theory [22], and approximation by neural and radial basis function networks [1, 8, 11, 12, 13, 14]. The quasi-Monte Carlo technique for this approximation is to choose appropriate points x∗1 , · · · , x∗n so that the average of the values f (x∗k ) approximates this integral. An important question in complexity theory is to estimate Z n 1X En (F , µ) := inf sup f (x)dµ(x) − f (xk ) , (1.1) x1 ,···,xn ∈Q f ∈F Q n k=1
for a suitable class F of functions. The problem is said to be tractable if cq α (1.2) nβ for some constants c, α, β > 0, independent of q. However, the constants may depend upon µ and F . In addition to En (F , µ), it is often customary to study the normalized error, defined by En (F , µ) R Eˆn (F , µ) = . (1.3) supf ∈F | f dµ| En (F , µ) ≤
The denominator in the above fraction may be thought of as an “initial cost” or the “cost of doing nothing”, and the normalized error measures the improvement of the quasi-MonteCarlo method over this initial cost. R If F contains the function I, which is identically equal to 1 on Q, then clearly, supf ∈F | f dµ| ≥ 1. If, in addition, each function f in F satisfies |f (x)| ≤ 1 (x ∈ Q), then (1.4) Eˆn (F , µ) = En (F , µ). In many results on this subject, the tractability problem is studied for functions that can be represented in the form x → σ(Φ(x, ·)), where Φ is a fixed kernel function (e.g., the reproducing kernel in a reproducing kernel Hilbert space), and σ varies over a suitable class of functionals. Novak and Wo´zniakowski have given two interesting surveys of this topic in [15, 16]. Next, we note an interesting connection between the tractability problem for multivariate integration and approximation theory. Suppose that F is the unit ball of some normed linear Rfunction space X, on which point evaluations as well as the functional µ∗ , given by f 7→ f dµ, are continuous linear functionals. If we denote the point evaluation functional at a point x by δx , then it is clear that En (F , µ) gives an estimate on the degree of approximation of µ∗ in the dual norm of the norm on X from the convex hull of {δx }. An important example of this line of thought, that includes both neural and radial basis function networks, is formulated in the following theorem. 2
Theorem 1.1 Let Q, Q1 be Borel subsets of some Euclidean spaces, Φ : Q × Q1 → R be a fixed, bounded Borel measurable kernel function, and M be a class of signed measures on Q having total variation equal to 1. We define FΣ to be the class of all functions on Q of the form Z x 7→
Φ(x, t)dσ(t),
(1.5)
Q1
where σ ranges over all signed measures on QR1 having total variation 1, and FM to be the class of all functions on Q1 of the form t 7→ Q Φ(x, t)dµ(x), µ ∈ M. The following are equivalent. (a) We have µ ∈ M. (1.6) En (FΣ , µ) ≤ δn , (b) We have 1X sup |g(t) − Φ(xj , t)| ≤ δn , inf x1 ,···,xn ∈Q t∈Q1 n j=1 n
g ∈ FM.
(1.7)
Proof. Let (1.6) hold, µ ∈ M, and > 0 be arbitrary. Since Φ(·, t) ∈ FΣ for every t ∈ Q1 , (1.6) implies that there exist x1 , · · · , xn ∈ Q (independent of t ∈ Q1 ) such that Z n X 1 Φ(x, t)dµ(x) − Φ(x , t) t ∈ Q1 . (1.8) ≤ δn + , j n j=1
In view of the definition of the class FM , this estimate is equivalent to the estimate (1.7). Conversely, let (1.7) hold, µ ∈ M, and > 0 Rbe arbitrary. Then there exist points x1 , · · · , xn ∈ Q such that (1.8) holds. Let f (x) = Φ(x, t)dσ(t) for some signed measure σ on Q1 with total variation equal to 1. Since Φ is a bounded function, we may use Fubini’s theorem to conclude that Z Z Z Z Z Φ(x, t)dµ(x)dσ(t) = Φ(x, t)dσ(t)dµ(x) = f (x)dµ(x). Therefore, (1.8) leads to Z n X 1 f (x)dµ(x) − f (x ) j ≤ δn + , n j=1 i.e., En (FΣ , µ) ≤ δn . This proves (1.6).
f ∈ FΣ ; 2
We observe that one needs an estimate of the form (1.6) uniformly for a large class of measures to make the approximation estimate (1.7) interesting. Our first aim in this paper is to explore a general framework that enables us to analyse different regions, manifolds, function classes, and classes of measures for which an estimate of the form (1.2) can be obtained. Many of the known results on the tractability problem deal with “tensor product” function classes and measures. Our results include tractability theorems for integration with respect to non-tensor product measures, and over unbounded and/or non-tensor product subsets, including the unit spheres of Rq with respect to various norms. 3
Our second aim in this paper is to obtain bounds of the form cq α /nβ on the degree of approximation by neural and radial basis function networks, where n is the number of “neurons” in the network (cf. Section 4 for the definition), and c, α, β are independent of n and q. Typically, the known estimates in this theory are of the form c(q)/nβ , where β is independent of q, but often without the requirement that c(q) be polynomially dependent on q. To give a preview of one of the novelties of our results in this paper, we recall, for example, that a radial basis function (RBF) network with activation function φ : [0, ∞) → Pn R and n neurons (and norm k · k) is a function of the form x 7→ j=1 aj φ(kx − yj k), where yj ∈ Rq and aj ∈ R, 1 ≤ j ≤ n. Most results on the degree of approximation by radial basis function networks assume the norm k · k to be the usual Euclidean norm. One novelty of our results is that we are able to supply some bounds in the case of any absolute norm (cf. Section 2.2 for the definition) in the argument of the activation function. In the next section, we develop some basic concepts, which will be needed in formulating our results. The main theorems concerning integration are stated in Section 3. Section 4 describes some applications to the theory of approximation by neural and radial basis function networks. The proofs are given in Section 5. The paper was originally intended to be a joint work with Professor Steven Damelin. In particular, he activated me to pursue this research, which had lain dormant in my mind for several years. In a series of emails, especially related to Theorem 5.1, Proposition 5.2, and Proposition 5.4, he asked many questions, thus highlighting what parts of the first draft might be obscure to some readers. I am thankful to him for his efforts. I am grateful to Professor Grzegorz Wasilkowski for a careful reading of another draft and for making many useful suggestions, resulting in a substantial improvement of the results in that draft. Finally, I am grateful both to Professor Grzegorz Wasilkowski and Professor Henryk Wo´zniakowski for their keen interest in this work.
2 2.1
Preparatory concepts Measures
In this section, we introduce certain classes of measures which will be needed in the statement of our theorems. We denote by λq the q-dimensional Lebesgue measure. For 1 ≤ p ≤ ∞ and x ∈ Rq , we define q X |xk |p }1/p , if 1 ≤ p < ∞, { kxkp := kxkq,p := (2.1) k=1 max1≤k≤q |xk |, if p = ∞. Definition 2.1 Let µ be a probability measure on Rq , L, M, β, γ ≥ 0. (a) The measure µ is said to satisfy a decay condition (with parameters (L, β)) if for all δ ∈ (0, 1], µ(Rq \ [−Lδ −β , Lδ −β ]q ) ≤ δ. (2.2) 4
(b) The measure µ is said to satisfy a continuity condition (with parameters (M, γ)) if for all Borel sets S ⊆ Rq , (2.3) µ(S) ≤ (Mλq (S))γ . (c) The measure µ is said to be regular (with parameters (L, β, M, γ)) if it satisfies both (2.2) and (2.3). (d) A signed measure σ will be called regular if the total variation measure |σ| is regular in the sense of part (c). Similar terminology will apply for σ to satisfy a decay condition or a continuity condition. Next, we give some examples of regular, non-tensor product measures. Example 1. Let K be a compact subset of Rq , 1 < p ≤ ∞, f : K → [0, ∞) be to the Lebsegue measure) Lebesgue measurable, with a finite Lp norm N (with respect R R on K, and K f dλq = 1. The measure defined by µf (S) = S∩K f dλq is a regular measure 0 with parameters (maxx∈K kxk∞ , 0, N p , 1/p0 ), where 1/p + 1/p0 = 1. 2 Example 2. We give two examples of non-tensor product measures supported on the whole space. Let Z (2.4) µexp (α; S) := λexp,α exp(−kxkα2 )dx, S
where α > 0, and αΓ(q/2) 2π q/2 Γ(q/α) is chosen to make µexp (α; Rq ) = 1 (cf. (5.33) below). Another set of examples is given by Z dx µpow (α; S) := λpow,α α, S 1 + kxk2 λexp,α :=
(2.5)
(2.6)
where α > q, and λpow,α :=
αΓ(q/2) sin(πq/α) 2π (q+2)/2
(2.7)
is chosen to make µpow (α; Rq ) = 1. Proposition 2.1 (a) Let α > 0. The measure µexp(α) satisfies the continuity condition with γ = 1 and M = λexp,α . It satisfies the decay condition with any β > 0 and corresponding L given by 1+|q−α|β αβ 1 + |q − α|β 2|q − α| 2 α Lexp := + + 1. (2.8) α αβe Γ(q/α) (b) Let α > q. The measure µpow (α) satisfies the continuity condition with γ = 1 and M = λpow,α . It satisfies the decay condition with βpow = 1/(α − q) and 1/(α−q) α sin(πq/α) . (2.9) Lpow := π(α − q) By restricting and renormalizing these measures to different Borel sets, one can easily generate examples of regular non-tensor product measures supported on Borel sets other than the whole space, including sets that are both non-tensor product and unbounded. 2 5
2.2
Geometrical concepts
Let k · k be any absolute norm on Rq ; i.e., we assume that k(x1 , · · · , xq )k = k(|x1 |, · · · , |xq |)k for all x ∈ Rq . It is known (cf. [6, Theorem 5.5.10]) that k · k is monotone; i.e., |xj | ≤ |yj |, 1 ≤ j ≤ q, implies kxk ≤ kyk. Let ej be the unit vector whose j-th component is 1 and other components are 0, κ1−1 := min1≤j≤q kej k and κ2 := k(1, · · · , 1)k. Then the monotonicity of the norm leads to κ−1 1 kxk∞ ≤ kxk ≤ κ2 kxk∞ ,
x ∈ Rq .
(2.10)
In the sequel, we will adopt the following notation. If ⊕ is a binary operation on R, x, y ∈ Rq , then x ⊕ y will be the vector in Rq whose j-th component is xj ⊕ yj . If c ∈ R, then c ⊕ x := (c, · · · , c) ⊕ x, and x ⊕ c = x ⊕ (c, · · · , c). Conventions regarding the placement of the operator ⊕ will be continued as usual; for example, max(x, y) is the vector whose j-th component is max(xj , yj ). Similar conventions are followed for binary x relations. In particular, for x ∈ Rq , and r ∈ [0, ∞]q , we define the vector z = by r ∞, if rj = 0 and xj 6= 0, 0, if rj = xj = 0, zj = 0, if rj = ∞, xj , otherwise. rj If a component of z is infinity, we set kzk := ∞. For y ∈ Rq , r ∈ [0, ∞]q , we define the ellipsoid
x − y q
≤1 . (2.11) B(k · k, y, r) = x ∈ R : r We note that the values 0 and ∞ are both valid for the components of r in the above definition. If all components of r are equal to r, then the ellipsoid is the ball denoted by B(k · k, y, r) := B(k · k, y, (r, · · · , r)). We denote B(k · k, 0, 1) by Bk·k , its volume by τq,k·k , q−1 its boundary by Sk·k , and the area of this boundary by ωq−1,k·k . It is easy to see that λq (B(k · k, y, r)) = τq,k·k
q Y
rk ,
(2.12)
k=1
where 0 · ∞ := 0. Next, we introduce some notations concerning strips. Let x·y denote the inner product q−1 , and a, b ∈ R, a ≤ b, we write of x and y. For y ∈ Sk·k 2 S(y, a, b) := {x ∈ Rq : x · y ∈ [a, b]}.
(2.13)
Further, S(y, a, ∞) := ∪b∈(a,∞) S(y, a, b), S(y, −∞, b) := ∪a∈(−∞,b) S(y, a, b). If a > b, we define S(y, a, b) to be the empty set. q−1 Finally, we discuss some notation related to the sphere. A cap of radius r in Sk·k centered at y is defined by q−1 q−1 Sk·k,r (y) := {x ∈ Sk·k : kx − yk ≤ r}.
6
(2.14)
2.3
Function classes
For a subset S ⊆ Rq , the characteristic function of S is defined by 1, if x ∈ S, χ(S; x) := 0, otherwise.
(2.15)
The constant function taking the value 1 everywhere on Rq will be denoted by I. An estimate on En (F , µ) where F consists of characteristic functions of certain sets is usually called a discrepancy estimate. In Section 5, we will obtain the discrepancy estimates for the following classes of characteristic functions. We start with the set of characteristic functions of ellipsoids: B(k · k, R, R1 ) := {χ(B(k · k, y, r)) : y ∈ (−R, R)q , r ∈ [0, R1 ]q } ∪ {I},
R, R1 ≥ 0, (2.16) where R or R1 may also be infinity. In the case of the norm k · k∞ , the ellipsoids are just cells in Rq , and we write (2.17) R(R, R1 ) := B(k · k∞ , R, R1 ). Q (We recall that an open cell in Rq is a set of the form qk=1 Ik , where each Ik is an open interval in R. By a cell, we will mean the closure of an open cell.) Similarly, we define q−1 q−1 (y)) : y ∈ Sk·k , r ≥ 0}, Kk·k := {χ(Sk·k,r
(2.18)
and q−1 S(R) := {χ(S(y, a, b) ∩ B(k · k2 , 0, R)) : y ∈ Sk·k , a, b ∈ R} ∪ {I}. 2
(2.19)
We note that the class Kk·k already contains the function I. Clearly, any estimate on En (F , µ) is valid P also if F is replaced by its signed convex finite sum, fj ∈ F hull, i.e., the set of functions of the form Paj fj , where the sum is a P ’s are real numbers with |a | ≤ 1. We may write aj fj in the form for each j, and a j j R Φ(·, j)dσ(j), where for each j involved in the sum, Φ(·, j) := fj , and σ is the signed measure that associates the mass aj with the integer j. With this motivation in mind, we now proceed to define the notion of a generalized convex hull of F , denoted by conv(F ). In the case when F is the set of characteristic functions defined above, conv(F ) contains functions of the form (2.21) or (2.22) (described below), as well as some other sets of functions recently considered in the literature on tractability problems. Let F be a class of functions on a subset Q of Rq , and Q1 be a measure space. An F valued process on Q1 is a jointly measurable mapping Φ : Q × Q1 → R, such that for each t ∈ Q1 , Φ(·, t) ∈ F. The generalized convex hull of F with respect to Q1 , denoted by conv(F , Q1 ), is defined to be the set of all functions of the form Z x 7→ Φ(x, t)dσ(t), (2.20) where σ ranges over all signed measures on Q1 having total variation not exceeding 1, and Φ ranges over all F valued processes on Q1 . The class conv(F ) consists of functions 7
that are in conv(F , Q1 ) for some measure space Q1 . (Here, we have tacitly assumed a “universal set” of all measure spaces of interest. In this paper, this universal set consists of all Borel measurable subsets of all finite dimensional Euclidean spaces.) Next, we discuss some examples of the notion of generalized convex hulls. We recall (cf. [20, Chapter 8, Sections 12–21]) that there is a one-to-one correspondence between signed measures having bounded variation on R and functions having bounded variation on R. Thus, if φ : R → R is a right (respectively, left) continuous function having bounded variation, and φ(x) → 0 as x → −∞ (respectively, φ(x) → 0 as x → ∞), then there exists a unique signed measure µφ such that φ(x) = µφ ((−∞, x]) (respectively, φ(x) = −µφ ([x, ∞))) for all x ∈ R, and the total variation of this measure is the same as the total variation of φ. Similar representations hold for functions defined on subintervals of R, satisfying different one-sided continuity conditions, and normalizations. Therefore, one usually thinks of φ itself as a signed measure, and writes dφ in place of dµφ , where µφ is the measure appropriate to the normalizations of φ. The corresponding total variation measure is usually denoted (in the context of integrations) by |dφ|. Example 3. Let φ be a left continuous function of bounded variation on [0, ∞), such that lim φ(x) = 0, and the total variation of φ is equal to 1. If σ is a signed measure on x→∞
Rq × [0, ∞)q , having total variation equal to 1, then a function of the form x 7→
Z
Z Z
x − y
dσ(y, r) = − χ (B(k · k, x, ru); y) dφ(u)dσ(y, r) φ r
is in conv(B(k · k, ∞, ∞)).
(2.21) 2
Example 4. Let φ : R → [0, ∞) be a right continuous function of bounded variation with limx→−∞ φ(x) = 0, σ be a signed measure on Rq , having total variation equal to 1. Then a function of the form Z if x = 0, φ(0)σ(Rq ), x 7→ φ(x · y)dσ(y) = R R (2.22) χ (S(x/kxk, u/kxk, ∞); y) dφ(u)dσ(y), if x 6= 0, is in conv(S(∞)). Functions of the forms described in this and the previous example are of interest in the theory of neural networks and radial basis function networks respectively. We will examine these in further detail in Section 4. 2 Example 5. In this example, we discuss the set conv(R(R, R1 )) in some detail. In [5], Hickernell, Sloan, and Wasilkowski have studied the tractability of quasi-Monte-Carlo R approximation of an integral of the form Q F (x)W (x)dx, where Q is a (bounded or Q unbounded) cell in Rq , and W (x) = qk=1 Wk (xk ) for some weights W : R → [0, ∞). They have pointed out that R by simple substitutions, this problem is equivalent to the problem of approximating D f (x)dx, where D = [−1/2, 1/2]q . The problem is proved to be tractable for the class FH of functions, defined as follows. We fix an anchor c ∈ D. For a subset U ⊆ {1, · · · , q}, let DU := [−1/2, 1/2]|U |. For x ∈ Rq , let xU denote the vector of length |U | whose components are the components xj of x for which j ∈ U , and (xU , c) be the q dimensional vector whose k-th component is xk if k ∈ U , and ck if k 6∈ U . For a sufficiently smooth function f : D → R to allow the 8
following differentiation, we write ∂ |U | fU0 (xU ) = Y f (xU , c). ∂xk
(2.23)
k∈U
If U is the empty set, the corresponding fU0 is defined to be the constant function f (c). If U is the empty set, it is also convenient to define kfU0 k1,DU := |f (c)|. The class FH consists of all functions f : D → R for which fU0 (xU ) exists for each x ∈ [−1/2, 1/2]q and for each U ⊆ {1, · · · , q}, and X Z kf k1,D := |f 0 (xU )|dxU ≤ 1. (2.24) U ⊆{1,···,q}
DU
For f ∈ FH and x ≥ c, we have the integral representation (cf. [5]) X Z χ([(yU , c), (1/2, · · · , 1/2)]; x)fU0 (yU )dyU . f (x) = U ⊆{1,···,q}
(2.25)
[0,1/2]|U |
Similar representations hold in each of the 2q quadrants of D defined by c. We choose the anchor c = 0, and observe that each of the cells involved in (2.25) (and its analogue in the other quadrants) has its center in [−1/2, 1/2]q and k · k∞ -radius not exceeding 1/4. Thus, the class FH is seen to be a subset of conv(R(1/2, 1/4)). Hickernell, Sloan, and Wasilkowski have already made use of this observation in [5] to deduce an estimate on En (FH , λq ) from that on the class of characteristic functions of cells. The additional observation here is regarding the radii and locations of centers of the cells involved. 2 The following proposition summarizes an observation which we will use extensively. Proposition 2.2 Let µ be a probability measure on a Borel measurable subset Q of Rq , and F be a class of Borel measurable functions on Q, such that |f (x)| ≤ 1 for all f ∈ F and x ∈ Q. We have for integer n ≥ 1, Z n 1X inf sup f (x)dµ(x) − f (xk ) . (2.26) En (conv(F ), µ) = En (F , µ) = x1 ,···,xn ∈Q f ∈F Q n k=1
In particular, if > 0, one may choose points xj depending only on , µ and F , and independently of the measure spaces, processes, and measures needed to define functions in conv(F ), such that Z n X 1 f (xk ) ≤ En (F , µ) + . sup f (x)dµ(x) − n f ∈conv(F ) Q k=1
9
3
Tractability of integration
In this section, we will discuss a variety of theorems estimating En (F , µ) for different function classes and measures. In each case, F includes the function I, and |f | ≤ 1 for all f ∈ F. Hence, the normalized error Eˆn (F , µ) = En (F , µ) in each case. In the sequel, we write 4 G := ≈ 3.0868, (3.1) 3 log 3 − 2 and, for κ, B > 0, ∆n (κ, B) := 2
r
G {B + (κ/2) log(n/(GB))}. n
(3.2)
In general, our estimates will have the form En (F , µ) ≤ ∆n (κ, B), where the constants κ, B will be given explicitly, in terms of the various parameters defining the function classes and the decay/continuity conditions on the measures. It is perhaps possible to sharpen these results with a removal of the term log n, using ideas from V–C theory of probability, as in [5]. However, this is expected to give an unspecified constant depending on F and µ. We have decided to choose explicitly defined constants, even if they might not be the best ones, and also have the slightly weaker result with the logarithmic term, in order to make it easier to determine whether our theorems imply tractability for particular measures and function classes, with absolute constants. Our first theorem is an extension of estimate (21) in [5, Theorem 3] of Hickernell, Sloan, and Wasilkowski regarding the tractability of integration with respect to λq on [−1/2, 1/2]q . Theorem 3.1 Let 0 < R, R1 < ∞, M, γ > 0, and L, β ≥ 0. (a) Let 2qM(4R1 )q−1 min(3R1 /4, R) ≥ 1.
(3.3)
With B defined in (5.10), we have for n ≥ GB, and any measure µ satisfying a continuity condition with parameters (M, γ),
(b) Let
En (conv(R(R, R1 )), µ) ≤ ∆n (2q/γ, B).
(3.4)
√ Mqτq,k·k (2R1 )q−1 min 1, 3R1 /4, R(1 + κ2 + κ2 R1 ) ≥ 1.
(3.5)
With B defined in (5.15), we have for n ≥ GB, and any measure µ satisfying a continuity condition with parameters (M, γ), En (conv(B(k · k, R, R1 )), µ) ≤ ∆n (3qγ −1 , B).
(3.6)
(c) Let qM(22+β L)q ≥ 2. With the constant B as in (5.12), we have for n ≥ GB, and any regular measure µ with parameters (L, β, M, γ), En (conv(R(∞, ∞)), µ) ≤ ∆n (2q(βq + 1/γ), B). 10
(3.7)
The part (b) of this theorem is clearly a generalization of the part (a), except for different constants. We present the part (a) separately to allow a comparison with the result in [5] (Example 6 below). We note that the support of the measure µ in part (c) may well be an unbounded and non-tensor product set. In the most general cases, the value of B determines the tractability, and is O(q 2 log q), where the constant involved in O may depend upon µ, R, R1 , and the norm k · k. In some special cases, however, the value of B is smaller. We illustrate this with a few examples. Example 6. Theorem 3.1(a) may be applied to the case explained in Example 5. In this example only, let D = [−1/2, 1/2]q , M ∈ [1, ∞), w : [−1/2, 1/2]q → [0, M], and R w(x)dx = 1. The measure µ defined on D by dµ = w(x)dx satisfies a continuity D condition with parameters (M, 1). Since FH ⊂ conv(R(1/2, 1/4)), we take R = 1/2, and R1 = 1/4. Since M ≥ 1, the condition (3.3) is satisfied if q ≥ 3. Part (a) of the above theorem therefore implies that for the class FH (with the anchor fixed at 0), we have B = (4q + 1) log 2 + 2q log(qM), and En (FH , µ) ≤ 2
G (B + q log(n/(GB))) n
1/2
.
(3.8)
We note that w does not need to be a tensor product function. In the case when w ≡ 1, we recover the corresponding result in [5] as far as the order of magnitude of the dependence on q and n is concerned, apart from the values of the different constants involved. 2 Example 7. The purpose of this example is to illustrate Theorem 3.1(b) with a non-tensor product region of integration and non-tensor product measure. We take k · k = k · k2 , and omit the reference to this norm from the notations. To avoid conflict of notation, we write U = Bk·k . Let (in this example only), Φ denote the class of functions φ : [0, ∞) → [0, 1] which are decreasing on [0, ∞), with φ(0) = 1, and φ(x) = 0 if x ≥ 1/2, M be the set of all measures supported on U having total variation 1, and F be the class of functions of the form Z ZZ f (x) = φ(kx−yk)dσ(y) = − χ(B(k·k, x, u); y)dφ(u)dσ(y), x ∈ U, φ ∈ Φ, σ ∈ M. We observe that F ⊂ conv(B(k·k, 1, 1/2)). Let g : [0, 1] → [0, ∞) be an integrable function R1 with 0 g(u)du = 1, 0 ≤ g(u) ≤ N for u ∈ [0, 1], and µ be the measure defined on U by dµ(x) = τq−1 g(kxkq )dλq (x). Then µ is a probability measure, satisfying a continuity condition with M = N/τq , γ = 1. The condition (3.5) is satisfied if q ≥ 3. The value of √ B in (5.15) is given by (4q + 1) log 2 + 3q log N + 2q log(2 + 3 q) + 3q log q. 2 We observe that if a measure satisfies a decay condition with parameters (L, β), then it also satisfies a decay condition with parameters (L1 , β1 ) for any L1 ≥ L and β1 ≥ β. Similarly, if it satisfies a continuity condition with parameters (M, γ) then it also satisfies a continuity condition with parameters (M1 , γ) for all M1 ≥ M. Therefore, the condition (3.3) may be omitted by replacing M in (5.10) by max(M, {2q(4R1 )q−1 min(3R1 /4, R)}−1 ). Similar remarks hold also for the conditions (3.5), the lower bound condition in Theorem 3.1(c), and other similar conditions in the other theorems in this paper. However, 11
we feel that the formulations given here allows better clarity in the different formulas as well as better flexibility in applying the results. Our next theorem deals with tractability on the spheres. Theorem 3.2 Let M1 ≥ 1, γ1 > 0 and the constant B be defined by (5.17). For any inq−1 and satisfying a spherical continuity teger n ≥ GB and any measure µ, supported on Sk·k condition, q−1 q−1 r ≥ ρ ≥ 0, (3.9) sup µ Sk·k,r (y) \ Sk·k,ρ (y) ≤ (M1 (r − ρ))γ1 , y∈Sq−1 k·k we have En (conv(Kk·k ), µ) ≤ ∆n (q/γ1 , B).
(3.10)
The constant B in (5.17) is O(q), although the constants involved may depend upon both µ and k · k. We elaborate upon an example, which we find especially interesting. Example 8. Let K be any compact, convex subset of Rq , 0 be in the interior of K, and K be symmetric in the sense that x ∈ K if and only if (|x1 |, · · · , |xk |) ∈ K. The Minkowski functional for K is defined by kxkK := inf{t > 0 : t−1 x ∈ K}.
(3.11)
It is well known ([6, Theorem 5.5.8 and its proof]) that k · kK is an absolute norm, and K = Bk·kK . Conversely, for any absolute norm k · k, k · k = k · kBk·k . In particular, if the unit vectors ej are on the boundary of K, then kxk∞ ≤ kxkK ≤ kxk1 ≤ qkxk∞ .
(3.12)
Thus, Theorem 3.2 applies to integration over sets lying on the boundary of sets K satisfying the properties mentioned above. The constant B in (3.10) in such cases is q log q + q log(7M1 ) + 3 log 2. 2 Finally, we state a theorem related to classes defined in terms of strips. Theorem 3.3 Let R, γ, M > 0, L, β ≥ 0. (a) Let 2Mτq−1,k·kq−1,2 Rq ≥ 1. With B as in (5.20), we have for integer n ≥ GB, and any measure µ satisfying a continuity condition with parameters (M, γ), En (conv(S(R)), µ) ≤ ∆n ((q + 1)/γ, B).
(3.13)
2βq+1 τq−1,k·kq−1,2 MLq ≥ 1.
(3.14)
(b) Let With B as in (5.22), we have for integer n ≥ GB, and any regular measure µ with parameters (L, β, M, γ), En (conv(S(∞)), µ) ≤ ∆n ((q + 1)(qβ + 1/γ), B). 12
(3.15)
In the above theorem, as usual, B = O(q 2 log q), although the constants may depend on µ and R. An important class of functions for which Theorem 3.3 implies tractability in this sense is the class of all functions of the form Z x 7→ exp(−x · y)dσ(y), x ≥ 0, for a signed measure σ on [0, ∞)q with total variation equal to 1. Every function in this class has an analytic extension to the right half plane with respect to each of the components of x. If σ is a probability measure, the function is completely monotone in each of its variables.
4
Applications
In this section, we discuss certain applications of the theorems in Section 3 to the theory of neural networks and radial basis function networks.
4.1
Neural networks
Let φ : R → R. A neural network with activation function φ, and having n neurons, is a Pn function of the form x 7→ j=1 cj φ(x · wj + bj ), where the output layer weights cj ∈ R, the synaptic weights wj ∈ Rq , and the thresholds bj ∈ R. The theory of approximation by neural networks is quite well developed (cf. [13] for a survey in the context of approximation of classical Sobolev classes). In [2], Barron has studied functions on Rq−1 which can be expressed in the form Z χ([0, ∞); x · y + r)G(y, r)dσ1 (y, r), F (x) = Sq−1 ×[−1,1] k·k2 q−1 with the onewhere σ1 is the product measure of the normalized area measure on Sk·k 2 dimensional Lebesgue measure, and Z |G(y, r)|dσ1 (y, r) = 1. Sq−1 ×[−1,1] k·k2
For such functions, he proved that for any integer n ≥ 1, there exist ak , bk ∈ [−1, 1], q−1 , 1 ≤ k ≤ n, such that yk ∈ Sk·k 2 max |F (x) −
x∈Bk·k2
n X
p ak χ([0, ∞); x · yk + bk )| ≤ c log n/n,
k=1
where c is a constant depending only on q. Similar results have been proved by many authors, [1, 7, 8, 14, 11, 12]. In particular, Kurkova [7] has given bounds in a Hilbert 13
space setting that depend polynomially on q. We observe that we may define a function on Rq by the formula Z f (x) = χ([0, ∞); x · y)g(y)dσ2 (y), Sq−1 ×[−1,1] k·k2 where g and σ2 are just G and σ1 expressed in a different notation. Then f ((x1 , · · · , xq−1 , 1)) = F ((x1 , · · · , xq−1 )). Motivated by this example, we define the following class of functions. Let φ be a function having bounded variation on R. The class FN (φ, L, β, M, γ) consists of all functions of the form Z x 7→ φ(x · y)dµ(y), where µ is a regular, signed measure with parameters (L, β, M, γ). Theorem 4.1 Suppose φ is a function having bounded variation on R, with the normalizations that φ is right continuous, lim φ(x) = 0, and the total variation of φ is 1. x→−∞
Let f ∈ FN (φ, L, β, M, γ), where the condition (3.14) is satisfied, and B be the constant defined in (5.22). Then for integer n ≥ GB, there exist points yj ∈ Rq , j = 1, · · · , n, (depending on f ) such that n X 1 sup f (x) − φ(x · yj ) ≤ ∆n ((q + 1)(qβ + 1/γ), B). (4.1) q n x∈R j=1 We note again that B = O(q 2 log q). In addition, all the output layer weights in our network are equal to 1/n. The proof of Theorem 4.1 is a simple consequence of Proposition 5.5 (b). We are also able to place bounds on the synaptic weights yj , provided the measure µ in the definition of the target function f , as well as the function φ, are compactly supported, and the approximation is desired on a compact set. This is done using Proposition 5.5 (a), but no new ideas are needed. As far as we are aware, Theorem 4.1 is the first of its kind, where the degree of uniform approximation by neural networks on the whole Euclidean space is estimated.
4.2
Radial basis function networks
Let φ : [0, ∞) → R. A radial basis function (RBF) network P with activation function φ and n neurons (and norm k · k) is a function of the form x 7→ nj=1 aj φ(kx − yj k), where the centers yj ∈ Rq and the weights aj ∈ R, 1 ≤ j ≤ n. Approximation by RBF networks has also been very popular in different applications, ranging from pattern recognition to the production of animated cartoons. In this subsection, the function φ : [0, ∞) → R is a function having bounded variation on [0, ∞), with the normalizations that φ is left continuous, limx→∞ φ(x) = 0, and the total variation of φ is 1. We assume further that φ satisfies the decay condition Z ∞ |dφ(x)| ≤ δ, 0 < δ ≤ 1. (4.2) Lδ−β
14
We now consider the class FR (φ, k · k, L, β, M, γ) consisting of functions of the form Z x 7→ φ(kx − yk)dµ(y), where µ is a regular, signed measure with parameters (L, β, M, γ). We note that there is no loss of generality in assuming that the same L and β are used here as in (4.2). Requiring the two values to be different will only result in a more elaborate book-keeping, but not in new ideas. This class is analogous to the “native space” for the function φ. Theorem 4.2 Let L, β, M, γ > 0. We define Bn = B as in (5.15) with R1 = L(log n/n)−β/2 , R = κ1 (1 + κ2 )R1 , and 21/γ M in place of M. Let n be sufficiently large, so that n ≥ GBn , log n/n ≤ 1/4, and the condition (3.5) is satisfied with these parameters. Then for f ∈ FR (φ, k · k, L, β, M, γ), there exist yj = yj (f ) ∈ [−R1 , R1 ]q such that r n X 1 log n −1 sup f (x) − . (4.3) φ(kx − yj k) ≤ ∆n (3qγ , Bn ) + 7 n j=1 n x∈Rq We note that Bn = O(q 2 log(qn)). As far as we are aware, this is the first result of its kind where uniform approximation bounds are obtained for RBF networks using a norm other than the Euclidean norm on Rq . It appears to be the first result of its kind proving a tractability result for uniform approximation by RBF networks on the entire Euclidean space. It is amusing to note that the weights in our networks are again all equal to 1/n. Our proof can be modified to yield analogous results where the centers yj are restricted to a compact cell in Rq , and the approximation is also desired on a compact cell. We do not feel that this adds any new ideas.
5
Proofs
For clarity of presentation, we postpone the proof of Proposition 2.1 until the end of this section. The results in Section 4 are simple applications of those in Section 3. Our strategy for proving the theorems in Section 3 is as follows. In light of Proposition 2.2, it is enough to estimate En (F , µ) when F is the set of characterisitc functions of the sets involved in each theorem. We will use the geometrical properties of the sets and the notion of one-sided entropy (“entropy with brackets” in the terminology in the book [21] of van der Vaart and Wellner) to obtain a finite set Y of characteristic functions such that En (F , µ) can be estimated using En (Y, µ). This process is codified in Theorem 5.1 below. The problem of estimating En (F , µ) thus reduces to estimating the one-sided entropy of F . The details of this estimation depend heavily on the geometrical properties of the sets, and we had to present them in the form of different propositions, in spite of a common theme behind all these estimations. Before proving other theorems, we prove Proposition 2.2. 15
Proof of Proposition 2.2. It is clear that F ⊆ conv(F ), so that En (conv(F ), µ) ≥ En (F , µ). In the proof of the reverse inequality, we note that En (F , µ) < ∞. Let > 0 be arbitrary and xj be chosen so that Z n 1X sup g(x)dµ(x) − g(xk ) ≤ En (F , µ) + . (5.1) n g∈F k=1
Let f ∈ conv(F ). There exists an F -valued process Φ and a signed measure σ of total variation 1 such that Z f (x) = Φ(x, t)dσ(t), x ∈ Q. Using Fubini’s theorem, we see that ) Z (Z Z n n 1X 1X f (xk ) = Φ(x, t)dµ(x) − Φ(xk , t) dσ(t). f (x)dµ(x) − n k=1 n k=1 Since Φ(·, t) ∈ F for each t, we conclude from (5.1) that Z n X 1 sup f (x)dµ(x) − f (xk ) ≤ En (F , µ) + . n f ∈conv (F ) k=1
Since is arbitrary, this completes the proof.
2
We now begin with the proofs of the theorems in Sections 3 and 4. Towards this end, we define the notion of one-sided entropy, and prove a general estimate for quantities of the form En (F , µ) in terms of this one-sided entropy. Let Q be a measure space, µ be a probability measure defined on Q, F be a class of µ-integrable functions on Q, and δ > 0. A finite set Y of µ-integrable functions on Q is said to be a one-sided (µ, δ)-cover Rof F if for every f ∈ F , there exist g, h ∈ Y with g ≤ f ≤ h everywhere on Q, and (h − g)dµ ≤ δ. We observe that Y need not be a subset of F . If N (F , µ, δ) is the Q number of elements in a minimal one-sided (µ, δ)-cover of F , then we define the one-sided entropy H(F , µ, δ) to be the quantity log N (F , µ, δ), where we find it convenient to take the natural logarithm. The starting point of our investigations is the following observation. It is probably known in the statistical literature, but we find it easier to prove it than finding a reference. Theorem 5.1 Let (Q, µ) be a probability space, and F be a set of real valued, µ-integrable functions on Q, such that |f (x)| ≤ 1 for all f ∈ F and x ∈ Q, and the one-sided entropy H(F , µ, ·) satisfies H(F , µ, δ) ≤ log A − κ log δ,
0 < δ ≤ 1,
(5.2)
for some positive constants A and κ depending on F , Q, and µ. Let B := log(2A). Then for any integer n ≥ GB, there exist a set T ⊆ Q, consisting of n points, such that Z r X G 1 {B + (κ/2) log(n/(GB))}, (5.3) f (t) ≤ ∆n (κ, B) = 2 f dµ − n n t∈T
where G is the constant defined in (3.1). 16
The proof of Theorem 5.1 mimics an argument in [4]. The main ingredient is to use the following sharper version of the Hoeffding’s inequality (cf. [19, p. 191]). It is proved in [5], but not stated in this way. Proposition 5.1 Let (Q, µ) be a probability space, n ≥ 1 be an integer, and {Xk }, k = 1, · · · , n be independent random variables on Q, each with range contained in a compact interval [a, b] and expectation equal to m. Then for any ∈ (0, (b − a)/2], ! n X 4n2 −1 Xk − m ≥ ≤ 2 exp − . (5.4) Prob n 2 G(b − a) k=1 Proof. Let Zj := (Xj − a)/(b − a), 1 ≤ j ≤ n. Then for 1 ≤ j ≤ n, we have 0 ≤ Zj ≤ 1, the expected value of Zj is (m − a)/(b − a), and the variance of Zj can be estimated by Z 2 Z Z 2 Z 2 Zj dµ − Zj dµ ≤ Zj dµ − Zj dµ ≤ 1/4. Following [5], we now recall the Bennett inequality [19, p. 192]. According to this invariables, each with mean 0, range in [−M, M ], equality, if Yj are independent Pn random 2 and variance σj , and V ≥ j=1 σj , then for η > 0, ! n X V Prob | Yj | ≥ η ≤ 2 exp − 2 g(Mη/V ) , (5.5) M j=1 where, in this proof only, g(t) := (1 + t) log(1 + t) − t. We apply this estimate with Yj = Zj − (m − a)/(b − a). Then we may choose M = 1, V = n/4 and η = n/(b − a). This leads to ! n X Prob n−1 Xk − m ≥ k=1 ! n X = Prob Zk − n(m − a)/(b − a) ≥ η k=1 n ≤ 2 exp − g(4η/n) . (5.6) 4 Using elementary calculus, one verifies (cf. [5]) that g(t) ≥ (3 log 3 − 2)t2 /4 = t2 /G if t ∈ [0, 2]. Hence, if 0 ≤ η ≤ n/2; i.e., ≤ (b − a)/2, then ! n X 4 2 4n2 −1 η = 2 exp − Xk − m ≥ ≤ 2 exp − . Prob n 2 nG G(b − a) k=1 This completes the proof.
2
Proof of Theorem 5.1. If n ≤ G{B +(κ/2) log(n/(GB))}, then (5.3) is trivial. Therefore, in the remainder of this proof, we will assume that n > G{B + (κ/2) log(n/(GB))}, and write δ := ∆n (κ, B)/2. Our assumption that n ≥ GB implies that n > GB and 17
δ ∈ (0, 1). Let Y be a minimal one sided (µ, δ)-cover for F . By replacing each g ∈ Y by the function g(x), if |g(x)| ≤ 1, if g(x) ≥ 1, g1 (x) := 1, −1, if g(x) ≤ −1, we may assume without loss of generality that the functions g ∈ Y satisfy R |g| ≤ 1 as well. Now, let f ∈ F. Then there exist g , g ∈ Y such that g ≤ f ≤ g and (g2 − g1 )dµ ≤ δ. 1 2 R R 1 2 R Then for any measure ν on Q, g1 dν ≤ f dν ≤ g2 dν, and Z Z Z Z Z Z g2 dµ − g2 dν − δ ≤ f dµ − f dν ≤ g1 dµ − g1 dν + δ. Consequently,
Z Z Z Z sup f dµ − f dν ≤ max gdµ − gdν + δ. g∈Y f ∈F
(5.7)
Now, let g ∈ Y . Following [4], we take a random sample ξk from Q, distributed according to R µ, and consider the random variable Xk = g(ξk ). Then the expected value of Xk is gdµ and |Xk | ≤ 1. Since δ ∈ (0, 1), Proposition 5.1 implies that n ! Z 1 X g(ξk ) − gdµ ≥ δ ≤ 2 exp(−nδ 2 /G). Prob n k=1
Hence, Prob
! Z n 1 X g(ξk ) − gdµ ≥ δ max g∈Y n k=1
≤ 2|Y | exp(−nδ 2 /G) = exp(log 2 + H(F , µ, δ) − nδ 2 /G) ≤ exp(log(2A) − κ log δ − nδ 2 /G) = exp(B − κ log δ − nδ 2 /G) n κ κ log = exp − log 1 + 2 2B GB < 1.
(5.8)
Therefore, there exist points ξk such that s Z n 1 X GB G max B − (κ/2) log . g(ξk ) − gdµ ≤ g∈Y n n n k=1 Along with (5.7) (with ν being the measure that associates the mass 1/n with each ξk ), 2 this proves (5.3) (with T = {ξk }). We now begin the program of estimating the one-sided entropies of the different sets of characterisitic functions described in Section 2.3. The following simple estimate will be used often in this process. In the sequel, µ will denote a probability measure on Rq . 18
Lemma 5.1 If η > 0, 0 < R < ∞, and max1≤j≤q max(|xj + η|, |xj |) ≤ R, then q q Y Y xj ≤ qηRq−1 . (xj + η) − j=1
(5.9)
j=1
Proof. The estimate (5.9) follows immediately from the identity q q q q q k−1 k Y Y X Y Y Y Y (xj + η) − xj = (xj + η) xj − (xj + η) xj j=1
j=1
k=1
j=k
q
q
= η
X Y
j=1
(xj + η)
k=1 j=k+1
k−1 Y
j=k+1
!
j=1
xj .
j=1
2 In order to prove Theorem 3.1, we first prove two propositions, Proposition 5.2 and Proposition 5.3. Proposition 5.2 (a) Let 0 < R, R1 < ∞, and µ be a measure satisfying a continuity condition with parameters (M, γ). Suppose that (3.3) is satisfied. With q 2R 2q 2q2 (4R1 ) B = log 2(2qM) , (5.10) R1 we have for n ≥ GB, En (R(R, R1 ), µ) ≤ ∆n (2q/γ, B).
(5.11)
(b) Let µ be a regular measure with parameters (L, β, M, γ), where qM(22+β L)q ≥ 2. With B = (2q 2 (β + 2) + (3 + 2/γ)q + 2) log 2 + 2q log(qMLq ),
(5.12)
we have, for n ≥ GB, En (R(∞, ∞), µ) ≤ ∆n (2q(βq + 1/γ), B).
(5.13)
Proof. First, we prove part (a). Let 0 < δ ≤ 1. In view of (3.3), there exists an integer m ≥ 3 in the interval [6qM(4R1 )q−1 Rδ −1/γ , 8qM(4R1 )q−1 Rδ −1/γ ]. We divide the cube [−R, R]q into mq congruent subcubes, and let (in this proof only) C denote the set of centers of these subcubes. Next, let m1 = R1 m/R. The condition (3.3) ensures that m1 ≥ 4. For z ∈ C and multi-integer k ≥ 1, let gz,k denote the characteristic function of the cell B(k · k∞ , z, kR1 /m1 ). If any component of k is not positive, we define gz,k = 0. The set consisting of I and the functions gz,k , z ∈ C, 0 ≤ k ≤ m1 + 2 (k ∈ Zq ) will be denoted by Yδ (R, R1 ). Now, if y ∈ [−R, R]q and r ∈ [0, R1 ]q , then there exist z ∈ C and multi-integer k with 0 ≤ k ≤ m1 , such that ky − zk∞ ≤ R/m, and 19
kR1 /m1 ≤ r ≤ (k + 1)R1 /m1 . Denoting the characteristic function of B(k · k∞ , y, r) by f , it is easy to verify that gz,k−1 ≤ f ≤ gz,k+2 . Further, Z (gz,k+2 − gz,k−1 )dλq # q "Y q q Y 2R1 ≤ (kj + 2) − max(0, kj − 1) m1 j=1 j=1 q 2R1 ≤ 3q (m1 + 2)q−1 ≤ 3q2q−1 (2R1 )q /m1 m1 = 3q2q (2R1 )q−1 R/m = 6qR(4R1 )q−1 /m ≤ δ 1/γ /M. Therefore, the continuity condition on µ implies that Z (gz,k+2 − gz,k−1 )dµ ≤ δ. Thus, the set Yδ (R, R1 ) is a one-sided (µ, δ)-cover of R(R, R1 ). Therefore, exp(H(R(R, R1 ), µ, δ)) ≤ |Yδ (R, R1 )| ≤ mq (m1 + 3)q + 1 ≤ mq (m1 + 3 + 1/m)q ≤ mq (m1 + 3 + 1/3)q ≤ 2q m2q (R1 /R)q 2q ≤ 2q (R1 /R)q 8qMR(4R1 )q−1 δ −1/γ q 2R 2 2q = (2qM) (4R1 )2q δ −2q/γ . (5.14) R1 In view of Theorem 5.1, this leads to (5.11). To prove part (b), we let h be the characteristic function of Rq \[−L(δ/2)−β , L(δ/2)−β ], R = R1 = L(δ/2)−β , and Y = Yδ/2 (R, R) ∪ {g + h : g ∈ Yδ/2 (R, R)}. Now, for any y ∈ Rq and r ≥ 0, B(k · k∞ , y, r) ∩ [−R, R]q is either empty or equal to B(k · k∞ , x, r1 ) for some x ∈ [−R, R]q and r1 ∈ [0, R]q . Thus, any f ∈ R(∞, ∞) can be expressed in the form f = f1 + f2 , where f1 ∈ R(R, R) ∪ {1 R − I}, and 0 ≤ f2 ≤ h. We may find g1 , g2 ∈ Yδ/2 (R, R) such that g1 ≤ f1 ≤ g2 and (g2 − g1 )dµ ≤ δ/2. Hence, g1 ≤ f ≤ g2 + h, and the decay condition for µ implies that Z (g2 + h − g1 )dµ ≤ δ. Thus, Y is a one-sided (µ, δ)-cover for R(∞, ∞). The cardinality of Y is at most twice that of Yδ/2 (R, R). We substitute the values of R = R1 in (5.14), and use δ/2 in place of δ to deduce that H(R(∞, ∞), µ, δ) ≤ (2(β + 2)q 2 + (3 + 2/γ)q + 1 log 2 + 2q log(qMLq ) − (2q(qβ + 1/γ)) log δ. This estimate and Theorem 5.1 leads to (5.13). 20
2
Proposition 5.3 Let µ be a probability measure satisfying a continuity condition with parameters (M, γ), 0 < R, R1 < ∞, and (3.5) be satisfied. With q B = log 2 (4Mqτq,k·k )3 R(1 + κ2 + κ2 R1 )2 (2R1 )3q−2 (5.15) we have for integer n ≥ GB, En (B(k · k, R, R1 ), µ) ≤ ∆n (3qγ −1 , B).
(5.16)
The next lemma supplies a detail required in the proof of this proposition. Lemma 5.2 Let r ∈ [0, R1 ]q , 0 < ≤ 1, kzk∞ ≤ 2 , and r0 ≥ r + (1 + κ2 + κ2 R1 ). Then B(k · k, y, r) ⊆ B(k · k, y + z, r0 ). Proof. Since k · k is monotone, B(k · k, y, r) ⊆ B(k · k, y, r + ). Further,
x − y x − y − z kzk
r + − r + ≤ ≤ κ2 . So, x ∈ B(k · k, y, r + ) implies that x ∈ B(k · k, y + z, (r + )(1 + κ2 )). Since (r + )(1 + κ2 ) ≤ r + + κ2 (R1 + ) ≤ r + (1 + κ2 + κ2 R1 ) ≤ r0 , the monotonicity of k · k implies that x ∈ B(k · k, y + z, r0 ).
2
Proof of Proposition 5.3. In this proof, we will denote τq,k·k by τq . We will estimate the one-sided√entropy H(B(k · k, R, R1 ), µ, δ), and use Theorem 5.1. Let δ ∈ (0, 1], and m ≥ max(3, R) be an integer in the range √ √ [3M qτq (2R1 )q−1 (1 + κ2 + κ2 R1 ) Rδ −1/γ , 4Mqτq (2R1 )q−1 (1 + κ2 + κ2 R1 ) Rδ −1/γ ]. (The condition (3.5) ensures that such an integer exists for every δ ∈ (0, 1].) We divide [−R, R]q into m2q congruent subcubes, and let (in this proof only) C denote the set of centers of these subcubes. Let R1 m √ . m1 = (1 + κ2 + κ2 R1 ) R Again, the condition (3.5) implies that m1 ≥ 4. For z ∈ C and multi-integer k with 1 ≤ k ≤ m1 +2, let gz,k denote the characteristic function of the ellipse B(k·k, z, kR1 /m1 ). If some component of a multi-integer k is not positive, we define gz,k = 0. The set consisting of I and the functions gz,k , 0 ≤ k ≤ m1 + 2 (k ∈ Zq ) will be denoted by Y . Let y ∈ [−R, R]q , r ∈ [0, R1 ]q , and f be the characteristic function of B(k · k, y, r). Then there exists z ∈ C and multi-integer k with 0 ≤ k ≤ m1 such that ky − zk∞ ≤ R/m2 and kR1 /m1 ≤ r ≤ (k + 1)R1 /m1 . Using Lemma 5.2, it is easy to verify that gz,k−1 ≤ f ≤ gz,k+2 . We observe that Z (gz,k+2 − gz,k−1 )dλq ! q Y q q Y R1 ≤ τq (kj + 2) − max(0, kj − 1) m1 j=1 j=1 q R1 (m1 + 2)q−1 ≤ 3qτq m1 √ ≤ 3qτq (2R1 )q−1 (1 + κ2 + κ2 R1 ) R/m ≤ δ 1/γ /M. 21
Since gz,k+2 − gz,k−1 is the characteristic function of a Borel measurable set, the continuity condition on µ implies that Z (gz,k+2 − gz,k−1 )dµ ≤ δ. Thus, the set Y is a one-sided (µ, δ)-cover for B(k · k, R, R1 ). The cardinality of Y is at most q √ m2q (m1 + 3)q + 1 ≤ m2q (2m1 )q = 2R1 ( R(1 + κ2 + κ2 R1 ))−1 m3q . √ Recalling that m ≤ 4qMτq (2R1 )q−1 (1 + κ2 + κ2 R1 ) Rδ −1/γ , the above estimate leads to H(B(k · k, R, R1 ), µ, δ) ≤ log (4Mqτq )3 R(1 + κ2 + κ2 R1 )2 (2R1 )3q−2
q
−
3q log δ. γ
Along with Theorem 5.1, this leads to (5.16).
2
Proof of Theorem 3.1. We recall (2.26). Parts (a) and (c) follow from Proposition 5.2, parts (a) and (b) respectively. Part (b) follows from Proposition 5.3. 2 The proof of Theorem 3.2 requires the following proposition. q−1 Proposition 5.4 Let µ be a probability measure on Sk·k satisfying the spherical continuity condition (3.9), where we assume further that M1 ≥ 1. Let 8q q B = log (7κ1 κ2 M1 ) . (5.17) κ1 κ2
Then for integer n ≥ GB, En (Kk·k , µ) ≤ ∆n (q/γ1 , B).
(5.18)
q−1 , Proof. It is easy to verify that for x, y ∈ Sk·k ∞
kx − yk∞
x y 2
≤ 2κ1 κ2
kxk − kyk ≤ (2κ1 κ2 ) kx − yk∞ .
(5.19)
Let δ ∈ (0, 1]. Since M1 ≥ 1, we may find an integer m ≥ 6κ1 κ2 in the interval [6κ1 κ2 M1 δ −1/γ1 , 7κ1 κ2 M1 δ −1/γ1 ]. q−1 into mq−1 congruent subcubes, and let C be the set We divide each of the 2q faces of Sk·k ∞ q−1 . For integer `, let r` = 2κ1 κ2 `/m. of projections of the centers of these subcubes on Sk·k q−1 Let gz,` denote the characteristic function of the cap Sk·k,r` (z), z ∈ C, 1 ≤ ` ≤ m/(κ1 κ2 )+2, (` integer), and gz,` = 0, if ` ≤ 0. Let Y be the set of functions gz,` , z ∈ C, 0 ≤ ` ≤ m/(κ1 κ2 ) + 2. Now, let f be the characteristic function of Sqk·k,r (y). In view of (5.19), there exist z ∈ C and integer k with 0 ≤ k ≤ m/(κ1 κ2 ) such that ky − zk ≤ 2κ1 κ2 /m and rk ≤ r ≤ rk+1 . It is easy to verify R that gz,rk−1 ≤ f ≤ gz,rk+2 . Our choice of m and the continuity condition (3.9) lead to (gz,rk+2 − gz,rk−1 )dµ ≤ δ.
22
Thus, Y is a one-sided (µ, δ)-cover of Kk·k . The cardinality of Y does not exceed 2qmq−1
4q q 4q m + 3κ1 κ2 ≤ m ≤ (7κ1 κ2 M1 )q δ −q/γ1 . κ1 κ2 κ1 κ2 κ1 κ2
Therefore, H(Kk·k , µ, δ) ≤ log
4q (7κ1 κ2 M1 )q κ1 κ2
− (q/γ1 ) log δ,
Along with Theorem 5.1, this leads to (5.18).
2
Proof of Theorem 3.2. The theorem follows immediately from (2.26) and Propositon 5.4. 2 The proof of Theorem 3.3 will follow from the following proposition. Proposition 5.5 (a) Let µ be a probability measure satisfying a continuity condition with parameters (M, γ). Let R > 0 and 2Mτq−1,k·kq−1,2 Rq ≥ 1. Let n q+1 o √ B = log 16 14 qτq−1,k·kq−1,2 MRq . (5.20) Then for integer n ≥ GB, En (S(R), µ) ≤ ∆n ((q + 1)/γ, B).
(5.21)
(b) Let µ be a regular measure with parameters (L, β, M, γ) satisfying (3.14). Let n o (q+1)/2 q qβ+1/γ q+1 . (5.22) L2 B = log 32 14τq−1,k·kq−1,2 Mq Then for integer n ≥ GB, En (S(∞), µ) ≤ ∆n ((q + 1)(qβ + 1/γ), B).
(5.23)
In order to prove this proposition, we first prove a simple lemma, estimating the volume of intersections of strips and spheres. q−1 , R > 0, −R ≤ a < b ≤ R. Then Lemma 5.3 Let y ∈ Sk·k 2
λq (S(y, a, b) ∩ B(k · k2 , 0, R)) ≤ τq−1,k·kq−1,2 Rq−1 (b − a).
(5.24)
Proof. Since λq is rotation-invariant, we may assume that y = (0, · · · , 0, 1). Let C(R, a, b) be the right cylinder with cross sections congruent to B(k · kq−1,2 , 0, R), base in the plane xq+1 = a and top in the plane xq+1 = b. Then S(y, a, b) ∩ B(k · kq,2 , 0, R) ⊆ C(R, a, b). The estimate (5.24) is now clear. 2 q−1 by Sq−1 and τq−1,k·kq−1,2 Proof of Proposition 5.5. In this proof, we will denote Sk·k 2 by τq−1 . Let δ ∈ (0, 1]. In view of the condition 2Mτq−1 Rq ≥ 1, there exists an integer √ m ≥ 6 q in the interval
√ √ [12 qτq−1 MRq δ −1/γ , 14 qτq−1 MRq δ −1/γ ]. 23
As in the proof of Proposition 5.4, we find a set C consisting of 2qmq−1 points on Sq−1 √ such that for any y ∈ Sq−1 , there exists z ∈ C with ky − zk2 ≤ 2 q/m. Let rk = −R + √ √ 2kR q/m, −2 ≤ k ≤ 2+m/ q (k integer). Let gz,`,k denote the characteristic function of S(z, r`, rk ) ∩ B(k · k2, 0, R), and Yδ (R) be the set consisting of I, 1 − I, and these functions. Now, let f be the characteristic function of S(y, a, b) ∩ B(k · k2 , 0, R) for some y ∈ Sq−1 , √ [a, b] ⊆ R. We may assume that [a, b] ⊆ [−R, R]. We find a z ∈ C with ky−zk2 ≤ 2 q/m, √ and integers `, k, 0 ≤ `, k ≤ m/ q such that [r`+1 , rk−1 ] ⊆ [a, b] ⊆ [r` , rk ]. It is easy to verify that gz,`+2,k−2 ≤ f ≤ gz,`−1,k+1 . In view of Lemma 5.3, we verify that Z (gz,`−1,k+1 − gz,`+2,k−2 )dλq Z ≤ (gz,`−1,`+2 + gz,k−2,k+1 )dλq √ 12 qτq−1 Rq . ≤ m The continuity condition on µ and our choice of m now lead to the estimate Z (gz,`−1,k+1 − gz,`+2,k−2 )dµ ≤ δ. Thus, Yδ (R) is a one-sided (µ, δ)-cover of S(R). Its cardinality does not exceed √ |Yδ (R)| ≤ 2qmq−1 (5 + m/ q)2 + 2 ≤ 8mq+1 √ ≤ 8 (14 qτq−1 MRq )q+1 δ −(q+1)/γ .
(5.25)
Theorem 5.1 now leads to (5.21). To prove part (b), we let R = L(δ/2)−β , h be the characteristic function of Rq \ [−R, R]q , and √ √ Y := Yδ/2 ( qR) ∪ {g + h : g ∈ Yδ/2 ( qR)}. As in the proof of Proposition 5.2(b), Y is a one-sided (µ, δ)-cover of S(∞), and |Y | ≤ √ √ 2|Yδ/2 ( qR)|. The estimate (5.25) with δ/2 in place of δ and qL(δ/2)−β in place of R then leads to an estimate on H(S(∞), µ, δ), which, along with Theorem 5.1 implies (5.23). 2 Proof of Theorem 3.3. The theorem follows immediately from (2.26) and Propositon 5.5. 2 Proof of Theorem 4.1. There is no loss of generality in assuming that φ is nondecreasing. Let Z f (x) = φ(x · y)dµ(y), x ∈ Rq , q R for a regular measure µ with parameters (L, β, M, γ) satisfying (3.14). Again, without loss of generality, we may assume that µ is a positive measure. In this proof, we will write 24
k · k in place of k · k2 . We observe that f (0) = φ(0). Let x ∈ Rq , x 6= 0, and X := x/kxk. We note that for y ∈ Rq , Z Z φ(x · y) = χ((−∞, x · y]; u)dφ(u) = χ((−∞, X · y]; u/kxk)dφ(u). R R Using Fubini’s theorem, we obtain the representation Z Z Z f (x) = φ(x · y)dµ(y) = χ(S(X, u/kxk, ∞); y)dµ(y)dφ(u). Rq R Rq
(5.26)
In view of Proposition 5.5 (and its proof via Theorem 5.1), for n ≥ GB, there exist points yj ∈ Rq such that Z n X 1 χ(S(X, u/kxk, ∞); y)dµ(y) − χ(S(X, u/kxk, ∞); yj ) ≤ ∆n (κ, B), (5.27) Rq n j=1 where, in this proof only, κ = (q + 1)(qβ + 1/γ). Now, we observe again that Z Z χ(S(X, u/kxk, ∞); yj )dφ(u) = χ((−∞, x · yj ]; u)dφ(u) = φ(x · yj ). R R Consequently, (5.26) and (5.27) lead to the estimate (4.1).
2
Proof of Theorem 4.2. Without loss of generality, we assume that φ is nonincreasing. Let Z f (x) = φ(kx − yk)dµ(y) Rq for a regular measure µ with parameters (L, β, M, γ), where we may assume without loss of generality that µ is a positive measure. Let x ∈ Rq . Writing dν(u) = −dφ(u), we note that Z ∞Z Z Z ∞ χ([kx − yk, ∞); u)dφ(u)dµ(y) = χ(B(k · k, x, u); y)dµ(y)dν(u). f (x) = − 0 Rq 0 Rq (5.28) Using (4.2) and the decay condition on µ, we derive that Z R1 Z f (x) − χ(B(k · k, x, u); y)dµ(y)dν(u) 0 [−R1 ,R1 ]q Z p dµ(y)dν(u) ≤ log n/n ≤ 1/2. (5.29) ≤ (u,y)∈[0,∞]×Rq \[0,R1 ]×[−R1 ,R1 ]q R First, we consider the case when kxk∞ ≤ R. Writing I = [−R1 ,R1 ]q dµ(y) and I1 = R R1 dν(u), we see that for any measurable function g : [0, ∞) → [−1, 1], 0 Z
∞ 0
1 g(u)dν(u) − I1
Z
R1 0
p g(u)dν(u) ≤ 3 log n/n,
25
(5.30)
and Z
R1 0
Z
χ(B(k · k, x, u); y)dµ(y)dφ(u) [−R1 ,R1 ]q
1 − II1
Z
p ≤ 3 log n/n.
R1
Z
0
[−R1 ,R1 ]
χ(B(k · k, x, u); y)dµ(y)dν(u) q
Therefore, Z R1 Z p 1 f (x) − ≤ 4 log n/n. χ(B(k · k, x, u); y)dµ(y)dν(u) II1 0 [−R1 ,R1 ]q
(5.31)
The measure 1I dµ(y), supported on [−R1 , R1 ]q satisfies the continuity condition with parameters (21/γ M, γ). Since condition (3.5) is satisfied, we may apply Proposition 5.3 (and its proof via Theorem 5.1) to obtain points yj ∈ [−R1 , R1 ]q such that for u ∈ [0, R1 ], Z n 1 1X χ(B(k · k, x, u); y)dµ(y) − χ(B(k · k, x, u); yj ) ≤ ∆n (κ, B), I [−R1 ,R1 ]q n j=1
where, in this proof only, κ = 3qγ −1 . Consequently, Z R1 Z 1 χ(B(k · k, x, u); y)dµ(y)dν(u) II1 0 [−R1 ,R1 ]q n Z 1 X R1 − χ(B(k · k, x, u); yj )dν(u) nI1 0 j=1
≤ ∆n (κ, B). In view of (5.30), (5.31), this leads to r n 1X log n . φ(kx − yj k) ≤ ∆n (κ, B) + 7 f (x) − n n j=1
This proves (4.3) in the case when kxk∞ ≤ R. Next, let kxk∞ > R. Then kxk > (1 + κ2 )R1 , and for (u, y) ∈ [0, R1 ] × [−R1 , R1 ]q , we obtain kx − yk ≥ kxk − kyk > (1 + κ2 )R1 − κ2 kyk∞ ≥ R1 ≥ u; i.e., χ(B(k · k, x, u); y) = 0. Therefore, for all y ∈ [−R1 , R1 ]q , Z ∞ |φ(kx − yk)| = χ(B(k · k, x, u); y)dν(u) 0 r Z ∞ log n ≤ , dν(u) ≤ n R1 26
and in particular,
r n 1 X log n φ(kx − yj k) ≤ . n n j=1 p Moreover, (5.29) implies that |f (x)| ≤ log n/n. Hence, r n X 1 log n . φ(kx − yj k) ≤ 2 f (x) − n j=1 n 2 Finally, we prove the remaining assertion of this paper, Proposition 2.1. Proof of Proposition 2.1. In this proof only, we will denote the surface area of the Euclidean unit sphere embedded in Rq by ωq−1 :=
2π q/2 . Γ(q/2)
(5.32)
Passing to spherical coordinates, we see that Z Z ∞ α −1 α λexp,α = exp(−kxk2 )dx = ωq−1 rq−1 e−r dr 0 Rq Z ωq−1 Γ(q/α) ωq−1 ∞ q/α−1 −t . t e dt = = α 0 α
(5.33)
Since exp(−kxkα2 ) ≤ 1, the assertion about the continuity condition is now clear. To prove the assertion regarding the decay condition, let (in this proof only) s := (q − α)/α, and s! := Γ(q/α). Passing to the spherical coordinates again, a little calculation as above leads to Z 1 ∞ s −t q q µexp (α; R \ [−R, R] ) ≤ t e dt, R > 0. (5.34) s! Rα Now, an integration by parts shows that if R ≥ 1, Z ∞ Z ∞ Z |s| ∞ s −t s −t αs α s−1 −t α|s| α t e dt = R exp(−R ) + s t e dt ≤ R exp(−R ) + α t e dt. R Rα Rα Rα It follows that if Rα > max(2|s|, 1), then 1 µexp (α; R \ [−R, R] ) ≤ s! q
q
Z
∞
ts e−t dt ≤ Rα
2 α|s| R exp(−Rα ). s!
(5.35)
Now, let δ > 0, and in this proof only, let = (s!/2)δ. Using elementary calculus, we verify that x − a log x ≥ a log(e/a) for every x > 0 and a > 0. Therefore, choosing |s| + 1/(αβ) log A := (|s| + 1/(αβ)) log , x = A−αβ , a = |s| + 1/αβ, αβ e we conclude that (A−αβ )|s| exp(−A−αβ ) ≤ . Therefore, with Rα ≥ max(1, 2|s|, A−αβ ), we see from (5.35) that µexp (α; Rq \ [−R, R]q ) ≤ δ. This completes the proof of part (a). 27
For the proof of part (b), we recall the identity (cf. [3, Chapter V, Example 2.12]) Z ∞ t−1 π x = , 0 < t < 1. 1+x sin πt 0 The remainder of the proof of part (b) using spherical coordinates is very elementary, and is omitted. 2
References [1] A. R. Barron, Universal approximation bounds for superpositions of a sigmoidal function, IEEE Trans. Information Theory, 39 (1993), 930–945. [2] A. R. Barron, Neural net approximation, in “Proceedings of 7th Yale workshop on adaptive and learning systems” (K. S. Narendra Editor), Yale University Press, New Haven, Connecticut, 1992, pp. 69–72. [3] J. B. Conway, “Functions of one complex variable”, Springer Verlag, New York, 1973. [4] S. Heinrich, E. Novak, G. W. Wasilkowski, and H. Wo´ zniakowski, The inverse of the star-discrepancy depends linearly on the dimension, Acta Arithmetica, XCVI.3 (2001), 279–302. [5] F. J. Hickernell, I. H. Sloan, and G. W. Wasilkowski, On tractability of weighted integration over bounded and unbounded regions in Rs , Accepted for publication in Mathematics of Computation. [6] R. A. Horn and C. R. Johnson, “Matrix analysis”, Cambridge University Press, Cambridge, 1999. [7] V. Kurkova and M. Sanguineti, Bounds on rates of variable basis and neural network approximation, IEEE Trans. Inform. Theory, 47 September, 2001, 2659– 2665. [8] V. Kurkova and M. Sanguineti, Comparison of worst case errors in linear and neural network approximation, IEEE Trans. Inform. Theory, 48, January, 2002, 264– 275. [9] Y. Li and G. W. Wasilkowski, Worst case complexity of weighted approximation and integration over Rd , J. of Complexity, 18 (2002), 330-345. [10] G. G. Lorentz, M. v. Golitschek, and Y. Makovoz, “Constructive approximation, advanced problems”, Springer Verlag, New York, 1996. [11] Y. Makovoz, Uniform approximation by neural networks, J. Approx. Theory 95 (1998), 215–228. 28
[12] Y. Makovoz, Random approximants and neural networks, J. Approx. Theory 85 (1996), 98–109. [13] H. N. Mhaskar, Approximation theory and neural networks, in ”Wavelet Analysis and Applications, Proceedings of the international workshop in Delhi, 1999” (P. K. Jain, M. Krishnan, H. N. Mhaskar J. Prestin, and D. Singh Eds.), Narosa Publishing, New Delhi, India, 2001, 247–289. [14] H. N. Mhaskar and C. A. Micchelli, Dimension-independent bounds on approximation by neural networks, IBM J. of Research and Development,38 (1994), 277-284. [15] E. Novak and H. Wo´ zniakowski, When are integration and discrepancy tractable?, in “Foundations of computational mathematics” (R. DeVore, A. Iserles, and E. S¨ uli Editors), Cambridge University Press, 2001, 211–266. [16] E. Novak and H. Wo´ zniakowski, Intractability results for integration and discrepancy, Journal of Complexity, 17 (2001), 388–441. [17] S. H. Paskov, New methodologies for evaluating derivatives, in “Mathematics of Derivative Securities” (S. Pliska, M. Dempster eds.), pp. 545-582, Cambridge University Press, 1997. [18] S. H. Paskov and J. F. Traub, Faster valuation of financial securities, J. Portfolio Management 22 (1995), 113-120. [19] D. Pollard, Convergence of stochastic processes, Springer Verlag, New York, 1984. [20] W. Rudin, “Real and complex analysis”, McGraw Hill, New York, 1974. [21] A. W. van der Vaart and J. A. Wellner, “Weak convergene and empirical processes, with applications to statistics”, Springer Verlag, New York, 1996. [22] V. N. Vapnik, “Statistical learning theory”, John Wiley and Sons, New York, 1998.
29