This is page 1 Printer: Opaque this
Constructive Function Approximation: Theory and Practice D. Docampo D.R. Hush C.T. Abdallah
ABSTRACT In this paper we study the theoretical limits of nite constructive convex approximations of a given function in a Hilbert space using elements taken from a reduced subset. We also investigate the trade-o between the global error and the partial error during the iterations of the solution. These results are then specialized to constructive function approximation using sigmoidal neural networks. The emphasis then shifts to the implementation issues associated with the problem of achieving given approximation errors when using a nite number of nodes and a nite data set for training.
1 Introduction It has been shown that continuous functions on compact subsets of IRd can be uniformly approximated by linear combinations of sigmoidal functions [11, 20]. What was missing from that result is how the error in the approximation is related to the number of sigmoids used. This can be phrased in a more general way as the problem of approximating a given element (function) f in a Hilbert space H by means of an iterative sequence fn , and has an impact in establishing convergence results for projection pursuit algorithms [22], neural network training [5] and classi cation [12]. Moreover, the fact that one will have to achieve the approximation when samples of f are given has been largely forgotten by most papers which quote the results of [11, 20]. The approximation problem can be given a constructive solution where the iterations taking place involve computations in a reduced subset G of H [22, 5]. This leads to algorithms such as projection pursuit. Convergence of the classical projection pursuit regression techniques [13] however, has been shown to be very slow unless the iterate fn+1 is chosen to be an optimal combination of the past iterate fn and a ridge function of elements of the subset G. The bound of the error in this approximation has been re ned several times since the initial non-constructive proof given by Maurey, as reported in [23]. Jones [22] provided the rst constructive
2
D. Docampo, D.R. Hush, C.T. Abdallah
solution to the problem of nding nite convex approximations of a given function in a Hilbert space using elements taken from a reduced subset. His results have been recently re ned by Barron [3] and Dingankar [12]. In this paper we report that the rate of convergence obtained in [22] and [5] is the maximum achievable, and, only under some restricted assumptions, the results in Dingankar can be derived as the optimal convex combination to preserve the desired convergence rate. In the rst part of the paper, we formulate the approximation problem in such a way that we can study the limits of the global error, obtain the best possible trade-o between global and partial errors, and give theoretical bounds for the global error when a prespeci ed partial error is xed. We then concentrate on the implementation aspects of the problem, specifically, the problem of achieving a certain approximation error using one approximating function at a time. We then discuss some speci c sigmoidal functions and algorithms which have been shown to be ecient in solving a particular step of the approximation problem. The rest of the paper is organized as follows: We start out by reviewing some theoretical results in section 2 where we state the problem and highlight its practical implications. In section 3 we review the theoretical solutions to the problem, and provide the framework under which those solutions can be derived. In section 4 we analyze the limits of the global error and its relation to the partial errors at each step of the iterative process. In section 5 we specialize the constructive functions to sigmoidal functions. Section 6 presents the practical issues associated with implementing a constructive algorithm with an eye towards neural network results. Finally, section 7 presents our conclusions.
2 Overview of Constructive Approximation In this section, we state and present some theoretical results on the constructive approximation problem. In order to state the results in their full generality, let G be a subset of a real or complex Hilbert space H , with norm k:k, such that its elements, g, are bounded in norm by some positive constant b. Let co (G) denote the convex closure of G (i.e. the closure of the convex hull of G in H ). The rst global bound result, attributed to Maurey, concerning the error in approximating an element of co (G) using convex combinations of n points in G, is the following: Lemma 2.1 Let f be an element of co (G) and c a constant such that c > b2 ? kf k2 = b2f . Then, for each positive integer n there is a point fn in the convex hull of some n points of G such that:
kf ? fn k nc 2
4
1. Constructive Function Approximation: Theory and Practice
3
The rst constructive proof of this lemma was given by Jones [22] and re ned by Barron [3]; the proof includes an algorithm to iterate the solution. In the next section, a review of the constructive proof will be presented. We will speci cally prove the following in section 3. Theorem 2.1 For each element f in co (G), let us de ne the parameter as follows:
= inf sup kg ? k2 ? kf ? k2 2H g2G
Let now be a constant such that > . Then, we can construct an iterative sequence fn , fn chosen as a convex combination of the previous iterate fn?1 and a gn 2 G, fn = (1 ? )fn?1 + gn , such that:
kf ? fn k n 2
Proof: See section 3.
Note that this new parameter, , is related to Maurey's b2f , since if we make = 0 in the de nition of we realize that b2f . The relation between this problem and the universal approximation property of sigmoidal networks was clearly established by [22, 5]; speci cally, under certain mild restrictions, continuous functions on compact subsets of IRd belong to the convex hull of the set of sigmoidal functions that one hidden layer neural networks can generate. Moreover, since the proofs are constructive, an algorithm to achieve the theoretical bounds is provided as well. Other nonlinear approximation techniques have also bene ted from the solution to this problem: approximation by hinged hyperplanes [8], projection pursuit regression [28] and radial basis functions [17]. In all these related approximation problems the solution can always be constrained to fall in the closure of the convex hull of a subset of functions (e.g. hinged hyperplanes, ridge functions or radial basis functions in the examples mentioned above).
3 Constructive Solutions For the sake of clarity and completeness, we include here the proof given in [5] and [12]. Lemma 3.1 Given f 2 co (G), for each element of co(G), h, and 2 [0; 1]: inf kf ? (1 ? )h ? gk2 (1 ? )2 kf ? hk2 + 2 (1.1) g2G
4
D. Docampo, D.R. Hush, C.T. Abdallah
Proof: The proof of the lemma will be carried out for f 2 co(G); it extends
to elements in co (G) because of the continuity of all the terms involved in the inequalities [10]. Since f 2 co(P G), there exists a convex combination of elements g from G, so that f = m then g be a random vector taking values on k=1 k gk . Let H with probabilities P (g = gk) = k . Then: E (g ) = f , var(g ) = E (kg ? f k2) = E (kg k2 ) ? kf k2 b2f . Additionally, for 2 H , var(g ) = var(g ? ) = E (kg ? ? (f ? )k2 ) = E (kg ? k2) ?kf ? k2. Thus, 8 2 H , var(g )
sup kg ? k ? kf ? k ) g2G var(g ) inf sup kg ? k ? kf ? k = : 2H 2
2
2
g2G
2
Now, for 2 [0; 1] and d 2 H , E (k(g ? f ) + dk2 ) = 2 E (kg ? f k2 ) + kdk2 2 + kdk2 , and for 2 [0; 1] inf kf ? (1 ? )h ? gk2 E (k(1 ? )h + g ? f k)2
g2G
E (k(1 ? )(h ? f ) + (g ? f )k) (1 ? ) kf ? hk + 2
2
2
2
which concludes the proof of Lemma 3.1.
4
We can now prove Theorem 2.1, using an inductive argument. Proof: At step 1, nd g1 and 1 so that kf ? g1k2 inf G kf ? gk2 + 1 . This is guaranteed by (1.1), for = 1 and 1 = ? . Let now fn be our iterative sequence of elements in co(G), and assume that for n 2, kf ? fn?1 k2 =(n ? 1) It is then possible to choose among dierent values of and n so that: (1 ? )2 kfn?1 ? f k2 + 2 n ? n
(1.2)
At step n, select gn such that:
kf ? (1 ? )fn? ? gn k ginf kf ? (1 ? )fn? ? gk + n 2G 1
2
1
2
(1.3)
Hence, using (1.1), (1.3) and (1.2), we get: kf ? fn k2 n , and that completes the proof of Theorem 2.1.
1. Constructive Function Approximation: Theory and Practice
5
The values of and n in [5] and [12] are related to the parameter , = = ? 1, in the following way: fn?1 k [5] : = +kfk? ; f ? fn?1k2 = [12] : = 1 ; 2
n
n
n =
n(n + )
n2
It is easy to check that, in both cases, 1 is equal to ? as stated in the proof. Given that the values of the constant are dierent in both cases, we rst look for the values of which make the problem solvable (i.e. feasible values for the constant ). Admissible values of will have to satisfy inequality (1.2) for positive values of n ; it is easy to show that those values fall in the following interval, centered at Barron's optimal value for : r
1 kf ? fn? k kf ? fn? k ? kf ? fn? k + n
+ kf ? fn? k + kf ? fn? k 1
2
1
2
1
1
2
4
1
2
To evaluate the possible choices for the bound n we need to make use of the induction hypothesis; introducing it in inequality (1.2), values of should now satisfy (1 ? )2 n ? 1 + 2 n ? n In this case, admissible values of for positive values of n fall in the interval (which always contains the value of = 1=n): s
1 + n ? 1 (1 + ) n + n + n(n ? 1) In Figure 1 we show the bounds of this second interval for as a function of n. The bounds are shown in solid lines, the center of the interval using a dotted line, and the value of in [12] using a dash dotted line. Note how the dash-dotted line approaches the limits of the interval, which results in a poorer value for n , as will be shown later.
3.1 Discussion
Since the results presented so far achieve a bound of the global error of
O(1=n), and, to construct the solution, a partial error n of O(1=n2 ) is
the maximum allowed at each step, it is useful to formulate the following questions:
6
D. Docampo, D.R. Hush, C.T. Abdallah 1
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
5
10
15
20
25
FIGURE 1. admissible values of for = 1
1. Is there any possibility of achieving a further reduction in the global error using convex combinations of n elements from G? What is the minimum bound for the global error assuming n = 0 for all n? 2. What is the optimal choice of for a given bound, so that n is maximum, making the quasi-optimization problem at each step easier to solve? 3. For that the optimal choice of and a prespeci ed partial error, n , what is bound for the global approximation problem? Based on the assumptions made and in Lemma 3.1, let us formulate the problem again in a more general way: Our objective is to look for a constructive approximation so that the overall error using n elements from G satis es the following inequality:
kf ? fn k b(n) 2
(1.4)
b(n) being a function of the parameter n which indicates the order of our approximation (i.e. b(n) = n both in [22] and [12]) and the parameter related to as de ned before. In what follows we will assume that the iterate fn will be chosen as a convex combination of the previous iterate fn?1 and a point in G, gn ; this
introduces a loss of generality, since other constructive approaches could be devised in order to re-optimize the coecients of previous elements from G at each step. The facts that fn is forced to be a convex combination of n elements from G, and our algorithm has to be constructive, mean that fn is in the convex hull of fg1 ; g2; : : : ; gn g and fn?1 is in the convex hull of fg1; g2 ; : : : ; gn?1g, but that does not imply that fn must be a convex combination of fn?1 and gn , as can be easily shown. We leave the more general problem for further investigation and concentrate here on the case
1. Constructive Function Approximation: Theory and Practice
7
where constructiveness of the algorithm is taken as in [22] and [12] to be equivalent to the constraint that, at each step, fn is in the convex hull of ffn?1; gn g. Before we try to answer the three questions posed at the beginning of this section, let us now set up a framework where the constructive results can be derived. Let fn = (1 ? )fn?1 + gn , then, in our approximation problem we want to nd , n , and the function b(n) so that:
kf ? fn k 2
inf inf kf ? (1 ? )fn?1 + gk2 + n
; let = (1 + ) . 2. Find g1 2 G so that kf ? g1 k2 . Set f1 = g1 . 3. For n > 1, evaluate: (a) n = (1 + )= (1 + + b(n ? 1)) from (1.9) (b) Find gn 2 G so that
kf ? (1 ? n)fn? ? ngn k inf kf ? (1 ? n)fn? ? n gk + n G 2
1
1
2
(c) Make fn = (1 ? n )fn?1 + n gn (d) Compute b(n) from (1.11) In order to make the appropriate comparisons with previous results, we will set n = ( =n2 ), as in [12]. Then, again under the induction hypothesis, 1 1 = + b(n) 1 + + b(n ? 1) (1 + )n2 To predict the asymptotic behavior of b(n), let us assume that, at step n ? 1, b(n ? 1) (1 + )(n ? 1), we will prove then that, for some values of the constant , we can imply that also b(n) (1 + )n. Since b(n ? 1) (1 + )(n ? 1), we have: 1 1 b(n) (1 + )(1 + (n ? 1)) + (1 + )n2 ) 1 1+ + b(n) 1 + (n ? 1) n2 ) b(n) n(1 + ) b(n) n(1 + ) n(1 ? ? 2 )
n n+(1+(1 +(n (?n 1)) ? 1)) ) , n(1 ? ) (1 + (n ? 1)) , (1 ? ) 2
This last inequality is asymptotically ful lled for any value of such that: 0
p4 + 1 ? 1 2
1. Constructive Function Approximation: Theory and Practice
11
Then, for the value of n selected in [12], the asymptotic value for b(n) is: b(n) = (1 + )n
p4 + 1 ? 1
2 which is a better rate than the one obtained in [12]. In Figure 2 we show b(n) as a solid line, and the straight lines l(n) = n corresponding with the rate in [12], dotted line, and the predicted asymptotic behavior of b(n). The gure clearly supports the asymptotic results, and shows that the constant n found (1.9) always results in a better convergence rate than [12]. The gap between the two lines would be bigger for larger values of the constant ; in other words, the larger the constant the worse the convergence rate achieved using = 1=n. 140
120
100
80
60
40
20
0 0
10
20
30
40
50
60
70
80
90
100
FIGURE 2. Optimal convergence rate for = 1
4.3 Fixing the rate of convergence
The remaining problem, namely: given the optimal value of nd the maximum n for a xed convergence rate, thus making the quasi-optimization problem at each step easier to solve, was already explicitly solved in (1.12). Again, to show how our results compare with [5] and [12], we will assume that our desired rate of convergence is given by b(n) = n. The value n = (1 + )=(n + ) solves the optimization problem, and: n =
n(n + )
(1.13)
This is the best upper bound we can achieve for the partial error at each step of the iteration process. It is easy to show that it coincides with Barron's bound, and is always greater than the bound found in [12].
12
D. Docampo, D.R. Hush, C.T. Abdallah
Now, in Figure 3 we show the bound n for n = 5 and = 1, as a function of . The optimal bound is shown using the solid line, while the bound from [12] is shown using a dotted line. 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 0
5
10
15
20
25
FIGURE 3. n for n = 5
5 The Sigmoidal Class of Approximators When discussing neural networks, we are typically referring to a system built by linearly combining a large collection of simple computing devices (i.e., nodes), each of which performs a nonlinear transformation (in general a sigmoid function) on its inputs [18]. A sigmoid is de ned here as a bounded function (x). It is now known that a 1-hidden layer static network, whose nodes are sigmoidal is capable of approximating an arbitrary (continuous) function. Many proofs of this result have appeared of which we recall the ones in [11, 20]. Until recently, these proofs have used the Stone-Weirestrass theorem and required the continuity or even dierentiability of the sigmoid (or nonlinearities) in the neural net. Chen et al. [9], building on the research of Sandberg [25, 26, 27] have recently shown however that all is needed is the boundedness of the sigmoid building block. Table 5 is taken from [9] and summarizes some available results for the approximation of functions. The set K denotes a compact subset of IRn . Note that even those results labeled \constructive" still ignore the facts associated with the training algorithm and the available data. The set of popular sigmoids include the hardlimiting threshold or Heaviside function shown in Figure 4(a): H (x) =
1 x>0 0 x0
(1.14)
1. Constructive Function Approximation: Theory and Practice
13
Reference Activation Function Approximation In Proof [11] Continuous Sigmoid C [K ] Existential [11] Bounded Sigmoid Lp [K ] Existential [20] Monotone Sigmoid C [K ] Constructive [9] Bounded Sigmoid C [IRn ] Constructive TABLE 1.1. Approximation Results β=5.0
1.0
1.0
β=0.2
S (x)
H (x)
β=1.0
0.0
0.0
-10.0
0.0
-10.0
10.0
x
(a) Hardlimiter Nonlinearity
0.0
x
10.0
(b) Sigmoid Nonlinearities
FIGURE 4. Typical Nonlinearities.
In order to derive certain learning techniques, a continuous nonlinear activation function is often required. For example, gradient descent techniques typically require that the sigmoid be dierentiable [2]. Thus the threshold function is commonly approximated using the sigmoid function shown in Figure 4(b): S (x) =
1
1 + e? x
(1.15)
The gain of the sigmoid, , determines the steepness of the transition region. Note that as the gain approaches 1, the sigmoid approaches a hardlimiting threshold. Often the gain is set equal to one, and is omitted from the de nition in equation (1.15). Later in this paper, we shall use the ramp equation which is another sigmoid de ned as r (x) =
8