Smoothing Regularizers for Projective Basis Function Networks John E. Moody
Thorsteinn S. R¨ognvaldsson
Dept. of Computer Science and Engineering Oregon Graduate Inst. of Science and Technology P.O. Box 91000 Portland, Oregon 97291-1000, U.S.A.
[email protected] Centre for Computer Systems Architecture University of Halmstad P.O. Box 823, 301 18 Halmstad, Sweden
[email protected] Abstract: Smoothing regularizers for radial basis functions have been studied extensively, but no general smoothing regularizers for projective basis functions (PBFs), such as the widely-used sigmoidal PBFs, have heretofore been proposed. We derive new classes of algebraically-simple mth -order smoothing regularizers for networks N of projective basis functions f (W; x) = j =1 uj g xT v j + vj 0 + u0; with general transfer functions g []. These regularizers are:
P
RG (W; m)
=
RL (W; m)
=
N X j =1 N X j =1
u2j kvj k2m?1
Global Form
u2j kvj k2m
Local Form
:
With appropriate constant factors, these regularizers bound the corresponding mth -order smoothing integral
Z
m x)
2 : S (W; m) = dD xΩ(x)
@ f@ (xW; m In the above expressions, fv j g are the projection vectors, W denotes all the network weights fuj ; u0 ; v j ; v0 g, and Ω(x) is a weighting function (not necessarily the input density) on the D -dimensional input space. The global and local cases are distinguished by different choices of Ω(x). These simple algebraic forms R(W; m) enable the direct enforcement of smoothness without the need for costly Monte Carlo integrations of S (W; m). The regularizers are tested on illustrative sample problems and compared to quadratic weight decay. The new regularizers are shown to yield better generalization errors than weight decay when the implicit assumptions in the latter are wrong. Unlike weight decay, the new regularizers distinguish between the roles of the input and output weights and capture the interactions between them. From a Bayesian standpoint, our results suggest that when the assumption of mth -order smoothness is appropriate, priors of the form exp[?R(W; m)] will yield PBF networks that generalizate better than networks trained with gaussian priors.
1 Introduction: What is the right bias? Regularization is a technique for reducing prediction risk by balancing model bias and model variance. A regularizer R(W ) imposes prior constraints on the network parameters W . Using squared error as the most common example, the objective functional that is minimized during training is 1 E = 2M
M X i=1
y(i) ? f (W; x(i) )]2 + R(W ) ;
[
(1)
where y (i) are target values corresponding to the inputs x(i) , M is the number of training patterns, and the regularization parameter controls the importance of the prior constraints relative to the fit to the data. Several approaches can be applied to estimate (see for example Eubank (1988), Hastie & Tibshirani (1990) or Wahba (1990)) in order to minimize the prediction risk by optimizing the bias/variance tradeoff. Regularization reduces model variance at the cost of introducing some model bias. An important question arises: What is the right bias? (Geman, Bienenstock & Doursat 1992). A good choice of the regularizer R(W ) will result in lower expected prediction error than will a poor choice.1 Weight decay is often used effectively, but it is an ad hoc technique that controls weight values rather than the fit to the data directly. It is thus not necessarily optimal. Weight decay is not appropriate for arbitrary function parameterizations, since it will give very different results, depending upon whether a function is parameterized, for example, as f (w; x) or as f (w?1; x). Since many real world problems are intrinsically smooth, we propose that an appropriate bias to impose is to favor solutions with low mth -order curvature. Direct penalization of curvature is a parametrizationindependent approach. The desired regularizer is the standard D dimensional curvature functional of order m:
Z
m x)
2 : S (W; m) = dD xΩ(x)
@ f@ (xW; (2) m Here k k denotes the ordinary euclidean tensor norm and @ m =@ xm denotes the mth order differential operator.
The weighting function Ω(x) ensures that the integral converges and determines the region over which we require the function to be smooth. Ω(x) is not required to be equal to the input density p(x), and will most often be different. The use of smoothing functionals S (W ) like (2) has been extensively studied for smoothing splines (Eubank 1988, Hastie & Tibshirani 1990, Wahba 1990) and for radial basis function (RBF) networks (Powell 1987, Poggio & Girosi 1990, Girosi, Jones & Poggio 1995). However, no general class of smoothing regularizers that directly enforce smoothness S (W; m) for any m and projective basis functions (PBFs), such as the widely used sigmoidal PBFs, has hitherto been proposed. Only two special cases of m and Ω(x) have been treated previosly for PBFs: m = 1 with global weighting2 (van Vuuren 1994) and m = 2 with local weighting (Bishop 1993). Since explicit enforcement of smoothness using (2) in the general case requires costly, impractical MonteCarlo integrations,3 we derive new, algebraically-simple regularizers R(W; m) that tightly bound S (W; m). In this paper, we focus on the ubiquitous PBF networks. We derive and test the corresponding regularizers for RBFs in the companion paper (Moody & R¨ognvaldsson 1996). 1 Regularizers can be viewed as being part of a more general class of biases called “hints” (Abu-Mostafa 1995). Hints can include soft or hard constraints that enforce symmetries, positivity, monotonicity, smoothness, and so on. 2 Van Vuuren considers the norm of the gradient instead of the squared norm. 3 Note that (2) is not just one integral. The number of independent terms in @ m f=@xm 2 is (D + m 1)!=[(D 1)!m!], which makes it extremely expensive to compute for large D and large m.
k
1
k
?
?
2 Derivation of Simple Regularizers from Smoothing Functionals We consider single hidden layer networks with D input variables, Nh nonlinear hidden units, and No linear output units. For clarity, we set No = 1, and drop the subscript on Nh (the derivation is trivially extended to the case No > 1). Thus, our network function is
f (x) =
N X j =1
uj g[j ; x] + u0
(3)
where g [] are the nonlinear transfer functions of the internal hidden units, x 2 RD is the input vector4 , j are the parameters associated with internal unit j , and W denotes all parameters in the network.
For regularizers R(W ), we will derive strict upper bounds for S (W; m). We desire the regularizers to be as general as possible so that they can easily be applied to different network models. Without making any assumptions about Ω(x) or g (), we have the upper bound
S (W; m) N
P
m g[ ; x]
2 j
@ xm
;
Z N X 2 j =1
dD xΩ(x) @
uj
2
(4)
P
N 2 N a which follows from the inequality i=1 i N i=1 ai : We consider two possible options for the weighting function Ω(x). One is to require global smoothness, in which case Ω(x) is a very wide function that covers all relevant parts of the input space (e.g. a very wide gaussian distribution or a constant distribution). The other option is to require local smoothness, in which case Ω(x) approaches zero outside small regions around some reference points (e.g. the training data).
2.1 Projective Basis Representations Projective basis functions (PBFs) are of the form
g[j ; x] = g xT v j + vj0 ; (5) where j = fv j ; vj 0 g, v j = (vj 1 ; vj 2 ; : : : ; vjD ) is the j th projection vector, the vector of weights connecting hidden unit j to the inputs, and vj 0 is the bias, offset, or threshold. Denoting zj (x) xT v j + vj 0 , the most commonly used PBFs g [z ] are sigmoids, such as tanh[z ] and the logistic function g [z ] = (1 + exp(?z ))?1. Other nonlinear transfer functions g [z ] include erf[z ], cos[z ], and monomials z p .5 For PBFs, expression (4) simplifies to
S (W; m) N with
Ij (W; m)
Z
N X j =1
u2j kvj k2m Ij (W; m);
dD xΩ(x)
4 Throughout,
dm g[zj (x)] dzjm
!2
(6)
:
(7)
we use small letter boldface to denote vector quantities. examples of the practical application of non-sigmoidal g [z ] are presented in Moody & Yarvin (1992). Nonsigmoidal PBFs can perform better than sigmoids for some highly nonlinear problems. 5 Some
2
2.2 Global weighting For the global case, we select a gaussian form for the weighting function
p
ΩG (x) = ( 2 )?D exp
?kxk2 2 2
(8)
and require to be large. The gaussian simplifies evaluation of the smoothing integral considerably, since it is both separable and spherically symmetric. Integrating out all dimensions, except the one associated with the projection vector v j , we are left with
Ij (W; m) =
p1 2
Z1
dxe?x =2 2
?1
!2
2
dmg[zj (x)] : dzjm
(9)
Provided the resulting integral converges, we get an upper bound to (9) by replacing the exponential with one (see Appendix A for a discussion). This implies
Ij (W; m) Ik(vmk)
with
j
where equality holds in the global limit
I (m)
p1 2
Z 1 dm g[z] 2 ; dz dzm
(10)
?1
! 1. Definining the global regularizer to be N X
u2j kvj k2m?1 ;
(11)
S (W; m) NI (m)RG (W; m) ;
(12)
RG (W; m)
j =1
the bound of equation (6) becomes
where the index G emphasizes the fact that this upper bound is used in the global case of large . Since absorbs all multiplicative factors, we need only weigh expression (11) into the training objective function.
2.3 Local weighting For the local case, we consider weighting functions of the general form ΩL (x) =
1
M
M X i=1
Ω(x(i) ; )
(13)
p
where x(i) are a set of points and Ω(x(i) ; ) is a function such as ( 2 )?D exp ?kx ? x(i) k2 =2 2 that decays rapidly for large kx ? x(i) k. We require that lim !0 Ω(x(i) ; ) = (x ? x(i) ), where () is the delta function. Thus, when the x(i) are the training data points, the limiting distribution of (13) is the empirical distribution. In the local limit
! 0, the integral Ij (W; m) of (7) simplifies, and (6) becomes
!2
N M dm g [z (x(i) )] X NX j S (W; m) M u2j kvj k2m : m dz j j =1 i=1 3
(14)
In theory, we could compute the expression within parentheses in (14) for each input pattern x(i) during training and use it as the regularization cost, which is the approach taken by Bishop (1993). However, this requires explicit design for each transfer function form and also becomes increasingly complicated as we go to higher m. To construct a simpler and more general form, we instead assume that the mth derivative of the transfer function is bounded and define
CLP (m) max z and
RL (W; m)
dm g[z] 2
N X j =1
;
(15)
u2j kvj k2m :
(16)
dzm
This gives the bound
S (W; m) NCLP (m)RL(W; m) (17) for the maximum local curvature of the function (the additional index L emphasizes that it is an upper bound in the local limit). We propose using expression (16) as the smoothing regularization term.
We show in Appendix C that it is reasonable to ignore the exact values of the derivatives of the internal units for problems with many input variables, and use (17) instead of (14).
3 Simulation Studies on Synthetic Data In this section we demonstrate the value of using smoothing regularizers on two simple synthetic problems which illustrate two key differences between smoothing and quadratic weight decay. Both problems have few input variables and more internal units than inputs, and are thus not examples of cases where we expect our regularizers to work optimally. However, considering low dimensional problems enables us to do extensive simulation studies. We use standard sigmoidal PBF networks for both problems, and study the effects of additive noise, sparse data sets, and the choice of order of the smoother. We train the networks using the RPROP algorithm (Riedmiller & Braun 1993), which was empirically found to be the learning algorithm that converged quickest and reached the lowest errors. The training data are i.i.d. for both the inputs and the additive noise in the targets. The input variables are sampled separately, and are hence uncorrelated with each other, and are normalized to have zero mean and unit variance. The experimental procedure is as follows: First, we scan 8 orders of magnitude of , in steps of ∆ log 10() = 0:5, in order to find the value that gives the lowest geometric6 mean generalization error. For each value of , the mean is computed over 10 networks with different initial conditions and different randomly-generated training sets. An example of such a trace is shown in fig. 1. The generalization error is computed by integrating over a lattice of 201D points of the noise-free target function. Secondly, we train 100 networks with different initial weights and different training sets, using the value that resulted in the lowest generalization error. The procedure is repeated for each regularization method (local/global smoothers of different orders and weight decay with or without including bias weights). Generalization performances of the different regularization methods are then compared pairwise, using a Wilcoxon rank test7 (Kendall & Stuart 1972). Two methods 6 The
generalization errors seem log-normally distributed, which is why we consider the geometric mean. also did a paired t-test on the log of the generalization errors. This gave the same results as the Wilcoxon test, and we report only the latter. 7 We
4
are considered significantly different if the null hypothesis (equal average generalization performance) is rejected at the 95% confidence level. As a base-level comparison, we also fit 100 linear models and 100 unregularized neural networks to test if our regularized networks perform significantly better. In all experiments reported here, the regularized networks do significantly better than both a linear model or an unregularized model (using the test criteria described above).
3.1 Sigmoidal Bump The first problem is the one dimensional “bump problem” suggested by Wu & Moody (1996). The target function is t(x) = 0:5tanh (x + 5)=2 ? 0:5tanh (x ? 5)=2 ; (18)
which can be realized with a linear output neural network with one hidden layer of at least two logistic or tanh hidden units. For this problem, we generate training sets f(xi ; yi ); i = 1; : : : ; M g of sizes M 2 f11; 21; 41g with noisy targets yi t(xi ) + i , randomly sampled from ?10 x 10, and add gaussian noise of variance s2 2 f0:1; 0:5; 1:0g. These noise levels correspond to signal-to-noise ratios of f1:2; 0:55; 0:39g, defined as the ratio of the standard deviation of the target function, evaluated over the sampling region, and the standard deviation of the gaussian noise. Table 1 summarizes our results on this problem using networks with four hidden sigmoidal units (four internal units avoids some problems with local minima that occur when using only two internal units). The smoothing regularizers do significantly better than weight decay when the problems are noisy. However, when the bias weights are excluded from the weight decay, the difference essentially disappears.
3.2 Bilinear Function The second problem is the two dimensional bilinear function
t(x1; x2 ) = x1 x2 :
(19)
This example was used by Friedman & Stuetzle (1981) to demonstrate projection pursuit regression. It is the simplest example of a function that has interactions between the input variables. The function can be wellfitted by a one hidden layer network with four sigmoidal hidden units by expressing the function in the form t(x1; x2 ) = 0:5(x1 + x2 )2 ? 0:5(x1 ? x2 )2 and approximating the quadratic functions with a superposition of two tanh sigmoids. We generate training sets of sizes M 2 f20; 40; 100g, randomly sampled from the space ?1 fx1 ; x2 g 1, and add gaussian noise with standard deviation s 2 f0:1; 0:2; 0:5g, which corresponds to signal-to-noise ratios of f3:33; 1:67; 0:67g. The student networks have 8 hidden tanh units and one linear output. Figure 1 illustrates, for the special case of s = 0:2 and a training set with 40 data points, how the generalization performance improves when higher order smoothing regularizers are used instead of weight decay or first order smoothers, which yield inferior solutions. As shown in table 2, this is true when the network is trained on few data points with lots of noise, as well as when it is trained on many data points with little noise. In contrast to the previous problem, excluding bias weights from the quadratic weight decay term does not help.
5
Smoothing regularizer Global, m Local, m
=1
Global, m Local, m
=2
=2
Global, m Local, m
=1
=3
=3
Size of training set 41 21 11 41 21 11 41 21 11 41 21 11 41 21 11 41 21 11
Weight decay Signal/Noise 0.39 0.55 1.2 + + 0 + + 0 + + 0 + + – + + 0 + + 0
0 + 0 + + + + + + + + + + 0 + + 0 0
0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0
w/out bias Signal/Noise 0.39 0.55 1.2 0 0 0 0 0 0 + 0 0 + 0 – + 0 – + 0 0
0 0 0 0 + 0 0 + 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 – 0 0 0 0 0
Table 1: Pairwise performance comparisons on the one dimensional bump problem defined in eq.(18). A ‘+’ means that use of the smoother results in significantly lower generalization error than weight decay (with or without bias weights). A ‘–’ means that the smoother results in significantly higher generalization error than weight decay. A ‘0’ means that the difference is insignificant (i.e. less than required for rejection of the equal means hypothesis at a 95% confidence level).
3.3 Discussion Weight decay performs poorly on the sigmoid bump problem because the magnitudes of the required bias weights are very different from the magnitudes of the other weights. The target network function thus fits poorly into the assumption that bias weights and non-bias weights all have the same gaussian prior. However, if weight decay is not applied to the bias weights, the problem fits weight decay much better, and the performance is improved. The equivalent treatment of bias and non-bias weights is thus not always the proper thing to do (whereas it is often quite reasonable to expect a function to be smooth). Simply removing the bias weights from the weight decay cost is no “cure for all ills”. This is demonstrated in the bilinear problem, where weight decay performs poorly because it lacks any form of interaction between the weights. To fit the bilinear problem, the resulting function must be close to quadratic, which can be done with a neural network of the form
f (W; x) = u0 + utanh[vx + v0 ] ? utanh[vx ? v0 ] by expanding it around x = 0, requiring that u0 = ?2v tanh[v0 ] and uv 2tanh[v0 ](tanh2 [v0 ] ? 1) = 1. gives
f (W; x) = x2 + O(x4v2 );
(20) This (21)
showing that constructing a good quadratic fit from two sigmoids requires that v ! 0 and u ! 1. That is, the weights between inputs and hidden units must approach zero while the weights between hidden units and 2
6
(b) Weight decay vs. global smoother
(a) Global smoother of different order
1
1
Weight decay Global smoother, m=3
Geom. mean generalization error
Geom. mean generalization error
Global smoother, m=1 Global smoother, m=2 Global smoother, m=3
0.1
0.01 -6
-5
-4
-3 -2 Log10( λ )
-1
0
1
0.1
0.01
2
-6
-5
-4
-3
-2 -1 Log10( λ )
0
1
2
Figure 1: (a) Generalization errors on the x1 x2 problem, with 40 training data points and a signal-to-noise ratio of 2/3, for different values of the regularization parameter and different orders of the smoothing regularizer. For each value of , 10 networks have been trained and averaged (geometric average). The best generalization error decreases with increasing order of the smoothing regularizer. The shaded area shows the 95% confidence bands for the average performance of a linear model on the same problem. (b) Similar plot for the m = 3 smoother compared to the standard weight decay method. Error bars mark the estimated standard deviation of the mean generalization error of the 10 networks.
the output must approach infinity. This is completely contrary to weight decay, where weights in different layers are assumed independent.
4 Two Real Data Sets: Predicting Flows in Icelandic Rivers Here we test our smoothing regularizers versus weight decay on the task of predicting tomorrow’s flow of water in two Icelandic rivers, J¨okuls´a and Vatnsdals´a, based on historical information about flow, precipitation in the region, and temperature. The problem is obviously nonlinear because of the difference between above freezing and below freezing temperatures. The data, plotted in figure 2, are listed and described in (Tong 1990). They cover the period 1972-1974, corresponding to 1096 points. We use data from 1972 and 1973 for training (731 examples) and data from 1974 for testing (365 examples). On both problems, we try two different representations: 3 lags and 8 lags of the three input variables (past flow, precipitation and temperature), corresponding to 9 inputs and 24 inputs respectively. These lags were chosen by optimizing the prediction performance of a linear model over the test period. We use networks with one internal layer with 12 units (tanh activation), 9 or 24 inputs depending on the representation, and one linear output. The experimental procedure is the same as described for the synthetic data sets, except for the fact that only 10 networks are trained using the optimal value. The results from using different regularizers are compared using the Wilcoxon test (Kendall & Stuart 1972). Two methods are considered different if the equivalence hypothesis is rejected with 95% confidence. 7
200 Flow
Jokulsa 100 0 0
200
300
400
500
600
700
800
900
1000
Vatnsdalsa
Flow
50
100
Temp.
0 0 20
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
0 −20 0 100
Prec.
100
50 0 0
Figure 2: The river flow data. The upper two panels show the daily average water flows (m3=s) in the rivers J¨okuls´a and Vatnsdals´a. The lower two panels show the daily average temperature ( C ) and the accumulated daily precipitation (mm) in the area.
8
Smoothing regularizer Global, m Local, m
=1
Global, m Local, m
=2
=2
Global, m Local, m
=1
=3
=3
Sample size 100 40 20 100 40 20 100 40 20 100 40 20 100 40 20 100 40 20
Weight decay Signal/Noise 0.67 1.7 3.3 – + + 0 + + 0 + + + + + + + + + + +
– – – 0 0 0 + + + + + + + + + + + +
– – – – 0 – + + + + + + + + + + + +
w/out bias Signal/Noise 0.67 1.7 3.3 – – 0 0 0 0 0 + 0 + + + + + 0 + + 0
– – – 0 – 0 + 0 + + + + + + + + + +
– – – – – – + 0 + + + + + + + + + +
Table 2: Pairwise performance comparisons on the two dimensional x1x2 problem defined in eq.(19). Notation follows table 1. Table 3 summarizes the results on the river flow prediction. For this problem, a second order smoothing regularizer is preferrable over weight decay. Smoothness-regularized networks also produce much better predictions than a committee of 10 networks that are trained using early stopping, where the early stopping time is determined by monitoring the performance on a validation data set (different training and validation data sets are used for the individual committe members).
5 Quality of the Regularizers: Approximations vs Bounds With appropriate multiplicative factors, eqs. (11) and (16) are strict upper bounds to the smoothness functional S (W; m), eq. (2), in the global and local cases. However, questions arise as to how tight the bounds are and how well the regularizers R(W; m) track the curvature functional S (W; m). If the bounds are not sufficiently tight, then penalizing R(W; m) might not have the effect of penalizing S (W; m). In this section, we discuss the nature of the bounds, present approximations to S (W; m) that are proportional to the bounds, and present empirical results that demonstrate that penalizing R(W; m) does in fact have the effect of penalizing S (W; m) for networks of projective basis units.
Note that for the proposed regularizers R(W; m) to be effective in penalizing S (W; m), we need only have an approximate monotonic relationship between them. In fact, we will argue and demonstrate empirically that an approximate linear relationship between R and S holds.
Ignoring for a moment the effect of a nonzero and finite , our derivation involves two steps that can influence the tightness of the bounds. The first is the inequality (4). The second approximations are the bound
9
River
Smoothing regularizer
J¨okuls´a
Vatnsdals´a
Global, m = 1 Global, m = 2 Global, m = 3 Global, m = 1 Global, m = 2 Global, m = 3
Weight decay # inputs 24 9 – 0 + + + + – – + + + –
w/out bias # inputs 24 9 0 – + 0 + – + + + + + –
Committee (10) # inputs 24 9 + + + + + + + + + + + +
Table 3: Pairwise performance comparisons on the J¨okuls´a (upper rows) and the Vatnsdals´a (lower rows) river flow prediction problem (single step prediction). The notation is the same as in table 1, with the addition that “Committee (10)” denotes the result from training a committee with 10 networks using early stopping with a validation set. The local smoothing regularizer was not tested on this problem. in (10) for the global form and the bound in (15) for the local form. In the global limit (10) becomes an equality.
! 1, the bound
5.1 The Uncorrelated Hidden Unit Approximation The inequality (4) should not introduce significant error for problems in which the internal unit activities are uncorrelated, since the bound can be replaced by an approximation. A set of functions aj (x) is uncorrelated if: dD xΩ(x)aj (x)ak (x) 0 for j 6= k : (22)
Z
This yields the following approximate relationship:
Z
0N 12 Z N ? X X aj (x) 2 : dD xΩ(x) @ aj (x)A dD xΩ(x) j =1
j =1
(23)
Under the uncorrelated internal unit assumption, the bound of equation (12) for the global case can be replaced by the approximation: SG (W; m) I (m)RG (W; m) ; (24) Note that the right hand sides differs from that in equation (12) only by a factor of N , so this approximation is proportional to the bounds.
In general, it is not always the case that the algebraic forms of an approximation and a strict upper bound differ only by a constant factor. For our regularizers, the constant factor N doesn’t matter, since it can be absorbed into the regularization parameter (along with the value of the integral I (m)). Since is selected on a case by case basis, such constant factors are irrelevant. In practical terms then, there is no difference between using the upper bound (12) or the uncorrelated approximation (24). How good is the uncorrelated internal unit assumption? Our empirical results presented below indicate that it is a good approximation for PBFs when the dimensionality of the input space gets large. The probability of having significant correlation between two internal units decreases exponentially with the input 10
space dimension, and is very small already for moderate numbers of variables. Furthermore, even in low dimensions, the possible positive overlap between internal units decreases for many transfer functions (e.g. sigmoids or gaussians) with increasing order of the derivative. The accuracies of the approximation thus improves with increasing input dimension, and with increasing m. This is a very nice effect, since many real problems deal with many (10 or more) input variables.
5.2 Quality of Other Approximations For the global case, the bound (10) is fairly tight and approaches equality as gets large. For the special case of sigmoidal erf[] units, I (W; m) can be evaluated exactly, and the approximation error in bound (10) can be shown to be O ( ?2). This analysis is presented in Appendix B. Note that the numerical differences between the popular logistic units and appropriately scaled erf [] units are small. For the local case, the approximation error in the bound (15) depends on whether the averages 1
M
M dm g [z (x i )] !2 X j ( )
i=1
dzm
(25)
for PBFs varies much with j or not. We show in Appendix C that they do not vary much.8 Thus, the local forms behave well, almost as well as the global forms. This analysis is confirmed by extensive simulation results, which are reported below.
5.3 Empirical Comparisons of R(W; m) vs S (W; m) For the regularizers R(W; m) to be effective in penalizing S (W; m), an approximate monotonically-increasing relationship must hold between them. The uncorrelated internal unit assumption implies that this relationship is linear. To test for such a linear scaling, we generated a large number of randomly selected networks. For each such network, we computed the values of R(W; m) and performed Monte Carlo integrations to compute S (W; m). For each experiment, we fit a linear model
R(W; m) = + S (W; m) (26) to the data, estimating p the parameters and . The accuracy of the linear scaling is measured by the linear correlation hRS i= (hR2 ihS 2i). Under the assumption of a linear relationship, the quality of the regularizers can thus be measured. If the linear correlation is high, then using the regularizer R(W; m) will effectively penalize the smoothing functional S (W; m). Figure 3 shows the correlation between the value of the true functional (2) and our regularizers for networks with 10 input units (D ) and 10 internal units (N ). The value of S (W; m) is estimated through Monte Carlo integration. That is, we sample 105 input data patterns from a gaussian distribution with zero mean and unit variance, and replace the integration with a summation over these points. This is repeated 500 times, picking new random weights each time but keeping the network architecture constant. The network weights are sampled from a uniform distribution of width 10. This ensures that we sample networks that can be very nonlinear. 8 Note that for RBFs they can potentially vary a lot, due to different RBF widths. This renders the local forms for RBFs useless. Analysis and simulations are presented Moody & Ro¨ gnvaldsson (1996).
11
The correlation is very high for both the global and local forms, although the global form is slightly better. For the RBF case, only the global form is correlated with the true functional. To verify that this finding is not spurious, we repeat our Monte Carlo simulations for several different network architectures, using the same method for sampling weights. These results, which are shown in figure 4, show that the same conclusions hold as the number of hidden or input units increase or decrease. As anticipated, the regularizers are better estimates of S (W; m) when the number of inputs grows or when the order m is increased.
The effect of a nonzero or finite depends on the particular choice of weighting function. For a gaussian weighting function, the limiting bounds should be correct to second order (i.e. 2 or ?2 for the local or global limits, respectively), which follows from the power series expansion of the gaussian. We show in Appendix B that this is correct for a gaussian weighting function Ω(x) and a network with sigmoidal projective basis units.
6 Weight Decay Type Approaches and Smoothing 6.1 Does Weight Decay Impose Smoothness? There is no reason to expect a smoothing regularizer to be the best choice of regularizer for all problems, as well as there is no reason to expect any other regularizer to always work better than a smoother. We have therefore refrained from benchmark tests comparing our smoothing regularizers to the extensive list of all hitherto proposed regularization terms, and instead chosen to compare only with the most common form, quadratic weight decay. There are, however, qualitative differences between smoothing regularization and any conventional type of weight decay cost that are worth noting. Quadratic weight decay (Plaut, Nowlan & Hinton 1986), is essentially Hoerl & Kennard’s (1970a) (1970b) “ridge regression” applied to nonlinear models. From a Bayesian viewpoint, weight decay imposes a zero mean, spherically symmetric, gaussian prior on the network weights (Lindley & Smith 1972, Buntine & Weigend 1991). The appropriateness of such an ad hoc prior has been questioned in the linear regression case (Smith & Campbell 1980), and it is equally questionable for nonlinear regression (e.g. neural networks). It is sometimes argued, as an alternative view, that quadratic weight decay corresponds to a smoothness constraint. As illustrated in figure 5, this is a misinterpretation if smoothness is measured by functionals of the form (2). Smoothness constraints necessarily create a coupling between the weights in different layers of the network9 , similar to the one described for the bilinear problem. No such coupling exists in quadratic weight decay. Using weight decay with individual weight decay parameters for weights in different layers, or for individual weights, does not solve this problem since the weights nevertheless are treated as independent10. From the Bayesian perspective, the implied prior when using weight decay with a single weight decay parameter is a spherically symmetric gaussian, as shown on the left in figure 6. The implied prior when using a smoothing regularizer is fundamentally different from this, on the right in figure 6, since the interaction between weights is taken into account. Including bias weights in the weight decay cost will result in a systematic distortion of the mean network output, which is undesirable. This is why we compare our regularizers to weight decay without bias terms in all our experiments. However, excluding the bias weights does not make weight decay better at smoothing, which is clearly demonstrated by the results on the bilinear problem. 9 This
is easily verified by constructing a neural network with sigmoidal internal units that reproduces e.g. a linear function. non-trivial problem when using many individual weight decay parameters is the time spent in selecting good values for them. This would reduce the practical usefulness of such strategies. 10 A
12
14
First order global smoother, projection-based tanh units
Second order global smoother, projection-based tanh units 7 Linear regression (correlation = 0.998)
12 10 8 6 4 2 0
6
Value of global R(W,2) [E+6]
Value of global R(W,1) [E+3]
Linear regression (correlation = 0.992)
0
1 2 3 4 5 6 Monte Carlo estimated value of S(W,1) [E+3]
4 3 2 1 0
7
First order local smoother, projection-based tanh units
250
5
140
0
0.5 1.0 1.5 2.0 Monte Carlo estimated value of S(W,2) [E+6]
2.5
Second order local smoother, projection-based tanh units
Linear regression (correlation = 0.980)
Linear regression (correlation = 0.991) Value of local R(W,2) [E+6]
Value of local R(W,1) [E+3]
120 200
150
100
50
100 80 60 40 20
0
0
1 2 3 4 5 6 Monte Carlo estimated value of S(W,1) [E+3]
0
7
0
0.5 1.0 1.5 2.0 Monte Carlo estimated value of S(W,2) [E+6]
2.5
Figure 3: Linear correlation between S (W; m) and R(W; m) for a neural network architecture with 10 input units, 10 internal tanh PBF units, and one linear output (each point corresponds to a different set of weights). The left column shows results for first order smoothing (m = 1) and the right column shows results for second order smoothing (m = 2). The top row shows the global form of the regularizer RG (W; m) and the bottom row shows results for the local form RL (W; m). Note that the correlation coefficients are very close to 1.0, confirming that penalizing R(W; m) effectively penalizes the curvature functional S (W; m).
13
1.00
First order global smoother, projection based tanh units
0.96
0.94
0.92
0.90
0
1
2 3 N/D [# hidden/# inputs]
4
0.994
0.990
5
First order local smoother, projection based tanh units
1.000
0
1
2 3 N/D [# hidden/# inputs]
4
5
Second order local smoother, projection based tanh units
4 inputs 8 inputs 12 inputs 16 inputs
0.98
4 inputs 8 inputs 12 inputs 16 inputs
0.995 Linear correlation
Linear correlation
0.996
0.992
1.00
0.96
0.94
0.990
0.985
0.92
0.90
4 inputs 8 inputs 12 inputs 16 inputs
0.998 Linear correlation
0.98 Linear correlation
Second order global smoother, projection based tanh units 1.000
4 inputs 8 inputs 12 inputs 16 inputs
0
1
2 3 N/D [# hidden/# inputs]
4
5
0.980
0
1
2 3 N/D [# hidden/# inputs]
4
5
Figure 4: Linear correlation between S (W; m) and R(W; m) for different network architectures, using tanh PBF units. The left column shows results for first order smoothing (m = 1) and the right column shows results for second order smoothing (m = 2). The top row shows the global form of the regularizer RG (W; m), and the bottom row shows results for the local form RL (W; m). In both the first and second order cases, the linear correlations are higher for the global than for the local smoothers. In all cases, however, the regularizers R(W; m) are highly correlated with the curvature functionals S (W; m). Thus, R(W; m) is proportional to S (W; m), and penalizing R(W; m) effectively penalizes S (W; m).
14
Weight decay, projection-based tanh units
5.0
Weight decay, projection-based tanh units
5.5
Linear regression (correlation = 0.60) Quadratic weight decay cost [E+3]
Quadratic weight decay cost [E+3]
Linear regression (correlation = 0.38) 4.6
4.2
3.8
3.4
3.0
0
1 2 3 4 5 6 Monte Carlo estimated value of S(W,1) [E+3]
4.5 4.0 3.5 3.0 2.5
7
Weight decay (first order), projection based tanh units
0.65
5.0
0.8
0
0.5 1.0 1.5 2.0 Monte Carlo estimated value of S(W,2) [E+6]
Weight decay (second order), projection based tanh units
4 inputs 8 inputs 12 inputs 16 inputs
4 inputs 8 inputs 12 inputs 16 inputs
0.7 Linear correlation
Linear correlation
0.55
0.45
0.35
0.25
2.5
0.6
0.5
0
1
2 3 N/D [# hidden/# inputs]
4
5
0.4
0
1
2 3 N/D [# hidden/# inputs]
4
5
Figure 5: Linear correlation between S (W; m) and the quadratic weight decay cost, using tanh PBF units. The left column shows results for first order smoothness (m = 1) and the right column shows results for second order smoothness (m = 2). The top row shows the result of generating 500 random networks, with 10 input units, 10 internal units, and one linear output, and comparing the Monte Carlo estimated value of S (W; m) to the quadratic weight decay cost. The scatterplots exhibit substantially less correlation than those for the smoothing regularizers R(W; m) shown in figure 3. The bottom row shows how the linear correlation coefficient varies with the network architecture (each point corresponds to 500 random networks). The correlation coefficients are significantly less than those for R(W; m) shown in figure 4. It is quite clear that the weight decay cost kW k2 is not a good estimate of S (W; m).
15
Weight Decay
||v||
Global Smoother, m=1
||v||
u
u
Figure 6: The implied priors for Weight Decay (left hand side) and a first order global Smoothing Regularizer (right hand side). The weight decay prior is a symmetric gaussian p(u; kv k) exp[?u2 ? kv k2]. In contrast, the prior implied by the first order global smoother is a combination of a laplacian and a gaussian with interaction between the weights, p(u; kv k) exp[?u2 kv k].
16
While our smoothing regularizers R(W; m) capture interactions between the input and output weights for each PBF unit, they do not explicitly capture interations between units. However, our Monte Carlo simulation results described above indicate that the uncorrelated hidden unit assumption is approximately valid, and that interactions between hidden units can thus be safely ignored. The quality of the approximation improves as the ratio of input dimension to number of units D=N becomes larger, as a consequence of the fact that two vectors in a high dimensional space are very likely to be orthogonal. Variations on the theme “weight decay” that have been presented in the neural network literature are e.g. weight elimination (Rumelhart 1988, Scalettar & Zee 1988, Chauvin 1990, Weigend, Rumelhart & Huberman 1990), Laplacian pruning (Williams 1995, Ishikawa 1996), and soft weight sharing (Nowlan & Hinton 1992). These all build on the concept of imposing a prior distribution on the model parameters, and their motivation vis-`a-vis quadratic weight decay is that a non-gaussian prior on the parameters is more likely to be correct than a gaussian prior. None of these variations contain any form of coupling between different parameters in the model, and they can therefore not be said to correspond to smoothing.
6.2 Comparison to Smoothing Splines and Smoothing RBFs Smoothing splines (Eubank 1988, Hastie & Tibshirani 1990, Wahba 1990) and smoothing radial basis functions (Powell 1987, Poggio & Girosi 1990, Girosi et al. 1995) impose smoothness by requiring the forms of the hidden units g [] to be Greens functions of the smoothing operator S (). The proceedure for obtaining the forms of such splines and RBFs can be summarized by: Smoothing Functionals
+
Variational Principle
)
=
:
Splines & RBFs
In it’s pure form, the variational approach requires that one basis function be assigned to each data point. (The standard practice of relaxing this requirement, by using fewer basis functions than data points, results in networks that lose the best approximation property. This property is what makes the variational approach most attractive in it’s pure form.) The resulting network is then a linear combination of basis functions with the output weights obtained by solving a least squares problem. This approach can not be extended in a general way to networks of projective basis functions, since there are no known smoothing functionals for which PBFs are the corresponding Green’s functions. Our approach is substantially different from that of smoothing splines and smoothing RBFs, since we derive algebraically-simple smoothing regularizers for general classes of PBFs. In contrast to the spline/RBF approach, the proceedure for obtaining our regularizers can be summarized by: PBF Networks
+
Smoothing Functionals
)
=
Simple Regularizers
:
Our approach does not require the assignment of one basis function per datum and has the advantage that it can be applied to the types of networks most often used in practice, namely networks of PBFs.
6.3 Other Approaches to Smoothing Neural Networks Two proposals for smoothing regularizers in the first order case (m = 1) for PBF networks have previously been put forth. Wu & Moody (1996) have derived a smoothing regularizer for both two layer recurrent and feed forward PBF networks, by requiring the model to be robust with respect to small perturbations of the inputs. Van Vuuren (1994) has proposed a form for PBF networks based on a smoothing functional that uses the norm of the gradient, instead of the squared norm as in equation (4). Bishop (1995), Leen (1995), and Wu & Moody (1996) independently pointed out the correspondence between first order smoothing and 17
training with noisy inputs. For second order smoothing ( m = 2), Bishop (1993) has proposed an algorithm for imposing smoothness by using a local smoothness requirement, similar to that discussed in section 2.3. We are unaware of any previously presented and demonstrated methods for imposing general smoothness requirements of any order ( m = 1; 2; : : : and higher) for PBF networks.
7 Summary Our regularizers R(W; m) are the first general class of mth –order smoothing regularizers to be proposed for projective basis function networks. The regularizers are simple algebraic forms for smoothing functionals of any order:
RG (W; m)
=
RL (W; m)
=
N X j =1 N X j =1
u2j kvj k2m?1
Global Form
u2j kvj k2m
Local Form
:
(27)
These regularizers R(W; m) enable the direct enforcement of smoothness without the need for costly Monte Carlo integrations of the smoothness functional S (W; m) defined in eq. (2). The forms of the regularizers differ fundamentally from quadratic weight decay, in that they distinguish the roles of the input weights v j and output weights uj , and capture the interactions between them. Our regularizers apply to PBFs with large classes of transfer functions g [], including sigmoids. Empirical results for both toy problems and real world data demonstrate the efficacy of R(W,m) relative to weight decay. Monte Carlo simulation results confirm that equations (27) are effective in penalizing the smoothness functional S (W; m). Both the global and local forms of R(W; m) are very good, with the global being slightly better. The Monte Carlo results show that weight decay does not directly impose smoothness as defined by S (W; m) for PBFs.
From a Bayesian standpoint, these results suggest that when the assumption of mth -order smoothness is appropriate, priors of the form exp [?R(W; m)] will yield PBF networks that generalize better than networks trained with gaussian priors. In a companion paper (Moody & R¨ognvaldsson 1996), we derive and test corresponding regularizers for radial basis functions.
Acknowledgements Both authors thank Steve Rehfuss and Dr. Lizhong Wu for stimulating input. John Moody thanks Volker Tresp for a provocative discussion at a 1991 Neural Networks Workshop sponsored by the Deutsche Informatik Akademie. We gratefully acknowledge support for this work from ARPA and ONR (grant N00014-92-J4062), NSF (grant CDA-9503968), the Swedish Institute, and the Swedish Research Council for Engineering Sciences (contract TFR-282-95-847).
18
Appendix A: Requirements for the Global Case In this appendix, we discuss the necessary requirements on the internal activation functions g [] for the global case to be valid. In general, the integral
Ij (W; m) = converges, if
dm g zj 2 [
dzjm
]
1
Z1
p
kvj k 2 ?1
(z ? v )2 dm g[z ] !2 j0 j dz exp ? 2 2 kv j k2
dzjm
(A:1)
is nonsingular and 11
dmg[zj ] dzjm
!2
o
exp[z2=22kv k2] j
jzj
when z
! 1:
(A:2)
The integral can be evaluated in some special cases (see e.g. Appendix B) and it depends on kv j k.
In the global limit, ! 1, we can simplify the integral by replacing the exponential with 1. The integral dm g[zj ] 2 is nonsingular and then converges if dzjm
dm g[zj ] dzjm
!2
1
o jzj
when z
! 1:
(A:3)
The resulting integral does not depend on kv j k and can be treated as a multiplicative constant which gets absorbed into . The global limit bound is thus valid for all m 1 for all functions g [z ] that decrease e.g. exponentially, or like (1 + jz ja )?1 with a > 0, or grow like log[1 + jz j]. It is, however, not valid for all m if g[z] increases like jzja with a > 0.
The tightness of the upper bound depends on how quickly the exponential approaches zero compared dm g[zj ] 2 approaches zero. In general, for the bound to be tight it is necessary that to how quickly dzjm max[1; jvj 0 j ] kv j k, where and are positive real numbers that depend on the exact form of g [z ]. For example, if (dm g [z ]=dz m )2 = e? jz j then we have
p1 2
Z1
?1
dxe?x =2 2
2
dm g[zj (x)] dzjm
!2
+
p1 2 p1
Z1
m dx d gdz[zmj (x)] ?1 j
1 X
2 n=1
!2
p
+
p 325vk0v k5 2
j
?1)n(2n ? 1)!!(kvj k)2n 1 +
(
; 2 kvj k2 ? v0 )2n+1 + ( 2kv j k2 + v0 )2n+1 (A:4) 1
1
(
provided that 2 kv j k2
max[1; jvj0 j].
We derive the exact form for the integral Ij (W; m) for the special case of sigmoidal g [z ] in Appendix B.
11 We use the notation o() to denote that the ratio of the expression on the left hand side of approaches zero asymptotically.
19
to the expression on the right hand side
Appendix B: Evaluation of I (W; m) for Sigmoidal PBFs For the special case of sigmoidal PBFs in the internal layer of the network, we can use g (z ) = erf(z ) to compute the integral Ij (W; m) exactly and compare the behaviour of the smoothing functional with our bounds. With a gaussian weighting function of the form (8), some tedious but straightforward algebra yields the expressions12
I (W; 2n + 1)
=
I (W; 1) k l X +
I (W; 2n)
=
k
l n ? k)!(n ? l)!
(2 )!(2 )!(
k;l=0
2v2(42v2 + 1) i
I (W; 1)
where
In the global limit,
(B:1)
k;l=0 (2k + 1)!(2l + 1)!(n ? k ? 1)!(n ? l ? 1)! 2 2 2 2 i
v (4 v (2k + 2l + 2 ? 2i)!i! 2v02 1
v0 2(k+l+1) 4 2 v 2 + 1
+ 1)
(B:2)
I (W; 1) = p 42 2 exp[?2v02 =(42v2 + 1)]: 4 v + 1
(B:3)
! 1, we make the approximations
1 (2n ? 2i)!i!
2v2(42v2 + 1) i
2v02
(4
v
1
2 2+
1 )n
1 n!
24v4 n v02
1
(4
v
I (W; 1) v 2v2 max[1; v02 ].
+
1
2!(n ? 1)!
n ;
n 1 ? 4 2 v 2
2 2)
2
which require that (v )?2 gives
v0 2(k+l) 4 2 v 2 + 1
2v02 i=0 (2k + 2l ? 2i)!i! nX ?1 (?1)k+l4k+l+1[(2n ? 1)!]2 (2k + 2l + 2)!
i=0
i=0
l 4k+l [(2n)!]2 (2k + 2l)!
+
1
kX +l+1
n X
n (?1)k X
1?
1 + 4v02 8 2 v 2
24v4 n?1 v02
;
(B:4) (B:5)
;
(B:6)
Inserting this into the exact expressions and keeping terms of order
? 3)!! I (W; m) 2(2mv
1+
1 ? v02 (6m ? 10) 8 2 v 2 (2m ? 3)
;
(B:7)
where the first term corresponds to the global limit regularizer (11), which is thus correct to order ?2kv k?2. Note that the local limit,
! 0, can be treated in a similar manner, which gives
I (W; m) exp[?2v02 ] P2(m1)?1 (v0 ) + 2 kvk2P2(m2) (v0)
,
(B:8)
where Pn (v0 ) and Pn (v0 ) are two different polynomials of order n. The exponential weighting, and the oscillatory nature of the polynomials, supports our decision to ignore the dependence on v0. Thus, the local limit can be said to be correct to order 2 kv k2. (1)
12 The
(2)
index on vj is dropped.
20
Appendix C: Quality of the Local Form To simplify the notation, we define
m G(m; zj (x)) d gdz[zmj (x)] j where
!2 (C:1)
zj (x) xT vj + vj0
(C:2)
With this definition, the quality of the local form
RL (W; m) = depends upon how much the sum 1
M
M X i=1
N X j =1
u2j kvj k2m
(C:3)
G(m; zj (x(i) ))
(C:4)
varies with j . If it varies very little, then the local form is a good approximation. If it varies a lot, then the local form is a poor approximation. To simplify the evaluation of (C:4), we work in the limit M 1
M
M X i=1
Z
! 1, such that
G(m; zj (x(i) )) ! dD xp(x)G(m; zj (x)) = G(m; j )
(C:5)
where p(x) is the input data probability density function. To avoid unnecessary complications, we also assume that the input data follow a zero mean gaussian distribution
p(x) =
D
p1
exp
s 2
?kxk2 2s2
:
(C:6)
Sigmoidal Projective Basis Functions Instead of the usual hyperbolic tangent, we employ g (z ) = erf(z ) as our sigmoid function (tanh and erf are quite similar functions). This simplifies the derivatives considerably and enables us to write
G(m; zj ) = 4 Hm2 ?1 (zj ) exp[?2zj2 ];
(C:7)
where Hn (z ) are Hermite polynomials of order n. Plugging (C:7) into eq. (C:5), using the gaussian p(x), and integrating out all dimensions that are orthogonal to v j results in
G(m; j )
=
"
#
?vj20 p exp 2s2 kv j k2 2skv j k 2 (4s2kvj k2 + 1) Z1 v j 0 2 2 dtHm?1(t) exp ?t 2s2 kv k2 + t s2 kv k2 ; j j ?1 16
21
(C:8)
where the bar denotes averaging over the input distribution. Evaluating the integral gives the result
G(m; j )
"
p
#
?vj20 (4s4kvj k4 ? 4s2kvj k2 ? 1) exp 2 s2 kvj k2 8s6 kv j k6 m ? 12 2s2kv k2 + 1 m?k?1 mX ?1 j 2k k ! k 4s2 kv j k2 + 1 k 0 8 4s2 kv j k2 + 1
=
"
=
#
vj0
H2m?2k?2 p
8s4 kv j k4 + 6s2 kv j k2 + 1
:
(C:9)
Since (C:9) only depends on the magnitude kv j k, the expectation value of G(m; j ), evaluated over the weight distribution, should vary little with j if the number of input variables (D ) is large, as a consequence of the law of large numbers. For example, if the probability density of the vjk weights is spherically symmetric (excluding the bias weights for the moment), then we can do the transformation
n
hG (m; j )i =
Z
D=2 dD vp(kvk)Gn (m; kvk) = 2
Z
dvvD?1 p(v)Gn (m; v);
Γ(D=2)
(C:10)
where v = kv k and hi denotes averaging over the weight distribution. The index j is dropped on v since we are taking an expectation over the distribution of weights. If the weight distribution is gaussian and D is n large (larger than 10), then the integrand will be sharply peaked so that hG (m; j )i hG(m; j )in , resulting in a small variance w.r.t. the weights vjk . To get some insight on how expression (C:9) scales with the weights and see the effect of bias weights, we assume that s2 kv j k2 > 1. Expression (C:9) then simplifies to
G(m; j ) = 22m skv k exp j 32
Furthermore, if vj 0 (ignoring m)
"
# ?1 2 " # ?vj20 mX v 1 m ? 0 j k H2m?2k?2 2 2 p : 4 k! k 2s2 kv j k2 s kvj k 8 k 0
(C:11)
=
< s2 kvj k2 then the weight independent term dominates the sum, and the scaling is G(m; j ) skv k exp j 1
"
#
?vj20 ; 2s2 kv j k2
(C:12)
which, as expected, resembles the scaling of the corresponding global case. If the bias weights vj 0 follow a gaussian distribution, with variance 0 , then we have
n
hG (m; j )i =
1
p
sn kvkn 0 2
Z
dv0 exp
?v2 1 0
2
+ 2
0
n
s2 kvk2
=
1?n 1?n qs kvk 2 :
s2 kvk2 + n0
(C:13)
Since kv k2 / D , it is likely that 0 skv k, why the variance due to bias weights will supposedly be very small as well. Furthermore, the variance will decrease as D ?1. In summary, the local smoothing regularizer for networks with projective basis functions is an acceptable approximation. This is mainly due to the fact that kvj k has essentially the same value for all hidden units (the law of large numbers). The variation between units caused by different bias weight values should also, at least in reasonable cases, be small. 22
References Abu-Mostafa, Y. (1995), ‘Hints’, Neural Computation 7(4), 639–671. Bishop, C. (1993), ‘Curvature-driven smoothing: A learning algorithm for feedforward networks’, IEEE Trans. Neural Networks 4, 882–884. Bishop, C. (1995), ‘Training with noise is equivalent to Tikhonov regularization’, Neural Computation 7(1), 108–116. Buntine, W. L. & Weigend, A. S. (1991), ‘Bayesian back-propagation’, Complex Systems 5, 603–643. Chauvin, Y. (1990), Dynamic behavior of constrained back-propagation networks, in D. Touretzky, ed., ‘Advances in Neural Information Processing Systems 2’, Morgan Kaufmann Publishers, San Francisco, CA, pp. 642–649. Eubank, R. L. (1988), Spline Smoothing and Nonparametric Regression, Marcel Dekker, Inc. Friedman, J. H. & Stuetzle, W. (1981), ‘Projection pursuit regression’, J. Amer. Stat. Assoc. 76(376), 817–823. Geman, S., Bienenstock, E. & Doursat, R. (1992), ‘Neural networks and the bias/variance dilemma’, Neural Computation 4(1), 1–58. Girosi, F., Jones, M. & Poggio, T. (1995), ‘Regularization theory and neural networks architectures’, Neural Computation 7, 219–269. Hastie, T. J. & Tibshirani, R. J. (1990), Generalized Additive Models, Vol. 43 of Monographs on Statistics and Applied Probability, Chapman and Hall. Hoerl, A. & Kennard, R. (1970a), ‘Ridge regression: applications to nonorthogonal problems’, Technometrics 12, 69–82. Hoerl, A. & Kennard, R. (1970b), ‘Ridge regression: biased estimation for nonorthogonal problems’, Technometrics 12, 55–67. Ishikawa, M. (1996), ‘Structural learning with forgetting’, Neural Networks 9(3), 509–521. Kendall, M. G. & Stuart, A. (1972), The Advanced Theory of Statistics, third edn, Hafner Publishing Co., New York. Leen, T. (1995), From data distributions to regularization in invariant learning, To appear in Neural Computation, 1995. Lindley, D. V. & Smith, A. F. M. (1972), ‘Bayes estimates for the linear model’, Journal of the Royal Statistical Society B 34(302), 1–41. Moody, J. E. & Yarvin, N. (1992), Networks with learned unit response functions, in J. E. Moody, S. J. Hanson & R. P. Lippmann, eds, ‘Advances in Neural Information Processing Systems 4’, Morgan Kaufmann Publishers, San Mateo, CA, pp. 1048–55. Moody, J. & R¨ognvaldsson, T. (1996), Smoothing regularizers for radial basis function networks, Manuscript in preparation. Nowlan, S. & Hinton, G. (1992), ‘Simplifying neural networks by soft weight-sharing’, Neural Computation 4, 473–493. 23
Plaut, D., Nowlan, S. & Hinton, G. (1986), Experiments on learning by back propagation, Technical Report CMU-CS-86-126, Carnegie-Mellon University. Poggio, T. & Girosi, F. (1990), ‘Networks for approximation and learning’, IEEE Proceedings 78(9). Powell, M. (1987), Radial basis functions for multivariable interpolation: a review., in J. Mason & M. Cox, eds, ‘Algorithms for Approximation’, Clarendon Press, Oxford. Riedmiller, M. & Braun, H. (1993), A direct adaptive method for faster backpropagation learning: The RPROP algorithm, in H. Ruspini, ed., ‘Proc. of the IEEE Intl. Conference on Neural Networks’, San Fransisco, California, pp. 586–591. Rumelhart, D. E. (1988), Learning and generalization, in ‘Proc. of the IEEE Intl. Conference on Neural Networks’, San Diego. (Plenary address). Scalettar, R. & Zee, A. (1988), Emergence of grandmother memory in feed forward networks: learning with noise and forgetfulness, in D. Waltz & J. Feldman, eds, ‘Connectionist Models and Their Implications: Readings from Cognitive Science’, Ablex Pub. Corp. Smith, G. & Campbell, F. (1980), ‘A critique of some ridge regression methods’, Journal of the American Statistical Association 75(369), 74–103. Tong, H. (1990), Non-linear Time Series: A Dynamical System Approach, Clarendon Press, Oxford. van Vuuren, S. H. J. (1994), Neural network correlates with generalization, Master’s thesis, University of Pretoria. http://www.cse.ogi.edu/sarelv. Wahba, G. (1990), Spline models for observational data, CBMS-NSF Regional Conference Series in Applied Mathematics. Weigend, A., Rumelhart, D. & Huberman, B. (1990), Back-propagation, weight-elimination and time series prediction, in T. Sejnowski, G. Hinton & D. Touretzky, eds, ‘Proceedings of the connectionist models summer school’, Morgan Kaufmann Publishers, San Mateo, CA, pp. 105–116. Williams, P. M. (1995), ‘Bayesian regularization and pruning using a Laplace prior’, Neural Computation 7, 117–143. Wu, L. & Moody, J. (1996), ‘A smoothing regularizer for feedforward and recurrent networks’, Neural Computation 8(2).
24