Adaptive learning algorithms to incorporate ... - Semantic Scholar

Report 2 Downloads 106 Views
Neurocomputing 35 (2000) 73}90

Adaptive learning algorithms to incorporate additional functional constraints into neural networks So-Young Jeong, Soo-Young Lee* Department of Electrical Engineering & Brain Science Research Center, Computation & Neural Systems Laboratory, Korea Advanced Institute of Science and Technology, 373-1 Kusong-dong, Yusong-gu, Taejon 305-701, South Korea Received 14 December 1998; revised 3 April 1999; accepted 10 April 2000

Abstract In this paper, adaptive learning algorithms to obtain better generalization performance are proposed. We speci"cally designed cost terms for the additional functionality based on the "rstand second-order derivatives of neural activation at hidden layers. In the course of training, these additional cost functions penalize the input-to-output mapping sensitivity and highfrequency components in training data. A gradient-descent method results in hybrid learning rules to combine the error back-propagation, Hebbian rules, and the simple weight decay rules. However, additional computational requirements to the standard error back-propagation algorithm are almost negligible. Theoretical justi"cations and simulation results are given to verify the e!ectiveness of the proposed learning algorithms.  2000 Elsevier Science B.V. All rights reserved. Keywords: Adaptive learning algorithm; Mapping sensitivity; Curvature smoothing; Timeseries prediction

1. Introduction Multi-layer neural networks have been successfully applied to complicated pattern classi"cation and function approximation problems. However, a proper matching of the underlying problem complexity and the network complexity is crucial

* Corresponding author. Tel.: #82-42-869-3431; fax: #82-42-869-8570. E-mail address: [email protected] (S.-Y. Lee). 0925-2312/00/$ - see front matter  2000 Elsevier Science B.V. All rights reserved. PII: S 0 9 2 5 - 2 3 1 2 ( 0 0 ) 0 0 2 9 6 - 4

74

S.-Y. Jeong, S.-Y. Lee / Neurocomputing 35 (2000) 73}90

to good generalization capability [1]. While smaller networks are not capable of representing the problems accurately, networks with too many synaptic weights actually su!er from over"tting and result in poor generalization [3,9,11,12]. To avoid over"tting and to obtain better generalization, as shown in Fig. 1, one usually reduces the network complexity by synaptic weight decay, weight elimination, and weight-sharing methods [4,5,8,10,16,17]. On the other hand, one may increase the problem complexity by incorporating additional functional constraints. For robust classi"cations the low input-to-output mapping sensitivity was enforced by an additional cost term with the "rst-order derivatives of Sigmoid activation functions at hidden neurons [7]. The increased problem complexity may also be achieved with much heavier computational requirements by injecting noises on training data [13]. In this paper, we further extend the low-sensitivity algorithm to time-series prediction problems. Also, another adaptive learning algorithm is proposed with an additional cost term based on the second-order derivatives of Sigmoid activation functions at hidden layer. The additional cost terms penalize the input-to-output mapping sensitivity and high-frequency noisy components contained in training data, respectively. Learning algorithms derived from the gradient-descent method happen to consist of two popular learning algorithms, i.e., standard error back propagation and Hebbian learning. Weight decay terms also appear for the learning algorithm with the second derivatives. Only slight modi"cations are made to the standard back-propagation algorithm and additional computational requirements are almost negligible. In Sections 2 and 3, the two learning algorithms are described with theoretical justi"cations. Experimental results are given for two real-world time-series benchmark data in Section 4. Finally, a brief concluding remark follows in Section 5.

Fig. 1. Two methods to overcome over"tting problems

S.-Y. Jeong, S.-Y. Lee / Neurocomputing 35 (2000) 73}90

75

2. Hybrid-I algorithm for low input-to-output sensitivity For many supervised learning algorithms, the sum-of-square error criterion has played a major role in cost function. By minimizing this cost function neural networks learn input-to-output mapping de"ned by training data. In addition to this simple mapping, we would like to take into account the derivatives of the mapping function. To implement additional functionality such as low mapping sensitivity, the "rst-order derivatives are used here. 2.1. Additional cost term for Hybrid-I algorithm To improve robustness and fault-tolerance ability one would like to reduce the sensitivity of output values to input values. The low sensitivity of output values is quite advantageous for robust classi"cations and approximations even with noisy input data and/or malfunctioning synapses and neurons. In particular, for iterated time-series prediction problems, predicted outputs at a time step are used as inputs of the following time steps. The prediction errors may be accumulated at each iterative process, which results in large prediction errors and occasionally causes an instability. Therefore, one would like to reduce the error accumulation factor, *y/*x, i.e., the sensitivity of the input-to-output mapping because this low sensitivity can reduce error accumulations. It also helps to avoid over"tting, which frequently shows high sensitivity in the mapping. By applying chain rule for an ¸th-layer feedforward neural network, one obtains the sensitivity as *y G" w**\ w*\ w f Y (y( ) f Y (hK *\) f Y (hK  ), GH H*\ H*\ 2 H I * G *\ HJ\ 2  H *x I H H 2 H*\

(1)

where x and y denote the kth element of the input vector and the ith element of I G the output vector, respectively, and wJJ J\ is the synaptic interconnection from the HH j th hidden neuron at the lth layer to the j th hidden neuron at the (l!1)th J J\ layer. f Y( ) ) is the derivative of the sigmoid nonlinear function f ( ) ) at the lth layer, J J J\ wJ and hK J J " ,J\ hJ\ is the post-synaptic neural activation of the j th element H H HJ HJ\ HJ\ J at the lth layer, and hJ J "f (hK J J ) is the corresponding neural activation. For H J H simplicity, it is assumed that all neurons at the lth layer have the same sigmoid function f ( ) ). J Although the sum-of-squares of the sensitivities for all k indices can be de"ned as an additional cost term, we adopt a much simpler term here for both numerical e$ciency and easy interpretations. The derivative f Y(hK J J ) may become exponentially small for J H large values of hK J J , and plays a dominant role in reducing the sensitivity. Therefore, it H can be made smaller by forcing the hidden-layer activations to be saturated, i.e., f Y(hK J J )1. Also hyperplanes represented by hidden-layer neurons become far from any J H training data and slight changes in the synaptic weights and data do not make much di!erence to robust generalization ability.

76

S.-Y. Jeong, S.-Y. Lee / Neurocomputing 35 (2000) 73}90

A new cost function is de"ned as *\ 1 + 1 + *\ E"E # c EJ " EQ # c EQJ, (2)  J  M  M J  J Q Q J where EQ "(1/2N ) (tQ !yQ ) and EQJ"(1/N ) ,J J f Y(hK QJJ ) denote the normalized out  G G G  J H H put error and the additional hidden-layer penalty term at lth layer for the sth training pattern, respectively. The tQ and yQ are target and actual output values of the ith G G output neuron for the sth stored pattern, respectively, and hK QJJ is the corresponding H post-synaptic value for the j th element at the lth layer. Here, M, N and N are the J  J number of stored patterns, the number of output neurons, and the number of neurons at the lth layer, respectively. If the neural activation of the hidden layer stays at the linear region of the sigmoid function, it becomes sensitive to post-synaptic values and the high hidden-layer error EQJ is assigned. It is worth noting that both the output  error EQ and hidden-layer error EQJ are normalized to take similar values, i.e., around   0.5 for hyperbolic tangent sigmoid functions for very small initial synapses. By minimizing the EQJ, one may push the network into nonlinear region for improved  robustness. c represents the relative signi"cance of the hidden-layer cost over the J output error. The network is trained by a steepest-descent error minimization algorithm as usual. Although the last layer is not a!ected by additional cost term, the partial derivative of the total cost E with respect to each synaptic weight in the other layers now contains additional terms and the weight update becomes *E *wJJ J\ "!g "g hJ\ dJ , (3) HH J *wJ J HJ\ HJ HJ HJ\ where dJ J is the sensitivity of the total cost E to the post-synaptic neural activations H h J at the lth layer and may be calculated by back-propagation as H *E ,J> c " d J> wJ> # J hJ J f Y(hK J J ), (4) dJ ,! *> J> J H H H HJ H J MN H *hK J H J HJ>  where the sensitivity d** is de"ned at the output layer as usual, i.e., d** ,(1/MN ) H H * (t * !y * ) f Y(y( * ). g is the learning coe$cient for the lth layer synapses, and H H H J f Y"(1!f )/2 is used for the bipolar sigmoid f (x)"(1!e\V)/(1#e\V). Here the superscript s has been removed for simplicity. From Eq. (4), one notices the two components of weight updates, i.e., the backpropagated sensitivity (or error in literature) and the Hebbian-learning term (or (c /MN )hJ\ hJ f Y(hK J J )). Now we call this new algorithm as `Hybrid-Ia. J J HJ\ HJ H





2.2. Robustness of Hybrid-I algorithm In this section, we further extend previous discussions on the robustness and small input-to-output mapping sensitivity of the `Hybrid Ia learning algorithm [7]. We simplify the proof process using statistical expectation operator. Here it is assumed

S.-Y. Jeong, S.-Y. Lee / Neurocomputing 35 (2000) 73}90

77

that the network is a two-layered feedforward neural network with sigmoidal hidden neuron and linear output neuron, respectively. If the input vector x is modi"ed by *x, changes of the output may be approximated as *y G " =f Y(hK )=*x . *y "y (x#*x)!y (x)+ Dx (5) G G G I *x GH H H HI I I I I H For improved robustness and better generalization capability, smaller values of *y /y G G are preferred. Although simple weight decay algorithm tries to get smaller synaptic weights, smaller weights usually result in smaller hK and larger sigmoid derivatives in H general. The synaptic weights (=, =) and sigmoid derivatives f Y(hK ) are closely HI GH H coupled in training, and should be considered together. By comparing Eq. (5) with y " =f (hK ), smaller *y /y may be obtainable from smaller value of G H GH H G G f Y(hK ) f Y(hK ) (6) SV, H =*x , H =x rV , HI I HI I I H f (hK ) f (hK ) H I H I where SV is a newly de"ned sensitivity index for the input perturbation, and H rV ,*x /x . This index denotes the amount of change in output values when inputs I I I are perturbed and thus it would be desirable to reduce this quantity in odrer to obtain more robust outputs. For bipolar binary inputs, *x "!2x or 0, and rV"!2 or 0. If the probability I I I of non-zero perturbation (rV"!2) were d, the expectation value of the random I variable rV would become 2d. Then Eq. (6) results in I f Y(hK ) =x 1rV2 1SV2" H I HI I I "RVS (hK ). (7) H  H f (hK ) H H Here (' is the expectation operator, RV,1rV 2, and I f Y(hK )hK 2hK H S (hK ), H H " . (8)  H f (hK ) exp(hK )!exp(!hK ) H H H H Fig. 2 shows the functional shapes of S (hK ) and 2f Y(hK ). Both have similar functional  H H forms like the Gaussian-type function. The S (hK ) has a maximum at hK "0 and two  H H minima at hK "$R. For large values of hK , S (hK ) becomes exponentially small. H H  H Therefore, larger hK results in better generalization capability. S (hK ) shows similar H  H functional dependence with f Y(hK ), which justi"es the proposed additional cost term in H Eq. (2)

3. Hybrid-II algorithm for small e4ective frequency For many function approximation problems one is not interested in detailed high-frequency behavior of the output values, and curvature smoothing, and or low-pass "ltering is usually followed at the post-processing stage. However, these

78

S.-Y. Jeong, S.-Y. Lee / Neurocomputing 35 (2000) 73}90

Fig. 2. Sensitivity index S (x) and the "rst-order derivatives of sigmoid function (2.0*f Y(x)). The former and M the latter are denoted by solid and dashdoted lines, respectively.

characteristics may also be incorporated into the neural network as an additional cost term. Bishop suggested additional cost term in order to explicitly constrain the magnitudes of the second-order derivatives of network outputs over inputs [2]. By penalizing large curvatures, it is expected to have more regularized networks so as to result in a better generalization. This constraint, however, cannot be directly applied with standard back-propagation procedure and requires heavy computational overheads in proportion to the number of layers. Here we propose a somewhat di!erent smoothness constraint by introducing the notion of frequency "ltering and derive a computationally e$cient learning algorithm easily encoded with standard backpropagation learning. 3.1. Additional cost term for Hybrid-II algorithm Let us notice that a signal y(x) with frequency w, i.e., sin wx or cos wx, is a general solution of a second-order di!erential equation, dy #wy"0, dx

(9)

and an e!ective frequency of y(x) may be de"ned as



dy/dx X" ! . y

(10)

It is easy to derive the e!ective frequency in a single-layer neural network. Let us assume that x , y , w denote the kth input neuron, ith output neuron, and synaptic I G GI weights, respectively. Nonlinear activation function of output neuron is bipolar

S.-Y. Jeong, S.-Y. Lee / Neurocomputing 35 (2000) 73}90

79

sigmoid with the property of f "!+ Y. With this notation, we can calculate the e!ective frequency as follows. *y /*x !f (y( )[w ] G GI , X"! G I " y y G G f Y(y( ) f (y( )[w ] G GI "f Y(y( )[w ]. " G G GI f (y( ) G

(11)

One notices that e!ective frequency can be expressed as the product of slope of activation function and square of synaptic weights. If one uses linear neurons instead of sigmoidal neurons, e!ective frequency may have the usual form of a weight decay term. It is straightforward to extend to multi-layered architecture. An extended structure can be regarded as a series of low-pass "lters interconnected layer by layer. Hence, the layer with smallest cut-o! frequency dominates the overall characteristics of the low-pass "lter. When considering the e!ective frequency at each layer, the following additional cost function is added as an extra term in the usual mean-squared errors: 1 1 E" EQ # c EJQ,  S J F S Q Q J

(12)

J\ where EJQ"(1/N ) ,J J f Y(hK QJ )[ ,J\ w ] is the proposed error term and c also F J H  H  H  HJ HJ\ J denotes the relative signi"cance between two error terms. For computational simplicity, we omit the pattern index s, and separate the additional cost function into

EJ "EJ EJ , F ? @

(13)

J\ where EJ "(1/N ) ,J J f Y(hK J ) and EJ " ,J\ w denote derivatives of post? J H  H @  H  HJ H\ synaptic neural activation and the synaptic weight energy, respectively. The amount of weight change for each epoch becomes

*w J J\ "g [d J h J\ !c f Y(hK J )w J J\ ] HH J H H J H HH

(14)

where d J "[ J> d J>1 w J> J #c h J EJ ] f Y(hK J ) is the negative derivative of the total H H H H H J H @ H cost E over the jth neural activation at the lth layer. From Eq. (14), one can "nd that the weight update formula incorporates three popular learning algorithms, i.e., the error back-propagation, Hebbian rules, and the simple weight-decay learning. These learning rules compete with each other so as to optimize synaptic weight in the course of learning and we now call this new algorithm as `Hybrid-IIa.

80

S.-Y. Jeong, S.-Y. Lee / Neurocomputing 35 (2000) 73}90

3.2. Bayesian interpretation of Hybrid-II algorithm From a Bayesian perspective, additional cost function can be interpreted as a negative logarithm of the prior probability distribution of weights [4,17]. Here we investigate how the weight distribution is varied with respect to regularizing functions, particularly variants of weight decay regularizers. The Hybrid-I algorithm mainly focuses on the increasing problem complexity rather than reducing network complexity, but Hybrid-II algorithm tries to obtain more pruned network by introducing weight decay term. Thus, the Hybrid-II regularizer is compared with other weight decay regularizers herein. One of the simplest weight decay forms is the sum-of-squared magnitudes of weights (¸2 norm) as follows: 1 E (w)" w. A G 2 G

(15)

By taking negative logarithm on the Gaussian distribution of weights, Eq. (15) can be derived. That is, quadratic weight decay regularizer may correspond to the Gaussian prior. Large weights are penalized and only small weights are encouraged. This weight decay method has proved to be useful for improving generalization, but it has some drawbacks for decaying weights at the same rates regardless of its size. Uniform weight decay may cause slow learning because essential large weights as well as small weights are forced to decay toward zero. Therefore, cost function should be corrected so as to decay small weights more rapidly than large weights. Eq. (16) shows a possible alternative, E (w)" w/1#w. G G A G

(16)

A regularizing function as given in Eq. (16) takes similar functional form like log(1#w) and thus prior weight distribution can be regarded as Cauchy distribution [17]. A more general form called weight elimination is proposed by Weigend as follows [16]: E (w)" (w /w )/1#(w /w ), A G  G  G

(17)

where w is a free parameter to determine whether many small weights or few large  weights are preferred. If w /w is much larger than 1, cost function approaches unity G  and prefers a small number of signi"cant weights. To the contrary, if w /w is much G  smaller than 1, many small weights are preferred [5]. Recently, another variant of the weight decay called Laplace prior, drew much attention for its pruning property. The sum of absolute weight value (L1 norm) is used as the Laplace prior as follows [4,17]: E (w)" "w ". A G G

(18)

S.-Y. Jeong, S.-Y. Lee / Neurocomputing 35 (2000) 73}90

81

As described above, the Gaussian and Laplace prior try to minimize the magnitudes of weights, while the Cauchy prior allows large weights as well as small weights. The main di!erence between priors as described above and Hybrid-II is the way of allowing large weights. For notational simplicity, the following Hybrid-II cost function is considered: fY E " w. A G 2 G

(19)

It favors large weights only when the corresponding hidden activation is saturated, i.e., f Y is small. The derivative of hidden activation, f Y can be regarded as a scaling parameter to control whether weights are scaled up or down during learning process. Fig. 3 shows the functional shapes of weight decay variants with respect to weights. In quadratic weight decay (L2 norm) and absolute weight decay (L1 norm), only small weights are encouraged because there is only one extremum point at zero. In weight elimination, large weights are favored as well as small weights because there are three extrema points, i.e., minimum at zero and maxima at positive and negative large weights. Also, how large weights are encouraged depends on the somewhat ad hoc parameter w . However, the Hybrid-II algorithm can systematically accommodate  large weights according to the degree of hidden saturation. 3.3. Learning characteristics of Hybrid-II We compare and check the validity of Hybrid II algorithm applied to noisy function approximation problem. Fig. 4(a) shows a function to be approximated and it has the

Fig. 3. Additional cost terms versus synaptic weights. Here, solid, dashdotted, dashed, and dotted lines denote the L2-norm weight decay, the L1-norm weight decay, the weight elimination, and the Hybrid-II, respectively.

82

S.-Y. Jeong, S.-Y. Lee / Neurocomputing 35 (2000) 73}90

Fig. 4. 3-D function identi"cation results with the Hybrid-II algorithm. (a) Original function; (b) Noisy training function; (c) Standard EBP algorithm; (d) Quadratic weight decay algorithm; (e) Hybrid II algorithm.

form of harmonic function, i.e., z"sin xy, consisting of 4096 points which is evaluated at equally spaced grid points in the x}y plane. After adding Gaussian noise with zero mean and 0.5 standard deviation into this desired function, we selected 256 points randomly and used them for training patterns as shown in Fig. 4(b). We used

S.-Y. Jeong, S.-Y. Lee / Neurocomputing 35 (2000) 73}90

83

(2-15-1)-sized multi-layer perceptron network and fully trained the network "ve times with di!erent initial weights. For the selection of c, it should be focused more on reducing error mis"ts than hidden layer cost in the initial stage of learning and then roles of the additional term are more and more emphasized while keeping error mis"ts from increasing too much. With this guideline in mind, the parameter c was initially set to zero and slowly increased to 0.01 at 10 000 epoch. This strategy is applied to the overall simulations as will be presented later. Comparing the generalization ability as shown in Figs. 4(c)}(e), the proposed Hybrid II algorithm can approximate more smooth function due to the low-pass "ltering of high-frequency component. However, the standard EBP algorithm learned noisy component as well as desired function, and resulted in poor generalization. After the network is fully trained, it is noticeable in Fig. 5 that the quadratic weight decay algorithm allows only small weights and standard EBP algorithm does not constrain the magnitudes of weights. On the other hand, the Hybrid II algorithm favors large weights only when connected hidden activations are deeply saturated as well as small weights.

Fig. 5. Derivative of hidden-neuron activation versus input-to-hidden synaptic weights for test patterns after training. (a) Standard EBP algorithm; (b) Quadratic weight decay algorithm; (c) Hybrid II algorithm. Unlike weight decay algorithm, the Hybrid-II algorithm allows relatively larger synaptic weights for saturated hidden neurons.

84

S.-Y. Jeong, S.-Y. Lee / Neurocomputing 35 (2000) 73}90

To explicitly explain the role of e!ective frequency in predicting time series, we chose sine map and cosine map as given in the following equation: y "sin(wy ), L> L

y "cos(wy ). L> L

(20)

Since both equations have the same e!ective frequency like X"w, the meaning of reducing e!ective frequency would be revealed if we further investigate the role of frequency w in the sine and cosine map equations, respectively. Fig. 6 shows the bifurcation diagram with respect to the change of w. Chaotic regime start at about w"2.7 in the sine map, and at about w"2 in the cosine map. It can be inferred that the smaller the e!ective frequency, the less deep the chaotic regime [6]. Therefore, it

Fig. 6. Bifurcation diagrams. (a) sine map; (b) cosine map. Both diagrams show that the e!ective frequency should be smaller to avoid chaotic behavior.

S.-Y. Jeong, S.-Y. Lee / Neurocomputing 35 (2000) 73}90

85

Fig. 7. Results with iterative predictions for the superimposed sine series. Although the standard EBP algorithm may result in chaotic behavior, the Hybrid-II algorithm successfully avoid the chaotic behavior.

would be possible to obtain the time series exhibiting less chaotic phenomenon if the e!ective frequency is reduced during the training process. In the third example, Fig. 7 demonstrates the meaning of e!ective frequency in a simple iterated-step prediction problem. Training data are generated from the following equation: y(t)"sin t#0.1 sin 10t.

(21)

Low-frequency dominant sinusoidal signal is mixed with high-frequency signal. Total number of training data is 250 and the same number of data is used for test process. Two networks were fully trained and tested with varying c in Hybrid-II algorithm. It is noticeable that the predicted sine series oscillates over 150 time points when only EBP algorithm is used as shown in Fig. 7(a), but the network with reduced e!ective frequency does not exhibit this phenomenon. Fig. 7(b) clearly shows that the predicted series is almost pure sinusoidal signal when the network is trained with Hybrid-II algorithm. The beauty of proposed learning algorithms may reside in their simplicity and straightforward easy interpretation in terms of sensitivity and e!ective frequency. It is worth noting that the second terms in Eqs. (4) and (14) are the only modi"cation from the standard back-propagation algorithm, and additional computation requirements are almost negligible. Other learning algorithm utilizing "rst- and second-order derivatives need extra signal feedforwards and/or back-propagation [9,2].

4. Real-world benchmark results To demonstrate the improved generalization ability of hybrid algorithms, we tested with two real-world benchmark problems, i.e., sunspot time series and chaotic laser pulsation data from Santa Fe competition data set A.

86

S.-Y. Jeong, S.-Y. Lee / Neurocomputing 35 (2000) 73}90

Predicting the number of sunspot series has been widely used as time-series benchmark. Sunspot data from the year 1700 to 1920 are used as training. After the year 1920, we divided test data into four intervals, that is, from the year 1921 to 1955, 1956 to 1979, 1980 to 1994, and "nally 1921 to 1994. To compare the generalization ability with other methods, we used (12-8-1)sized network. Fig. 8. shows the single-step prediction results for EBP learning and hybrid learning, respectively. The prediction results are summarized in Table 1 for comparison with other methods. Among these, the "fth method * early stopping * represents the best achievable performance when error back-propagation algorithm is used, and the Hybrid II learning algorithms gives the best performance in this table. Finally, hybrid learning algorithm applied to the iterated prediction of chaotic laser pulsation data from the Santa Fe competition data set A. Although the standard EBP algorithm provides reasonable accuracy at single-step prediction, iterated prediction results su!er from accumulated errors. Besides, the iterated prediction method is the only way to predict longer times than the available training data period. The "rst 1000

Fig. 8. Results with single-step prediction for the sunspot series data. (a) Standard EBP algorithm; (b) Hybrid I algorithm; (c) Hybrid II algorithm.

S.-Y. Jeong, S.-Y. Lee / Neurocomputing 35 (2000) 73}90

87

Table 1 Mean Squared Errors for the Sunspot Time Series Data Learning algorithm

Trivial LAR TAR EBP ES Wnet Ssnet Wan Hybrid I Hybrid II

Train

Test

Test

Test

Test

1700}1920

1921}1955

1956}1979

1980}1994

1921}1994

0.292 0.128 0.097 0.057 0.120 0.082 * 0.082 0.058 0.121

0.416 0.126 0.097 0.180 0.103 0.086 0.077 0.066 0.124 0.061

0.94 0.36 0.28 0.234 0.244 0.35 * 0.26 0.224 0.205

0.785 0.306 0.306 0.365 0.184 0.313 * 0.200 0.126 0.116

0.661 0.238 0.197 0.235 0.165 0.219 * 0.156 0.157 0.119

Note: For prediction values, [Trivial] makes use of predicted results at previous step and [LAR] is 12th-order linear auto-regression and [TAR] [14] represents Threshold auto-regression. [EBP] means Error back-propagation and [ES] algorithm is EBP with early stopping at 1000 epoch. [Wnct] [16] denotes EBP with weight decay algorithm and [SSnet] [10] represents EBP with soft weight sharing algorithm. Finally [Wan] [15] represents TAR with Wnet and Dc-recti"ed trained net.

step data are used for the training, and the following 200 steps are used for the iterated prediction test. Fig. 9 shows the iterated-step prediction results with standard back-propagation. Small prediction errors at time t introduce ampli"ed errors at time t#1 and later times. By reducing the input-to-output mapping sensitivity, the Hybrid learning algorithms greatly reduce these accumulated errors up to 60th time steps. At this point, the pulsation mode changes, and it is almost impossible to predict correctly by a single predictor. Multiple predictors may be used to identify the pulsation mode "rst and to utilize a proper predictor for the identi"ed mode later. However, the high accuracy of the predictions in single pulsation mode is essential. The results of the standard back-propagation algorithm in Fig. 9(a) do not predict this mode change. However, the results of hybrid learning algorithm are able to point out this mode change. Although there still exist large errors, it even extracts the essential tendency of the time series for a prolonged period as shown in Figs. 9(b) and (c). The prediction results are summarized in Table 2. The results are better than many previous results, but worse than the best results in the Santa Fe competition, which predicts the mode change "rst and utilizes di!erent networks for each mode. Since we use only the standard multi-layer perceptron architecture, and the best results may come from a recurrent network. Therefore, direct comparison of the errors may not be fair. Here we would like to point out that the developed hybrid learning algorithms greatly improve iterated prediction performance compared to the standard error back-propagation learning algorithm.

88

S.-Y. Jeong, S.-Y. Lee / Neurocomputing 35 (2000) 73}90

Fig. 9. Results with iterative prediction for the laser pulsation data. (a) Standard EBP algorithm; (b) Hybrid I algorithm; (c) Hybrid II algorithm. Table 2 Mean Squared Errors for the Chaotic Laser Pulsation Data MSE

EBP

Hybrid I

Hybrid II

Up to 60th Step Up to 100th Step Up to 200th Step

0.7797 1.3345 3.1009

0.1337 0.3659 0.8302

0.1532 0.3276 1.0718

5. Conclusion In this paper, we proposed two adaptive learning algorithms for the additional functionality such as input-to-output mapping sensitivity and curvature smoothing. In the course of learning, `Hybrid Ia algorithm reduces input-to-output, mapping sensitivity and `Hybrid IIa algorithm penalizes high-frequency components. Theoretical and experimental results tested with time-series prediction benchmarks showed better generalization performance.

S.-Y. Jeong, S.-Y. Lee / Neurocomputing 35 (2000) 73}90

89

Acknowledgements This research was supported by Korean Ministry of Science and Technology as a Brain Science and Engineering Research Program (Braintech'21).

References [1] E. Baum, D. Haussler, What size net gives valid generalization? Neural Comput. 1 (1989) 151}160. [2] C.M. Bishop, Curvature-driven smoothing: a learning algorithm for feedforward networks, IEEE Trans. Neural Networks 4 (1993) 882}884. [3] M. Cottrell, B. Girard et al., Neural modeling for time series: a statistical stepwise method for weight elimination, IEEE Trans. Neural Networks 6 (1992) 1355}1364. [4] C. Goutte, L.K. Hansen, Regularization with a pruning prior, Neural Networks 10 (1997) 1053}1059. [5] A. Gupta, S.M. Lam, Weight decay backpropagation for noisy data, Neural Networks 11 (1998) 1127}1137. [6] R.C. Hilborn, Chaos and Nonlinear Dynamics, Oxford University Press, Oxford, 1994. [7] D.-G. Jeong, S.-Y. Lee, Merging back-propagation and Hebbian learning rules for robust classi"cations, Neural Networks 9 (1996) 213}1222. [8] E.D. Karnin, A simple procedure for pruning back-propagation trained neural networks, IEEE Trans. Neural Network 1 (1990) 239}242. [9] Y. LeCun, J. S. Denker, S.A. Solla, Optimal brain damage, in: Advances in Neural Information Processing Systems, Vol. 2, Morgan Kaufmann, San Mateo, CA, 1990, pp. 598}605. [10] S.J. Nowlan, G.E. Hinton, Simplifying neural networks by soft weight sharing, Neural Comput. 4 (1992) 473}493. [11] T. Poggio, F. Girosi, Regularization algorithms for learning that are equivalent to multilayer networks, Science 2, 247 (1990) 978}982. [12] R. Reed, Pruning algorithms * a survey, IEEE Trans. Neural Networks 4 (1993) 740}747. [13] W.S. Sarle, Stopped training and other remedies for over"tting, Proceedings of the 27th Symposium on the Interface of Computing Science and Statistics, 1995, pp. 352}360. [14] H. Tong, K. Lim, Threshold autoregression, limit cycles and cyclical data, J. Roy. Stat. Soc. 42 (1980) 245}292. [15] E. A. Wan, Combining fossil and sunspot data: committee prediction, Proceedings of International Conference on Neural Network, Houston, 1997, pp. 2176}2180. [16] A.S. Weigend, D. E. Rumelhart, B. A. Hyberman, Generalization by weight-elimination applied to currency exchange rate prediction, Proceedings of International Conference on Neural Network, Seattle, 1991, pp. 837}841. [17] P.M. Williams, Bayesian regularization and pruning using a Laplace prior, Neural Comput. 7 (1995) 117}143.

So-Young Jeong received the B.S. and M.S. degrees in Electrical Engineering from the KAIST, Korea, in 1996, and 1998, respectively. Since 1998, he has been a Ph.D. candidate with Computation and Neural Systems Laboratory in the KAIST. His research interests include development of learning algorithm, neural network modeling, and time-series prediction.

90

S.-Y. Jeong, S.-Y. Lee / Neurocomputing 35 (2000) 73}90

Soo-Young Lee received the B.S. degree in Electronics Engineering from Seoul National University, Korea, in 1975, the M.S. degree in Electrical Engineering from the KAIST, Korea, in 1977, and the Ph.D. degree in Electrophysics from the Polytechnic Institute of New York, NY, in 1984. He was with the Taihan Engineering Company, Korea, from 1977 to 1980. From 1980 to 1982, he was a senior research fellow at the Microwave Research Institute, Polytechnic Institute of New York. In 1982, he joined the General Physics Corp., Columbia, MD, as a sta! scientist. In 1986, he accepted an appointment as an Assistant Professor in the Department of Electrical Engineering at KAIST, and is currently a Professor. From 1997 he is also serving as Director of Brain Science Research Center, the main research organization of Korean Brain Science and Engineering Research Program for the next 10 years. His current research interests reside in intelligent information processing systems based on biological brain information processing mechanism. Speech processing and recognition are the main applications. Both theoretical and VLSI hardware issues are investigated. He is currently President-Elect of Asia-Paci"c Neural Network Assembly and Co-Chair of SIGINNS-Korea.