Improved constrained learning algorithms by ... - Semantic Scholar

Report 2 Downloads 134 Views
Applied Mathematics and Computation 174 (2006) 34–50 www.elsevier.com/locate/amc

Improved constrained learning algorithms by incorporating additional functional constraints into neural networks Fei Han

a,b,*

, De-Shuang Huang

a

a

b

Intelligent Computing Lab., Hefei Institute of Intelligent Machines, Chinese Academy of Sciences, P.O. Box 1130, Hefei Anhui 230031, China Department of Automation, University of Science and Technology of China, Hefei 230027, China

Abstract In this paper, two improved constrained learning algorithms that are able to guarantee to obtain better generalization performance are proposed. These two algorithms are substantially on-line learning ones. The cost term for the additional functionality of the first improved algorithm is selected based on the first-order derivatives of the neural activation at hidden layers, while the one of the second improved algorithm is selected based on second-order derivatives of the neural activation at hidden layers and output layer. In the course of training, the cost terms selected from these additional cost functions can penalize the input-to-output mapping sensitivity or high-frequency components included in training data. Finally, theoretical justifications and simulation results are given to verify the efficiency and effectiveness of the two proposed learning algorithms. Ó 2005 Elsevier Inc. All rights reserved.

* Corresponding author. Intelligent Computing Lab., Hefei Institute of Intelligent Machines, Chinese Academy of Sciences, P.O. Box 1130, Hefei Anhui 230031, China. E-mail address: [email protected] (F. Han).

0096-3003/$ - see front matter Ó 2005 Elsevier Inc. All rights reserved. doi:10.1016/j.amc.2005.04.073

F. Han, D.-S. Huang / Appl. Math. Comput. 174 (2006) 34–50

35

Keywords: Feedforward neural network; On-line constrained learning algorithm, Generalization; Mapping sensitivity; High-frequency components, Time series prediction

1. Introduction Good generalization performance is one of most important tasks in designing neural network model. However, a proper matching of the underlying problem complexity and the network complexity is crucial for improving the network generalization capability [1]. It is well known that although the network with too many synaptic weights can solve exactly the involved problems, it may suffer from overfitting, thus result in poor generalization performance. On the other hand, while a smaller network cannot solve the problem accurately, it can obtain a better generalization capability [2,3]. So, in practical applications, we should make a proper compromise between the generalization capability and the network complexity. In order to avoid occurring the overfitting case, usually, ones have to reduce the network complexity by pruning synaptic weights or increase the problem complexity by incorporating additional functional constraints into the network [4–10]. In literature [11], a learning algorithm was proposed that supplemented the training phase in feedforward neural networks (FNN) with the architectural constraints and additional conditions representing desirable information about the learning process. This method can improve the convergence and generalization properties and efficiently avoid local minima through prompt activation of the hidden units, optimal alignment of successive weight vector offsets, elimination of excessive hidden nodes, and regulation of the magnitude of search steps in the weight space. In literatures [12–15], several new constrained learning algorithms (CLA) were proposed by coupling the a priori information from problems into the cost functions defined at the network output layer. As a result, the solutions for the involved problems finding the roots of polynomials can be very easily obtained. In addition, in literature [16], the authors proposed a method of noise injection into training data. This method can reduce the occurrence of the overfitting case. Although these approaches above can alleviate the overfitting case to some extent, the additional computational requirements are increased greatly. On the other hand, several pruning algorithms that can reduce the network complexity were also proposed. These approaches include synaptic weight elimination strategy [17], weight decay method [18], and weight sharing technique [7,19]. Although these algorithms can also avoid the overfitting case to some extent, they were not specifically designed for the purpose performing better generalization ability.

36

F. Han, D.-S. Huang / Appl. Math. Comput. 174 (2006) 34–50

Usually, the low input-to-output mapping sensitivity can be enforced by an additional cost term with the first-order derivatives of sigmoid activation functions at hidden neurons [5]. The corresponding learning algorithm is referred to as Hybrid-I method. In this algorithm, the cost terms for the additional functionality based on the first-order derivatives of neural activation at hidden layers were designed for penalizing the input-to-output mapping sensitivity in the course of training. By forcing the hidden layer neurons to work in saturating mode, this algorithm can increase problem complexity and improve generalization capability. Another generic method incorporated an additional cost term based on the second-order derivatives of sigmoid activation functions at hidden layers and output layer into the cost function defined at the network output layer. As a result, a new CLA, referred to as Hybrid-II method, was derived [20]. In this algorithm, the cost term designed for the additional functionality can penalize the high-frequency components included in training data in the course of training so that the network generalization performance can be also increased. Nevertheless, these two learning algorithms are off-line learning ones so that they cannot be applied to solve those online processing problems. Moreover, the computational requirements of these two learning algorithms are large. In the course of one training epoch, first, the synaptic weights update for the corresponding every stored pattern must be computed. Second, after computing the synaptic weights update for all stored patterns, the mean synaptic weights update values for all stored patterns are calculated and the new synaptic weights are obtained by adding the mean synaptic weights update values to the previous synaptic weights. Moreover, the above operations process is repeated during the next training epoch until the training target is satisfied. Further, it was found from experimental results that the computational requirements in these two learning algorithms are actually relatively large. In this paper, we propose two improved constrained learning algorithms based on Hybrid-I and Hybrid-II algorithms, which are referred to as improved Hybrid-I and Hybrid-II algorithms, respectively. These two improved algorithms are capable of performing on-line learning. In other words, during the training, when a stored pattern is input, the new synaptic weights are updated immediately by adding the modified terms to the previous synaptic weight values. This process is executed repeatedly until the training target is satisfied. These two improved algorithms inherit the features from the original Hybrid-I and Hybrid-II ones but have much simpler form of cost terms. Furthermore, they require lower computational complexity than the ones of Hybrid-I and Hybrid-II ones. Through experiments, it can be found that the generalization performance for these two improved algorithms is significantly better than the one of the two original Hybrid-I and Hybrid-II ones. The rest of this paper is organized as follows. Section 2 describes the improved Hybrid-I algorithm, and Section 3 presents the improved Hybrid-II

F. Han, D.-S. Huang / Appl. Math. Comput. 174 (2006) 34–50

37

algorithm. Experimental results for the benchmark data from sine series and two real-world time series are given in Section 4, respectively. Finally, a concluding remark is included in Section 5. 2. Improved Hybrid-I algorithm with low input-to-output sensitivity Most traditional learning algorithms with feedforward neural networks use the sum-of-square error criterion. Their common characteristic is to adapt the synaptic weights until the activations of the networks output layer units can match prespecified target values. However, these traditional learning algorithms have not considered the network structure and the involved problem properties, thus their capabilities are limited. To obtain good generalization capability, literature [20] proposed incorporating the first-order derivatives of the neural activation at hidden layers into the sum-of-square error cost function to reduce the input-to-output mapping sensitivity. First, let us introduce the fundamental principle of this method. 2.1. Additional cost term for low input-to-output sensitivity Consider an FNN with one input layer, L  1 hidden layers, and one output layer. The units in each layer apart from the input layer receive input from all units in the previous layer. For simplicity, we adopt the same activation function for all neurons at output and hidden layers, i.e., tangent sigmoid transfer function f ðnÞ ¼ ð1  expð2nÞÞ=ð1 þ expð2nÞÞ.

ð1Þ

It can be deduced that this activation function is of the following property: f 00 ðnÞ ¼ 2f ðnÞf 0 ðnÞ.

ð2Þ

Before presenting input-to-output sensitivity, we make the following mathematical notation. Assume that xk and yi denote the kth element of the input vector and the ith element of the output vector, respectively; wjl jl1 denotes the synaptic weight from the j1th hidden neuron at the lth layer to the jl1th hidden neuron at (l  1)th layer; wijL1 denotes the synaptic weight from the ith neuron at output layer to the jL1th hidden neuron at (L  1)th layer; wj1 k denotes the synaptic weight from the j1th hidden neuron at the first layer to the kth element of the input vector; fl0 ðÞ is the derivative of the sigmoid ^ function fl(Æ) at lth layer; hjl ¼ fl ðP hjl Þ is the activation function of the jlth element at the lth layer with ^ hjl ¼ jl1 wjl jl1 hjl1 . The ti and yi denote target and actual output values of the ith neuron at output layer, respectively; Nl denotes the number of neurons at the lth layer. According to literature [5], for an L-layer feedforward neural network, the sensitivity for yi to xk can be defined as

38

F. Han, D.-S. Huang / Appl. Math. Comput. 174 (2006) 34–50

X oy i 0 ¼ wijL1 wjL1 jL2    wj1 k fL0 ð^y i ÞfL1 ð^ hjL1 Þ    f10 ð^hj1 Þ. oxk j ;...;j 1

ð3Þ

L1

From this equation, it can be deduced that while ^hjl becomes bigger, the derivative fl0 ð^ hjl Þ may become smaller sharply. As a result, the low input-to-output sensitivity will be achieved. To obtain a simpler and more effective additional cost term with respect to Hybrid-I algorithm, a new cost function including additional input-to-output sensitivity is defined as follows: NL L1 X 1 X ðti  y i Þ2 þ cl Elh ; ð4Þ 2N L i¼1 l¼1 P l 0 ^ f ðhjl Þ denotes the additional hidden layer penalty term at where Elh ¼ N1l Njl ¼1 lth layer; the gain cl represents the relative significance of the hidden layer cost over the output error. Assume that the network is trained by a steepest-descent error minimization algorithm as usual, the synaptic weight update becomes oE Dxjl jl1 ¼ gl ¼ gl djl hjl1 ; l ¼ 1; . . . ; L; ð5Þ oxjl jl1 where djl denotes the sensitivity of the total cost E to the ^hjl at lth layer. The sensitivity of the total cost E to the ^ hjl at the hidden layer, i,e., djl , can be computed by back-propagation style as follows:



lþ1 X oE c ¼ djlþ1 xjlþ1 jl f 0 ð^ hjl Þ  l f 00 ð^hjl Þ; ^j N l oh j ¼1

N

djl ¼ 

l

l ¼ 1; . . . ; L  1.

ð6Þ

lþ1

The sensitivity of the total cost E to the ^ hjl at the output layer, i.e., djL , can be calculated as follows: 1 0 ^ f ðhjL ÞðtjL  y jL Þ. ð7Þ djL ¼ NL From the above equations, it can be noted that the updating of synaptic weights consists of two components, i.e., the back-propagation term and the Hebbian term. This method is referred to as improved Hybrid-I algorithm. 2.2. An analysis of low input-to-output sensitivity for the improved Hybrid-I algorithm In this section, we further discuss the robustness and small input-to-output mapping sensitivity of improved Hybrid-I learning algorithm. For simplicity, we consider a single-layered neural network with tangent sigmoid neuron. If the input vector x is modified by Dx, the change Dyi, at the ith output neuron may be approximated as

F. Han, D.-S. Huang / Appl. Math. Comput. 174 (2006) 34–50

Dy i ¼ y i ðx þ DxÞ  y i ðxÞ 

X k

Dxk

oy i X ¼ wik f 0 ð^y i ÞDxk . oxk k

Consequently, Dyi/yi can be computed as follows: P Dy i wik f 0 ð^y i ÞDxk f 0 ð^y i Þ X Dxk f 0 ð^y i Þ^y i Dxk ¼ ¼ k wik xk ¼ . f ð^y i Þ k yi f ð^y i Þ xk f ð^y i Þ xk

39

ð8Þ

ð9Þ

We define gð^y i Þ as gð^y i Þ ¼

f 0 ð^y i Þ^y i 4^y i ¼ ; f ð^y i Þ expð2^y i Þ  expð2^y i Þ

ð10Þ

where gð^y i Þ has similar functional form like the Gaussian-type function [20]. The gð^y i Þ has generally a maximum at ^y i ¼ 0 and two minima at ^y i ¼ 1. When the value of ^y i becomes larger, the value of gð^y i Þ becomes exponentially decreasing. Consequently, a larger value of ^y i brings on better generalization capability. Thus, the additional cost term in Eq. (4) results in better generalization performance. Now, we consider the case in an L-layered feedforward neural network. According to the above results, the network obtains low input-to-output sensitivity in the first hidden layer, which means that the changes of the input vector will lead to smaller changes of the values of output vector in the first hidden layer. In the similar way, the smaller changes in the first hidden layer result in much smaller changes of the values of output vector in the second hidden layer, because the output vector in the first hidden layer is the input vector for the second hidden layer. The remaining hidden layers and output layer may be deduced by analogy. Consequently, it can be easily deduced that the values of output vector in the output layer get much smaller changes although the input vector is changed a lot. 3. Improved Hybrid-II algorithm with low-frequency components In practical applications, ones are not interested in detailed high-frequency behavior of the output values of hidden and output layers, and smoother curvature values of the error cost surface. Literature [20] adopted the low-pass filtering to obtain the low-frequency components from the hidden and output layers. For the time series prediction problems, it was found that if the highfrequency components are reduced in the course of the training, the feedforward neural network will get better generalization capability [20]. Jeong and Lee proposed a smoothed constraint method by introducing the notion of frequency filtering and derived a new constrained learning algorithm (CLA) encoded with standard back-propagation learning [20]. In the following, based on this algorithm, we shall derive a simpler CLA with smaller computational load and better generalization capability.

40

F. Han, D.-S. Huang / Appl. Math. Comput. 174 (2006) 34–50

3.1. Improved Hybrid-II algorithm with low-frequency components Effective frequency can be expressed as the product of the slope of activation function and the square of synaptic weights in a single-layered neural network [20]. In multilayered architecture, the neural network can be regarded as a series of low-pass filters interconnected by layer-by-layer style. However, the algorithm in literature [20] was off-line learning one so that it cannot be applied to solve those online processing problems. So, in this paper we generalize it to an on-line learning version with a simpler additional cost term with respect to Hybrid-II algorithm. Consequently, a new cost function including low-frequency components is defined as follows: E¼

NL L X 1 X 2 ðti  y i Þ þ cl Elh ; 2N L i¼1 l¼1

ð11Þ

where " # N l1 Nl X X 1 1 2 Elh ¼ f 0 ð^ hj l Þ ðxj j Þ N l j ¼1 2 j ¼1 l l1 l

l1

is the proposed error term and the gain cl denotes the relative significance of the corresponding hidden or output layer cost over the output error. Similar to improved Hybrid-I algorithm, assume that the same activation function for all neurons at output and hidden layers is adopted, i.e., tangent sigmoid transfer function. The network is also trained by a steepest-descent error minimization algorithm. So the corresponding synaptic weight update can be written as follows: Dxjl jl1 ¼ gl

oE c ¼ gl djl hjl1  gl l xjl jl1 f 0 ð^hjl Þ; oxjl jl1 Nl

l ¼ 1; . . . ; L; ð12Þ

where djl denotes the negative derivative of the total cost E to the ^hjl at the lth layer. For the hidden layer case, djl can be computed as djl ¼ 

oE o^ hj

l

¼

N l1 c 1 X 2 djlþ1 xjlþ1 jl f 0 ð^ hjl Þ  l f 00 ð^ hj l Þ ðxjl jl1 Þ ; 2 N l ¼1 j ¼1

N lþ1 X jlþ1

l ¼ 1; . . . ; L  1.

l1

ð13Þ

F. Han, D.-S. Huang / Appl. Math. Comput. 174 (2006) 34–50

For the output layer case, djL can be computed as N L1 1 0 ^ c 1 X 2 djL ¼ f ðhjL ÞðtjL  y jL Þ  L f 00 ð^ hjL Þ ðxj j Þ . NL 2 j ¼1 L L1 NL

41

ð14Þ

L1

From the above equations, it can be noted that the updating of synaptic weights consists of three components, i.e., standard error BP term, Hebbian learning and the simple weight decay terms. In the following, this method is also referred to as improved Hybrid-II algorithm. 3.2. An analysis of the improved Hybrid-II algorithm From a Bayesian perspective, all the additional cost functions designed for the above constrained learning algorithms can be interpreted as a negative logarithm of the prior probability distribution of weights [4,10]. In order to obtain good generalization capability, this improved Hybrid-II algorithm tried to reduce the network complexity by introducing weight decay term. For the weight P decay form, Ec ðwÞ ¼ 12 i w2i , it can be derived by taking negative logarithm on the Gaussian distribution of weights. For the weight decay form, P Ec ðwÞ ¼ 12 i w2i =ð1 þ w2i Þ, the corresponding prior weight distribution can be regarded as Cauthy distribution [10]. Another prior weight distribution is called asP Laplace prior one whose negative logarithm can be represented as Ec ðwÞ ¼ i jwi j [4,10]. The Gaussian and Laplace prior ones generally try to minimize the magnitudes of weights while the Cauchy prior distribution allows large weights as well as small weights. For notational simplicity, the weight decay form Pin the improved Hybrid-II algorithm can be simplified as 0 Ec ðwÞ ¼ f2 i w2i . It favors large weights only when the corresponding hidden activation is saturated. The derivative of hidden activation, f 0 , can be regarded as a scaling parameter to control whether weights are scaled up or down during learning process [20]. 4. Experimental results To demonstrate the improved generalization capability of our proposed hybrid algorithms, in the following we shall test the performance of these two improved algorithms with the superimposed sine series and two real-world benchmarks. 4.1. Single-step predictions for the superimposed sine series In this experiment, the training data set was generated from the following equation: yðtÞ ¼ sin t þ 0.1 sin 10t. ð15Þ

42

F. Han, D.-S. Huang / Appl. Math. Comput. 174 (2006) 34–50

1.5

0.06

1

0.05

0.5

0.04

error

target, prediction

The number of the total training data is 251 selected from [5, 5] at identical spaced interval. Likely, 251 testing samples are also selected from [5, 15] at identical spaced interval. Assume that a 5–10–1-sized network is adopted. The predicted results and the corresponding predicted errors for the BP algorithm and two improved algorithms are shown in Figs. 1–3 and Table 1. From these results, it can be found that the prediction accuracies of the proposed two algorithms are obviously better than the one of the BP algorithm. That is to say, our proposed two improved CLAs have better generalization capability than the BP algorithm. It can also be seen through this experiment that the improved Hybrid-II algorithm is of the best performance.

0

0.03

-0.5

0.02

-1

0.01

-1.5

5

6

7

8

9

10

11

12

13

14

0 5

15

6

7

8

9

10

11

12

time

time

(a) The predicted values

(b) The predicted errors

13

14

15

Fig. 1. Results with single-step prediction for the superimposed sine series by using the BP algorithm.

1.5

0.08 0.07

1

0.05

error

target, prediction

0.06 0.5 0

0.04 0.03

-0.5

0.02 -1 -1.5

0.01 5

6

7

8

9

10

11

12

13

14

15

0

5

6

7

8

9

10

11

12

time

time

(a) The predicted values

(b) The predicted errors

13

14

15

Fig. 2. Results with single-step prediction for the superimposed sine series by using the improved Hybrid-I algorithm.

F. Han, D.-S. Huang / Appl. Math. Comput. 174 (2006) 34–50 1.5

43

0.05 0.045 0.04 0.035

0.5

0.03

error

target, prediction

1

0

0.025 0.02

-0.5

0.015 0.01

-1

0.005 -1.5 5

0 6

7

8

9

10

11

12

13

14

15

time

5

6

7

8

9

10

11

12

13

14

15

time

(a) The predicted values

(b) The predicted errors

Fig. 3. Results with single-step prediction for the superimposed sine series by using the improved Hybrid-II algorithm.

Table 1 The comparison of mean squared errors for the superimposed sine series by the BP algorithm and two improved algorithms Learning algorithms

Training

Testing

BP Improved Hybrid-I Improved Hybrid-II

0.0022 0.0029 0.0031

0.0082 0.0066 0.0060

4.2. Real-world benchmark In this section, we demonstrated the improved generalization capability of the proposed two learning algorithms with two real-world benchmark problems including sunspot time series and chaotic laser pulsation data from Santa Fe competition data set A. To compare the generalization ability of our proposed two learning algorithms with the one of the two original Hybrid ones, we used a 12–8–1-sized network to solve the two real data. Sunspot data from the year 1700 to1920 are used as training set. The data after the year 1920 are used as testing set. We divided this testing data into four intervals, that is, from the year 1921 to 1955, 1956 to 1979, 1980 to 1994, and finally 1921 to 1994. As a result, the predicted results are shown in Figs. 4–6 for BP algorithm, improved Hybrid-I algorithm and improved Hybrid-II algorithm, respectively. The corresponding prediction accuracies are summarized in Table 2. From these results, it can be seen that the two improved hybrid algorithms have better generalization capability than the BP algorithm as well as the two original Hybrid ones. Moreover, the improved Hybrid-I algorithm gives the best generalization.

44

F. Han, D.-S. Huang / Appl. Math. Comput. 174 (2006) 34–50 200

0.08

180

0.07

160 0.05

120

error

target, prediction

0.06 140

100 80

0.04 0.03

60 0.02 40 0.01

20 0 1920 1930 1940 1950

0 1920 1930 1940 1950 1960 1970 1980 1990 2000

1960 1970 1980 1990 2000

time

time

(a) The predicted values

(b) The predicted errors

Fig. 4. Results with single-step prediction for sunspot time series by using the BP algorithm.

200

0.04

180

0.035

160

0.025

120

error

target, prediction

0.03 140

100 80

0.02

0.015

60 0.01 40 0.005

20 0 1920 1930 1940 1950

1960 1970 1980 1990

0 1920 1930 1940 1950

2000

time

1960 1970 1980 1990

2000

time

(b) The predicted errors

(a) The predicted values

Fig. 5. Results with single-step prediction for sunspot time series by using the improved Hybrid-I algorithm.

0.045

180

0.04

160

0.035

140

0.03

120 0.025 100

error

target, prediction

200

80

0.02 0.015

60 40

0.01

20

0.005

0 1920 1930 1940 1950 1960 1970 1980 1990 2000

time

(a) The predicted values

0 1920 1930 1940 1950 1960 1970 1980 1990 2000

time

(b) The predicted errors

Fig. 6. Results with single-step prediction for sunspot time series by using the improved Hybrid-II algorithm.

F. Han, D.-S. Huang / Appl. Math. Comput. 174 (2006) 34–50

45

Table 2 The comparison of mean squared errors for the sunspot time series data by the five algorithms Learning algorithm

Training (1700–1920)

Testing (1921–1955)

Testing (1956–1979)

Testing (1980–1994)

Testing (1921–1994)

BP Hybrid-I Hybrid-II Improved Hybrid-I Improved Hybrid-II

0.00084644 0.0008613 0.0017968 0.00051619 0.0019

0.00057893 0.00039882 0.00019619 0.000094795 0.00043912

0.0095 0.0091 0.0083 0.0049 0.0048

0.0120 0.0042 0.0039 0.0050 0.0038

0.0056 0.0037 0.0035 0.0022 0.0034

Below, we discuss the effects of the three parameters, g1, g2, and c1, with the improved Hybrid-I algorithm on sunspot time series prediction performance. Case I: g1 = 0.14 and g2 = 0.03 are kept unchanged, c1 is selected as 0.05, 0.07, 0.09 and 0.11, respectively. From the simulation results, it can seen that the bigger the c1 is, the worse the generalization performance is. Case II: g1 = 0.14 and c1 = 0.05 are kept unchanged, g2 is selected as 0.03, 0.05, 0.07, and 0.09, respectively. From the simulation results, it can seen that the bigger the g2 is, the worse the generalization performance is. Case III: g2 = 0.03 and c1 = 0.05 are kept unchanged, g1 is selected as 0.08, 0.10, 0.12, and 0.14, respectively. From the simulation results, it can seen that the bigger the g1 is, the better the generalization performance is. To conveniently discuss the effects of the four parameters, g1, g2, c1, and c2, with the improved Hybrid-II algorithms on sunspot time series prediction performance, we make the assumption of g1 = g2 and c1 = c2. Case I: g1 = g2 = 0.3 is kept unchanged, c1 = c2 is selected as 0.001, 0.003, 0.005 and 0.007, respectively. From the simulation results, it can seen that the bigger the c1 and c2 are, the worse the generalization performance is. Case II: c1 = c2 = 0.001 is kept unchanged, g1 = g2 is selected as 0.15, 0.20, 0.25 and 0.3, respectively. From the simulation results, it can seen that the bigger the g1 and g2 are, the better the generalization performance is. All the above results are shown in Table 3. Finally, the proposed two learning algorithms are also applied to the singlestep prediction of chaotic laser pulsation data from the Santa Fe competition data set A. Similarly, we also used a 12–8–1-sized network to address this problem. The first 1000 step data are used for the training, and the following 200 steps are used for the single-step prediction test. The predicted results obtained are shown in Figs. 7–9 for the BP algorithm, improved Hybrid-I algorithm and improved Hybrid-II algorithm, respectively. The corresponding prediction accuracies are summarized in Table 4. According to Table 4, it can be found that the generalization capability of the proposed two learning algorithms significantly surpasses the one of the BP learning algorithm and the two original Hybrid algorithms. Moreover, the improved Hybrid-I learning algorithm is the best among the five ones.

46

F. Han, D.-S. Huang / Appl. Math. Comput. 174 (2006) 34–50

Table 3 The effects of the parameters with the improved Hybrid-I and improved Hybrid-II algorithms on sunspot time series prediction performance Indices

Mean squared errors (1921–1994)

g1 = 0.14 g2 = 0.03

c1 = 0.05 0.0022

c1 = 0.07 0.0034

c1 = 0.09 0.0048

c1 = 0.11 0.0072

g1 = 0.14 c1 = 0.05

g2 = 0.03 0.0022

g2 = 0.05 0.0028

g2 = 0.07 0.0036

g2 = 0.09 0.0042

g2 = 0.03 c1 = 0.05

g1 = 0.08 0.0041

g1 = 0.10 0.0035

g1 = 0.12 0.0031

g1 = 0.14 0.0022

g1 = g2 = 0.3

c1 = c2 = 0.001 0.0034

c1 = c2 = 0.003 0.0046

c1 = c2 = 0.005 0.0055

c1 = c2 = 0.007 0.0064

c1 = c2 = 0.001

g1 = g2 = 0.15 0.0052

g1 = g2 = 0.20 0.0050

g1 = g2 = 0.25 0.0047

g1 = g2 = 0.3 0.0034

300

0.14 0.12 0.1

200 0.08

error

target, prediction

250

150

0.06

100 0.04 50 0 0

0.02 0 20

40

60

80

100 120 140 160 180 200

0

20

40

60

80

100 120 140 160 180 200

time

time

(a) The predicted values

(b) The predicted errors

Fig. 7. Results with single-step prediction for the laser pulsation data by using the BP algorithm.

Here we again discuss the effects of the three parameters, g1, g2, and c1, with the improved Hybrid-I algorithm on chaotic laser pulsation data prediction performance. Case I: g1 = 0.25 and g2 = 0.1 are kept unchanged, c1 is selected as 0.004, 0.006, 0.008 and 0.01, respectively. From the simulation results, it can seen that the bigger the c1 is, the better the generalization performance is. Case II: g1 = 0.25 and c1 = 0.01 are kept unchanged, g2 is selected as 0.1, 0.15, 0.2, and 0.25, respectively. From the simulation results, it can seen that the bigger the g2 is, the worse the generalization performance is. Case III: g2 = 0.1 and c1 = 0.01 are kept unchanged, g1 is selected as 0.1, 0.15, 0.2, and 0.25, respectively. From the simulation results, it can seen that the bigger the g1 is, the better the generalization performance is. Similarly, we also can discuss the effects

F. Han, D.-S. Huang / Appl. Math. Comput. 174 (2006) 34–50 300

0.025

250

0.02

200 0.015

error

target, prediction

47

150

0.01 100 0.005

50 0 0

0 20

40

60

80

100 120 140 160 180 200

0

20

40

60

80

100 120 140 160 180 200

time

time

(a) The predicted values

(b) The predicted errors

Fig. 8. Results with single-step prediction for the laser pulsation data by using the improved Hybrid-I algorithm.

0.04

300

0.035

250

0.025

error

target, prediction

0.03 200 150

0.02 0.015

100

0.01 50 0

0.005 0

20

40

60

80

100 120 140 160 180 200

0

0

20

40

60

80

100 120 140 160 180 200

time

time

(a)The predicted values

(b)The predicted errors

Fig. 9. Results with single-step prediction for the laser pulsation data by using the improved Hybrid-II algorithm.

Table 4 The comparison of mean squared errors for the chaotic laser pulsation data by the five algorithms Learning algorithm

Training (up to 200th step)

Testing (up to 200th step)

BP Hybrid-I Hybrid-II Improved Hybrid-I Improved Hybrid-II

0.00026918 0.00009206 0.00009304 0.00031271 0.00047027

0.0057 0.0015 0.0032 0.00074030 0.0022

of the four parameters, g1, g2, c1, and c2, with the improved Hybrid-II algorithm chaotic laser pulsation data prediction performance. For simplicity, we

48

F. Han, D.-S. Huang / Appl. Math. Comput. 174 (2006) 34–50

Table 5 The effects of the parameters with the improved Hybrid-I and improved Hybrid-II algorithms on chaotic laser pulsation data prediction performance Indices

Mean squared errors(up to 200th step)

g1 = 0.25 g2 = 0.1

c1 = 0.004 0.0013

c1 = 0.006 0.00095223

c1 = 0.008 0.00084734

c1 = 0.01 0.00074030

g1 = 0.25 c1 = 0.01

g2 = 0.1 0.00074030

g2 = 0.15 0.00084957

g2 = 0.2 0.0012

g2 = 0.25 0.0065

g2 = 0.1 c1 = 0.01

g1 = 0.1 0.0022

g1 = 0.15 0.0013

g1 = 0.2 0.0005489

g1 = 0.25 0.00074030

g1 = g2 = 0.06

c1 = c2 = 0.0001 0.0022

c1 = c2 = 0.0006 0.0026

c1 = c2 = 0.0011 0.0041

c1 = c2 = 0.0016 0.0045

c1 = c2 = 0.0001

g1 = g2 = 0.06 0.0022

g1 = g2 = 0.08 0.0027

g1 = g2 = 0.1 0.0030

g1 = g2 = 0.12 0.0033

make g1 = g2 and c1 = c2. Case I: g1 = g2 = 0.06 is kept unchanged, c1 = c2 is selected as 0.0001, 0.0006, 0.0011 and 0.0016, respectively. According to the simulation results, it can seen that the bigger the c1 and c2 are, the worse the generalization performance is. Case II: c1 = c2 = 0.0001 is kept unchanged, g1 = g2 is selected as 0.06, 0.08, 0.1 and 0.12, respectively. From the simulation results, it can seen that the bigger the g1 and g2 are, the worse the generalization performance is. All the above results are shown in Table 5.

5. Conclusions In this paper, we proposed two improved constrained learning algorithms for the additional functionality based on the Hybrid-I and Hybrid-II learning algorithms introduced in [20]. These two algorithms are substantially on-line learning ones whose cost terms for the additional functionality are selected based on the first-order derivatives of the neural activation at hidden layers and second-order derivatives of the neural activation at hidden layers and output layer, respectively. In the course of training, the cost terms selected from these additional cost functions can penalize the input-to-output mapping sensitivity or high-frequency components included in training data. The cost terms for these two improved learning algorithms are significantly simpler than the ones of Hybrid-I and Hybrid-II learning ones, thus the computational complexities for the two improved algorithms are lower. Moreover, the experimental results about benchmark data from sine series and time series prediction showed that the generalization ability of our proposed two improved learning algorithms is obviously better than the ones of Hybrid-I and Hybrid-II learn-

F. Han, D.-S. Huang / Appl. Math. Comput. 174 (2006) 34–50

49

ing ones as well as BP learning one. The experimental results also showed that the improved Hybrid-I algorithm is of the best generalization performance among the five ones. In addition, we also discussed the effects of the parameters with two new improved learning algorithms on the network performance. Future works will include how to apply these two new on-line constrained learning algorithms to more practical problems.

References [1] E. Baum, D. Haussler, What size net gives valid generalization? Neural Comput. 1 (1989) 151–160. [2] T. Poggio, F. Girosi, Regularization algorithms for learning that are equivalent to multilayer networks, Science 2 247 (1990) 978–982. [3] M. Cottrell, B. Girard, et al., Neural modeling for time series: a statistical stepwise method for weight elimination, IEEE Trans. Neural Networks 6 (1992) 1355–1364. [4] C. Goutte, L.K. Hansen, Regularization with a pruning prior, Neural Networks 10 (1997) 1053–1059. [5] D.-G. Jeong, S.-Y. Lee, Merging back-propagation and Hebbian learning rules for robust classifications, Neural Networks 9 (1996) 213–222. [6] E.D. Karnin, A simple procedure for pruning back-propagation trained neural networks, IEEE Trans. Neural Networks 1 (1990) 239–242. [7] S.J. Nowlan, G.E. Hinton, Simplifying neural networks by soft weight sharing, Neural Comput. 4 (1992) 473–493. [8] C.M. Bishop, Curvature-driven smoothing: a learning algorithm for feedforward networks, IEEE Trans. Neural Networks 4 (1993) 882–884. [9] A.S. Weigend, D.E. Rumelhart, B.A. Hyberman, Generalization by weight-elimination applied to currency exchange rate prediction, in: Proceedings of International Conference on Neural Network, Seattle, 1991, pp. 837–841. [10] P.M. Williams, Bayesian regularization and pruning using a Laplace prior, Neural Comput. 7 (1995) 117–143. [11] D.A. Karras, An efficient constrained training algorithm for feedforward networks, IEEE Trans. Neural Networks 6 (1995) 1420–1434. [12] D.S. Huang, Horace H.S. Ip, Zheru Chi, A neural root finder of polynomials based on root moments, Neural Comput. 16 (2004) 1721–1762. [13] D.S. Huang, Zheru Chi, Finding roots of arbitrary high order polynomials based on neural network recursive partitioning method, Sci. China Ser. F Inform. Sci. 47 (2004) 232–245. [14] D.S. Huang, A constructive approach for finding arbitrary roots of polynomials by neural networks, IEEE Trans. Neural Networks 15 (2004) 477–491. [15] D.S. Huang, Horace H.S. Ip, Zheru Chi, H.S. Wong, Dilation method for finding close roots of polynomials based on constrained learning neural networks, Phys. Lett. A 309 (2003) 443– 451. [16] J. Sietsma, R.J.F. Dow, Creating artificial neural networks that generalize, Neural Networks 4 (1991) 67–79. [17] Y. LeCun, J.S. Denker, S.A. Solla, Optimal brain damage, Advances in Neural Information Processing Systems, vol. 2, Morgan Kaufmann, San Mateo, CA, 1990, pp. 598–605. [18] A.S. Weigend, D.E. Rumelhart, B.A. Huberman, Back-propagation, weight elimination and time-series prediction, in: Proceedings of the 1990 Connectionist Models Summer School, Morgan Kaufmnn, San Mateo, CA, 1990, pp. 65–80.

50

F. Han, D.-S. Huang / Appl. Math. Comput. 174 (2006) 34–50

[19] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, K. Lang, Phoneme recognition using timedelay neural networks, IEEE Trans. Acoust. Speech Signal Process. 37 (1989) 328–339. [20] S.Y. Jeong, S.Y. Lee, Adaptive learning algorithms to incorporate additional functional constraints into neural networks, Neurocomputing 35 (2000) 73–90.