Merging Echo State and Feedforward Neural networks for Time Series Forecasting ˇ Stefan Babinec1 and Jiˇr´ı Posp´ıchal2 1
2
Department of Mathematics, Fac. of Chemical and Food Technologies Slovak University of Technology, 812 37 Bratislava, Slovakia Phone/FAX: +421 2 52495177 E-mail:
[email protected] Institute of Applied Informatics, Fac. of Informatics and Information Technologies Slovak University of Technology, 842 16 Bratislava, Slovakia Phone: +421 2 60291548, FAX: +421 2 65420587 E-mail:
[email protected] Abstract. Echo state neural networks, which are a special case of recurrent neural networks, are studied from the viewpoint of their learning ability, with a goal to achieve their greater prediction ability. A standard training of these neural networks uses pseudoinverse matrix for one-step learning of weights from hidden to output neurons. Such learning was substituted by backpropagation of error learning algorithm and output neurons were replaced by feedforward neural network. This approach was tested in temperature forecasting, and the prediction error was substantially smaller in comparison with the prediction error achieved either by a standard echo state neural network, or by a standard multi-layered perceptron with backpropagation.
1
Introduction
From the point of view of information flow, neural networks can be divided into two types: feedforward neural networks and recurrent neural networks [1]. In feedforward neural networks the input information proceeds from input neurons to output neurons. Such networks basically implement input/output mapping (functions) and they can serve as universal approximators. On the other hand, recurrent neural networks contain at least one cyclic path, where the same input information repeatedly influences the activity of neurons on the cyclic path. Such networks are more closely related to biological neural networks, which are also mostly recurrent. From the mathematical point of view the recurrent neural networks are also universal approximators, but they implement dynamic systems. These networks are less common in technical applications because of both theoretical and practical problems with learning. Their applications are hindered by a lack of effective algorithms for supervised training. This problem was solved by the so-called Echo state neural networks [3], where a very fast learning algorithm is used. It is based on a calculation of a pseudoinverse matrix, which is a standard numerical task.
2
The trouble with this approach is that it offers nearly perfect learning of the training set for the given recurrent neural network, but the predictive ability of such a network is not very good. The advantage of ”one-step” learning changes into a disadvantage, when we want to improve the predictive abilities of the given trained network. The approach using the pseudoinverse matrix does not help us in any direct way in this matter. In our previous work [5, 6] we explored a possibility to improve ”one-step” learning by evolutionary approaches. In this paper we have substituted the output layer by feedforward neural network and the original ”one-step” learning algorithm was substituted by backpropagation of error learning algorithm. Connection between ”liquid state” computing, related to Echo states, and backpropagation was mentioned previously in [4, 7]. With this approach we loose the advantage of fast computation of the ”onestep” optimization typical for Echo state networks, but we get flexibility and better quality of prediction.
2
Main Idea of Echo State Neural Networks
The core of echo state neural networks is a special approach to the analysis and training of recurrent neural networks. This approach leads to a fast, simple and constructive algorithm for supervised learning of recurrent neural networks. The basic idea of echo state neural networks is exploitation of a big ”reservoir” of neurons recurrently connected with random weights, as a source of dynamic behavior of the neural network, from which the requested output is combined. Under certain circumstances, the state xn = (x1 (n), x2 (n), . . . , xN (n)) of recurrent neural networks is a function of its previous inputs u(n), u(n − 1), . . .. An output of ith hidden neuron in time n is xi (n) and N is the number of hidden neurons. An input vector is un = (u1 (n), u2 (n), . . . , uK (n)), where ui (n) is an input to the ith input neuron in time n and K is the number of input neurons. Therefore, there exists such a function E, that x(n) = E(u(n), u(n − 1), . . .).
(1)
Metaphorically speaking, the state x(n) of neural network can be considered as so called ”echo” recall of its previous inputs.
3
Combination of Echo State and Feedforward Neural Network
Our paper presents a combination of echo state neural network and feedforward neural network. In this approach the output layer of echo state network is substituted with feedforward neural network and ”one-step” training algorithm is replaced with backpropagation of error learning algorithm (Fig.1). In original echo state neural networks we have no possibility to stop the training algorithm
3
to avoid the overfitting problem. Therefore such neural networks have often troubles with generalization. By using backpropagation of error learning algorithm we can stop training when error on validation set does not improve. The most complex and at the same time the most important part of echo state neural network, so called ”dynamical reservoir”, remained preserved. The main task of this big recurrent layer is to preprocess the input signal for the feedforward part of the whole neural network. 3.1
Description of Neural Network
The echo state part of the whole neural network consists of K input, N hidden and L output neurons. The state of input neurons in time n is characterized by a vector un = (u1 (n), u2 (n), . . . , uK (n)) , the state of output neurons in time n is given by a vector yn = (y1 (n), y2 (n), . . . , yL (n)), and similarly for hidden neurons xn = (x1 (n), x2 (n), . . . , xN (n)). The values of input – hidden synaptic in weights are stored in matrix Win = (wij ), hidden – hidden weights are stored in matrix W = (wij ) and hidden – to – output weights are stored in matrix out Wout = (wij ). The feedforward part of the whole neural network consists of L input neurons, M output neurons and of S hidden layers which may have different number of neurons in each layer. As we can see from the Fig.1, the output layer in echo state part is the same as the input layer in the feedforward part.
Fig. 1. The architecture used in this approach – combination of echo state neural network and feedforward neural network.
4
3.2
The Learning Algorithm
The only weights which are trained in this combination of neural networks are the weights in the feedforward part. The whole algorithm for training and testing consists of two steps. The First Step: The first step should create an untrained echo state neural network consisting of weights Win , W, Wout , which however can produce so called ”echo” states. There exists a number of ways how to obtain such a network with the given property. We have used the following approach [3]: – We have randomly generated an internal weight matrix W0 . – Then we have created a normalized matrix W1 with unit spectral radius from the matrix W0 by putting W1 = 1/|λmax |W0 , where |λmax | is the spectral radius of W0 . – After that we have scaled W1 to W = αW1 , where α < 1, whereby W obtains a spectral radius α. – Then we have randomly generated input weights Win and output weights Wout . Now, the untrained network is an echo state network, regardless of how Win , Wout are chosen. The Second Step: Now we can accede to the training and testing of the whole neural network. As we mentioned before, the only weights which are trained in this approach are the weights in the feedforward part. The learning algorithm used in this part is the well known backpropagation of error learning algorithm. This algorithm is described in details in [1]. For our approach is important, how to propagate the input signal through the echo state part. The states of hidden neurons in ”dynamical reservoir” are calculated from the formula (2) x(n + 1) = f (Win u(n) + Wx(n)), where f is the activation function of hidden neurons (we used the sigmoidal function). The states of output neurons are calculated by the formula y(n + 1) = f out (Wout (u(n)x(n + 1)),
(3)
where f out is the activation function of output neurons (we used the sigmoidal function).
4
Prediction of Air Temperature Data
Most of the publications about prediction strive to achieve the best prediction, which is then compared with results of other prediction systems on selected data. This paper is different in this aim; its goal was to compare results achieved by original ”echo state” neural network, with our new approach.
5
Records of average air temperature in Slovakia in years 1999 – 2002 were chosen as testing data for quality of prediction. The training set was composed of a time sequence of 1096 samples of air temperature data in years 1999, 2000, 2001 and the testing set was composed of next 31 samples – that means January of the year 2002. This task is basically a function approximation; it is the prediction of data from previous trends in the same data. A mean absolute percentage error (MAPE) was used for the measurement of prediction quality, where Pireal and Picalc are measured, resp. predicted values, and N is the number of couples of values (the length of the predicted time series): N Pireal −Picalc i=1 Pireal × 100 (4) M AP E = N
5
Experiments
Experiments were divided into two parts. The task of the first part was to find parameters of echo state neural networks, which would be optimal for the quality of prediction on the testing air temperature set. Very simple feedforward neural network was used in this first part. The reason for such simple neural network were computational demands. The network consisted of two layers with 4 neurons in first layer and 1 neuron in second layer. The results of experiments are in the following Table 1. Table 1. Results of experiments in the first part: quality of the prediction for different parameter values. Index 1 2 3 4 5 6
DR Size 200 200 250 250 300 300
α Average MAPE 0.7 42.981 % 0.8 41.549 % 0.7 44.857 % 0.8 44.557 % 0.7 39.528 % 0.8 40.254 %
The best MAPE 33.986 % 32.475 % 35.451 % 35.251 % 29.123 % 30.256 %
DR Size is dynamical reservoir, α is a parameter influencing the ability of the neural network to have echo states (the used values were chosen in agreement with the values used by Jaeger, the author of echo state networks, see [2, 3]). For each value of DR size and parameter α, which are presented in the Table 1, values of weights in dynamical reservoir were generated randomly 50 times and for every of this initialization of weights the prediction error on the testing set was calculated. Then the average error from the whole set of attempts was calculated (attribute Average MAPE in Table 1). The best achieved error was
6
recorded as well (attribute The best MAPE in Table 1). As we can see from Table 1, there is an evident correlation between these attributes. In the cases, where better Average MAPE error was achieved, better The best MAPE error was achieved too. The best results were achieved for DR which consists of 300 neurons and for the parameter α equal 0.7. The second part of the experiments was focused on finding the best parameters of feedforward neural network and it’s backpropagation of error learning algorithm. Parameters of dynamical reservoir and initial synaptic weights were chosen in accordance with the results of experiments in the first phase. Thereafter we started with training the feedforward neural network for all samples from the training set except the last 7 samples. This last week of year 2001 was chosen as a validation set. This set was used for testing the quality of prediction on samples, which were not used during the training process. A considerable number of experiments was carried out, the representative results of which are given in the following Table 2. Table 2. Results of representative experiments in second part of neural network learning. Index 1 2 3 4 5 6
Learning cycles 3854 3214 5635 4250 4411 3953
Number of neurons 12 – 1 12 – 1 14 – 8 – 1 14 – 8 – 1 25 – 11 – 1 25 – 11 – 1
Parameter γ MAPE 0.8 % 24.547 % 0.9 % 22.654 % 0.8 % 15.729 % 0.9 % 16.925 % 0.8 % 18.521 % 0.9 % 17.443 %
Attribute Learning cycles specifies the number of learning cycles after which the best prediction error on the testing set was achieved. Attribute Number of neurons specifies the number of neurons in each layer. Attribute Parameter γ specifies the value of learning parameter in backpropagation of error learning algorithm. Every training of feedforward part started with the same values of synaptic weights and other parameters of echo state part. Attribute MAPE specifies the best reached prediction error on the testing set. In the following Table 3 we can see the comparison of best achieved errors on testing air temperature set with three different approaches. We can see graphical representation of the two most important approaches in Figures 2 and 3. It is clear from this table and figures, that the combination of echo state neural network and feedforward neural network can considerably increase the quality of prediction.
7 Table 3. Comparison of three different approaches. Approach TD FFNN with BP ESN Comb. of ESN and FFNN with BP
MAPE 29.833 % 23.281 % 15.729 %
Attribute MAPE in Table 3 specifies the best reached prediction error on the testing set. Attribute Approach specifies the approach used for prediction. TD FFNN with BP – time delay feedforward neural network with backpropagation of error learning algorithm. Its best error (mape 29.833 %) was achieved with these network’s parameters: number of learning cycles: 6386, learning parameter γ = 0.7, number of neurons in each layer: 8 - 23 - 9 - 1. ESN – echo state neural network with ”one-step” learning algorithm, Comb. of ESN and FFNN with BP – combination of echo state neural network and feedforward neural network with backpropagation of error learning algorithm.
Fig. 2. Testing data: 31 records of air temperature and 31 values predicted by original echo state neural network with ”one-step” learning algorithm (DR Size 250, Alpha 0.8, MAPE 23.28 %).
8
Fig. 3. Testing data: 31 records of air temperature and 31 values predicted by echo state neural network combined with feedforward neural network (Experiment No. 3 from Table 2, DR Size 300, Alpha 0.7, MAPE 15.73 %).
6
Conclusions
Echo state neural networks belong to a group of relatively new approaches in the field of neural networks. Their biggest advantage is their closer relation to biological models due to their recurrent nature and the use of the reservoir of dynamic behavior without weight setting. These networks perform extraordinarily in learning a time sequence, which is essential for example for motoric control, in human beings or in a robot, or also for language processing tasks. Compared to other types of recurrent networks, echo state networks have a major advantage in their ability of ”one step learning”, even though this approach is probably not very biologically plausible. Their disadvantage is a lower generalization ability and the absence of an approach, which would be able to improve a trained network. The problem of the trained network improvement doesn’t appear in common feed forward or recurrent neural networks, because in a case of need of the network’s improvement the trained network can be further ”trained” by another batch of iterations of the classical algorithm of back propagation of error. This
9
however doesn’t work in echo state networks, where the standard algorithm with a pseudoinverse matrix allows only the approaches ”all or nothing”, which means that we will, or we will not train the network, nothing in between. A network trained by this approach can’t be further trained. We have tried to solve the above mentioned problem in this work, where the output layer was substituted by feedforward neural network and the original ”one-step” learning algorithm was replaced by backpropagation of error learning algorithm. Because we didn’t want to work with artificially created examples, we chose real data to evaluate our algorithms. Those data represent the meteorological measurements of air temperature. Our aim was to find out if this approach is able to increase prediction quality of echo state networks. From the results shown in the paper, it is clear that this aim has been accomplished. The combination of echo state neural network and feedforward neural network can increase the quality of the network’s prediction. Acknowledgement : This work was supported by Scientific Grant Agency Vega of Slovak Republic under grant #1/1047/04 and by Grant Agency APVT under grant APVT-20-002504.
References 1. Haykin, S.: Neural networks - A comprehensive foundation. Macmillian Publishing, 1994. 2. Jaeger, H.: The Echo State Approach to Analysing and Training Recurrent Neural Net-works. German National Research Center for Information Technology, GMD report 148, 2001. 3. Jaeger, H.: Short Term Memory in Echo State Networks. German National Research Center for Information Technology, GMD report 152, 2002. 4. Natschlager, T., Maass, W., Markram, H.: The ”liquid computer”:A novel strategy for real-time computing on time series. Special Issue on Foundations of Information Processing of TELEMATIK, 8(1):39-43, 2002. 5. Babinec, S., Pospichal, J.: Optimization in Echo state neural networks by Metropolis algorithm. In R. Matousek, P. Osmera (eds.): Proceedings of the 10th International Conference on Soft Copmputing, Mendel’2004. VUT Brno Publishing, 2004, pp. 155-160. 6. Babinec, S., Pospichal, J.: Two approaches to optimize echo state neural networks. In R. Matousek, P. Osmera (eds.): Proceedings of the 11th International Conference on Soft Computing, Mendel’2005. VUT Brno Publishing, 2005, pp. 39-44. 7. Goldenholz, D.: Liquid computig: A real effect. Technical report, Boston University Department of Biomedical Engineering, 2002.