Neurocomputing 55 (2003) 771 – 778
www.elsevier.com/locate/neucom
Letters
Parallel layer perceptron Walmir M. Caminhas∗ , Douglas A.G. Vieira, Jo˜ao A. Vasconcelos Department of Electrical Engineering, Federal University of Minas Gerais, Belo Horizonte 30270-010, MG, Brazil
Abstract In this paper, both the architecture and learning procedure underlying the parallel layer perceptron is presented. This topology, di1erent to the previous ones, uses parallel layers of perceptrons to map nonlinear input–output relationships. Comparisons between the parallel layer perceptron, multi-layer perceptron and ANFIS are included and show the e1ectiveness of the proposed topology. c 2003 Elsevier B.V. All rights reserved. Keywords: ANFIS; Multi-layer perceptron; Parallel layer perceptron
1. Introduction In the last years the multi-layer perceptrons (MLP) gained popularity in a vast range of applications due to its universal approximation characteristic [1,2] and the popularization of the backpropagation algorithm [7]. Another popular network is the adaptive-network-based fuzzy inference system (ANFIS) [4]. This network has some advantage if compared with MLPs due to the linear dependence of its output in relation to the consequents; therefore, a more e?cient learning algorithm can be used. However, the complexity of the ANFIS topology increases exponentially as the rules are generated using all possible combinations of premisses. The number of generated rules N for a system with n inputs and P premisses is N = P n , hence it is prohibitive to use an ANFIS for problems with several variables. In this paper is proposed a novel network, called parallel layer perceptron (PLP), which tries to combine the advantages of both MLP and ANFIS topologies. Moreover, as the learning for most problems needs a big ∗
Corresponding author. Tel.: +55-31-34-99-4812; fax: +55-31-34-99-4810. E-mail address:
[email protected] (W.M. Caminhas).
c 2003 Elsevier B.V. All rights reserved. 0925-2312/$ - see front matter doi:10.1016/S0925-2312(03)00440-5
772
W.M. Caminhas et al. / Neurocomputing 55 (2003) 771 – 778
computational e1ort, training neural networks can be viewed as a natural application of parallelism. The topology proposed here is a natural extension of artiHcial neural networks to parallel environments. Several aspects of the PLP will be treated in this paper. Firstly, the architecture of the proposed topology is presented. Also included is a particular case which the error surface for half of its parameters is a quadratic curve. Afterwards a learning procedure is discussed. Subsequently, the universal approximation theorem is presented. Lastly, some computational results are presented and discussed.
2. PLP architecture as
The output yt of the PLP considering n inputs, m perceptrons per layer is calculated m [ (ajt )(bjt )] ; yt =
(1)
j=1
where (:); (:) and (:) are activation functions (hyperbolic tangent, Gaussian, linear, n n etc.), ajt = i=0 pji xit ; bjt = i=0 vji xit ; vji and pji are components of the weight matrices P and V, xit is the ith input for the tth sample, where x0t is the perceptron bias, and yt is the tth position of the vector output y. In Fig. 1 is shown the PLP topology. Similarly to the traditional multi-layer perceptron, all network parameters can be adapted using the backpropagation method. However, some di1erences must be highlighted. Firstly, in the MLP case, the network uses function of functions to
Fig. 1. PLP architecture.
W.M. Caminhas et al. / Neurocomputing 55 (2003) 771 – 778
773
input–output mapping. The PLP is mainly based on product of functions. Moreover, as can be seen in Fig. 1, the proposed topology is composed of parallel layers. This feature simpliHes the network implementation in parallel machines or clusters. One particular case of the topology showed in Fig. 1 is considering both (:) and (:) as identity functions. In this case, the network output is computed as yt =
m
[ajt (bjt )] =
j=1
m
[Lj Nj ]:
(2)
j=1
It is important to note that, the particular case described in Eq. (2), has some desirable characteristics. The error surface in relation to pji , which for this particular case is a linear parameter, is a quadratical structure, hence a more e?cient learning algorithm can be used. A hybrid learning that combines backpropagation and the least-squares estimate (LSE) is used to adapt the network parameters. In the next sections, just the network given by Eq. (2) will be considered. Firstly, the learning algorithm for the proposed network is presented. It is important to remember that the learning for the most general case is similar to the traditional backpropagation, thus this discussion is not covered in this paper.
3. Hybrid learning algorithm When the particular network presented in Eq. (2) is employed, a method that uses a combination of gradient and LSE is more interesting. As the output yt is a linear function of the parameters pji , their optimum values can be calculated using simple algorithm based on linear algebra. This feature is similar to the consequent characteristics for ANFIS. To simplify the explanation about the LSE let lk = pji , where k = n(j − 1) + i, which is simply the transformation of the matrix P to a vector l with the same components. First of all, the outputs of the nonlinear perceptrons are calculated. One matrix C, which is a combination of the nonlinear output and the inputs is generated. The components ctk of the matrix C are, ctk = xit (bjt ). The equivalence of Eq. (2) in matrix notation is y = Cl. The aim of the learning process for most supervised learning algorithms for a neural network is minimizing the sum of the squared error for the training data, that is e = 12 (Cl − yd)T (Cl − yd);
(3)
where yd is the desired output vector. The optimum value for l is obtained using the condition that the gradient of Eq. (3) is zero. So, l∗ = (C T C)−1 C T yd:
(4)
After the evaluation of the l∗ , its components return to the matrix form P. The output is calculated and the nonlinear weights, matrix V , are adapted according to the classical backpropagation method, vji (iter + 1) = vji (iter) − 9e=9vji , where is the learning rate, which is calculated as described by Jang [4], iter is the iteration number.
774
W.M. Caminhas et al. / Neurocomputing 55 (2003) 771 – 778
4. PLP as a universal function approximation One of the most important features that are desired in this kind of network, is the possibility of mapping a nonlinear input–output model. Consider the universal approximation theorem stated as follows [1,2]: Let (:) be a nonconstant, bounded and monotone-increasing continuous function. Let Ip denote the p-dimensional unit hypercube. The space of continuous functions on Ip is denoted by C(Ip). Then, given any function and ¿ 0 there exists an integer M and sets of real constants i ; wij and bi , where i = 1; : : : ; M and j = 1; : : : ; p such that we de6ne F(x1 ; : : : ; xp ) =
M i=1
i
p
wij xj + bi
(5)
k=1
as an approximate realization of the function f(:); that is |F(x1 ; : : : ; xp )−f(x1 ; : : : ; xp )| ¡ for all {x1 ; : : : ; xp } in Ip . By inspection it can be seen that Eq. (2) has the same form as Eq. (5). As the logistic and tanh functions are nonconstant, bounded and monotone increasing, both can be used in the nonlinear layer to satisfy the conditions imposed by the theorem. For radial basis functions, the approximation theorem can be stated as in [6]. A graphical interpretation can be given to the PLP approximation property. In fact, this network can be understood as a linear combination of nonlinear functions or vice versa. One network with one pair of parallel layers, and two perceptrons per layer, m = 2, is capable to approximate two periods of a sin function. In Fig. 2, is shown the linear functions generated by the linear perceptron. In Fig. 3, is shown the generated nonlinear functions considering as a Gaussian function. The linear– nonlinear product is shown in Fig. 4 and, lastly, in Fig. 5 is shown the resulted approximation.
Fig. 2. Linear parameters.
W.M. Caminhas et al. / Neurocomputing 55 (2003) 771 – 778
775
Fig. 3. Nonlinear parameters.
Fig. 4. Product of linear and nonlinear parameters.
Fig. 5. Approximated function.
5. Numerical problems Two test problems were used to compare the PLP computational performance. The comparisons were done with ANFIS [4] and MLP trained with the Levenberg–
776
W.M. Caminhas et al. / Neurocomputing 55 (2003) 771 – 778
Marquardt algorithm [3]. All networks were trained during 50 epochs. The PLP activation function, , used in the examples was the Gaussian function. The Hrst test problem is a mapping of a nonlinear input–output model. One sinc function with two variables ranging between [ − 10; 10] × [ − 10; 10], was used. The training data were composed by 121 sampled points, equally spaced. The results for this problem are shown in Table 1. The numbers in front of the PLP topology represent the perceptrons and parallel layers numbers (2 × m); the number of perceptrons in the hidden layer for MLP, considering one hidden layer; the number of rules for the ANFIS topology. The second example is the chaotic Mackey–Glass di1erential delay equation [5], x˙ = (0:2x(t − %))=(1 + x10 (t − %)) − 0:1x(t), which is a benchmark largely used in the neural and fuzzy communities [4]. This series was solved using fourth-order Runge– Kutta method assuming x(0)=1:2; t=17 and x(t)=0 for t ¡ 0. From the Mackey–Glass series was extracted 1000 pairs of input–output data for t = 118–1117. The Hrst 500 pairs were used for training and the others 500 to validating the model. The results are presented in Table 2. Columns three and four represent the root mean squared error for training (Trn) and validation data (Val), respectively.
Table 1 Sinc results Topology
Time
RMSE
PLP (2 × 5) MLP (10) PLP (2 × 16) MLP (32) ANFIS (16) PLP (2 × 25) MLP (50) ANFIS (25)
1.63 7.25 3.34 10.48 5.87 6.73 17.35 12.65
8:1 × 10−5 0.03 6:7 × 10−7 3:0 × 10−6 0.078 1:1 × 10−10 5:15 × 10−7 0.060
Table 2 Mackey–Glass series results Topology
Time
Trn (10−3 )
Val (10−3 )
PLP (2 × 16) ANFIS (16) MLP (32) PLP (2 × 20) MLP (40) PLP (2 × 30) MLP (60)
12.11 19.67 26.98 20.61 39.32 47.14 85.74
1.78 1.84 1.82 1.29 1.30 1.15 1.29
1.68 1.71 1.71 1.17 1.18 1.10 1.26
W.M. Caminhas et al. / Neurocomputing 55 (2003) 771 – 778
777
Comparing the simulations presented in Table 1, for equivalent networks (equivalence was considered here as the total number of functions used in each network), the PLP topology was faster and presented smaller errors, showing its e?ciency. For instance, PLP (2 × 5) outperform, it is better in both training time and error than MLP(10). The PLP (2 × 16) outperform the MLP(32) and ANFIS(16). This same analysis can be done for the others network showing the advantage of the proposed network. The same analysis done in the previous example, can be done for the second test problem (Table 2). In this example, the PLP outperform both ANFIS and MLP. What is more, for ANFIS the spatial exponential problem occurs in this example. Even though the validation error does not depend only of the topology, the PLP networks presented good results for unknown samples, as shown in the fourth column of Table 2. 6. Conclusions In this work, a novel neural network topology was proposed, PLP, including the architecture and learning algorithms. One particular case was also proposed due to its special characteristics that enable a more e?cient training algorithm. This network can be used in a wide variety of applications, due to its universal function approximation characteristic. The numerical results presented in this paper show the computational e?ciency of the proposed topology. Although its parallel characteristics were not deeply approached in this paper, it is an important feature of PLP networks that seems to be an open area for further researches. Acknowledgements The authors would like to thank CNPq, CAPES, FINEP and FAPEMIG for the Hnancial support.
References [1] G. Cybenko, Approximation by superpositions of a Sigmoid function, Math. Control Signals Systems 2 (1989) 303–314. [2] K. Funahashi, On the approximate realization of continuous mappings by neural networks, Neural Networks Signals Systems 2 (1989) 183–192. [3] M.T. Hangan, M.B. Menhaj, Training feedforward networks with the Marquardt algorithm, IEEE Trans. Neural Networks 5 (6) (1994) 989–993. [4] J.S.R. Jang, ANFIS: adaptative-network-based fuzzy inference system, IEEE Trans. Systems Man Cybernet. 23 (3) (1993) 665 – 685. [5] M.C. Mackey, L. Glass, Oscillation and chaos in physiological control systems, Science 197 (1977) 287–289. [6] J. Park, I.W. Sandberg, Universal approximation using radial-basis-function networks, Neural Comput. 3 (1991) 246–257.
778
W.M. Caminhas et al. / Neurocomputing 55 (2003) 771 – 778
[7] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representations by error propagation, in: D.E. Rumelhart, J.L. McClelland (Eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1, Bradford Books, MIT Press, Cambridge, MA, 1986 (Chapter 8). Walmir Matos Caminhas is an Adjunct Professor at Department of Electrical Engineering at Federal University of Minas Gerais, Brazil. He holds a Doctarate degree in Electrical Engineering obtained from University of Campinas, Sao Paulo, Brazil, in 1997. His research interests include computational intelligence and control of electrical drives.
Douglas Alexandre Gomes Vieira was born in Belo Horizonte, Brazil, in 1980. He is a student of electrical engineering course at Federal University of Minas Gerais, Brazil. His research interests include computational intelligence, multi-objective optimization and design, and stochastic and deterministic optimization methods.
Jo˜ao Antˆonio de Vasconcelos was born in Monte Carmelo, Brazil. He obtained his Ph.D. in 1984 at Ecole Centrale de Lyon—France. He is Professor at Electrical Engineering Department of the Federal University of Minas Gerais since 1985. His research interests include vector optimization (evolutionary multi-objective optimization) and design, computational intelligence, computational electromagnetics (Hnite element methods, boundary integral equation methods and others).