Block-structured Recurrent Neural Networks
Simone Santini
Alberto Del Bimboy
Ramesh Jainz
Abstract
This paper introduces a new class of dynamic multi layer perceptrons, called Block Feedback Neural Networks (BFN). BFN have been developed to provide a systematic way to build networks of high complexity, including networks with coupled loops, nested loops and so on. BFNs are speci ed using a block notation. Any B FN can be seen as a block and connected to other BFNs using a xed number of elementary connections. The result of such a connection can also be considered as a block, and connected to other blocks, in a recursive fashion. We develop a cost-minimizing supervised training algorithm for this model. The algorithm is a gradient-descent type, and is tailored on the block structure of the model. Finally, we present some experiments of use of BFNs.
1 Introduction The interest in dynamic neural networks has been steadily growing in the past few years, and a number of models have been proposed { such as Williams and Zipser's Forward Propagation network (Williams and Zipser, 1989a), Kuhn's dynamic networks (Kuhn et al., 1989), and Robinson's three layers dynamic network (Robinson, 1989). All these models are based on a xed architecture: the way neurons are interconnected and the position of the delay units are embedded in the model. The designer can t the network to the particular problem by suitably selecting the number of neurons in the various parts of the network. In this paper, we present a class of dynamic neural networks, called Block Feedback Neural Networks (BFN) that try to overcome the limitations of a xed architecture, and to provide a framework in which a great number of dierent architectures can be speci ed and trained. The BFN family includes, as particular instances, all the models above, the backpropagation network (Werbos, 1974; Rumelhart and McCelland, 1987), allows to combine them into more complex networks, and allows the design of architectures not included in Department of COmputer Science and Engineering, University of California San Diego, 9500 Gilman Drive, La Jolla CA, 92093-0114 y Dipartimento di Sistemi e Informatica, Universit a di Firenze, Firenze, Italy z Department of Electrical and Computer Engineering, University of California San Diego
1
the above examples, such as architecture with multiple { and possibly nested { feedback loops. The BFN, therefore, is not a neural network, but rather a general model to describe and train neural architectures. Because of this, we don't start the traditional way, giving a network architecture and then showing how learning can be done. Rather, we begin by giving a recipe. The recipe can be used to build a great number (actually, an in nite number) of network architectures of very dierent characteristics but, as long as the indications of the recipe have been followed, all these dierent networks will be dubbed BFN. The recipe is based on the use of a block notation derived from (Narendra and Parthasarathy, 1990a). We consider as a BFN any block N made up of a multilayer percepton that satis es a suitable set of trainability conditions. Then, we give four connections that connect a further layer l to the block in a way that the overall network N 1 still satis es the same trainability conditions. In this way, N 1 can still be considered as a block, and the same four connections can be used to attach more layers in a recursive way. The recipe for building BFN can be associated to a parallel recipe for building training algorithm. Just like blocks are nexted one inside the other to build complex networks, so the training algorithm is made up of a series of procedures, one procedure for each type of connection, that are executed recursively to obtain a learning step for the whole network. Note that, since the BFN model includes backpropagation, it enjoys the universal approximation (Hecht-Nielsen, 1989; Blum and Li, 1991), and Bayesian classi cation (Hampshire II and Pearlmutter, 1990) properties of backpropagation. Moreover, preliminary studies (Doya, ; Santini and Del Bimbo, b) indicate that BFN extend these properties in the dynamic systems realm. As a nal note, we want to point out that in this introduction neural networks with delays were called \dynamic", instead of \recurrent" purposefully. This word was meant to emphasize that we are talking about networks whose output evolves in time { just like that of a dynamic system { under the action of a time varying input. We make this distinction to separate the networks we are talking about from those that work only in the equilibrium states, such as the Hop eld network (Hop eld, 1982; Hop eld, 1984). In the terms of (Pearlmutter, 1990), we deal with \non- xpoint networks", as opposed to \ xpoint networks".
2 Block Description of Neural Networks BFNs are described using a block notation { derived from Narendra's (Narendra and Parthasarathy, 1990b) { in which networks are enclosed into blocks that are connected together recursively to build further blocks. The basic unit of the notation is the layer (Fig.1), which we regard as a weights matrix A followed by an array of output functions F. The layer is regarded as a simple block N . Simple blocks can beconnected together to 2
a i1 i2 i3 a
f
o1
f
o2
f
o3
f
o4
f
o5
f
o6
i
A
F
o
N
Figure 1: A neuron layer can be represented as a \simple block". The layer (a) can be represented as an array of output functions F { in general non linear { coupled with a weight matrix A, as in (b). This layer can be considered as a black box, and connected to other blocks. From the point of view of a block to which N is connected, the internal details of N are irrelevant, and all that's necessary to know is its input/output behavior. form more complex networks. These networks, in turn, can also be regarded as blocks, and connected together to form still more complex networks, in a recursive fashion. In the BFN model, blocks can be connected together using four elementary connections, dubbed the cascade, the sum, the split, and the feedback connection (Fig. 2). Each connection is made of a number of embedded blocks (marked N 1 and N 2 in Fig. 2), and a connection layer (the matrices A and B, and the functions array F in Fig. 2). This set of connections is somewhat arbitrary, and it results as a compromise between the complexity of the model { and, hence, of the learning procedure { and its exibility. In particular, the cascade connection is such that the Backpropagation network is included as a partiacular instance of BFN. The cascade and the feedback connections together provide enough architectural exibility to make the BFN model a universal computer (Santini and Del Bimbo, a) and a Bayesian classi er (Santini and Del Bimbo, b). The sum and the split connections, on the other side, are useful modelling tools that allow the designer to separate and merge parallel lines of processing.
3 The Training Algorithm To exploit the exibility of the block notation, we must design an equally exible training algorithm. Since in the block notation networks are de ned recursively by connecting prede ned blocks, it seems natural to develop a BFN training algorithm that also operates recursively, doing the training of a block in terms of the training of its embedded blocks. The algorithm is applied to the network starting from the innermost blocks (which are simple blocks) and going outwards to the blocks they are embedded into, until the whole network is trained. For each block, the algorithm updates the weights of the connection layer following a gradient descent method. In this paper, we will use the steepest descent method. The algorithm, however, can be easily generalized to more eective schemes 3
A
F
N1
N1 A
F N2
N1 A F B
N2
A N1
F B
-1 z
Figure 2: Elementary connections used to build neural networks architectures: the cascade of blocks (a); the split (b); the sum (c) and the feedback (d). The blocks marked N 1 and N 2 are embedded blocks, while the shaded area, including the matrices A and (for the sum and the feedback) B includes the connection layer. (see, e.g. (Shanno, 1990)). To compute the derivatives of the cost with respect to the weights of the given block, some knowledge of the input/output behavior of the embedded blocks is needed. For instance, for the network of Fig. 2.a Backpropagation requires, to update the weights matrix W, the derivatives of the cost with respect to all the inputs to the block N 1 (Rumelhart et al., 1986). If L is an objective function for N 1, yj and xi are the outputs and inputs of N 1, respectively, then it is possible to compute @L=@xi if we know @yj =@xi, so as to apply the chain rule: @L = X @L @yj : @xi j @yj @xi It will be seen in the following that this condition is also necessary for a block embedded in a feedback connection. We introduce the following de nition:
De nition: 1 A block with inputs xi, outputs yi, weight matrix wij is trainable if, for each instant t and for each C 1 function of the outputs and of the weights, L(y; w), the quantities: @ +L ; @ +yi i = 1; N ; j = 1; N (1) y x
@wij @xj can be determined, where @ + indicates ordered derivatives (see appendiz and (Werbos,
1974)).
4
The knowledge of @ +L=@wij ensures that the weights inside the block can be trained using a gradient descent algorithm. The knowledge of @ +yi=@xj ensures that the chain rule can be used to propagate the derivatives of the cost backward through the blocks. If all the blocks embedded in a connection are trainable, then it is possible to compute the derivative of the cost with respect to the weights in the connection layer and therefore to apply the gradient descent to the connection layer. Moreover, if we ensure that the whole block resulting from the connection also satisfy the same trainability condition, it will be possible to embed it into another block (via one of the elementary connections) allowing the other block to be trained. In other words, the trainability condition represents all the knowledge we need about a block in order to apply a gradient descent algorithm, and all the knowledge we need to know about its input/output behavior to embed it into a BFN trained according to a gradient descent algorithm. In the following subsections, we develop methods to apply the gradient descent algorithm to the connection layer of all the four elementary connection. In any case, we will make no hypothesis about the embedded blocks but that they satisfy the trainability conditions, and in any case, we will develop the algorithm so as to ensure that the overall block also satis es the trainability conditions. The concept of ordered derivatives is central for the derivation of the algorithm Ordered derivatives were introduced in (Werbos, 1974), and developed in appendix A.
3.1 Training the cascade connection
In this subsection, we develop the training algorithm for the cascade connection. This connection is noteworthy since it contains, as a particular case, the Backpropagation architecture, which is obtained by cascading two simple blocks. Consider the cascade connection of Fig. 3 where x, z and o have size Nx, Nz and No, x
y A
z
F
N1
o
Figure 3: Cascade connection. The symbols reported in the gure are those used in the text for the derivation of the learning algorithm. respectively. Let an objective function L(o; A) be given. We need to compute the ordered derivatives of this function with respect to the weights of the connection layer. To this end, the variables in Fig. 3 are ordered as: [x j y j z j o] (2) so that functional dependencies are all \forward directed", according to the requirements of eq. (60) in the appendix. 5
Let the block N1 be trainable. From the expression for L, it is possible to compute @L (3) @zi i = 1; : : : ; Nz . Since N1 is trainable, we know the quantities @ +ok : (4) @zh Then, using the chain rulw for ordered derivatives, it is possible to compute N @L @ + o @ +L = X k (5) @zj k=1 @ok @zj : z
Going backward, we can use these quantities to compute @ +L = @ +L f 0(z ) (6) @yi @zj i and @ + L = @ +L x : (7) @aij @yi j Using the last equation, it is possible to make a single step of the update of the weights aij in the matrix A. If we use the steepest descent method, then we have + aij = ? @@aL ij
(8)
These equations ensure that the weights of the connection layer can be adapted to minimize the cost. However, if the resulting block has to be included as an embedded block in another connection, we must ensure that all the conditions prescribed by the de nition of trainability are satis ed. That is, we must compute: @ +oi (9) @xj We can use eq. (4) and the chain rule for ordered derivatives, obtaining: +
JNij = @@xoji =
N @ +o X k 0 f (zk )akj k=1 @zk z
(10)
Eqs. (8) and (10) make up a step of the learning process for the connection layer of Fig. 3.
6
x1
x2
y1
N1 N2
x
W
z
f
y
y2 Connection Layer
Figure 4: The sum connection with the symbols used for the derivation of the algorithm.
3.2 Learning the split connection
Let x, z y be the input, activation and output of the connection layer, respectively, and let y1, y2 be the outputs of the blocks N1 and N2 (Fig. 2.b), then 1
j J1ij def @y @yi
(11)
@yj2 (12) ij @yi can be computed, since we make the hypothesis that N1 and N2 are trainable. From these we obtain N1 @ + L @y 1 X N2 @ + L @y 2 @ +L = X k+ k (13) @yi k=1 @yk1 @yi k=1 @yk2 @yi : The equations to update the weights in the matrix A are given by @ +L = @ +L f 0(z)x (14) @wij @yi i j while the trainability condition of the overall block is ensured by the derivative matrix h i J def JAjJB (15)
J2 def
with and
+ 1
JAij def @@xyij = J1ij fi0(z)wij + 2
JBij def @@xyij = J2ij fi0(z)wij
3.3 Training the sum connection
(16) (17)
We consider the sum connection of Fig. 4. Let x1 and y1 be the input and output vectors 7
of block N 1, and x2 and y2 be the input and output vectors of block N 2. Moreover, let 1j (18) J1ij = @y @x1i and 2j : J2ij = @y (19) @x2 i
Finally, let x be the input of the connection layer, z be its activation vector and y its output vector. Since the connection layer is placed after the two embedded blocks, its training is made regardless the presence of N 1 and N 2, as: @ +L = @ +L f 0 (z )x (20) @wij @yi i j and @ +L = X @ +L f 0 (z )w (21) i ij @x @y j
i
i
The input of the overall block is made up of the vector: iT h q = x1T jx2T
and the coresponding Jacobian matrix is:
"
#
A J = JJB
with:
JAik = @x@y1jk X 1 @ +L Jij @x =
and:
(23)
j
j
=
(22)
X 1 X @ +L 0 Jij @yi f (zi) i j
JBik = @x@y2jk X 2 @ +L Jij @xj = j
X 2 X @ +L 0 Jij @yi f (zi) = i j 8
(24)
Figure 5: Feedback block with the symbols used for the derivation of the algorithm.
3.4 Training the feedback connection
The feedback connection (Fig. 5) is a key element to build dynamic networks. We suppose that N 1 is an arbitrary neural network, subject to the restriction of being trainable. To apply the chain rule for ordered derivatives, he block variables must be ordered, so that the form (60) can be derived. To this end, the input and the output of the delay block are considered as dierent entities. Since v(t) = y(t ? 1), there is no direct in uence of y(t) over v(t), while there is an in uence of v(t) over y(t). Thus, the vector q(t) contains all the block variables at time t ordered in the following way:
q(t) = [ x(t) j v(t) j A j B j z(t) j o(t) j y(t) ] : (25) In the following, for the ease of notation, dependency on t will be omitted unless required to emphasize temporal relationships. If N1 is trainable, we can compute + @@oyj (t) i = 1; : : : ; No j = 1; : : :; Ny JNij 1 def (26) i being No and Ny the dimensions of o and y, respectively. To update the elements of matrix B, it is necessary to compute N @ + L @z @ +L = X k h + iT (27) @bij k=1 @zk @bij = rz L rb z; z
ij
with and
# " + @ L ; @ +L ; : : :; @ +L T z @z1 @z2 @zN " #T @z def @z1 @z2 N rb z @b ; @b ; : : :; @b ;
r+L def
z
ij
being Nz = No the dimension of z.
(28)
z
ij
ij
9
ij
(29)
To determine r+z L and rb z, we proceed as follows. The ordered derivative of L with respect to o is N + N + @ L @yk = X @ L JN 1; @ +L = X (30) @oi k=1 @yk @oi k=1 @yk ik thus, the (ordered) gradient of L with respect to o is ij
y
y
r+o L = JN 1r+y L:
(31) The ordered derivative of L with respect to the generic element of z can now be computed as N @ + L @o + @ +L = X k = f 0 (zi(t)) @ L : (32) @zi k=1 @ok @zi @oi Thus, it holds r+z L = F0 [z(t)]JN 1r+y L; (33) where F0 [z(t)] = diag f 0 (z1(t)) ; : : :; f 0 (zN (t)) : o
o
In the following we will use the shorthands: F0 [t] def F0 [z(t)] and fj0 [t] def f 0 (zj (t)). We must still compute rb z. To obtain this, we expand the derivative of z with respect to the bij 's: 2N 3 N X @zk = @ 4X @ +vh ; 5 (34) b b v = v + kh kh h ki j @b @b @b ij
y
y
ij
ij h=1
ij
h=1
where ki is the Kronecker delta (not to be confused with Rumelhart's delta, that has only one index.. Since v(t) = y(t ? 1), it holds: N @ +y + os @ @ +vr (t) = @ +yr (t ? 1) = X r (t ? 1) @b (t ? 1): (35) @bij @bij ij s=1 @os Moreover, the derivative of os with respect to bij is bounded to the z derivative by the relation @ +os = f 0 [t] @zs : (36) s @b @bij ij This yields: N @ +y 0 @ +vr (t) = X @zs (t ? 1): r f [ t ? 1] (37) s @b @o @b o
o
Thus eq. (34) becomes:
ij
s=1
s
ij
N N @ +y X @zs (t ? 1): @zk (t) = v (t) + X 0 r b ( t ? 1) f [ t ? 1] kr ki j s @bij @bij r=1 s=1 @os In matrix form, eq. (37) can be rewritten as: rb v(t) = JN 1(t ? 1) T F0[t ? 1] rb z(t ? 1) ; y
o
ij
ij
10
(38) (39)
h i and the recursive equation for rb z(t) can be written as: rb z(t) = eTj y(t) i + B JN 1(t ? 1) T F0 [t ? 1] rb z(t ? 1) : (40) where ej is the j thelement of the natural basis of IRN , and i is the ithelement of the natural basis of IRN . The derivative of L with respect to the element bij of B, at time t can thus be expressed as: @ +L (t) = F0 [t]JN 1(t)r+L(t) r z(t) ; (41) b y @bij with rb z(t) given by (40). With a similar procedure, we nd @ +L=@aij (t) as: @ +L = F0 [t]JN 1(t)r+L(t) r z(t) (42) a y @aij (t) with ra z(t) given by: ra z(t) = eTj x(t) i + B JN 1(t ? 1) T F0 [t ? 1] ra z(t ? 1) : (43) Equations (41) and (42) allow to update the weights of the matrices A and B. If the connection layer has to be trainable according to our previous de nition, we must be able to propagate the derivatives of the outputs throughout the block N , that is, we must be able to compute the quantities: + (44) JN ij def @@xyj (t) i = 1; ::; Nx j = 1; ::; Ny: i This can be written as: N + N + X X h N 1f 0 [t]K ; JNik = @@xyik = @@oyhk fh0 [t] @z = J (45) ih hk h @x i h=1 h=1 with @zj (t) ; (46) Kij (t) = @x i j = 1; ::; Nz i = 1; ::; Nx or, in matrix form: JN (t) = K(t)F0 [t]JN 1(t): (47) 0 Since JN 1(t) and F [t] are known, we only need to derive an expression for K(t). According to this, we have: 3 2N N X @zk = @ 4X bkr vr5 (48) akr xr + @x @x ij
ij
ij
y
z
ij
ij
ij
ij
ij
ij
o
o
y
x
i
i r=1 Ny X
= aki +
r=1
+ bkr @@xvr : i
11
r=1
Since
@ +vr (t) = @ +yr (t ? 1) @xi @xi N X @yr @oh (t ? 1) = ( t ? 1) @o @x
(49)
o
h
h=1 No X
i
@yr (t ? 1)f 0 [t ? 1] @zh (t ? 1); h @xi h=1 @oh
= then, we have
N X N @zk (t) = a + X 0 @zh N1 b ki kr Jhr (t ? 1)fh [t] @xi @xi (t ? 1) r=1 h=1 N X = aki + bkr JNir (t ? 1): y
o
y
r=1
(50) (51)
Eq. (51) can be rewritten in terms of K as:
K(t) = AT (t) + JN (t ? 1) BT : (52) A recursive equation, in terms of JN only, can be derived from (45) and (51): JNik (t) = or, in matrix form:
N X o
h=1
Jhk
0
N1 f 0 [t] @a h
hi
+
N X y
r=1
1 bhr JNir (t ? 1)A ;
JN (t) = AT + JN (t ? 1)BT F0 [t]JN1 (t):
(53) (54)
This completes the training of the feedback connection at time t. With the block notation, the fully connected network of Williams and Zipser (Williams and Zipser, 1989b), can be easily obtained as a single layer feedback block, as the block in Fig. 5 with the embedded block N 1 removed. In this case, the matrix JN 1 reduces to the unity, and the equations presented here are equivalent (after a change in notation) to those presented in (Williams and Zipser, 1989b).
4 Application Examples of Feedback Networks In this section we present three examples of use of BFNs trained with the algorithm proposed. The rst two examples address prediction problems. The rst requires the prediction of the next sample in a short periodic sequence. The sequence is such that the last two samples are necessary for prediction. The second problem is the prediction of the output of a non-linear dynamic system. The third example shows the utility of dynamic networks for pattern classi cation when the training set and the test set are corrupted by noise. In this case, collection of 12
evidence from multiple presentation of the same pattern can improve estimation robustness. In the last two experiments we use two dierent BFN architectures: the rst with a single feedback loop and the second with two nested loops. This allows us to analyze the possible impact on performance of the double loop con guration. All the network architectures presented in this section were determined by trialand-error and common sense. So far there are no sound criteria for the choice of BF N architectures and of the number of neurons in the layers.
4.1 Prediction of a simple sequence
The test bed for the rst example is shown in Fig. 6. A series of 12 points, making
Figure 6: Test bed for the next sample prediction experiment. up an \8" shape, is presented to the network, in a given order. For every point, the network is asked to predict the following point. This task cannot be accomplished by a static network, since the point at coordinates (0; 0) has two successors: point 5 and point 11. The network must decide the successor of (0; 0) based on its predecessor: if the predecessor is 3, then the successor is 5, while if the predecessor is 9, the successor is 11. We used the simple network in Fig. 7. The network contains two input neurons,
Figure 7: The network designed to predict the \8" gure. which are activated with the two point-coordinates, and two output neurons, whose 13
values represent the two coordinates of the predicted point. The network has two hidden layers, with four neurons each. All the layers but the output, have logistic non-linearities. The output layer has a linear output function. The network was trained by minimizing a quadratic cost. The training was run for approximately 20,000 epochs. The predictions of the network are reported in Fig. 8 at 900, 5,000 and 20,000 training epochs.
Figure 8: Results of the prediction after 900, 5,000 and 20,000 training epochs.
4.2 Non-linear system prediction
As a second experiment, we use two BFN architectures to predict the output of a nonlinear system. The system we use for prediction is one of those presented in (Narendra and Parthasarathy, 1990b). It is non-linear and de ned by the second order dierence equation: where:
y(k + 1) = f [y(k); y(k ? 1) ] + u(k); 5 ]: f [y(k); y(k ? 1) ] = y1(k+)y(yk2(?k)1)+[y(yk2)(k+?2:1) 14
(55)
The input of the network is made up of the current system input and output. At every time instant, the network is requested to predict the output of the system two time steps later. The two network architectures in Fig. 9 have been used. The rst network is made
Figure 9: Two distinct network architectures used for the non-linear system output prediction. up of four layers with 2, 2, 2, 1 neurons, and a feedback loop originating and terminating in layer 3. The second network is identical to the rst except for a second feedback loop from layer 3 to layer 2. As in the previous experiment, in both cases all the layers but the output, have logistic non-linearities and the output layer has a linear output function. During the training of both networks, the dynamical system was driven with three sine waves of dierent frequency. Learning took about 8,000 iterations for the single-loop network, and about 20,000 iterations for the double-loop network. Figs. 10,11 show the results of the network predictions when the dynamic system is driven with a sinusoid with a dierent frequency from those used for training. It is worth noticing that in this example the double loop in the network of Fig. 9b results in a sharper response.
4.3 Experiments in Noise Immunity
The most common application of recurrent networks is in connection with dynamic processes. However, the possibility to collect evidence from successive presentation of the same input makes BFN an useful tool for the classi cation of patterns corrupted by noise. The results presented in this section compare outputs of two network architectures { including one and two feedback blocks {, with a reference feedforward architecture. Experiments are made in the presence of noisy training samples 15
5 target result
4 3 2 1 0 -1 0
5
10
15
20
25
30
Figure 10: Prediction of the non-linear system output. Single feedback network result. 5 target result
4 3 2 1 0 -1 0
5
10
15
20
25
30
Figure 11: Prediction of the non-linear system output. Double feedback network result. The experiments used the ubiquitous T-C patterns (Fig. 12). Both patterns are made up of 3 3 pixels, and both have the same number of black pixels. One pattern per frame is placed on a 8 8 \board". Patterns are moved between
Figure 12: The T and C patterns used for the classi cation experiment successive frames by 1 pixel diagonally, bouncing at the borders of the board. Fig 13 shows 5 successive frames where the pattern bounces at the right border of the board. Between frames 4 and 5, the patern changes from \T" to \C". The network is asked to determine which pattern is currently placed on the board, regardless its position. Every pattern presentation is build as follows. The keyboard is initialized at a gray level 0 (corresponding, in the illustrations, to the white). A pattern is then superimposed in the suitable position, with all the \on" pixels set to a level k. Then, we add to all the pixels a gaussian noise with variance k, and clip the resulting gray level in the interval 16
Figure 13: A \T" pattern moves and bounces on the Board, eventually becoming a \C". [0; 2k]. The result is a 2k gray levels image in which the \on" pixels have an average gray level of k. Changes in the pattern presented to the network are determined by a Poisson process. If a given pattern is presented in a frame, then the probability that in the next frame the same pattern will be presented is 1 ? , and the probability that the pattern will change is . Pattern switches make up a Poisson process with constant , the time between two switches { i.e. the time a pattern remains on the board in average { being equal to 1=. The parameters of the process are (x; y; p), being x; y the position of the upper-left corner of the pattern and p 2 fT; C g the type of the pattern. The parameters x and y are deterministic, while p is a stochastic process. For the experiments reported in these section we had k = 8, and = 0:05. The network architectures used for the experiments are depicted in Fig. 14. All the networks have the same number of layers (4), and the same number of neurons per layer. We compare a feedforward network and two BFNs. The rst BFN has a feedback loop around the second layer, while the second BFN has two feedback loops: one around the second layer and the other from the second to the rst layer. In the output layer one neuron is activated by pattern T, while the other is activated by pattern C. Training is made by generating pattern sequences with the prescribed probability distribution and corrupted with Gaussian noise with the prescribed . For all the networks training is stopped when the average cost, computed over the last 100 samples falls below 5 10?4 . For testing, similar moving patterns are generated, also corrupted with Gaussian noise of a given variance 2k. We test several values of from 0 to 1 { corresponding to a signal-to-noise ratio varying from in nity to 0 dB {. For each value of , the network is required to perform 10,000 classi cations, and the classi cation results are obtained by averaging over these presentations. Results presented take into account two distinct gures of performance: 1. The fraction of correctly recognized patterns. A pattern presented in input is considered as correctly recognized when the corresponding neuron has a value greater 17
8
8
2
(a)
8
8
2
−1
z (b)
8
8
2
−1
z
−1
z (c)
Figure 14: Architectures of the networks for the T and C classi cation. They all have the same number of layers and of neurons, although they have dierent feedback structure. Dashed lines mark Block boundaries. than that of the other neuron, i.e. being o1 and o2 are the actual network outputs and t1, t2 the expected outputs, if (o1 ? o2) (t1 ? t2) > 0. 2. The average classi cation error (o1 ? t1)2 + (o2 ? t2)2: this represents a quality measure for the classi cation, unlike the previous gure which only considers the relative ordering of the network outputs. Classi cation results are shown in Figs. 15,16. In these experiments, networks were trained with a training set corrupted by a Gaussian noise with variance t = 0:7, corresponding to a Signal to Noise ration of 1.55 dB. Fig. 15 shows the number of correctly recognized samples for the three network architectures of Fig. 14 as a function of the variance of the noise in the test set. Note that, especially for a low noise level in the test samples, both BFNs perform signi cantly better than the Feedforward network. In particular, the Feedforward network appears to be almost completely fouled out, since its fraction of recognized samples is around 0.5, which is the value we would expect from a random classi cation. On the other side, both BFNs show fraction of recognized samples between 0.6 and 0.7 for low noise in the test set. Note also that this superiority holds until the in the test set reaches the value of used in the training set. When exposed to exceedingly high noises, all the three networks are fouled out. The double loop network appears to have a better behavior than the single loop one. Fig. 16 shows the corresponding classi cation error, which has a behavior dual to that of the fraction of recognized patterns. 18
Fraction of Correctly Recognized Packets 1.2 Feedforward Network Single Loop Double Loop 1
0.8
0.6
0.4
0.2
0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 15: Fraction of Correctly classi ed pattern for a Feedforward and two Feedback networks with dierent feedback structure as a function of the input noise variance, for a training set noise variance t = 0:7
Classification Error 1.2 Feedforward Network Single Loop Double Loop 1
0.8
0.6
0.4
0.2
0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 16: Average classi cation error for Feedforward and Feedback networks as a function of the input noise variance, for a training set noise variance t = 0:7 19
Please note that in all cases, all the networks exhibit a very poor behavior as classi cator. Actually the experiments were purposedly designed to exhibit such a poor behavior. In particular, the input patterns, in spite of their simplicity, are relatively hard to classify, as noted in (Rumelhart and McCelland, 1987), the noise level is very high and the network size isvery limited. This was required in order to emphasize the dierences between the three models.
5 Conclusions This paper presents a block description for multi-layer discrete-time networks with unitdelay connections between layers. Networks are built by connecting blocks via a set of four \elementary connections": the cascade, the feedback, the split, and the sum connection. We derived an algorithm for this class of networks which adapts the connection layer for each block. The whole network can be trained by recursively applying the algorithm, starting from the innermost blocks and going outwards to the blocks they are embedded into. The algorithm is a block-matrix version of the Forward Propagation, which it reduces to in the case of a single layer with feedback. In the presence of cascade connected blocks with no feedback, the algorithm reduces to Backpropagation. As in Backpropagation, the algorithm computes the complete gradient of the cost.It is non-local, since a change in each weight depends on every weight in the connection layer. Computational requirements for the training of the whole network strongly depend on the network architecture and are de nitely high when blocks with feedback connections of large size are used. The block framework, and the corresponding block-matrix training algorithm provide a uni ed framework where static and dynamic multi-layer network architectures can be represented and trained. Networks of almost any complexity can be de ned, allowing the designer to more easily adapt the network architecture to the problem. Limitations in the architectures that can be expressed in this notation rely in the impossibility to de ne feedback paths crossing one another. We have reported examples in which dierent network architectures of dierent complexity have been trained and used for dierent problems, dealing with time varying input and output and with the classi cation of noisy samples based on multiple evidence. Further research is currently carried on in order to analyze performance and properties of dierent BFN architectures.
A Ordered derivatives
In the gradient descent methods, given a function f : IRn 7! IRm:
yk = fk (q) k = 1; : : :; m; (56) with q = [q1; : : : ; qn]T , and given a cost: L : IRm 7! IR , the value of q, such that the argument of L(y(q)) is minimum, can be determined by iteratively updating the q vector according to: 20
q = ?rL;
with:
(57)
" # (58) rL = @L @q : We can suppose that the vector q contains all the variables inside the network, both weight coecients and internal variables. We note that, in this case, the q's are no longer independent variables. In fact, changing some internal quantity will, in general, aect other quantities. The network architecture imposes some constraints to this interdependency. In a layered feedforward network, the variables inside a layer are a function only of the variables in the previous layers, i.e. there are no cyclic dependencies. Thus, if we consider the vector q, the above property implies that it is possible to nd a permutation P = fp1; p2; : : : ; pn g of the indices of q such that: qp = qp (qp1 ; qp2 ; : : :; qp ?1 ) k = 1; : : : ; n: To avoid notational complexities, the vector q is supposed suitably ordered: k
k
k
qk = qk (q1; q2; : : :; qk?1) k = 1; : : : ; n:
(59) (60)
To minimize L, it is necessary to compute its derivatives with respect to the elements of q. In general, L will be given as a function:
L = L(q1; : : : ; qn): (61) When the variable qj is modi ed, the value of L is changed, due to the variation of qj and of all the qs with s > j (which, according to (60), depend on qj ). This can be modeled by using the ordered derivative, (Werbos, 1974; Werbos, 1988). When qj varies, the elements of q describe the curve in IRn: 3 2 q 1 77 66 q2 77 66 ... 77 66 77 66 qj?1 77 ; (62) j (qj ) = 66 qj 77 66 qj+1(qj ) 77 66 ... 75 64 qn (qj ; : : :; qn?1) the ordered derivative of L with respect to qj is the derivative of L along the curve j . It is de ned by: @ +L = lim L(j (qj + qj )) ? L(j (qj )) : (63) @qj q !0 qj j
21
Note that, since the curve j is a function of qj only, ordinary calculus yields: + dL = @@qL dqj j
(64)
The curve j is a smooth manifold of dimension 1. However, the dierential of L is a linear functional de ned on the tangent space TP j of the curve in the point P where the derivative is compute. In particular:
dL 2 TP j ) dL : TP j 7! IR
(65)
The tangent space TP j is spanned by the vectors:
! @ @ (@j ; : : :; @n) (66) @qj ; : : :; @qn and has dimension 1, since the only independent variable is qj , so that, for every qk with k > j , the tangent vector @k can be expressed in terms of @j . The dual space TP j is spanned by: (dqj ; : : : ; dqn) (67) and is also of dimension 1. Note that dL is a covector in TP j . We need to restrict the dependence of the dqk 's to some of the previous dq's only. To this end, de ne: 2 3 0 66 .. 77 66 . 77 66 0 7 66 dqh(dqp; : : :; dqr ) 777 77 def 6 .. Lp;r (68) 66 . h;k 6 7 66 dqk (dqp; : : :; dqr ) 777 66 0 77 66 .. 77 4. 5 0 where the dependence is ordered, that is dqi(dqj ) = 0 if j > i. Moreover, let: n o i;j def (69) span Lk;h !k;h i;j def
To explore the relations between the dierentials dqk , let us get rid { for the moment +1;n , and an element { of the dependence on dqj . To this end, consider the space !jj+1 ;n +1;n . In general, dq will be decomposed into two parts. Part of the vector will dqh 2 !jj+1 h ;n +1;n , as be expressed as a linear combination of the other vectors in !jj+1 ;n hX ?1 k=j +1
akhdqk : 22
with some appropriate akh. However, there is another part which depends directly on dqj and, since dqj is not a vector in the space we are considering, it is independent on all +1;n . This part is given by: the other vectors in !jj+1 ;n h (70) dq h = @q @qj dqj +1;n is equal to All the elements dqj+1; : : :; dqn being independent, the dimension of !jj+1 ;n n ? j. In this space, we can de ne a partial inner product as: 8 @+q < (71) hdqi; dqj i = : @@q+q if i j if i < j @q Note that this de nition doesn't have all the properties of an internal product, but it holds: hdqi ; dqj i = hdqj ; dqii (72) hdqi; dqii = 1 (73) hdqi ; dqj i = 0 i qi and qj are independent (74) hdqj + dqk ; dqii = hdqj ; dqii + hdqk ; dqii if i j i k hdqi ; dqj + dqk i = hdqi ; dqj i + hdqi ; dqk i if i < j i < k which is all we will need in the following. j +1;n . The basis will also With this product, we can build an orthogonal basis for !j;n be orthonormal because of eq. (73). We do this by following the Grahm-Schmidt procedure. Usually, the elements of the orthogonal basis are denoted by e1; : : : ; en. For notational convenience, we will denote them by ej+1; : : : ; en. Thus: i
j
j
i
ej+1 = dqj+1 + ej+2 = dqj+2 ? hdqj+2 ; dqj+1iej+1 = dqj+2 ? @@qqj+2 ej+1 j +1 ... kX ?1 @ + q kX ?1 k eh hdqk ; dqhieh = dqk ? ek = dqk ? @q h h=j +1 h=j +1
(75)
j +1;n : the scalar product Note that each ek is independent on every other vector in !j;n hek ; dqh i is zero for j < h < k. However, on the curve j , the ek cannot be independent, since they must depend on qj . The only dependence remained is the direct dependence, as given by dq k , which is +1;n . Thus: not in !jj+1 ;n k ek = dq k = @q (76) @q dqj j
23
The expression for ek can thus be rewritten as: kX ?1 @ + q @qk dq k e dq k = dqk ? h= @qj j h=j +1 @qh or: kX ?1 @ + qk @qh k dq dqk = @q j+ @q @q @q dqj j
h=j +1
h
j
(77) (78)
If the function L is considered as the (n+1)-th variable of the series, we can write ?1 @ + L @q @L dq + kX h dL = @q dqj j @q @q j h=j +1 h j
(79)
From (79) and (64) we obtain the chain rule for the computation of the ordered derivative (Werbos, 1974) as: n @ + L @q @ +L = @L + X k (80) @qj @qj k=j+1 @qk @qj : A second chain rule, which can be derived in a similar way is: n @L @ + q @ +L = @L + X k (81) @qj @qj k=j+1 @qk @qj :
References Blum, E. K. and Li, L. K. (1991). Approximation theory and feedforward networks. Neural Networks, 4:511{515. Doya, K. Universality of fully-connected recurrent neural networks. IEEE Transactions on Neural Networks. (Submitted). Hecht-Nielsen, R. (1989). Theory of the backpropagation neural network. In Proceedings of International Joint Conference on Neural Network, pages I{593{I{605. Hop eld, J. J. (1982). Neural networks and physical systems with emergent collective computational abiblities. Proceedings of the National Academy of Sciences, 79:2554{ 2558. Hop eld, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences, 81:3088{3092. Kuhn, G., Watrous, R. L., and Ladendorf, B. (1989). Connected recognition with a recurrent network. In Neurospeech '89, Edinburgh, Scotland. 24
Hampshire II, J. B. and Pearlmutter, B. A. (1990). Equivalence proofs for multi-layer perceptron classi ers and the bayesian discriminant function. In Touretzly, Elman, S. and Hinton, editors, Proceedings of the 1990 Connectionists Models Summer School, San Mateo, CA. Morgan Kaufmann. Narendra, K. S. and Parthasarathy, K. (1990a). Identi cation and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks, 1(1):4{27. Narendra, K. S. and Parthasarathy, K. (1990b). Identi cation and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks, 1(1):4{27. Pearlmutter, B. A. (1990). Dynamic recurrent neural networks. Technical Report CMUCS-90-196, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213. Robinson, A. J. (1989). Dynamic Error Propagation Networks. PhD thesis, Engineering Department, Cambridge University, Trumpington Street, Cambridge, England. Rumelhart, D., Hinton, J., and Williams, R. J. (1986). Learning internal representation by error propagation. In Parallel Distributed Processing. Exploration in the microstructure of Cognition Vol. 1: Foundations. MIT Press. Rumelhart, D. E. and McCelland, J. L., editors (1987). Parallel Distributed Processing. Explorations in the Microstructure of Cognition, volume 1: Foundations. The MIT Press. Santini, S. and Del Bimbo, A. Properties of feedback neural networks. Neural Networks. (to appear). Santini, S. and Del Bimbo, A. Recurrent neural networks can be trained to be maximum a posteriori probability classi ers. Neural Networks. (to appear). Shanno, D. F. (1990). Recent Advanced in Numerocal Techniques for Large-Scale Optimization. MIT Press, Cambridge MA. Werbos, P. J. (1974). Beyond regression. New tools for prediction and analysis in the behavioral science. PhD thesis, Harvard University. Werbos, P. J. (1988). Generalization of back propagation with application to a recurrent gas market model. Neural Networks, 1(4). Williams, R. J. and Zipser, D. (1989a). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1:270{280. Williams, R. J. and Zipser, D. (1989b). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1:270{280.
25