Neural Networks, Rational Functions and Realization Theory
Uwe Helmke Department of Mathematics University of Regensburg Regensburg 8400 Germany
Robert C. Williamson Department of Engineering Australian National University Canberra 0200 Australia
April 18, 1995 The problem of parametrizing single hidden layer scalar neural networks with continuous activation functions is investigated. A connection is drawn between realization theory for linear dynamical systems, rational functions and neural networks that appears to be new. A result of this connection is a general parametrization of such neural networks in terms of strictly proper rational functions. Some existence and uniqueness results are derived. Jordan decompositions are developed which show how the general form can be expressed in terms of a sum of canonical second order sections. The parametrization may be useful for studying learning algorithms.
1 Introduction Nonlinearly parametrized representations of functions : R ! R of the form (1.1)
(x) =
n X i=1
ci(x ; ai) x 2 R
have attracted considerable attention recently in the neural network literature. Here : R ! R is typically a sigmoidal function such as (1.2)
(x) = 1 +1e;x 1
but other choices than (1.2) are possible and of interest. Sometimes more complex representations such as (1.3)
(x) =
n X i=1
ci(bix ; ai)
or even compositions of these are considered. Yet more generally, functions : R m ! R are also studied. Dierent choices of the \activation function" : R ! R correspond to dierent representation problems. For example, if (x) = x;1 then (1.1) amounts to nding the partial fraction decomposition of a rational function (x). The coecients ai and ci arising in (1.1) here have the interpretation of the poles and residues respectively of (x). In this example it is obvious also that complex coecients ai ci 2 C arise naturally, as a real rational function : R ! R may well have complex poles and residues. Another case of interest is where (x) = xd , d 2 N , is a monomial. Then (1.1) is equivalent to nding a decomposition of a polynomial (x) of degree d as a weighted sum of d-th powers of linear polynomials x ; ai . This is usually referred to as Waring's problem for binary forms, with early results going back to Sylvester 25] and Gundelnger (1886). There is also an interesting connection with Hilbert's 17th problem, asking whether a positive polynomial : R m ! R can be represented as a sum of squares of polynomials (or rational functions). In fact if the coecients ci in (1.1) are all positive and the degree d of is even, then (1.1) is such a representation of a polynomial as a sum of squares of polynomials. If : R ! R is a sigmoidal function such as (1.2), then functions of the form (1.1) are described by one \hidden layer" neural networks with n hidden layer thresholds ai and output weights ci, but with no input weights. The task then is to nd, or to \learn", an exact or approximate representation (1.1) of some function. There are now a number of results available describing the \universal approximation properties" of such classes of feedforward neural networks. Nearly all of these 5, 8, 14] are in the form of denseness results, saying that if one takes enough nodes, one can make an arbitrarily good approximation. Motivated by analogies with model reduction of linear control systems we became interested in nding best approximations of functions by neural network representations (1.1), with an upper bound on the number n of such nodes. Such questions are important in order to estimate the approximation theoretic capabilities of learning algorithms for (1.1). For the special sigmoid function (1.2) an analysis has been pre-
sented in 26], where the problem has been shown to be deeply related to classical rational approximation theory. In order to approach such neural network approximation tasks, possibly valid for a large class of activation functions : R ! R , it becomes important to study the parametrization problem for the class of functions described by (1.1). It has been shown in 26] that there often exists no best approximation of a function : R ! R by functions of the type (1.1). Thus the class of functions (1.1) is not rich enough to guarantee the convergence of general learning algorithms. Also, the dierent parametrizations of the class (1.1) may well have an impact on the transient behaviour of learning algorithms and are thus worth being studied in more detail. (See 4] for some examples of the eect of dierent parametrizations on some very simple learning problems.)
1.1 What this paper is about The purpose of this paper is to explore such parametrization issues regarding (1.1) (and to a lesser extent (1.3)), and in particular to show the close connection these representations have with the standard system-theoretic realization theory for rational functions. The main result of this paper is theorem 5.1. We rstly show how to dene a generalization of (1.1) parametrized by (A b c), where A is a matrix over a eld, and b and c are vectors. (This is made more precise below). The parametrization involves (A b c) being used to dene a rational function. The generalized representations are then dened in terms of the rational function. Representations (1.1) correspond to the case where A is diagonalizable. In that case, the thresholds ai and the output weights ci are interpreted as the poles and residues of the associated rational \transfer function" c(xI ; A);1 b. This connection allows us to use results available for rational functions in the study of neural-network representations such as (1.1). It will also lead to an understanding of the geometry of the space of functions. That there is indeed a close connection between representations of the form (1.1) and rational functions was shown in 26] (and previously used in 23]). There it was shown that (1.1) can be written as exr(ex) when () is given by (1.2). The function r() is a strictly proper rational function, and the coecients ai in (1.1) correspond to the logarithm of the poles of r(), and the ci coecients correspond to the residues of r(z) at z = eai . In the following section we shall show that representations of the form (1.1) in
general can be parametrized by rational functions in a single variable . Furthermore, the more general representations (1.3) are shown to be parametrized by rational functions in two variables. They correspond to so-called separable 2{D systems, arising in two dimensional image processing. Then, by using ideas originally from the theory of state-space realizations of linear dynamical systems, we give conditions under which a representation (1.1) exists for a given function (). More generally, the existence and uniqueness properties of representations (x) = c(xI + A)b are investigated, where A is an n n matrix, b is a n-vector and c is a n-covector. Such representations naturally extend representations (1.1) where A = diag(a1 : : : an) is diagonal.
2 Realizations Relative to a Function In this section we explore the relationship between sigmoidal representations (1.2) of real analytic functions : I ! R dened on an interval I R , real rational functions dened on the complex plane C , and the well established realization theory for linear dynamical systems
x_ (t) = Ax(t) + bu(t) y(t) = cx(t) + du(t): For standard textbooks on systems theory and realization theory we refer to 15, 17, 22]. Let K denote either the eld R of real numbers or the eld C of complex numbers. Let C be an open and simply connected subset of the complex plane, containing I, and let : ! C be an analytic function dened on . For example, may be obtained by an analytic continuation of some sigmoidal function : R ! R into the domain of holomorphy of the complex plane. Let T : V ! V be a linear operator on a nite-dimensional K -vector space V such that T has all its eigenvalues in . Let ; be a simple closed curve, oriented in the counter-clockwise direction, enclosing all the eigenvalues of T in its interior. More generally, ; may consist of a nite number of simple closed curves ;k with interiors 0k such that the union of the domains 0k contains all the eigenvalues of T.
De nition 2.1 The matrix valued function (T ) is de ned as the contour integral
18, p.44]
(2.1)
Z 1 (T ) := 2i (z) (zI ; T );1 dz: ;
Note that for each linear operator T : V ! V, (T ): V ! V is again a linear operator on V that is independent of the choice of ;. If we now make the substitution T := xI + A for x 2 C and A : V ! V K -linear, then Z 1 (xI + A) = 2i (z) ((z ; x)I ; A);1 dz ; becomes a function of the complex variable x, at least as long as ; contains all the eigenvalues of xI + A. Using the change of variables := z ; x we obtain (2.2)
1 Z (xI + A) = 2i (x + ) (I ; A);1 d ; 0
where ;0 = ; ; x encircles all the eigenvalues of A. Given an arbitrary vector b 2 V and a linear functional c : V achieve the representation (2.3)
!K
we then
1 Z c(xI + A)b = 2i (x + ) c(I ; A);1b d: ;
Note that in (2.3) the simple closed curve ; C is arbitrary, as long as it satises the two conditions (2.4) (2.5)
; encircles all the eigenvalues of A x + ; = fx + j 2 ;g :
We will take (2.3) to be the de nition of c(xI + A)b. Let : I ! R be a real analytic function in a single variable x 2 I, dened on an interval I R .
De nition 2.2 A quadruple (A b c d) is called a nite-dimensional -realization
of : I ! R over a eld of constants K if for all x 2 I
(2.6)
(x) = c(xI + A)b + d
holds, where the right hand side is given by (2.3) and ; is assumed to satisfy the
conditions (2.4){(2.5). Here d 2 K , b 2 V, and A : V ! V, c : V ! K are K -linear maps and V is a nite dimensional K -vector space. If d 0, we will sometimes just write of a -realization (A b c).
De nition 2.3 The dimension (or degree) of a -realization is dimK V. The degree of , denoted ( ), is the minimal dimension of all -realizations of . A minimal -realization is a -realization of minimal dimension ( ).
The above denition of a -realization is a rather straightforward extension of the familiar system-theoretic notion of a realization of a transfer function. In this paper we will address the following specic questions concerning -realizations.
Q1 Q2 Q3 Q4
What are the existence and uniqueness properties of -realizations? How can one characterize minimal -realizations? How can one compute ( )? Given , when does there exist a -realization (A b c d) with A diagonalizable over K , and what is the minimal dimension of such a realization?
Examples of ()
Important examples of activation functions : R ! R are: 1. (x) = x;1 In this case -realizations are just the standard realizations of analytic functions or formal power series in systems theory. Kalman's realization theorem 17] solves questions 1{3 for rational functions and, in fact, for arbitrary formal power series (x).
2. (x) = xd d 2 N This is known as Waring's problem for binary forms (being a generalization of the number theoretic question bearing that name 12]). A function (x) admits a -realization (A b c) if and only if it is a polynomial of degree d. Over C and over R with d even, the problem has been solved by Helmke 13]. 3. (x) = (1 + e;x);1 We refer to this as the standard sigmoid case. This function (x) is widely used in feedforward neural networks. Other cases of interest include bump functions such as e.g. (x) = e;x2 .
Remarks 1.
is usually either the eld R of real numbers or the eld C of complex numbers. Even if (x) and (x) are considered to be real valued functions, complex realizations may be of interest i.e. where V is a C -vector space, b 2 V and A : V ! V, c : V ! C are C -linear maps. This topic will be of importance, for example, if realizations are sought where A is a diagonal matrix (see section 6.1 below). In fact, if A is diagonalizable with complex eigenvalues then the diagonalization of A leads a nite sum sigmoidal representation of (x). Such complex realizations may thus yield nite sum representations using complex coecients, even in cases where exact nite sum representations with real coecients do not exist. This is one reason why complex realizations may prove to be important.
K
2. The standard realizations arising in systems theory have (x) = x;1 . In this case, a function (x) has a nite dimensional -realization if and only if (x) is rational with (1) 6= 1 17].
2.1 Generalizations: Realizations Related to X c (b x ; a ) n
=1
i
i
i
i
In (2.3), the strictly proper \transfer function"
g( ) := c(I ; A);1 b can be extended to an arbitrary rational function (and thus in particular to a polynomial). To see this we consider the integral transformation (2.7)
1 Z (x) = (x + ) g( ) d 2i ;
for an arbitrary rational function g( ) = p( )=q( ) 2 K ( ). Clearly (2.7) makes sense for any simple closed integration path ; which encircles the poles of g( ) such that x + ; for all x 2 I. Moreover, by the residue theorem, (2.8)
1 Z (x + ) g( ) d = X res (x + ) g( )] : 2i ; g(z)=1 =z
When all the poles of g( ) are distinct, (2.8) is just a weighted sum of residues, where the sum is over all nite poles of g( ). We note in passing that formula (2.8)
is reminiscent of Gaussian quadrature formulae see, e.g. 9, 20]. More generally one can consider -realizations (E A b c d), where E A : V ! V, c : V ! K are K -linear maps satisfying det(xE ; A) 6 0, b 2 V, d 2 K such that (x) = c(xE + A)b + d
(2.9)
holds for all x 2 I. Here we dene (2.10)
1 Z c(xE + A)b := 2i ( ) c(I ; xE ; A);1b d ;
where, for x 2 I xed, ; encircles all the roots of det(I ; xE ; A) = 0 such that x +; for all x 2 I. Note that the \transfer function" g(x ) = c(I ; xE ; A);1 b is a rational function in two variables ( x). While such rational functions arise naturally in applications, such as two dimensional image processing, the system theoretic interpretation of such transfer functions in the context of neural networks is left as an open question. Of course commuting -realizations (E A b c d) with
EA = AE
(2.11) are of obvious interest.
1. Let S : V ! V be an invertible K -linear map. Then (E A b c d) is a -realization of if and only if
Lemma 2.4
(SES ;1 SAS ;1 SB cS ;1 d) is a -realization of . In particular, for E = In (the n n identity), if (A b c d) is a -realization of then (SAS ;1 Sb cS ;1 d) is a -realization of for all invertible transformations S : V ! V. 2. If A = diag(1 : : : n) is diagonal, then for each x,
(2.12)
(xI + A) = diag((x + 1) : : : (x + n)):
Proof The proof is an immediate consequence of denition (2.3), once it is noticed that
cS ;1(xSES ;1 + SAS ;1)Sb = c(xE + A)b
holds for all x 2 I.
Remark In the case (x) = (1 + e;x);1 it is of interest to compare the repre-
sentation (2.7) with the representation used in 26], namely (x) = exr(ex), where r() is a strictly proper rational function with all its poles in C ; 0 1). (Such a representation was also used in 23].) It is easily veried (see 26] for details), that a representation (1.1) is equivalent to (x) = exr(ex), with
r(z) =
n X ci : a i=1 z + e i
Let g() be a strictly proper rational function and ; an arc encircling the poles of g(). Equating the two representations we have
Z exr(ex) = 21i (z + x) g(z) dz Z; ex 1 g(z) dz: = 2i ; ex + e;z Let y = ex and so x = log y. Then Z 1 1 r(y) = 2i y + e;z g(z) dz: ;
Let zk , k = 1 : : : n, denote the poles of g(z). Then by the residue theorem,
r (y ) =
n X k=1
zres =zk
g(z) : y + e;z
Thus if g(z) has a pole at z = , r(z) has a pole at z = ;e; . Specically, if g(z) = c(zI ; A);1b, then (2.13)
r(z) = c(zI + e;A);1b:
3 Existence of -realizations We now consider the question of existence of -realizations. To set the stage, we consider the systems theory case (x) = x;1 rst. Assume we are given a formal
power series (x) =
(3.1)
N X i=0
i i i! x N 1
and that (A b c) is a -realization in the sense of denition 2.2. The Taylor expansion of c(xI + A);1b at 0 is (for A nonsingular)
c(xI + A);1 b =
(3.2)
1 X i=0
(;1)i cA;(i+1) bxi :
Thus i ;(i+1) b i = 0 : : : N: i! = (;1) cA if and only if the expansions of (3.1) and (3.2) coincide up to order N . Observe 17] that i
(3.3)
(x) = c(xI + A);1 b and dim V < 1
()
(x) is rational with (1) = 0:
The possibility of solving (3.3) is now easily seen as follows. Let V = R N +1 = Map(f0 : : : N g R ) be the nite or innite (N + 1)-fold product space of R . (Here Map(X Y ) denotes the set of all maps from X to Y .) If N is nite let (3.4)
A;1
2 66 0 6 1 = ; 66 . . 64 .
For N = operator
1 0 ... 0
(1 0 0)T 2 V 0
b =
0 0 ... 1
3 77 77 (N +1)(N +1) 77 2 R 5
N
2
N ;1
!
c = N ! 0 1 2! : : : (N ; 1)! :
1 we may take, by an abuse of notation, A;1 : RN ! RN as the shift
(3.5) and
A;1 : R N ! R N A;1 : (x0 x1 : : :) 7! ;(0 x0 x1 : : :) b = (1 0 : : :) c = (0 0 1 2!2 : : :):
Note that there may be nite-dimensional realizations as well. We will now consider more general (). We then have
De nition 3.1 Let (x) be a meromorphic function and let (x) =
N X i=0
i i x
i!
N 1
be a, possibly nite, formal power series in the variable x. A triple (A b c) is called an N -th order -realization of (x) if the following conditions are satis ed: (a) (x) is analytic on a neighborhood of the spectrum of A in C . (b) The power series expansions of (x) and c(xI + A)b at x = 0 coincide up to order N .
By condition (a) the i-th derivative of the function x 7! (xI + A) at x = 0 exists and is equal to (i) (A), where (i) (x) denotes the i-th derivative of the function (x). Therefore condition (b) is equivalent to (3.6)
i
= c(i) (A)b for i = 0 : : : N:
Observe that condition (a) is satised for (x) = x;1 if and only if A is invertible. Moreover, since the standard sigmoid (z) = (1+ e;z );1 is analytic for any complex variable z 2 C with j=zj < , condition (a) is satised for the standard sigmoid if A has all its eigenvalues in fz 2 C : j=zj < 2g. If (x) = x;1 , and A 2 C nn is invertible, then (i) (;A) = i!(A;1 )i+1. Consequently, in this case, a triple (A b c) is an N -th order -realization if and only if the partial realization condition (3.3) is satised. Using the terminology of linear systems theory we may thus say that (A b c), with A invertible, is an N -th order -realization for (x) = x;1 if and only if (F g h): = (;A;1 A;1b c) is a partial realization of ( 0 1 : : : NN! ), i.e. if and only if i i i! = hF g
holds for i = 0 : : : N . The existence part of the realization question Q1 can now be restated as
Q5 Given a meromorphic function (x) and a sequence of real numbers ( 0 : : : N ),
does there exist an N -th order -realization (A b c) with (3.7)
i
= c(i) (A)b i = 0 : : : N ?
Thus the realization question Q1 is essentially an interpolation question (Loewner interpolation 2, 7]), which extends the partial realization task from linear system theory, where (x) = x;1 to more general sigmoid functions (x). Let ` = cA`b ` 2 N 0 , and let (3.8) Write
(3.9)
2 66 0 1 2 6 F = 66 1 2 3 64 2 3 4 ...
...
37 777 = (i+j )1 : ij =0 775
... . . .
2 3 2 0 66 77 66 0 66 1 77 6 ] = 666 2=2! 777 and ] = 66 1 64 2 66 3=3! 77 ... 4 .. 5 .
3 77 77 77 : 5
Then (3.6) (for N = 1) can formally be written as (3.10)
] = F ]:
Of course, any meaningful interpretation of (3.10) requires that the innite sums P1 i+j , i 2 N , exist. This happens, for example, if P1 2 < 1, i 2 N and j 0 0 j =0 j =0 i+j P1 j!j 2 < 1 exist. We have already seen that every nite or innite sequence j =0 j ! ] has a realization (A b c). Thus we obtain
Corollary 3.2 Suppose that the in nite sums P1j=0
i +j
j!
j, i
(x) admits a -realization if and only if ] 2 image(F ).
Corollary 3.3 Suppose that the in nite sums P1j=0 (3.11)
i+j
j!
j,
2 N 0 , exist. A function i 2 N 0 , exist. Let
H = (i+j )1 ij =0 :
There exists a nite dimensional -realization of (x) if and only if ] = F ] with
rank H < 1. In this case ( ) = rank H .
Proof This follows immediately from Kronecker's theorem and systems realization theory see 15, 17].
Remark It would be desirable to have an explicit characterization of when F
is invertible or surjective via Hankel operator theory for analytic functions. Also it would be nice to know how to compute the inverse of F i.e. ] = F ;1 ]. Furthermore, how can one characterize pairs of analytic functions ( ) such that P1 ;i;1 is rational and (i+j )1 ij =0 has nite rank? When rank H < 1, then i=0 ix thus i = cAib, i 2 N 0 for (A b c) is nite dimensional.
4 Uniqueness In this section we consider the uniqueness of the representation (2.3). We require several denitions rst.
De nition 4.1 A system fg1 : : : gng of continuous functions gi : I ! R , de ned on an interval I R , is said to be linearly independent, if for every c1 : : : cn 2 R with Pni=1 cigi(x) = 0 for all x 2 I, then c1 = = cn = 0.
Remark The linear independency condition is implied by the stronger condition that
2 66 g1(.x1 ) det 64 .. gn(x1 )
3 g1 (xn) 7 ... 77 6= 0 5 gn(xn) for all distinct (xi )ni=1 in I. Equivalently, if Pni=1 cigi(x) has n distinct roots in I, then c1 = = cn = 0.
De nition 4.2 A subset A of C is called self-conjugate if a 2 A implies a 2 A. Let : R ! R be a continuous function and dene z(ji )(x) := (j)(x + zi). Let
:= ( 1 : : : m) where
m X j =1
j = n j 2 N j 1 j = 1 : : : m
denote a combination of n of size m. For a given combination = ( 1 : : : m) of n, let I := f1 : : : mg and let Ji := f1 : : : ig. Let Zm := fz1 : : : zm g and let (4.1)
( Zm) := fz(ji ;1) : i 2 I j 2 Ji g:
De nition 4.3 If for all m
n, for all combinations = ( 1 : : : m) of n of
size m, and for any self-conjugate set Zm of distinct points, the class of functions ( Zm) are linearly independent, then is said to be l.i. (linearly independent) generating of order n.
A condition similar to this was dened in 1]. They considered a uniqueness condition on () for representations of the form Pni=1 ci(bi x ; ai ) (IP) and Pni=1 ci (bix) (WIP).
Theorem 4.4 (Uniqueness) Let : R ! R be l.i. generating of order at least 2n ~ ~b c~) be minimal -realizations of order n of functions on I and let (A b c) and (A and ~ respectively. Then the following equivalence holds
(4.2)
()
c(xI + A)b = c~(xI + A~)~b 8x 2 I c(I ; A);1 b = c~(I ; A~);1~b
8 2 R :
~ ~b c~), then Conversely, if (4.2) holds for almost all order n triples (A b c), (A : R ! R is l.i. generating on I of order n.
Proof (Equivalence) By hypothesis (A b c) and (A~ ~b c~) are minimal realizations of and ~ (of order n). If = ~, then
c(xI + A)b = c~(xI + A~)~b: Thus for
g( ) := c(I ; A);1 b and
g~( ) := c~(I ; A~);1~b
we have
1 Z (x + )h( ) d = 0 2i ;
(4.3)
8x 2 I
with h( ) := g( ) ; g~( ), having degree 2n. By the residue theorem, condition (4.3) is equivalent to
X
(4.4)
h(zi )=1
zres =zi (x + z )h(z )] = 0
where the sum is over all distinct poles fz1 : : : zm g of h(z) (m 2n). Write the partial fraction expansion of h(z) as
h(z) =
(4.5)
i m X X
hij j i=1 j =1 (z ; zi )
m X i=1
i = 2n:
(It is always of the form (4.5) since h(z) is the dierence of two strictly proper rational functions and hence is itself strictly proper.) Recall the general formula for residues of a product of an analytic and a meromorphic function. For the case of one pole (possibly of multiplicity greater than one) we have:
13 k aj f (j;1)(z ) aj A5 = X 0 j j =1 (j ; 1)! j =1 (z ; z0 )
2 0k X 4 @ res f ( z ) z =z 0
(see for example 21]). Thus we can express (4.4) as (4.6)
X X hij (j;1) (x + zi ) = 0: i2I j 2Ji (j ; 1)!
Since is l.i. generating of order at least 2n, (4.6) implies that hij = 0, i 2 I , j 2 Ji. Hence h(z) = 0 and so g(z) = g~(z) as required.
(Converse) Assume is not l.i. generating of order at least n. That is, for some
m n there is a combination = ( 1 : : : m ) of n of size m and a self-conjugate set of points Zm = fz1 : : : zmg of size m such that for some l 2 I = f1 : : : mg, and some jl 2 Jl = f1 : : : l g there is a sequence of coecients (ij ) such that X X ij (j;1) (jl;1) (x + zl ) = (4.7) (x + zi ) (ij )2I(l) (j ; 1)!
where
I (l) := f(i j): (i 2 I ; flg j 2 Ji) or (i = l j 2 Jl ; fjlg)g:
Now
c(xI + A)b = =
(4.8)
X g(zi )=1
zres =zi (x + z )g (z )]
X X gij (j;1) (x + zi) (j ; 1)! i2I j 2Ji
where gij are the coecients of the partial fraction expansion of g(z): X X gij (4.9) g (z ) = j: i2I j 2Ji (z ; zi ) By (4.7) and (4.8) we can then write
X X gij (j;1)(x + zi) gljl (jl;1) (x + zl ) + (j ; 1)! (jl ; 1)! (ij )2I(l) X X gij (j;1)(x + zi) X X gljl ij (j;1)(x + zi) + = (j ; 1)! (j ; 1)! (ij )2I(l) (ij )2I(l) X X (j;1) = ij (x + zi)
c(xI + A)b =
(ij )2I(l)
where (4.10)
ij := gij(j+;glj1)!l ij (i j ) 2 I (l):
Now jI (l)j = n ; 1 but there are n independent parameters gij , i 2 I , j 2 Ji . Thus given ij and ij , gljl can be chosen arbitrarily (with the gij given by (4.10)) such that ij remains invariant. Hence c(xI + A)b is unchanged. (In fact gljl generates a linear subspace.) But g( ) changes and, for dierent choices of gljl , is not necessarily equal to g~( ). The following result gives examples of activation functions : R ! R which are l.i. generating.
Lemma 4.5 Let d 2 N 0 . Then 1) The function (x) = x;d is l.i. generating of arbitrary order. 2) The monomial (x) = xd is l.i. generating of order d + 1.
3) The function e;x2 is l.i. generating of arbitrary order. ;1 (d + l)x;d;j . Let z : : : z 2 C Proof For (x) = x;d we have (j) (x) = (;1)j Qlj=0 1 m
be distinct complex numbers and let = ( 1 : : : m) be a combination of n of size m. Then m X m X i ;1 i ;1 X X cij (j)(x + zi) = cij (x +c~zij )d+j i i=1 j =0 i=1 j =0 ;1 (d + l)x;d;j , is a strictly proper rational function of degree with c~ij = (;1)j cij Qjl=0 n + m(d ; 1). Therefore it vanishes identically on a nontrivial interval I if and only if the coecients cij are all zero. This proves 1). To prove 2) let us assume for simplicity that ( 1 : : : m) = (1 : : : 1), m = d +1. The general case is treated analogously. Thus suppose there are c1 : : : cd+1 2 R and +1 ci (x + zi ) identically d + 1 distinct complex numbers z1 : : : zd+1 given with Pdi=1 zero on the interval I R . Thus for all x 2 I
0=
n X i=1
ci(x + zi) =
dX +1 i=1
ci (x + zi )d
! d dX +1 d! X d ; j = ci j zi xj : j =0 i=1
+1 ci z l = 0. Equivalently, Thus for l = 0 : : : d, Pdi=1 i
2 66 66 66 4
1 z1 ... z1d
3
1 72 zd+1 777 66 ... 77 64
zdd+1
c1 ...
5 cd+1
3 77 75 = 0:
Since the Vandermonde matrix is invertible (by assumption that the z1 : : : zd+1 are distinct) it follows that c1 = = cd+1 = 0. Thus xd is l.i. generating of order d + 1. To prove 3) assume there are c1 : : : cn such that for distinct complex numbers z1 : : : zn we have n X i=1
cie;(x;zi )2 = e;x2
n X i=1
cie;zi2 e;2zix = 0
for all x 2 I. Then by the identity theorem for analytic functions, (4.11)
n X i=1
c~ie;2zi x = 0
for all x 2 R with c~i = cie;zi2 . By dierentiating (4.11) n ; 1 times and evaluating at x = 0 we obtain n X i=1
c~i(;2zi )l = 0
l = 0 : : : n ; 1:
Thus, using the invertibility of the Vandermonde matrix,
2 66 66 66 4
1 z1 ...
z1n;1
1 zn ...
znn;1
3 77 2 c~1 77 66 .. 77 64 . 5 c~n
3 77 75 = 0
implies that c~1 = = c~n = 0 and hence c1 = = cn = 0. Thus e;x2 is l.i. generating of order n. Since n is arbitrary, the result follows. The general case can be proved using a residue type argument similar to the proof of theorem 1 in 1].
Remark A simple example of a which is not l.i. generating of order 2 is (x) = ex. In fact, in this case (x + zj ) = cj (x + zi) for cj = ezj ;zi , j = 2 : : : n.
Remark Similarly, the standard sigmoid function (x) = (1+e;x );1 is not l.i. gen-
erating of any order 2. In fact, by the periodicity of the complex exponential p function we have (x +2i) = (x ; 2i), i = ;1, for all x. Thus the l.i. condition fails for Z2 = f2i ;2ig. In particular, the above uniqueness result fails for the standard sigmoid case. In order to cover this case we need a further denition.
De nition 4.6 Let # = # C be a self-conjugate subset of C . A function : R !
is said to be l.i. generating of order n on #, if for all m n, for all combinations
= ( 1 : : : m) of n of size m, and for any self-conjugate subset Zm # of distinct points of #, ( Zm ) consists of linearly independent functions. R
Of course for # = C , this denition coincides with denition 4.3. We have the following extension of theorem 4.4. The proof is completely analogous to that of theorem 4.4 and is thus omitted.
Theorem 4.7 (Local Uniqueness) Let : R ! R be analytic and let # C be
a self-conjugate subset contained in the domain of holomorphy of . Let I be a nontrivial subinterval of # \ R . Suppose : R ! R is l.i. generating on # of order ~ ~b c~) at least 2n, n 2 N . Then for any two minimal -realizations (A b c) and (A of orders at most n the following equivalence holds:
(4.12)
()
c(xI + A)b = c~(xI + A~)~b 8x 2 I c(I ; A);1 b = c~(I ; A~);1~b
8 2 R :
The next result shows that the standard sigmoid function is l.i. generating on a suitable subset #.
Lemma 4.8 Let # := fz
(x) = (1 + e;x);1
2 C : j=zj < g.
Then the standard sigmoid function is l.i. generating on # of arbitrary order.
Proof Let distinct points z1 : : : zn P n c (x i=1 i
2 # and c1 : : : cn 2 R be given such that
+ zi) = 0. The general case is treated similarly, using the fact that ; (x)), so that all higher order derivatives of (x) can be expressed as polynomials in (x). Thus for all x 2 I #
0 (x) = (x)(1
0=
n X i=1
ci(x + zi) =
n X ci ex = exr(ex) x ;z i=1 e + e i
with r(z) = Pni=1 z+cei zi . Thus r(z) = 0 vanishes identically. As z1 : : : zn 2 # are pairwise distinct and as z 7! ez is injective on #, the poles ;e;zi , i = 1 : : : n, of r(z) are pairwise distinct. Thus r(z) is of degree n and identically zero. Therefore c1 = = cn = 0 and the result follows. ;
Albertini et al 1] showed (x) = 1+1e x satised their uniqueness condition IP using both a residue argument (as above) and directly using Cauchy's formula for det( ai+1 bj )ij (see 6, lemma 11.3.1]). ;
5 Main Result As a consequence of the uniqueness theorems 4.4 and 4.7 we can now state our main result on the existence of minimal -realizations of a function (x). It extends a
parallel result for standard transfer function realizations, where (x) = x;1 .
Theorem 5.1 (Realization) Let # C be a self-conjugate subset, contained in
the domain of holomorphy of a real meromorphic function : R ! R . Suppose is l.i. generating on # of order at least 2n and assume (x) has a finite dimensional realization (A b c) of dimension at most n such that A has all its eigenvalues in #. 1. There exists a minimal -realization (A1 b1 c1) of (x) of degree ( ) dim(A b c). Furthermore, there exists an invertible matrix S such that
(5.1)
2 3 2 3 A A SAS ;1 = 4 1 2 5 Sb = 4 b1 5 cS ;1 = c1 c2 ] 0 A3 0
holds. 2. If (A1 b1 c1 ) and (A01 b01 c01 ) are minimal -realizations of (x) such that the eigenvalues of A1 and A01 are contained in #, then there exists a unique invertible matrix S such that
(5.2)
(A01 b01 c01) = (SA1S ;1 Sb1 c1S ;1 )
holds. 3. A -realization (A b c) is minimal if and only if (A b c) is controllable and observable i.e. if and only if (A b c) satis es the generic rank conditions
rank(b Ab : : : An;1b) = n 3 2 c 7 66 66 cA 777 rank 6 .. 7 = n 64 . 75 cAn;1 for A 2 K nn , b 2 K n , cT
2 K n.
Proof The existence of minimal -realizations with ( ) dim(A b c) is trivial:
just pick any -realization (A1 b1 c1), with eigenvalues of A1 contained in #, from the set of all -realizations of (x) with smallest possible dimension dim(A b c).
Let (A1 b1 c1) be a minimal -realization of (x) with eigenvalues of A1 contained in #. By the uniqueness theorem 4.7,
c1(I ; A1 );1b1 = c(I ; A);1 b for all : Thus (5.1) follows from the Kalman decomposition see 15]. Moreover, statements 2 and 3 follow immediately from Kalman's realization theorem for strictly proper rational transfer functions 15].
Remark The use of the terms \observable" and \controllable" is solely for formal
correspondence with standard systems theory. There are no dynamical systems actually under consideration here.
Remark For any -realization (A b c) of the form
2 3 2 3 A A 11 12 5 b = 4 b1 5 c = c1 c2] A=4 0 A22 0
we have
2 (A) = 4 (A11 )
3 5
0 (A22 ) and thus c(xI + A)b = c1(xI + A11 )b1 . Thus transformations of the above kind always reduce the dimension of a -realization. The following results are immediate consequences of theorem 5.1 and lemma 4.5.
Corollary 5.2 Let (x) = g(k)(x) be a k-th derivative of a rational function g(x). Then
1. (x) = c(xI ; A);k;1b for (A b c) 2 K nn K n K 1n .
~ ~b c~) are minimal realizations such that for all x 2 I 2. If (A b c) and (A (x) = c(xI ; A);k;1b = c~(xI ; A~);k;1~b then there exists a unique invertible matrix S such that
~ ~b c~) = (SAS ;1 Sb cS ;1): (A
3. A realization (A b c) 2 K nn K n K 1n satisfying (x) = c(xI ; A);k;1 b is of minimal size n if and only if (A b c) is controllable and observable.
Proof By lemma 4.5 the function (x) = x;k is l.i. generating of all orders.
Applying theorem 5.1 to (x) = x;k and # = C ; f0g completes the proof. The next result is a special case of a more general result appearing in 13].
Corollary 5.3 Let (x) be a polynomial of degree 2n. Suppose (x) = =
s X i=1
s X i=1
ci(x ; ai )2n c0i(x ; a0i )2n
are two sums of 2nth power representations of (x) of minimal length s satisfying s < n. Then (a0i c0i) = (a(i) c(i) ) i = 1 : : : s for a permutation : f1 : : : sg ! f1 : : : sg.
Proof By lemma 4.5 the function (x) = x2n is l.i. generating of order 2n + 1. Then (A b c) and (A0 b0 c0) dened by
2 3 66 1. 77 A = diag(a1 : : : as) b = 64 .. 75 c = (c1 : : : cs) 1 2 3 17 6 A0 = diag(a01 : : : a0s) b0 = 664 ... 775 c0 = (c01 : : : c0s) 1
are -realizations of (x) of order s < n. By minimality of s, (A b c) and (A0 b0 c0) are controllable and observable. Applying theorem 5.1 to (x) = x2n , # = C , yields (A0 b0 c0) = (SAS ;1 Sb cS ;1) for a unique invertible matrix S . Since SAS ;1 = A0 and both A and A0 are diagonal, S must be a permutation matrix. The result follows.
As a nal consequence we obtain another proof of the following result (c.f. 24]which contains a more general result)
Corollary 5.4 Let (x) = (1 + e;x);1 and (x) =
n X i=1
ci(x ; ai) =
n X i=1
c0i(x ; a0i )
be two minimal length -representations with
j=aij < j=a0ij < Then
i = 1 : : : n:
(a0i c0i) = (a(i) c(i))
for a unique permutation : f1 : : : ng ! f1 : : : ng. In particular, minimal length representations (1.1) with real coecients ai and ci are unique up to a permutation of the summands.
Proof By lemma 4.8, (x) = (1+e;x);1 is l.i. generating on # = fz 2 C : j=zj < g
of arbitrary order. By minimality of the representations ai = 6 aj and a0i =6 a0j for i 6= j . Let (A b c) and (A0 b0 c0) be dened by
2 3 66 1. 77 A = diag(; log a1 : : : ; log an ) b = 64 .. 75 c = (c1 : : : cn) 1 2 3 66 1. 77 0 0 0 0 0 0 A = diag(; log a1 : : : ; log an) b = 64 .. 75 c = (c1 : : : c0n) 1
using the standard branch of the complex logarithm function. Then (A b c) and (A0 b0 c0) are controllable and observable -realizations of (x). By theorem 5.1 applied to (x) = (1 + e;x);1 and # = fz 2 C : j=zj < g the result follows as in the previous proof.
6 Jordan Canonical Forms We now explore the connection between the Jordan canonical form for minimal realizations (A b c) and -realizations. Consider the partial fraction decomposition of a transfer function g( ) = c(I ; A);1 b: (6.1) Then
(6.2)
g( ) =
i m X X
cij : j i=1 j =1 ( ; zi )
Z 1 (x + )g( ) d (x) = 2i ; ! i m X X ( x + ) = cij res j i=1 j =1 =zi ( ; zi ) i m X X cij (j;1) (x + z ) = i i=1 j =1 (j ; 1)!
By the realization theorem, any minimal -realization (A b c) of (x) satisfying (2.7) is a controllable and observable realization of the transfer function g( ). Thus by the Jordan control canonical form 15], (A b c) is similar to (AJ bJ cJ ) = (SAS ;1 Sb cS ;1) with (6.3) (6.4)
(6.5) (6.6)
AJ = diag( 2 z1 I + N1 : : :3 zm I + Nm) 07 66 0. .1 66 .. . . . . . ... 777 i i Ni = 6 2 R i = 1 : : : m 64 0 0 1 775 0 0 0 3 2 3 2 e 1 66 77 66 0. 77 6e 7 6.7 bJ = 66 .. 2 77 ei = 66 . 77 2 Ri 64 . 75 64 0 75 em 1 cJ = c11 : : : c11 : : : cm1 : : : cmm ] 2 K n
where i are the multiplicities of the poles zi of g( ), and cij are the coecients from the partial fraction expansion (6.1). Thus for a -realization (A b c) of (x)
in Jordan control canonical form (6.3){(6.5), (6.7)
c(xI + A)b =
i m X X cij (j;1) (x + z ): i i=1 j =1 (j ; 1)!
In particular, in the generic case where (A b c) has distinct possibly complex poles, then n X c(xI + A)b = ci(x + zi): i=1
Remark When is real, the poles and residues always appear in complex conjugate pairs. Thus, by analogy with the systems theory case, it is always possible (in the generic case of no repeated poles) to write
c(xI + A)b =
X
zi 2R
ci(x + zi) +
X
zi 2= R
(cRi cIi ziR ziI x)
where zi 2 spect A, cRi cIi ziR ziI are the real and imaginary parts of the residues and poles of c(I ; A);1b, and
(cR cI zR zI x) := (cR + icI )(x + (zR + izI )) + (cR ; icI )(x + (zR ; izI )) is the canonical second-order building block. Thus it is simple to cope with complex poles in -realizations using only real parameters. As an example, consider (x) = (1+ e;x);1. For simplicity write = cR , = cI , = zR and = zI . We then obtain ;x; ) ; 2 e;x; sin() ( x) = 2 + 12 + e2 e;x;cos( cos( ) + e;2 x;2
The non-uniqueness of the parametrization (due to the periodicity of ez ) is apparent explicitly in the cos() and sin() terms. Another example is (x) = e;x2 . In this case,
( x) = 2 e;x2;2 x;2 + 2 ( cos(2 x + 2 ) + sin(2 x + 2 )):
Remark Complex parametrizations for standard neural networks have been discussed in 10, 19]. The motivation there was to be able to use neural networks with complex inputs.
6.1 Diagonalizable -Realizations
A -realization (A b c) of (x) is called K -diagonalizable if the operator A : V ! V is diagonalizable over K . Since the set of degree n rational functions r (x) 2 C (x) with distinct poles is dense in the set of all rational functions of degree n, C -diagonalizability of a -realization (A b c) is generic property. Over R , a necessary and sucient condition for diagonalizable -realizations is that the poles of the associated rational function c(xI ; A);1b are on the real axis and simple. Certainly this is not a generic property. Note that a sucient condition ;i for a real rational function g(x) = P1 i=1 gi x of degree n to have a R -diagonalizable realization is that the n n-Hankel (gi+j;1)nij=1 is positive denite. In this case also the residues of the partial fraction decomposition of g(x) are positive. The following result provides an answer to question 4.
Theorem 6.1 Let : R ! R be analytic, possibly only on an interval I R , and
let # I be a self-conjugate subset of C contained in the domain of holomorphy of . Assume that is l.i. generating on # of order at least 2n. The set of functions : R ! R with -degree ( ) = n and which admit a diagonalizable -realization over R of length n has the structure of an analytic manifold of dimension 2n. It has exactly 2n connected components. Each such (x) with -degree ( ) has a decomposition
(6.8)
(x) =
n X i=1
ci(x ; ai )
with real numbers ci 6= 0, ai 2 #, ai 6= aj for i 6= j . The dierent connected components are characterized by a1 < < an , and sign(ci ) = "i , "i 2 f;1 1g, i = 1 : : : n. Moreover, the decomposition (6.8) is unique in the sense that for real numbers a1 : : : an 2 #, c1 : : : cn and a01 : : : a0n 2 #, c01 : : : c0n satisfying (6.8), then a0i = a(i) c0i = c(i) i = 1 : : : n for a permutation : f1 : : : ng ! f1 : : : ng.
Proof Let ;n #n R n denote the open subset of R 2n dened by ;n := f(a1 : : : an c1 : : : cn) 2 #n (R ;f0g)n : a1 < < an ai 2 # \ R ci 6= 0g:
For any (a c) 2 ;n let gac(x) be the rational function dened by
gac(x) := Let
ac (x)
be dened by
n X ci : i=1 x ; ai
Z 1 (x + )gac(x) d ac (x) := 2i ;
where ; is a suciently large arc containing all the poles of gac(x), (a c) 2 ;n. Then the map (a c) 7! ac(x)
is, by theorem 4.7, an injective map on ;n. The image Mn is exactly the class of functions described by (6.8). ;n is a smooth analytic manifold of dimension 2n with 2n connected components characterized by sign(ci) = "i, "i 2 f;1 1g, i = 1 : : : n. Endow Mn with the unique structure of an analytic manifold such that (a c) 7! ac is an analytic dieomorphism. This completes the proof.
7 Conclusions and Related work We have drawn a connection between the realization theory for linear dynamical systems and neural network representations. This is an exciting connection because it opens the way for the application of some of the machinery of realization theory to neural networks. A number of new open problems arise also. For example there is the problem of partial realizations 11, 16]. We are currently exploring the application of the theory of Pad$e approximants and continued fractions to the neural networks considered here. After this paper was substantially completed the authors became aware of the work of Barrar and Loeb 3]. They have considered parametrizations like (2.7) for general nonlinear families. We have considered a slightly more specic case, and we have obtained results not obtainable in their setup. Finally let us point out that the ability to parametrize general neural network representations in dierent ways could have a profound eect on learning algorithms: simply by performing gradient descent in the dierent parameter spaces is expected to oer dierent behaviour.
Acknowledgements This work was supported by the Australian Research Council, the Australian Telecommunications and Electronics Research Board, and the Boeing Commercial Aircraft Company (thanks to John Moore).
References 1] F. Albertini, E. D. Sontag and V. Maillot, Uniqueness of Weights for Neural Networks, in Arti cial Neural Networks for Speech and Vision, R. Mammone, ed., Chapman and Hall, London, 1993, pp. 115{125. 2] A. C. Antoulas and B. D. O. Anderson, On the Scalar Rational Interpolation Problem, IMA Journal of Mathematical Control and Information, 3 (1986), pp. 61{88. 3] R. B. Barrar and H. L. Loeb, On Extended Varisolvent Families, Journal D'Analyse Mathematique, 26 (1973), pp. 243{254. 4] K. L. Blackmore, R. C. Williamson and I. M. Y. Mareels, Local Minima and Attractors at Innity of Gradient Descent Learning Algorithms, to appear in Journal of Mathematical Systems, Estimation and Control, 1995. 5] G. Cybenko, Approximation by Superpositions of a Sigmoidal Function, Mathematics of Control, Signals, and Systems, 2 (1989), pp. 303{314. 6] P. J. Davis, Interpolation and Approximation, Dover, New York, 1975. 7] W. F. Donoghue, Jr, Monotone Matrix Functions and Analytic Continuation, SpringerVerlag, Berlin, 1974. 8] K. -I. Funahashi, On the Approximate Realization of Continuous Mappings by Neural Networks, Neural Networks, 2 (1989), pp. 183{192. 9] W. Gautschi, A Survey of Gauss-Christoel Quadrature Formulae, in E.B. Christoel, The Inuence of his work on Mathematics and the Physical Sciences, P. Butzer and F. Feher, eds., Birkhauser, Basel, 1981, pp. 72{147. 10] G. M. Georgiou and C. Koutsougeras, Complex Domain Backpropagation, IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 39 (1992), pp. 330{334. 11] W. B. Gragg and A. Lindquist, On the Partial Realization Problem, Linear Algebra and its Applications, 50 (1983), pp. 277{319. 12] G. H. Hardy and E. M. Wright, An Introduction to the Theory of Numbers, Oxford University Press, Oxford, 1979. 13] U. Helmke, Waring's problem for binary forms, Journal of Pure and Applied Algebra, 80 (1992), pp. 29{45.
14] K. Hornik, M. Stinchcombe and H. White, Multilayer Feedforward Networks are Universal Approximators, Neural Networks, 2 (1989), pp. 359{366. 15] T. Kailath, Linear Systems, Prentice-Hall, Englewood Clis, 1980. 16] R. E. Kalman, On Partial Realizations, Transfer Functions, and Canonical Forms, Acta Polytechnica Scandinavica, 31 (1979), pp. 9{32. 17] R. E. Kalman, P. L. Falb and M. A. Arbib, Topics in Mathematical System Theory, McGrawHill, New York, 1969. 18] T. Kato, Perturbation Theory for Linear Operators, Springer-Verlag, Berlin, 1966. 19] H. Leung and S. Haykin, The Complex Backpropagation Algorithm, IEEE Transactions on Signal Processing, 39 (1991), pp. 2101{2104. 20] C. Martin and M. Stamp, A Note on the Error in Gaussian Quadrature, Preprint, 1992. 21] R. A. Silverman, Introductory Complex Analysis, Dover, New York, 1972. 22] E. D. Sontag, Mathematical Control Theory: Deterministic Finite Dimensional Systems, Springer-Verlag, New York, 1990. 23] E. D. Sontag and H. J. Sussmann, Backpropagation can Give Rise to Spurious Local Minima Even for Networks Without Hidden Layers, Complex Systems, 3 (1989), pp. 91{106. 24] H. J. Sussmann, Uniqueness of Weights for Minimal Feedforward Nets with a Given InputOutput Map, Neural Networks, 5 (1992), pp. 589{593. 25] J. J. Sylvester, An Essay on Canonical Forms, Supplement to a Sketch of a Memoir on Elimination, in Collected Mathematical Paper I, paper 34, 1851. 26] R. C. Williamson and U. Helmke, Existence and Uniqueness Results for Neural Network Approximations, To appear, IEEE Transactions on Neural Networks, 1994.