Bounds for the Computational Power and Learning Complexity of Analog Neural Nets (Extended Abstract) Wolfgang Maass* Institute for Theoretical Computer Science Technische Universitaet Graz Klosterwiesgasse 32/2 A-8010 Graz, Austria e-mail:
[email protected] October 23, 1992
Abstract It is shown that feedforward neural nets of constant depth with piecewise polynomial activation functions and arbitrary real weights can be simulated for boolean inputs and outputs by neural nets of a somewhat larger size and depth with heaviside gates and weights from f0; 1g. This provides the rst known upper bound for the computational power and VC-dimension of such neural nets. It is also shown that in the case of piecewise linear activation functions one can replace arbitrary real weights by rational numbers with polynomially many bits, without changing the boolean function that is computed by the neural net. In addition we improve the best known lower bound for the VC-dimension of a neural net with w weights and gates that use the heaviside function (or other common activation functions such as ) from (w) to (w log w). This implies the somewhat surprising fact that the Baum-Haussler upper bound for the VC-dimension of a neural net with heaviside gates is asymptotically optimal. Finally it is shown that neural nets with piecewise polynomial activation functions and a constant number of analog inputs are probably approximately learnable (in Valiant's model for PAC-learning). also aliated with the University of Illinois at Chicago
1 Introduction We examine in this paper the computational power and learning complexity of analog feedforward neural nets N , i.e. of circuits with analog computational elements in which certain parameters are treated as programmable variables. We focus on neural nets N of bounded depth in which each gate g computes a function from Rm into R of the form < y1; : : :; ym > 7! g (Qg (y1; : : :; ym )). We assume that for each gate g ; g is some xed piecewise polynomial activation function (also called response function). This function is applied to some polynomial Qg (y1; : : :; ym ) of bounded degree with arbitrary real coecients, where y1 ; : : :; ym are the real valued inputs to gate g . The coecients (\weights") of Qg are the programmable variables of N , whose values may arise from some learning process. We are primarily interested in the case where the neural net N computes (respectively learns) a boolean valued function. For that purpose we assume that the real valued output of the output gate gout of N is \rounded o". More precisely, we assume that there is an \outer threshold" Tout (which belongs to the variable parameters of N ) such that the output of N is \1" whenever the real valued output z of gout satis es z Tout, and \0" if z < Tout. In some results of this paper we also assume that the input < x1 ; : : :; xn > of N is boolean-valued. It should be noted, that this does not aect the capacity of N to carry out on its intermediate levels (i.e. in its \hidden units") computation over reals, whose real-valued results are then transmitted to the next layer of gates. Circuits of this type have rarely been considered in computational complexity theory, and they give rise to the principal question whether these intermediate analog computational elements will allow the circuit to compute more complex boolean functions than a circuit with a similar layout but digital computational elements. Note that circuits with analog computational elements have an extra source of potentially unlimited parallelism at their disposal, since they can execute operations on numbers of arbitrary bit-length in one step, and they can transmit numbers of arbitrary bit-length from one gate to the next. One already knows quite a bit about the special case of such neural nets N where each gate g is a \linear threshold gate". In this case each polynomial Qg (y1 ; : : :; ym ) is of degree 1 (i.e. a weighted sum), and each activation function g in N is the \heaviside function" (also called \hard limiter") H de ned by (
H(y) = 01 ;; ifif yy from some \center" < c1 ; : : :; cm > (the latter may be determined through a learning process). Apparently Corollary 3.2 and Theorem 4.5 of this paper provide the rst upper bounds for the computational power and learning complexity of neural nets with non-boolean activation functions that employ non-linear gate inputs Qg (y1 ; : : :; ym) with arbitrary coecients. The power of feedforward neural nets with other activation functions besides H has previously been investigated in [RM] (ch.10), [S1], [S2], [H], [MSS], [DS], [SS]. It was shown in [MSS] for a very general class of activation functions g that neural nets (Nn )n2N of constant depth and size O(nO(1)) with real weights of size O(nO(1) ) and separation (1=nO(1)) (between the un-rounded circuit-outputs for rejected and accepted inputs) can compute only boolean functions in TC0. It follows from a result of Sontag [S2] that the assumptions on the weight-size and separation are essential for this upper bound: he constructed an arbitrarily smooth monotone function (which can be made to satisfy the conditions on g in the quoted result of [MSS]) and neural nets 3
Nn of size 2 (!) with activation function such that Nn can compute with suciently large weights any boolean function Fn : f0; 1gn ! f0; 1g (hence Nn has VC-dimension 2n).
These results leave open the question about the computational power and learning complexity of neural nets with arbitrary weights that employ common analog activation functions g , such as the piecewise linear function de ned by 8 > < 0; if y 0 (y) = > y; if 0 y 1 : 1; if y 1 ([L] refers to a gate g with g = as a \threshold logic element"). It has already been shown in [MSS] that constant size neural nets of depth 2 with activation function and small integer weights can compute more boolean functions than constant size neural nets of depth 2 with linear threshold gates (and arbitrary weights). [DS] exhibits an even stronger increase in computational power for the case of quadratic activation functions. Hence even simple non-boolean activation functions provide more computational power to a neural net than the heaviside-function. However it has been open by how much they can increase the computational power (in the presence of arbitrary weights). E. Sontag has pointed out that known methods do not even suce to show for a constant depth neural net Nn of size O(nO(1)) with n inputs and activation function , that there is any boolean function Fn : f0; 1gn ! f0; 1g that can not be computed on Nn with a suitable weight-assignment. Correspondingly no better upper bound than the trivial 2n could be given for the VC-dimension of such Nn (with n boolean inputs). From the technical point of view, this inability was caused by the lack of an upper bound on the amount of information that can be encoded in such neural net by the assignment of weights. For the case of neural nets with heaviside gates this upper bound on the information-capacity of weights is provided by the quoted result of Muroga et. al. [M]. But no such upper bound has been known for any activation function with non-boolean output. We introduce in section 2 of this paper a method that allows us to prove an upper bound for the information-capacity of weights for neural nets with piecewise linear activation functions (hence in particular for ). It is shown that for the computation of boolean functions on neural nets Nn of constant depth and polynomially in n many gates (where n is the number of input variables) it is sucient to use as weights rational numbers with polynomially in n many bits. As a consequence one can simulate any such analog neural net by a digital neural net of constant depth and polynomial size with the heaviside activation function (i.e. linear threshold gates) and binary weights (i.e. weights from f0; 1g). This result also implies that the VC-dimension of Nn can be bounded above by a polynomial in n. In section 3 we introduce another proof-technique, that allows us to derive the same two consequences also for neural nets with piecewise polynomial activation functions and nonlinear gate-inputs Qg (y1; : : :; ym ) of bounded degree. These results show that in spite of the previously quoted evidence for the superiority of non-boolean activation functions in neural nets, there is some limit to their computational power as long as 4
the activation functions are piecewise polynomial. On the other hand the polynomial upper bound on the VC-dimension of such neural nets may be interpreted as good news: It shows that neural nets of this type can in principle be trained with a sequence of examples that is not too long. In section 4 we derive in Theorem 4.1 a new lower bound of (w log w) for the maximal possible VC-dimension of a neural net with w weights. This result implies that the well-known upper bound of [BH] for neural nets with linear threshold gates is optimal up to constant factors. We conclude section 4 with a positive result for learning on neural nets in Valiant's model [V] for probably approximately correct learning (\PAC-learning"). We consider the problem of learning on neural nets with a xed number of analog (i.e. real valued) input variables. We exploit here the implicit \linearization" of the requirements for the desired weight-assignment that is achieved in the new proof-techniques from sections 2 and 3. In this way one can show that such neural nets are properly PAC-learnable in the case of piecewise linear activation functions, and PAC-learnable with a hypothesis class that is given by a somewhat larger neural net in the case of piecewise polynomial activation functions. The following notion of a feedforward neural net (with linear gate inputs) is the main model of this paper. However Corollary 3.2 and Theorem 4.5 also hold for the more general case of arbitrary polynomials Qg (y1; : : :; ym ) of bounded degree as gate inputs.
De nition 1.1 A (feedforward) neural net N (with linear gate inputs) is a labelled
acyclic directed graph < V; E >. Its nodes of fan-in 0 are labelled by the input variables x ; : : :; xn. Each node g of fan-in > 0 is called a computation node (or gate), and is labelled by some activation function g : R ! R. Furthermore N has a unique node of 1
fan-out 0, which is called the output node of N . For w := jV j + jE j + 1 the neural net N has exactly w variable parameters: - a weight for each edge of the graph < V; E > - a bias for each gate of N - an outer threshold Tout for the output gate gout of N . We assume that some numbering of all edges and gates in N has been xed. Then each assignment 2 Rw of reals to the variable parameters in N de nes an analog circuit N , which computes a functionn x 7! N (x) from Rn into f0; 1g in the following way: Assume that some input x 2 R has been assigned to the input nodes of N . If a gate g in N has mmimmediate predecessors in < V; E > which output y1; : : :; ym 2 R, then g P outputs g ( gi yi + g0 ), where g1 ; : : :; gm are the weights that have been assigned by i=1 to the corresponding incoming edges of g, and g0 is the bias that has been assigned by to the gate g . Finally, if gout is the output gate of N and gout gives the real valued output z (according to the preceding inductive de nition) we de ne ( N (x) := 10;; ifif zz < TTout out ; where Tout is the outer threshold that has been assigned by to gout.
5
De nition 1.2 Assume nthat N is an arbitrary neural net with n inputs and w variable parameters, and S R is an arbitrary set. Then one de nes the VC-dimension of N over S in the following way: VC-dimension(N ; S ) := maxf jS 0j S 0 S has the property that for every function F : S 0 ! f0; 1g there exists a parameter assignment 2 Rw such that 8x 2 S 0(N (x) = F (x))g: Remark 1.3 \VC-dimension" is an abbreviation for \Vapnik-Chervonenkis dimension". It has been shown in [BEHW] (see also [BH], [A]) that the VC-dimension of a neural net N determines the number of examples that are needed to train N (in Valiant's model for probably approximately correct learning [V]). Sontag [S2] has shown that the VC-dimension of a neural net can be drastically increased by using activation functions with non-boolean output instead of the heaviside function H.
De nition 1.4 A function : R ! R is called piecewise polynomial if there are thresholds t1 ; : : :; tk 2 R and polynomials P0 ; : : :; Pk such that t1 < : : : < tk and for each i 2 f0; : : :; kg : ti x < ti+1 ) (x) = Pi (x) (we set t0 := ?1 and tk+1 := 1). If k is chosen minimal for , we refer to k as the number of polynomial pieces of , to P0; : : :; Pk as the polynomial pieces of , and to t1 ; : : :; tk as the thresholds of . Furthermore we refer to t1 ; : : :; tk together with all coecients in the polynomials P0 ; : : :; Pk as the parameters of . The maximal degree of P0 ; : : :; Pk is called the degree of . If the degree of is 1 then we call piecewise linear, and we refer to P0 ; : : :; Pk as the linear pieces of . Note that we do not require that is continuous (or monotone).
2 A Bound for the Information - Capacity of Weights in Neural Nets with Piecewise Linear Activation Functions We de ne for arbitrary a 2 N
a?1
Qa := r 2 Q r = s i P?a bi 2i for bi 2 f0; 1g; i = ?a; : : :; a ? 1 and s 2 f?1; 1g . Note that for any r 2 Qa : jrj 2a 2 a minfjr0j j r0 2 Qa and r0 6= 0g. =
2
Theorem 2.1 Consider an arbitrary neural net N over a graph < V; E > with n input nodes, in which every computation node has fan-out 1. Assume that each activation function g in N is piecewise linear with parameters from Qa . Let w := jV j + jE j + 1 be the number of variable parameters in N . Then for every 2 Rw there exists a vector 0 = < st ; : : :; stw >2 Qw with integers s ; : : :; sw; t of absolute value (2w+1)!2 a w such that 8x 2 Qna N (x) = N (x) . 1
2 (2 +1)
1
0
In particular N computes the same boolean function as N . 0
Remark 2.2 The condition of Theorem 2.1 that all computation nodes in N have fan-out 1 is automatically satis ed for d 2. For larger d one can simulate any neural net N of depth d with s nodes by a neural net N 0 with s?s 1 sd?1 23 sd?1 nodes 6
and depth d that satis es this condition. Hence this condition is not too restrictive for neural nets of a constant depth d. It should also be pointed out that there is in the assumption of Theorem 2.1 no explicit bound on the number of linear pieces of g (apart from the requirement that its thresholds are from Qa ). For example these activation functions may consist of 2a linear pieces.
Remark 2.3 Previously one had no upper bound for the computational power (or for the VC-dimension) of multi-layer neural nets N with arbitrary weights and analog
computational elements (i.e. activation functions with non-boolean output). Theorem 2.1 implies that any N of the considered type can compute with the help of arbitrary parameter assignments 2 Rw at most 2O(aw log w) dierent functions from Qna into f0; 1g, hence VC-dimension (N ; Qna) = O(aw2 log w) (see Remark 3.4 for a slightly better bound). Furthermore Theorem 2.1 implies that one can replace all analog computations inside N by digital arithmetical operations on not too large integers (the proof gives an upper bound of O(wa + w log w) for their bit-length). It is well-known that each of these digital arithmetical operations (multiple addition, multiplication, division) can be carried out on a circuit of small constant depth with O(aO(1) wO(1) ) MAJORITY-gates, hence also on a neural net of depth O(1) and size O(aO(1) wO(1)) with heaviside gates and weights from f0; 1g ([CSV], [PS], [HMPST], [SR], [SBKH]). Thus one can simulate for inputs from f0; 1gn any depth d neural net N as in Theorem 2.1 with arbitrary parameter assignments 2 Rw by a neural net of depth O(d) and size O(aO(1) wO(1) ) with heaviside-gates and weights from f0; 1g. The same holds for inputs from Qna if they are given to N in digital form. 2
Sketch of the proof of Theorem 2.1: In the special case where g = H for all gates in N this result is well known ([M]). It follows by applying separately to each gate in N the following result. Lemma 2.4 (folklore; see [MT] for a proof) Consider a system Ax b of some arbitrary nite number of linear inequalities in l variables. Assume that all entries in A and b are integers of absolute value a. If this system has any solution in Rl , then it has a solution of the form < st ; : : : stl >, where s1 ; : : :; sl ; t are integers of absolute value (2l + 1)! a2l+1 . 1
Sketch of the proof for Lemma 2.4: Let k be the number of inequalities in Ax b. One writes each variable in x as a dierence of 2 nonnegative variables, and
one adds to each inequality a \slack variable". In this way one gets an equivalent system
A0x0 = b ; x0 0 over l0 := 2l + k variables, for some k l0 matrix A0. The k columns of A0 for the k slack-variables in x0 form an identity matrix. Hence A0 has rank k. (1)
The assumption of the Lemma implies that (1) has a solution over R. Hence by Caratheodory's Theorem (Corollary 7.1i in [Sch]) one can conclude that there is also a 7
solution over R of a system
A00x00 = b ;
(2)
x00 0
where A00 consists of k linearly independent columns of A0. Since A00 has full rank, (2) has in fact a unique solution that is given by Cramer's rule: x00j = det(A00j )=detA00 for j = 1; : : :; k, where A00j results form A00 by replacing its j th column by b. Since all except up to 2l columns of A00 contain exactly one 1 and else only 0's, we can bring each of the matrices A00 ; A00j by permutations of rows and columns into a form
C0 B= D I
!
where C is a square matrix with 2l + 1 rows. Hence the determinant of B is an integer of absolute value (2l + 1)! a2l+1 . The diculty of the proof of Theorem 2.1 lies in the fact that with analog computational elements one can no longer treat each gate separately, since intermediate values are no longer integers. Furthermore the total computation of N can in general not be described by a system of linear inequalities, where the w variable parameters of N are the variables in the inequalities (and the xed parameters of N are the constants). This becomes obvious if one just considers the composition of two very simple analog gates g1 and g2 on levels 1 and 2 of N , whose activation functions 1; 2 satisfy n P
1(y) = 2(y) = y. Assume x = i xi + 0 is the input to gate g1, and g2 receives as input
n P
m P
j =1
i=1
0j yj + 00 where y1 = 1(x) = x is the output of gate g1. Then g2 outputs
m
ixi + + P 0j yj + 0 : Obviously this term is not linear in the weights j i 0 ; ; : : :; n. Hence if the output of gate g is compared with a xed threshold at the
0
1
1
=1
1
0
=2
0
2
next gate, the resulting inequality is not linear in the weights of the gates in N .
If the activation functions of all gates in N were linear (as in the example for g1 and g2), then there would be no problem because a composition of linear functions is linear. However for piecewise linear activation functions it is not sucient to consider their composition, since intermediate results have to be compared with boundaries between linear pieces of the next gate. Hence the natural approach would be to describe the computations of N by a system of polynomial inequalities, and to prove a suitable generalization of Lemma 2.4 to polynomial inequalities. Unfortunately there is no such generalization: The system x2 2; x2 2 has a solution, but no solution of short bit-length. We introduce in this paper a new method in order to handle this diculty. We simulate N by another neural net N^ [c]^ (which one may view as a \normal form" for N ) that uses the same graph < V; E > as N , but dierent activation functions and dierent values ^ for its variable parameters. The activation functions of N^ [c] depend on jV j new parameters c 2 RjV j , which we call scaling parameters in the following. Although this new neural net has the disadvantage that it requires jV j additional parameters c, 8
it has the advantage that we can choose in N^ [c] all weights on edges between computation nodes to be from f?1; 0; 1g. Since these weights from f?1; 0; 1g are already of the desired bit-length, we can treat them as constants in the system of inequalities that describes computations of N^ [c]. Thereby we can achieve that all variables that appear in the inqualities that describe computations of N^ [c] (the variables for weights of gates on level 1, the variables for the biases of gates on all levels, the variable for the outer threshold, and the new variables for the scaling parameters c) appear only linearly in those inqualities. Hence we can apply Lemma 2.4 to the system of inequalities that describes the computations of N^ for inputs from Qna, and thereby get a \nice" solution ^0 ; c0 for all variable parameters in N^ . Finally we observe that we can transform N^ [c0 ]^ back into the original neural net N with an assignment of small \numbers" 0 to all variable parameters in N . 0
We will now ll in some of the missing details. Consider the gate function of an arbitrary gate g in N . Since is piecewise linear, there are xed parameters t1 < < tk ; a0 ; : : :; ak; b0 ; : : :; bk in Qa (which may be dierent for dierent gates g) such that with t0 := ?1 and tk+1 := +1 one has (x) = ai x + bi for x 2 R with ti x < ti+1 ; i = 0; : : :; k. For an arbitrary scaling parameter c 2 R+ we associate with the following piecewise linear activation function c: the thresholds of c are c t1 ; ; c tk and its output is c(x) = ai x + c bi for x 2 R with c ti x < c ti+1 ; i = 0; : : :; k (set c t0 := ?1; c tk+1 := +1). Thus for all reals c > 0 the function c is related to through the equality: 8x 2 R ( c(c x) = c (x)). Assume that 2 Rw is some arbitrary given assignment to the variable parameters in N . We transform N into a \normal form" N^ [c]^ in which allweights on edges be tween computation nodes are from f?1; 0; 1g, such that 8x 2 Rn N (x) = N^ [c]^ (x) . We proceed inductively from the output level towards the input level. Assume that m P the output gate gout of N receives as input i yi + 0, where 1 ; : : :; m; 0 are i=1 the weights and the bias of gout (under the assignment ) and y1 ; : : :; ym are the (real valued) outputs of the immediate predecessors g1 ; : : :; gm of g . For each i 2 f1; : : :; mg with i 6= 0 such that gi is not an input node we replace the activation function i of gi by ijij , and we multiply the weights and the(bias of gate gi with jij. Finally we 0. replace the weight i of gate gout by sgn(i ) := ?11 ;; ifif i > i is a solution. Hence we can apply Lemma 2.4, which provides a solution z 0 of the form < sti >i=1;:::;w with integers s1 ; : : :; sw ; t of absolute value (2w + 1)! 22a(2w+1). Let N^ [c0 ]^ be the neural net N^ with this new assignment < ^ 0 ; c0 >:= z 0 of \small" parameters. By de nition we have 8x 2 Qna (N (x) = N^ [c0 ]^ ). We show that one can transform this neural net N^ [c0]^ into a net N ^ with the same activation functions as N but a new assignment 0 of \small" parameters (that can easily be computed from ^ 0 ; c0 ). This transformation proceeds inductively from the input level towards the output level. Consider some gate g on level 1 in N^ that uses (for the new parameter 0
0
0
0
10
assignment c0) the scaling factor c > 0 for its activation function . Then we replace the weights 1; : : :; n and bias 0 of gate g in N^ [c0 ]^ by c ; : : :; cn ; c ; and c by
. Furthermore if r 2 f?1; 0; 1g was in N^ the weight on the edge between g and its successor gate g , we assign to this edge the weight c r. Note that g 0 receives in this way from g the same input as in N^ [c0]^ (for every circuit input). Assume now that 01 ; : : :; 0m are the weights that the incoming edges of g 0 get assigned in this way, that 00 is the bias of g 0 in the assignment z 0 =< ^ 0; c0 >, that c0 > 0 is the scaling factor of g 0 in N^ [c0 ]^ , that 0 is the activation function of g 0 in N . Then we assign the new weights ; : : :; m and the new bias to g 0, and we multiply the weight on the outgoing edge c c c from g 0 by c0. By construction we have that 8x 2 Rn (N (x) = N^ [c0 ]^ (x)), hence 8x 2 Qna (N (x) = N (x)). 0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
3 Upper Bounds for Neural Nets with Piecewise Polynomial Activation Functions Theorem 3.1 Consider an arbitrary array (Nn)n2N of neural nets Nn of depth
O(1) with n inputs and O(nO ) gates, in which the gate function g of each gate g is piecewise polynomial of degree O(1) with O(nO ) polynomial pieces, and where the parameters of g are arbitrary reals. Then there exists an array (N~n )n2N of neural nets N~n of depth O(1) with n inputs and O(nO ) gates, such that each gate g in N~n uses as its activation function the heaviside function H (i.e. g is a linear threshold gate), and such that for each assignment of reals to the variable parameters in Nn there is an assignment ~ of numbers from f0; 1g to the variable parameters in N~n such that 8x 2 f0; 1gn (Nn(x) = N~n(x)). In particular VC-dimension (Nn ; f0; 1gn) = O(nO ). Furthermore for any assignment (n)n2N of real valued parameters the boolean functions that are computed by (1)
(1)
(1)
~
(1)
(Nn n )n2N are in TC 0.
Corollary 3.2 The claim of Theorem 3.1 remains valid if each gate g in Nn passes to its activation function g some arbitrary polynomial Qg (y1; : : :; ym ) of bounded degree (instead of a weighted sum of y1 ; : : :; ym ). In this case all coecients of the polynomials Qg are considered as variable parameters (\weights") of Nn. The simulating neural nets N~n can still be chosen so that each gate in N~n is a linear threshold gate, and so that (N~n)n2N also has all the other properties that are claimed in Theorem 3.1. 1 2
Idea of the proof of Corollary 3.2 from Theorem 3.1: One has y z = ((y + z ) ? y ? z ). Hence one can eliminate nonlinear polynomials Qg as argu2
2
2
ments of activation functions by introducing intermediate gates with linear gate inputs and quadratic activation functions. In this way one can transform the given array of neural nets into an array of neural nets according to De nition 1.1 that satis es the 11
assumptions of Theorem 3.1.
Remark 3.3 Theorem 3.1 yields no bound for the computational power of neural
nets with the activation function (y ) = 1=(1 + e?y ). However it provides bounds for the case where the activation functions are spline approximations to of arbitrarily high degree d, provided that d 2 N is xed.
Remark 3.4 Sontag [S3] suggested using the \quasi-linearization" that is achieved in the proof of Theorem 3.1 in order to also get upper bounds for the VC-dimension over Rn, by counting the number of components into which the weightspacen is partitioned by the hyperplanes that are de ned by some arbitrary nite set S R of inputs. By letting n vary and keeping the neural net Nn and the input x 2 S xed one gets up to 2O(nO ) dierent systems L(N^n^n ; x) in the proof of Theorem 3.1 (see the Appendix for details). Hence the total number ln of linear inequalities that arise in this O (nO ) way for dierent x 2 S and dierent parameters n is bounded by jS j 2 . Furthermore the total number wn of variables that occur in these ln inequalities is bounded by O(nO(1)). Therefore the hyperplanes that are associated with these ln inequalities k w ?i l wn P P n n partition the range Rwn of the variables into at most = jS jO(nO ) k ? i w ? i n k=0 i=0 w n faces (Theorem 1.3 in [E]). Each A 2 R gives rise to at most 2O(nO ) dierent circuits C A(N^n^n ) when Nn and S are kept xed, but the parameters n (and hence the weights from f?1; 0; 1g between computation nodes) vary. Thus each A 2 Rwn can O be used to compute at most 2O(n ) dierent functions S ! f0; 1g on the resulting circuits. Furthermore, if A and A~ belong~ to^the same face of the partition of Rwn then ^ for all n the circuits C A(N^n n ) and C A (N^n n ) compute the same function S ! f0; 1g. Hence if S is shattered by Nn (i.e. any function S ! fO0; 1g can be computed by Nn n O for suitable parameters n ) then 2jS j jS jO(n ) 2O(n ) , hence jS j = O(nO(1)): This implies that VC-dimension (Nn; Rn) = O(nO(1)). Note that this estimate remains true if one allows in Nn also non-linear gate inputs as in Corollary 3.2. One can apply in a similar fashion the \linearization" that is achieved in the proof of Theorem 2.1. Consider a neural net N over a graph < V; E > as in Theorem 2.1, but allow that each activation function g consists of p linear pieces with arbitrary xed real parameters. Then one can show that VC-dimension (N ; Rn ) = O(w2 log p), where w := jV j + jE j + 1 is the number of variable parameters in N . It is sucient to observe that for dierent x 2 S and dierent initial assignments altogether at most jS j 2O(w log p) linear inequalities arise in the description of the computations of the associated nets N^ [c]^ for input x. The associated hyperplanes partition the \weight space" Rw for the variable parameters ^ ; c into jS jO(w) 2O(w log p) faces. The vectors from each face can be used to compute at most 2O(w) dierent functions S ! f0; 1g (note that in general more than one function S ! f0; 1g can be computed because of dierent weights from f?1; 0; 1g between computation nodes). Hence 2jS j jS jO(w) 2O(w log p) 2O(w) if S is shattered by N , thus jS j = O(w2 log p): (1)
(1)
(1)
(1)
(1)
(1)
(1)
2
2
Idea of the proof of Theorem 3.1 This proof is quite long and involved, even for the simplest nonlinear case where the activation functions consist of 2 polynomial pieces of degree 2. We rst note that one can transform each given neural net Nnn 12
into a normal form N^n^n of constant depth and size O(nO(1) ) with all weights on edges between computation nodes from f?1; 0; 1g, in which all gates g have fan-out 1, and all gates g use as activation functions g piecewise polynomial functions of the following special type: g consists of up to 3 pieces, of which at most one is not identically 0, and in which the nontrivial piece outputs a constant, or computes a power y 7! y k (where k 2 N satis es k = O(1)). We can choose ^ n such that one has \s1 + 1 s2 " for all strict inequalities \s1 < s2 " that arise in N^n^n for inputs from f0; 1gn when one compares some intermediate term s1 with the threshold s2 of some gate, or with the outer threshold (analogously as in the proof of Theorem 2.1). This transformation can be done in such a way that 8x 2 f0; 1gn(Nnn (x) = N^n^n (x)). Although we have now transformed Nnn into a similar \normal form" N^n^ n as in the proof of Theorem 2.1, a new source of non-linearity arises when one tries to describe the role of the variable parameters of N^n by a system of inequalities. Assume for example that g is a gate on level 1 with input 1x1 + 2 x2 and activation function
g(y) = y 2. Then this gate g outputs 21 x21 + 212x1x2 + 22 x22. Hence the variables 1 ; 2 will not occur linearly in an inequality which describes the comparison of the output of g with some threshold of a gate at the next level. We solve this new problem in the following way. We x an arbitrary assignment ^ n of real^ numbers to the parameters of N^n . We introduce ^for the system of inequalities L(N^n n ; f0; 1gn) (that describes the computations of N^n n for all inputs x 2 f0; 1gn) new variables v for all nontrivial parameters in N^n^n (i.e. for the weights on edges between input nodes and computation nodes, for the bias of each gate g , for the outer threshold Tout and for the thresholds tg1 ; tg2 of each gate g ). In addition we introduce new variables for all products of such parameters that arise in the computation of N^n^n . We have to keep the inequalities linear in order to apply Lemma 2.4. Hence we cannot demand in these inequalities that the value of the variable vvg ;vg (that represents the product of g1 and g2 ) is the product of the values of the variables v1g and v2g (that represent the weights g1 respectively g2 ). We solve this problem by describing in detail in the linear inequalities L(N^n^n ; f0; 1gn) which role the product of g1 and g2 plays in the computations of N^n^n for inputs from f0; 1gn. It turns out that this can be done in such a way that it does not matter whether a solution A of L(N^n^n ; f0; 1gn) assigns to the variable vvg ;vg a value A(vvg ;vg ) that is equal to the product of the values A(v1g ) and A(v2g ) (that are assigned by A to the variables v1g and v2g ). In any case A(vvg ;vg ) is forced to behave like the product of A(v1g ) and A(v2g ) in the computations of N^n^n . 1
1
2
1
2
2
1
2
In^ more abstract terms, one may view A as a model of a certain \linear fragment" ^ L(Nn n ; f0; 1gn) of the theory of the role of the parameters ^n in the computations of N^n^ n on inputs form f0; 1gn. Such model A (which will be given by Lemma 2.4) is some type of \nonstandard model" of the theory of computations of N^n , since it replaces products of weights by \nonstandard products". Such nonstandard model A does not provide a new assignment of (small) weights to the neural net N^n , only to a \nonstandard version" C A(N^n^n ) of the neural net N^n^n . However the linear fragment L(N^n^n ; f0; 1gn) can be chosen in such a way that C A(N^n^n ) computes the same boolean function as N^n^ n : Furthermore, if A consists of a solution with \small" values as given 13
by Lemma 2.4, then C A(N^n^n ) can be simulated by a constant-depth polynomial-size m P boolean circuit whose gates g are all MAJORITY-gates (i.e. g (y1; : : :; ym ) = 1 if yi i=1 m=2, otherwise g(y1; : : :; ym) = 0). This implies that the boolean functions that are computed by (C A(N^n^n ))n2N are in TC 0. However by construction these are the same boolean functions that are computed by (Nnn )n2N . We refer to the Appendix for a more detailed proof.
4 Further Results about Learning on Neural Nets All existing lower bounds for the VC-dimension of neural nets with common activation functions such as H; ; and (with (y ) := 1=(1 + e?y )) are at best linear in the number e of edges (respectively weights) of the net ([BH], [B], [S2]). The following result provides the rst superlinear lower bound.
Theorem 4.1 One can construct for arbitrarily large n 2 N neural nets Nn of depth 4 with n inputs and 33n edges such that VC-dimension (Nn ; f0; 1gn) = n log n. These neural nets Nn can be realized with any of the common activation functions H; ; at its gates. Remark 4.2 This result solves an open problem of Baum and Haussler [BH]. It shows that their upper bound of 2w log(eN ) = O(w log w) for the VC-dimension of an arbitrary neural net with linear threshold gates, w weights, and N nodes (see [C], [BH]) is optimal up to constant factors. Idea of the proof of Theorem 4.1: One applies some Shannon-type upper bound on circuit size, which shows that all boolean functionsF :f0; 1gm ! f0; 1g can m
be computed by some threshold circuit CF of depth 4 with O 2m edges. One just has to verify in addition that these circuits CF use the same underlying graph. It is easy to show that these requirements are satis ed by the construction of Neciporuk [N], which was later improved by Lupanov [Lu]. We brie y sketch the idea of this construction. Assume that m is a power of m ; b := 2. Fix some arbitrary function F : f0; 1gm ! f0; 1g. Set pa := m?log 2 a m= 2 c a + 2; c := (log m) ? 2. Thus a + b + c = m; 2 = 2 = m; 2 = m=4 a. De ne g : f0; 1ga+b ! f0; 1ga by de ning for j 2c the jth component of g (x; y) to be F (x; y; z ), where z is a binary representation of j . Call a function q : f0; 1ga+b ! f0; 1ga special if q(; y) is 1-1 for every y 2 f0; 1gb. It is not dicult to show that for g : f0; 1ga+b ! f0; 1ga there exist special functions g1; g2; g3; g4 such that 8x 2 f0; 1ga 8y 2 f0; 1gb :
g(x; y) =
(
g (x; y) g (x; y) ; if jxj < 2a? g (x; y) g (x; y) ; if jxj 2a? 1
2
3
4
14
1 1
(we write jxj for the number whose binary representation is x). Hence it is essentially sucient to construct suitable circuits for the special functions g1; : : :; g4. Fix some k 2 f1; : : :; 4g. Assume that one has already subcircuits that compute the 2b characteristic functions f : f0; 1gb ! f0; 1g for singletons from f0; 1gb. Then one can compute with these subcircuits and with 3 2a + a additional linear threshold gates the special function gk . For that purpose one rst constructs a circuit with 3 2a additional gates a that computes hk : f0; 1ga+b ! f0; 1g2 de ned by hk (x; y) := e jgk (x;y)j (i.e. hk (x; y) is the jgk (x; y )jth unit vector of length 2a). Each component hk;i of hk (i = 1; : : :; 2a) can be computed with 3 linear threshold gates: We have
hk;i(x; y) = 1 ,
a X r=1
xr 2r? = 1
b?
2X1
j =0
cij Sbin j (y) ; ( )
where bin(j ) is the binary representation of j; Sz : f0; 1gb ! f0; 1g is the characteristic function of z , and cij 2 N is de ned by the requirement that bin(i) = gk (bin(cij); bin(j )). Obviously hk;i can be computed as the AND of two linear threshold gates. Obviously gk can be computed from hk with a additional OR-gates. With a little bit of work one can show that this construction can be implemented by circuits CF over a xed graph of depth 4 with 33 2mm edges. In order to get a linear size circuit one adds 2m =m ? m dummy inputs to CF and sets n := 2m =m. We now turn to the analysis of learning on analog neural nets in Valiant's model [V] for probably approximately correct learning (\PAC-learning"). More precisely we consider the common extension of this model to real valued domains due to [BEHW]. Unfortunately most results about PAC-learning on neural nets are negative (see [BR], [KV]). This could either mean that learning on neural nets is impossible, or that the common theoretical analysis of learning on neural nets is not quite adequate. We want to point here to one somewhat problematic point of the traditional asymptotic analysis of PAC-learning on neural nets. In analogy to the standard asymptotic analysis of the runtime of algorithms in terms of the number n of input bits one usually formalizes PAC-learning on neural nets in exactly the same fashion. However in contrast to the common situation for computer algorithms (which typically receive their input in digital form as a long sequence of n bits) for many important applications of neural nets the input is given in analog form as a vector of a small number n of analog real valued parameters. These relatively few input parameters may consist for example of sensory data, or they may be the relevant components of a longer feature vector (which were extracted by some other mechanism). If one analyzes PAC-learning on neural nets in this fashion, the relevant asymptotic problem becomes a dierent one: Can a given analog neural net with a xed number n of analog inputs approximate the target concept arbitrarily close after it has been shown suciently many training examples? We show that for those types of neural nets which were considered in the preceding sections the previously discussed PAC-learning problem has in fact a positive solution:
Theorem 4.3 Let N be an arbitrary neural net as in Theorem 2.1, where the xed 15
parameters of the piecewise linear activation functions may now be arbitrary reals. Let CN := fC Rnj 9 2 Rw 8x 2 Rn (C (x) = N (x))g be the associated concept class, where C is the characteristic function of a concept C . Then CN is properly PAClearnable. This means that there exists a learning algorithm LAN for N such that for any distribution Q over Rn , any target concept CT 2 CN , and any given "; 2 R+ the learning ? 1 algorithm LAN with inputs " and carries out in O ( " )O(1); ( 1 )O(1) computations steps (w.r.t. the uniform cost criterion on a RAM) the following task: It computes a suitable number m and draws some sequence S of m examples for CT according to distribution Q. Then it computes from S an assignment S 2 Rw for the variable parameters of N such that Q [fx 2 Rnj CT (x) 6= N S (x)g] " with probability 1 ? .
Idea of the proof: We have VC-dimension (CN ) < 1 by Remark 3.4. Hence
according to [BEHW] it suces to show that for any given set S of m examples for CT one can compute from S within a number of computation steps that is polynomial in m, w 1 1 " ; an assignment S 2 R to the variable parameters of N such that 8x 2 S (CT (x) = N S (x)). The construction in the proof of Theorem 2.1 implies that it is sucient if one an assignment ^S ; cS computes instead with polynomially in m; "1 ; 1 computation steps of parameters for the associated neural net N^ such that 8x 2 S CT (x) = N^ [cS ]^S (x) . The latter task is easier because the role of the parameters ^ ; c in a computation of N^ for a speci c input x can be described by linear inequalities (provided one knows which linear piece is used at each gate). Nevertheless the following technical problem remains. Although we know which output N^ [cS ]^S should give for an input x 2 S , we do not know in which way this output should be produced by N^ [cS ]^S . More speci cally, we don't know which particular piece of each piecewise linear activation function g of N^ will be used for this computation. However this detailed information would be needed for each x 2 S and for all gates g of N^ in order to describe the resulting constraints on the parameters ^ ; c by a system of linear inequalities. However one can generate a set of polynomially in m many systems of linear inequalities such that at least on of these systems provides for all x 2 S satis able and sucient constraints for ^; c. By de nition of CN we know that there are parameters ^T ; cT such that N^ [cT ]^T computes CT . Consider any inequality I(^; c; x) 0 (with I(^; c; x) linear in ^; c for xed x, and linear in x for xed ^; c) as they were introduced in the proof of Theorem 2.1 in order to describe the comparison with a threshold at some gate g of N^ . The hyperplane fx 2 Rnj I(^T ; cT ; x) = 0g de nes a partition of S into fx 2 S j I(^T ; c; x) 0g and fx 2 S j I(^T ; c; x) > 0g. Hence it suces to produce (e.g. with the algorithm of [EOS]) in polynomially in m many computation steps all partitions of S that can be generated by as many hyperplanes as there are linear inequalities I(^; c; x) 0 in the proof of Theorem 2.1. One of these partitions will agree with the partition of S that is de ned by the hyperplanes fx 2 Rn j I(^T ; cT ; x) = 0g for the \correct values" ^ T ; cT of the parameters. Each of these partitions corresponds to a \guess" which linear pieces of the activation functions g of N^ are used for the dierent inputs x 2 S , and hence it de nes a unique system of linear inequalities in ^ ; c 16
(with the inputs x 2 S as xed coecients). Furthermore it is guaranteed that one of these \guesses" is correct for ^T ; cT . For each of the resulting polynominally in m many systems of inequalities we apply the method of the proof of Lemma 2.4 (i.e. we reduce the solution of each system of inequalities to the solution of polynomially in m many systems of linear equalities), or we apply Megiddo's polynomial time algorithm for linear programming in a xed dimension [Me] in order to nd values ^ s; cs for which N^ [cs ]^s gives the desired outputs for all x 2 S . By construction, this algorithm will succeed for at least one of the selected system of inequalities.
Remark 4.4 Assume N is some arbitrary neural net according to De nition 1.1 with arbitrary piecewise linear activation functions, and N does not satisfy the condition that all computation nodes of N have fan-out 1. Then Theorem 4.3 does not show that CN is properly PAC-learnable. However it implies that CN is PAC-learnable, with CN for a somewhat larger neural net N 0 of the same depth used as hypothesis class (see Remark 2.2 for the de nition of N 0 ). Note that this result may lead towards a theoretical explanation of an eect that has been observed in many experiments: One often achieves better learning results on arti cial neural nets if one uses a neural net with somewhat more units than necessary (i.e. necessary in order to compute the target concept on the neural net). 0
Theorem 4.5 Let N be an arbitrary neural net with arbitrary piecewise polynomial activation functions and arbitrary polynomial gate inputs Qg (y1 ; : : :; ym ). Then the associated concept class CN is PAC-learnable with an hypothesis class of the form CN~ for a somewhat larger neural net N~ . Idea of the proof: One can reduce this problem to the case of neural nets with linear gate inputs as indicated in Corollary 3.2. One uses as hypotheses sets which are de ned by a neural net N~ of the same structure as the circuits C A(N ) in the detailed proof of Theorem 3.1 in the Appendix. For this neural net N~ one can express the constraints on the assignment A by linear inequalities. Remark 3.4 implies that VC-dimension (N~ ; Rn) < 1. One applies the method from the proof of Lemma 2.4 in a manner analogous to the proof of Theorem 4.3 or linear programming in a xed dimension [Me] to polynomially in m many systems of linear inequalities. There is one small obstacle in generating the associated partitions of S , since the corresponding inequalities are not linear in the circuit inputs x. One overcomes this diculty by going to an input space of higher dimension.
5 Appendix In the following we provide details to the proof of Theorem 3.1. Because of the somewhat abstract nature of this proof one has to give precise de nitions for various auxiliary 17
notions. In the preceding sketch of the proof of Theorem 3.1 (after Remark 3.4) we have already described how the given neural nets Nnn (with an arbitrary given weight assignment n) can be transformed into neural nets N^n^n in normal form. We assume now that n and ^ n have been xed. We will give in the following a precise de nition of a system L(N^n^n ; f0; 1gn) of linear inequalities. This system describes in a weak manner (weak because of the required linearity) the role which the currently xed values ^ n of the parameters of N^n play in the computations of N^n^n for arbitrary inputs x 2 f0; 1gn. As in the proof of Theorem 2.1, the inputs x 2 f0; 1gn occur as constants in this system of linear inequalities, whereas the variable parameters of N^n (i.e. the weights on edges between input nodes and computation nodes, the biases and thresholds of all gates, and the outer threshold Tout) will be variables in this system of linear inequalities. We will exploit at the end of this proof (after Lemma 5.6) that this system has some solution, because the currently xed values ^ n provide such a solution. In order to simplify our notation, we will write N instead of Nn^n in the following. We assume that some concrete real values (given by ^n ) have been assigned to the variable parameters of this neural net N . We de ne for each gate g in N by induction on the depth of g : - in De nition 5.1 a set V g of variables and a set M g of formal terms that are needed to describe the operation of gate g . [The intuition is here that one writes for any circuit input x the output of g as a sum of products (of weights, and of components of x). Which of these terms will occur for a speci c circuit input x will depend on the course of the computation in N up to gate g : for dierent inputs the involved gates may use dierent polynomial pieces of their activation function. The set M g contains a separate formal term for each product that may possibly occur in this sum. Each term in M g consists of a variable w 2 V g (that represents a weight, a bias, a constant gate-output, or some product of weights and biases) and of a product P xj1 : : : xjnn of input variables x1 ; : : :; xn .] - in De nition 5.2 for any xed circuit input x 2 Rn a set Lg (x) of linear inequalities associated with gate g , that hold for the computation of N on input x if all formal terms t 2 M g are replaced by their actual value W (t; x) in N ; we also de ne in De nition 5.2 a set S g (x) of formal terms whose sum represents the input of g , and a set T g (x) M g of formal terms that represents the output of g for circuit input x. [Lg (x) speci es in particular which piece of g is used by gate g for circuit input x.] - in De nition S5.4 for any input set S Rn , any solution A of the resulting system L(N ; S ) := fLg (x) j x 2 S and g is a gate in Ng of linear inequalities, and any term t 2 M g a circuit C A;g (t) that decides for any circuit input x 2 S whether t is present in the output of g in N . [For any input x 2 S the circuits (C A;g (t))t2M g together compute the characteristic function of the set T g (x) M g . In this way one can replace in a recursive manner the analog computations in N by digital manipulations of formal terms, with \nonstandard products" in place of real products.] 1
18
One veri es in Lemma 5.3 that L(N ; S ) describes correctly the role of the parameters ^n in the computations of N := N^n^ n for inputs x 2 S . Unfortunately L(N ; S ) does not provide a complete description of the properties of the parameters ^n in these computations, since it represents only a \linear fragment" of their theory. Nevertheless one can prove with the help of Lemma 5.5 and Lemma 5.6 that for any solution A of L(N ; S ) the digital circuits C A;g (t) carry out a truthful simulation of the corresponding subcircuits of N . We would like to point out a dierence to the proof of Theorem 2.1 regarding the treatment of parameters. In the proof of Theorem 3.1 the variable parameters n of Nnn (the weights on edges, the biases of gates, and the outer threshold Tout) and the xed parameters of the given neural nets Nn (the thresholds of activation functions
g, and the coecients of the polynomial pieces of g ) are all changed simultaneously in the transformation to N^n^n . Consequently ^n denotes the values of all nontrivial parameters in N^n^n (the weights on edges between input nodes and computation nodes, the biases of all gates, the outer threshold,^and also the thresholds tg1 ; tg2 and the values of constant outputs ag of all gates g in N^n n ). As a consequence of this treatment of parameters one can allow in the given neural nets Nn of Theorem 3.1 arbitrary reals as xed parameters (i.e. for the thresholds and coecients of the polynomial pieces of the given activation functions g ). We refer to an analog neural net N with the properties of N^n^n (as described in section 3) as a neural net in normal form. This means that all gates in N have fan-out 1, all weights on edges between two computation nodes are from f?1; 0; 1g, all gates g in N use as activation function g a piecewise polynomial function that consists of 3 pieces, of which at most one piece is not identically 0, and in which the nontrivial piece (if it exists) outputs a constant from R or computes a power y 7! y k for some k 2 N. m P The argument of g is as usual a weighted sum gi yi + g0 , where y1 ; : : :; ym 2 R are i=1 the inputs to gate g . In order to simplify notation, we assume that for a neural net N in normal form the nontrivial piece of the activation function g of each gate g is de ned over a half-open interval [tg1 ; tg2 ) with certain reals tg1 < tg2 . It is easy to see that the subsequent proof can also be carried out without this simplifying assumption. We also assume w.l.o.g. that N is levelled, i.e. each gate g in N has the property that all paths in N from an input node to g have the same length.
De nition 5.1 Assume that N is a neural net in normal form with n input variables x ; : : :; xn, where arbitrary reals have been assigned to all parameters of N . We de ne by induction on the depth of g for each gate g in N a set V g of variables, a value W (v ) for each variable v 2 V g , and a set M g of (formal) terms. Each element of M g is of the form v P , where v 2 V g is a variable and P is some formal polynomial term of the form xj : : : xjnn , with j ; : : :; jn 2 N. The here occuring formal variables x ; : : :; xn for the input components should be distinguished from the concrete values x ; : : :; xn 2 R 1
1
1
1
1
1
for these variables that are considered later (starting in De nition 5.2). For any set M
19
of terms we set
1 M := M (?1) M := fw (?P ) j w P 2 M; where w is a variableg 0 M := = : Hence we can also talk about sets of terms gi M , where gi is a weight on an edge between computation nodes in N (gi 2 f?1; 0; 1g, since N is assumed to be in normal form). We consider rst the case where g has depth 1. If g gives on its nontrivial piece g [t1 ; tg2 ) a constant ag as output, we set g g [ fv g ; v g g and M g := fv g g: V g := fv g ; : : :; vng g [ fvconst const I II 0
g ) := ag ; W (v g ) := tg ; and W (v g ) := tg We de ne W (vig ) := gi for i = 0; : : :; n; W (vconst 2 1 II I (g1 ; : : :; gn are the weights and g0 is the bias of gate g in N ; tg1 ; tg2 are the thresholds of the activation function g ). In the other case g computes a power y 7! y k on its nontrivial piece. Then we introduce for each k-tuple < w1 P1; : : :; wk Pk > 2 k Q (fv0g g [ fv1g x1 ; : : :; vng xn g)k a new variable vwg ;:::;wk in V g and a term vwg ;:::;wk Pi i=1 in M g . We assume here that a formal multiplication P P 0 for formal terms P; P 0 of the form xj1 : : : xjnn is de ned in the obvious way. Thus we get 1
1
1
V g := fv g ; : : :; vng g [ fvIg ; vIIg g [ fvwg ;:::;wk j < w ; : : :; wk > 2 fv g ; : : :; vng gk g : 0
1
1
0
k Q
We set W (vwg ;:::;wk ) := W (wi ); and we de ne W (v ) for the other variables as before. i=1 We de ne k M g := fvwg ;:::;wk Q Pi j < w1 P1; : : :; wk Pk > 2 (fv0g g [ fv1g x1; : : :; vng xng)kg : 1
i=1
1
[The terms in M g play the role of the summands that one gets from the output (g0 + m g P i xi)k of g by rewriting this output as a sum of products.] i=1
We now consider the case where g is a gate on level l + 1, with edges from the gates g1; : : :; gm on level l leading into g. Assume that g1 ; : : :; gn 2 f?1; 0; 1g are the weights on these edges in N and that g0 is the bias of g . If g outputs a constant ag on its nontrivial piece, we set g g [ fv g ; v g g and M g := fv g g: V g := fv g ; vconst const II I 0
g ) := ag ; W (v g ) := tg , and W (v g ) := tg . If g comWe set W (v0g ) := g0 ; W (vconst 1 2 I II putes the power y 7! y k on its nontrivial piece, we introduce for each k-tuple
< w P ; : : :; wk Pk > 2 (fv g g [ 1
1
0
a new variable vwg ;:::;wk in V g , and a term vwg ;:::;wk 1
1
20
m [ j =1
k Q i=1
gj M gj )k
Pi in M g . Thus we set
V g := fv g g [ fvIg ; vIIg g [ fvwg ;:::;wk j < w P ; : : :; wk Pk > 2 (fvg g [ 0
1
1
1
m [
0
gj M gj )k;
j =1 for arbitrary polynomial terms P1 ; : : :; Pk m [ and variables wi 2 (fv0g g [ V gj )g: j =1 k Q
We de ne W (vwg ;:::;wk ) := W (wi); W (v0g ) := g0 , W (vIg ) := tg1 ; W (vIIg ) := tg2 . i=1 We set k m M g := fvwg ;:::;wk Q Pi j < w1 P1; : : :; wk Pk >2 (fv0g g [ S gj M gj )kg: [The argument 1
i=1 j =1 m S g of g is a sum of 0 and of summands that are denoted by terms in gj M gj . Hence j =1 the terms in M g correspond to the summands that one gets by rewriting the output of
g as a sum of products.] 1
Finally, for the output gate gout of N , we place into V g in addition the variable We de ne W (v gout ) as the value of the outer threshold of N .
v gout .
De nition 5.2 Assume that N is a neural net in normal form with n input variables. Let x 2 Rn be a xed input for N . We de ne for each gate g in N by simultaneous induction on the depth of g - a set Lg (x) of inequalities (that are linear in the variables from V g ) - a set S g (x) of formal terms (whose sum represents the argument of g for circuit input x) - a set T g (x) M g (whose sum represents the output of g for circuit input x). Since x is now a xed element of Rn, one can assign a speci c value W (P; x) 2 R to each term P of the form xj1 : : : xjnn that occurs in a formal term of the preceding de nition. Hence one can assign to any formal term t = v P (that belongs to some M g ) a speci c value W (t; x) := W (v ) W (P; x): For a set S of formal terms we de ne W (S; x) := P W (t; x). For the case S = = we set W (= ; x) := 0. 1
t2S
If g has depth 1, then we de ne
S g (x) := fv g g [ fvig xi j i = 1; : : :; ng : 0
Assume that g is a gate on level l + 1 with edges from gates g1 ; : : :; gm on level l leading into g . Let g0 ; g1 ; : : :; gm be the bias and the weights of gate g in N . Then we set
S g (x) :=
fvg g 0
[
21
m [ j =1
gj T gj (x) :
We de ne Lg (x) and T g (x) as P follows for any gate g in N . If W (S g (x); x) < tg1 , then Lg (x) contains thePinequality S g (x) + 1 vIg . If W (S g (x); x) tg2 , then Lg (x) contains the inequality S g (x) vIIg . In either case we set T g (x) := =. If tg1 W (S g (x); x) < tg2 , then Lg (x) contains the inequalities vIg S g (x) and S g (x) + 1 vIIg . If g gives on its nontrivial piece a constant ag as output, we set in g g. If g computes on its nontrivial piece a power y 7! y k , we this case T g (x) := fvconst set in this case T g (x) := P
P
k Y
fvwg ;:::;wk Pi j < w P ; : : :; wk Pk > 2 (fvgg [ 1
i=1
1
1
m [
0
gj T gj (x))g:
j =1 g out Finally, if g is the output Pgate gout of N and W (T (x); x) < W (v gout ), we add out (x) + 1 v gout . If W (T gout (x); x) W (v gout ), we to Lgout (x) also the inequality T gP add to Lgout (x) also the inequality T gout (x) v gout .
We de ne and for S Rn
L(N ; x) :=
[
L(N ; S ) :=
[
fLg (x) j g is a gate of Ng ;
fL(N ; x) j x 2 S g :
The following Lemma veri es that for any x 2 Rn the system L(N ; x) of inequalities provides a truthful description of the computation of N for input x.
Lemman 5.3 Assume that N is a neural net in normal form with n input variables, and x 2 R is an arbitrary concrete input. Then we have for any gate g in N : W (S g (x); x) is the input, and W (T g (x); x) is the output of gate g in the computation of N for input x. Furthermore: W (S g (x); x) < tg , \S g(x) + 1 vIg " 2 L(N ; x) W (S g (x); x) < tg , \S g(x) + 1 vIIg " 2 L(N ; x) W (S g (x); x) tg , \S g (x) vIg " 2 L(N ; x) vIIg " 2 L(N ; x) : W (S g (x); x) tg , \S g (x) 1 2 1 2
Proof: Induction on the depth of g. De nition 5.4 Assume that N is a neural net of depth d in normal form with n in-
puts. Assume that q is the maximal degree of a piecewise polynomial activation function in N . Furthermore assume that S Rn and A : V N ! R isSan arbitrary solution of the system L(N ; S ) of inequalities with the variable set V N := fV g j g is a gate in Ng. We de ne by induction on the depth of gate g in N for each term t 2 M g a circuit C A;g (t). Together the circuits (C A;g (t))t2M g mimic the subcircuit of N between the input
22
and gate g . The circuit C A;g (t) consists of linear threshold gates and multiplication gates of fan-in 2 that compute the product of any two real numbers. For any circuit input x 2 S the output of the circuit C A;g (t) will be 1 if t 2 T g (x), otherwise it will be 0. One associates with each circuit C A;g (t) for t 2 M g of the form t v P another circuit C~ A;g (t) that outputs for any circuit input x 2 S the real number
A(t; x) :=
(
A(v) A(P; x) ; if t 2 T g (x) 0 ; if t 2 M g ? T g (x) ;
where one de nes for each formal term P xj1 : : : xjnn a value A(P; x) 2 R as xj1 : : : xjnn (i.e. formal variables xi are replaced by the corresponding components of x 2 S ). The extension from C A;g (t) to C~ A;g (t) is done in a canonical manner by adding q d + 1 product gates (we assume for the moment that the constants A(v ) are available in analog form as separate inputs to the circuits). 1
1
The de nition of a value A(t; x) for each term t and each x 2 S is extended in a canonical way to arbitrary sets M of terms:
A(M; x) :=
X
t2M
A(t; x); A(=; x) := 0 :
We consider rst the case where g has depth 1. Let H1g be a linear threshold gate that checks whether A(vIg ) A(S g (x); x), and let H2g be a linear threshold gate that checks whether A(S g (x); x) + 1 A(vIIg ). For each term t 2 M g we de ne C A;g (t) to be the AND of H1g and H2g . Assume then that g is a gate on level l + 1 with edges from the gates g1; : : :; gm onm level l leading into g . According to De nition 5.2 we have in this case S g (x) = fv0gg [ S gj T gj (x) for every x 2 S . By induction hypothesis we have already j =1 de ned circuits C A;gj (t), and hence also circuits C~ A;gj (t) for all t 2 M gj ; j = 1; : : :; m. For each term t 2 M g the circuit C A;g (t) employs two linear threshold gates H1g and H2g , which receive their inputs from the circuits C~ A;gj (t) for t 2 M gj ; j = 1; : : :; m. Their weights are from f?1; 0; 1g as given by g1 ; : : :; gm. The gate H1g checks whether A(vIg ) A(S g (x); x) and H2g checks whether A(S g (x); x) + 1 A(vIIg ). g ) is de ned as the AND If g outputs a constant on its nontrivial piece, C A;g (vconst g g of H1 and H2 (together with their subcircuits C~ A;gj (t)).
If g computes y 7! y k on its nontrivial piece, each t 2 M g is of the form vwg ;:::;wk k m Q Pi for some k-tuple < w1 P1; : : :; wk Pk > 2 (fv0gg [ S gj M gj )k . In i=1 j =1 g ; H g , and of the outputs of the cirthis case C?A;g (t) is de ned as the AND of H 2 1 cuits C A;gj wi (gj Pi ) for all i 2 f1; : : :; kg and j 2 f1; : : :; mg with gj 6= 0 and wi Pi 2 gj M gj . [One should recall that gj 2 f?1; 0; 1g since N is in normal form. Hence for gj 6= 0 one has wi Pi 2 gj M gj , wi (gj Pi ) 2 M gj . One should also k Q keep in mind that in general A(vwg ;:::;wk ) 6= A(wi )]. 1
1
i=1
23
Finally we de ne the circuit C A(N ) by using as subcircuits the circuits C~ A;gout (t) for all t 2 M gout . The output of C A (N ) is given by a linear threshold gate H that checks whether A(M gout ; x) A(v gout ).
Lemma 5.5 Assume that S Rn and A is an arbitrary solution of L(N ; S ). Then the following holds for any gate g in N , for any term t 2 M g , and any input x 2 S : a) The output of theP gate H1g in C A;g (t) is 1 if and only if L(N ; S ) contains the of the gate H2g in C A;g (t) is 1 if inequality \vIg S g (x)". Similarly the output P g and only if L(N ; S ) contains the inequality \ S (x) + 1 vIIg ". b) t 2 T g (x) , (C A;g (t) outputs 1 for circuit input x).
Proof: By induction on the depth of g. Lemma 5.6 Assume that N is a neural net in normal form with n input variables, n S R is an arbitrary set of inputs, and A is an arbitrary solution of L(N ; S ). Then N and C A(N ) compute the same function form S into f0; 1g. Proof: Follows from Lemma 5.3 and Lemma 5.5. We are now in a position where we can complete the proof of Theorem 3.1. Assume that a given array (Nn)n2N of neural nets satis es the assumption of Theorem 3.1, to the variable and that (n )n2N is an arbitrary array of real valued assignments ? n n parameters in Nn . One can transform the given neural nets Nn n2N into an array (N^n^n )n2N of neural nets in normal form (with properties as speci ed in section 3) such that N^n^ n computes the same boolean function as Nnn . We then apply the machinery ^n^n with S := from the de nition and Lemmas 5.1 to 5.6 to each neural net N := N f0; 1gn. By construction of N^nn the resulting system L(N ; f0; 1gn) of inequalities has some solution over R. We exploit here in particular that ^n was chosen so that all relevant strict inequalities \s1 < s2 " in S computations of N^n^n on inputs x 2 f0; 1gn were strengthened to \s1 + 1 s2 ". Since j fM g j g gate in N^n^n g j = O(nO(1) ), it follows that the number of gates in C A(N^n^n ) is bounded by O(nO(1) ). The number of variables in L(N ; f0; 1gn) is polynomial in n and it only contains constants from f?1; 0; 1g. Hence by Lemma 2.4 there is a solution A of L(N ; f0; 1gn) that consists of rationals of the form st (with a common integer t) such that s and t are integers of size 2O(nO ) . By Lemma 5.6 the associated circuit C A(N ) computes the same boolean function as N . Furthermore all constants and parameters in C A (N ) are from f?1; 0; 1g or are given by A. Hence they are quotients of integers with polynomially in n many bits. Thus (see [SBKH], [SR]) one can carry out all arithmetical operations in C A(N ) by polynomial size digital subcircuits of constant depth with linear threshold gates (or equivalently: with MAJORITY-gates, see [CSV]). In the resulting circuit all parameters from A are replaced by corresponding sequences of bits. Hence one gets in this way neural nets N~n which satisfy the claim of Theorem 3.1. (1)
24
Acknowledgements We would like to thank Eduardo D. Sontag for drawing our attention to the problem of nding upper bounds for neural nets with -gates, and for his insightful comments. We thank Franz Aurenhammer, Eric Baum, David Haussler, Philip M. Long, Gyorgy Turan, and Gerhard Woginger for various helpful discussions on this research. We also thank Gyorgy Turan for making his translation of the work of Neciporuk available to us.
References [A] [B] [BH] [BR] [CSV] [C] [DS] [DR] [E] [EOS] [HMPST]
[Has]
Y. S. Abu-Mostafa, \The Vapnik-Chervonenkis dimension: information versus complexity in learning", Neural Computation, vol. 1, 1989, 312 - 317 P. L. Bartlett, \Lower bounds on the Vapnik-Chervonenkis dimension of multilayer threshold networks", preprint (July 1992) E. B. Baum, D. Haussler, \What size net gives valid generalization?", Neural Computation, vol. 1, 1989, 151 - 160 A. Blum, R. L. Rivest, \Training a 3-node neural network is NP-complete", Proc. of the 1988 Workshop on Computational Learning Theory, Morgan Kaufmann (San Mateo, 1988), 9 - 18 A. K. Chandra, L. Stockmeyer, U. Vishkin, \Constant depth reducibility", SIAM J. Computing, vol. 13 (2), 1984, 423 - 439 T. M. Cover, \Capacity problems for linear machines", in: Pattern Recognition, L. Kanal ed., Thompson Book Co., 1988, 283 - 289 B. DasGupta, G. Schnitger, \Ecient approximations with neural networks: a comparison of gate functions", preprint (Aug. 1992) R. Durbin, D. E. Rumelhart, \Product units: a computationally powerful and biologically plausible extension to backpropagation networks", Neural Computation, vol. 1, 1989, 133 - 142 H. Edelsbrunner, \Algorithms in Combinatorial Geometry", Springer (Berlin, 1987) H. Edelsbrunner, J. O'Rourke, R. Seidel, \Constructing arrangements of lines and hyperplanes with applications", SIAM J. Comp., vol. 15, 1986, 341 - 363 A. Hajnal, W. Maass, P. Pudlak, M. Szegedy and G. Turan, \Threshold circuits of bounded depth", Proc. of the 28th Annual IEEE Symp. on Foundations of Computer Science, 1987, 99 - 110. Full version to appear in J. Comp. Syst. Sci. 1992 J. Hastad, \On the size of weights for threshold gates", preprint (September 1992)
25
[H] [Ho] [J] [KV] [L] [Lu] [MSS] [MT] [MR] [Me] [MP] [MD] [M] [N] [Ni] [PS] [PG] [R]
D. Haussler, \Decision theoretic generalizations of the PAC model for neural nets and other learning applications", Tech Report UCSC-CRL - 91-02 (1989) J. J. Hop eld, \Neurons with graded response have collective computational properties like those of two-state neurons", Proc. Nat. Acad. of Sciences USA, 1984, 3088 - 3092 D. S. Johnson, \A catalog of complexity classes", in: Handbook of Theoretical Computer Science vol. A, J. van Leeuwen ed., MIT Press (Cambridge, 1990) M. Kearns, L. Valiant, \Cryptographic limitations on learning boolean formulae and nite automata", Proc. of the 21st ACM Symposium on Theory of Computing, 1989, 433 - 444 R. P. Lippmann, \An introduction to computing with neural nets", IEEE ASSP Magazine, 1987, 4 - 22 O. B. Lupanov, \On circuits of threshold elements", Dokl. Akad. Nauk SSSR, vol. 202, 1288 - 1291; engl. translation in: Sov. Phys. Dokl., vol. 17, 1972, 91 93 W. Maass, G. Schnitger, E. D. Sontag, \On the computational power of sigmoid versus boolean threshold circuits", Proc. of the 32nd Annual IEEE Symp. on Foundations of Computer Science, 1991, 767 - 776 W. Maass, G. Turan, \How fast can a threshold gate learn?", in: Computational Learning Theory and Natural Learning Systems: Constraints and Prospects, G. Drastal, S. J. Hanson and R. Rivest eds., MIT Press, to appear J. L. McClelland, D. E. Rumelhart \Parallel Distributed Processing", vol. 2, MIT Press (Cambridge, 1986) N. Megiddo, \Linear Programming in linear time when the dimension is xed", J. of the ACM, vol. 31, 1984, 114 - 127 M. Minsky, S. Papert, \Perceptrons: An Introduction to Computational Geometry", Expanded Edition, MIT Press (Cambridge, 1988) J. Moody, C. J. Darken, \Fast learning in networks of locally-tuned processing units", Neural Computation, vol. 1, 1989, 281 - 294 S. Muroga, \Threshold Logic and its Applications", Wiley (New York, 1971) E. I. Neciporuk, \The synthesis of networks from threshold elements", Probl. Kibern. No. 11, 1964, 49 - 62; engl. translation in: Autom. Expr., vol. 7, No. 1, 1964, 35 - 39 N. J. Nilsson, Learning Machines, McGraw-Hill (New York, 1971) I. Parberry, G. Schnitger, \Parallel computation with threshold functions", Lecture Notes in Computer Science vol. 223, Springer (Berlin, 1986), 272 - 290 T. Poggio, F. Girosi, \Networks for approximation and learning", Proc. of the IEEE, vol. 78(9), 1990, 1481 - 1497 F. Rosenblatt, \Principles of Neurodynamics", Spartan Books (New York, 1988)
26
[RM] [Sch] [SS] [SBKH]
D. E. Rumelhart, J. L. McClelland, \Parallel Distributed Processing", vol. 1, MIT Press (Cambridge, 1986) A. Schrijver, \Theory of Linear and Integer Programming", Wiley (New York, 1986) H. T. Siegelmann, E. D. Sontag, \Neural networks with real weights: analog computational complexity",Report SYCON-92-05, Rutgers Center for Systems and Control (Oct. 1992) K. Y. Siu, J. Bruck, T. Kailath, T. Hofmeister, \Depth ecient neural networks for division and related problems", to appear in IEEE Transactions on Inf. Theory
[SR] [S1] [S2] [S3] [T] [V]
K. Y. Siu, V. Roychowdhury, \On optimal depth threshold circuits for multiplication and related problems", Tech. Report ECE - 92-05, University of California, Irvine (March 1992) E. D. Sontag, \Remarks on interpolation and recognition using neural nets", in: Advances in Neural Information Processing Systems 3, R. P. Lippmann, J. Moody, D. S. Touretzky, eds., Morgan Kaufmann (San Mateo, 1991), 939 - 945 E. D. Sontag, \Feedforward nets for interpolation and classi cation", J. Comp. Syst. Sci., vol. 45, 1992, 20 - 48 E. D. Sontag, private communication (July 1992) G. Turan, private notes (1989) L. G. Valiant, \A theory of the learnable", Comm. of the ACM, vol. 27, 1984, 1134 - 1142
27