On the learnability of recursive data Barbara Hammer Department of Mathematics/Computer Science University of Osnabruck, Albrechtstr. 28, D-49069 Osnabruck, phone: +49(0)541/969-2488, fax: +49(0)541/969-2770, e-mail:
[email protected] Abstract: We establish some general results concerning PAC learning:
We nd a characterization of the property that any consistent algorithm is PAC. It is shown that the shrinking width property is equivalent to PUAC learnability. By counterexample PAC and PUAC learning are shown to be dierent concepts. We nd conditions ensuring that any nearly consistent algorithm is PAC or PUAC, respectively. The VC dimension of recurrent neural networks and folding networks is in nite. For restricted inputs, however, bounds exist. The bounds for restricted inputs are transferred to folding networks. We nd conditions on the probability of the input space ensuring polynomial learnability: The probability of sequences or trees has to converge to 0 suciently fast with increasing length or height. Finally, we nd an example for a concept class that requires exponentially growing sample sizes for accurate generalization.
Keywords: Recurrent Neural Networks, Folding Networks, Computational
Learning Theory, PAC Learning, VC Dimension.
1
1
Introduction
Neural networks are successfully used to deal with recursive data, e.g. for time series prediction, for modeling nite automata, for recognizing DNA sequences, or in system identi cation and controlling [M2, R, OG, NP, S1, S2]. They can naturally be generalized so that they can handle not only linear data but also trees. The so-called folding architecture is used for term classi cation tasks and can be included in more complex scenarios, e.g. controlling search heuristics in automatic theorem proving [SKG]. Recurrent networks are Turing universal [KS1, SS1]. Furthermore, folding networks are approximation complete if they are used for the approximation or classi cation of lists and trees [HS]. As a consequence, they are in principle well suited for these tasks. It remains to show that eective learning algorithms can be found. An eective learning algorithm should yield a well trained network in acceptable time, i.e. a network which approximates an unknown function up to a certain accuracy with high probability if some empirical data of the unknown function is given. Here, several problems occur: First, if a gradient descent method is used to select the optimal network parameters numerical problems will arise: Due to long-term dependencies the learning is either not stable or takes a prohibitive amount of time [BSF]. Additionally, a gradient descent may only nd a local optimum in the weight space due to spurious minima of the error function even for very simple architectures [SS2]. Second, learning is at least as complex as in the case of feedforward networks. Here the loading problem, which is a decision problem correlated to the learning task, is NP-hard for some situations [BR, DSS, H2]. Third, if no further information about the probability of the examples used for training is available, an information theoretical barrier can prevent a guarantee for valid generalization. This is due to the in niteness of the Vapnik-Chervonenkis (VC) dimension for standard function classes. In the case of time sequences or trees the VC dimension of recurrent and folding networks restricted to inputs of length at most h is growing with h, and therefore becomes in nite if arbitrary inputs are allowed. We will consider the question of accurate generalization in more detail. In section 2, we recall necessary and sucient conditions for valid generalization in a distribution independent and in a distribution dependent setting [V]. The latter turns out to be more interesting in our case. Some results are interesting on their own, though: We establish a characterization of PUAC learnability and show that PUAC and PAC learnability are dier2
ent concepts. Especially, this answers the problem 12.4 of [V]. We nd a characterization that any consistent algorithm is PAC. Furthermore, we nd extensions of these characterizations ensuring that any nearly consistent algorithm is PAC or PUAC, respectively. In section 3, the network structures we will investigate are de ned: Recurrent neural networks and folding architectures as a generalization that can deal with trees [SKG]. Then we recall the bounds on the VC dimension of recurrent networks and transfer them to folding networks [KS2]. For arbitrary inputs and distribution valid generalization cannot be guaranteed. However, conditions on the probability can be derived ensuring that the sample size necessary for valid generalization only grows polynomially in the generalization accuracy. On the other hand, one can nd probabilities so that the function class is learnable, but only with exponentially growing sample size, answering problem 12.6 of [V]. We will draw a conclusion and pose some open questions that naturally arise in this scenario.
2
Learnability of function classes
In this section we use the notation of [V]. As in the rest of this paper, any proofs and examples we present indicate that the corresponding fact is a new result. The term ` nitely characterizable' as well as the scaled version of this term and the scaled version of the shrinking width property are new de nitions.
2.1 Fixed distribution learning
A distribution dependent setting contains a measurable space (X; S; P ) and a set F of functions on X to [0; 1] that ful ll some measurability conditions. We assume that we want to learn an unknown function f 2 F if m examples (x1 ; f (x1)); : : :; (xm ; f (xm)) are given. The xi are i.i.d. with respect to P . A learning algorithm is a mapping h from the set of possible samples to F . We write hm (f; x) for the image of (xi ; f (xi))mi=1. We are interested in algorithms that, given a xed number of examples, produce a function with a small distance from the unknown function; i.e. for growing m the error
dP (f; hm(f; x)) =
Z
X
jf (x) ? hm (f; x)(x)jdP 3
should be small with high probability. A concrete algorithm often simply minimizes the empirical error m
X d^m(f; hm(f; x); x) = jf (xi ) ? hm(f; x)(xi)j :
i=1
Since f 2 F , a concrete algorithm can produce empirical error 0 for any function f and training sample x; such an algorithm is called consistent. We are interested in the question if for a consistent algorithm not only the empirical error but also the real error becomes small with increasing m. The questions above can be formalized as follows. We write P m for the induced product measure on X m .
De nition 1 A function set F is said to be probably approximately correct or PAC learnable if an algorithm h exists (which then is called PAC, too) such that for any
sup P m (x j dP (f; hm (f; x)) > ) ! 0 (m ! 1) :
f 2F
F is called probably uniformly approximately correct or PUAC learnable if an algorithm h exists (which then is called PUAC, too) such that for any P m (x j sup dP (f; hm(f; x)) > ) ! 0 (m ! 1) : f 2F
F is said to be consistently PAC learnable if any consistent algorithm is PAC. F is said to be consistently PUAC learnable if any consistent algorithm is PUAC.
Note that in [V] the term `consistently learnable' is the same as `consistently PUAC learnable' as de ned above. It would be desirable to obtain equivalent characterizations that do not refer to a learning algorithm and can be evaluated if only F is known. For a set X with pseudometric d the covering number N (; X; d) denotes the smallest number n such that n points fx1; : : :; xn g exist that cover X ; i.e. for any x 2 X an xi exists with d(x; xi) < . The packing number M (; X; d) is the largest number n such that n points fx1 ; : : :; xng exist that are -separated, i.e. d(xi ; xj ) > for i 6= j . It can be shown that
M (2; X; d) N (; X; d) M (; X; d) 4
[V](Lemma 2.2). If the functions in F are binary valued, i.e. they map to f0; 1g only, F is called a concept class. In this case learnability can be characterized fully in terms of the covering number whereas for arbitrary F this condition is sucient, only. Lemma 2 If the covering number N (; F ; dP ) is nite for any , F is PAC learnable. If F is a concept class this condition is equivalent to PAC learnability and any algorithm h that is PAC up to accuracy requires at least log M (2; F ; dP ) examples. [V](Theorems 6.3 and 6.5). Up to now an equivalent characterization of PAC learnability for a function class is not known, but we have the following result:
Lemma 3 A function class is consistently PAC learnable if and only if it is nitely characterizable, i.e. for all sup P m (x j 9g (d^m (f; g; x) = 0 ^ dP (f; g ) > )) ! 0 (m ! 1) : f 2F
Proof: Suppose F is nitely characterizable and h is a consistent algorithm. Then for any f 2 F the set fx j dP (f; hm (f; x)) > g is contained in the set fx j 9g (d^m (f; g; x) = 0 ^ dP (f; g ) > )g because the algorithm is consistent. Therefore the probability we are interested in tends to 0. Suppose that F is not nitely characterizable, i.e. there exists a sequence n1 , n2 , : : : going to 1 and functions f1 , f2 , : : : in F such that P ni (x j 9g (d^m (fi; g; x) = 0 ^ dP (fi ; g) > ) > for a positive . Then an algorithm can be constructed which selects for a sample size ni and unknown function fi for any x in the above set a function g with dP (f; g ) > and d^m(f; g; x) = 0. This algorithm is not PAC, but consistent. 2 A class is nitely characterizable if, roughly speaking, the probability of points that do not characterize a function uniquely up to accuracy tends to zero. This de nition does not match the fact that an algorithm can use speci c information about the function class and does not just choose any function with zero error. Consequently, nite characterizability is not necessary for PAC learnability. There exist PAC learnable function classes in which not every consistent algorithm is PAC: Consider, e.g. the following scenario: X = [0; 1], P = uniform distribution, F = ff : X ! f0; 1g j f (x) = 0 almost surely or 5
f (x) = 1 for all xg. Consider the algorithm which produces the function 1 until a value x with image 0 is an element of the sample. It is consistent
and PAC because for any function except 1 the set with image 1 has the measure 0. However, the consistent algorithm which produces the function with image 1 only on the corresponding elements of the sample and 0 on any other point is not PAC because the constant function 1 cannot be learned. PUAC learnability is a stronger condition than PAC learnability: Consider as before X = [0; 1], P = uniform distribution and F = ff : X ! f0; 1g j f is almost surely constantg. Since F has a nite covering, it is PAC learnable. Assume there exists an algorithm h that is PUAC. Assume that for a sample x and function f the function hm (f; x) is almost surely 0. Then for the function g which equals f on x and is almost surely 1 the distance dP (g; hm(g; x)) is 1 because hm (f; x) = hm (g; x). The same is valid if h produces a function that is almost surely 1. Therefore P (x j supf 2F dP (f; hm(f; x)) > ) = 1. This partially answers problem 12:4 in [V]. The following is a complete solution of this problem because it implies a characterization of PUAC learnability. Lemma 4 The following three characterizations are equivalent: F ful lls the shrinking width property, i.e. for any P m (x j 9f; g (d^m(f; g; x) = 0 ^ dP (f; g) > )) ! 0 (m ! 1) :
F is consistently PUAC learnable. F is PUAC learnable. Proof: The equivalence of the rst two conditions can be found in [V]
(Theorem 6.2). We show the equivalence to the third characterization: Assume that the shrinking width property is violated. For an arbitrary learning algorithm h, a sample x and functions f and g with d^m (f; g; x) = 0, it is valid that dP (f; g ) dP (f; hm (f; x)) + dP (g; hm(g; x)). If dP (f; g ) > , at least one of the functions f and g has a distance of at least =2 from the function produced by the algorithm. Therefore fx j 9f; g (d^m (f; g; x) = 0 ^ dP (f; g ) > )g fx j supf 2F dP (f; hm (f; x)) > =2g; the probability of the latter set is at least as large as the probability of the rst one. 2 In particular, this shows the remarkable property that if one PUAC algorithm exists, every consistent algorithm is PUAC. Obviously, the difference of PAC and PUAC learning also follows from this abstract result, 6
since PUAC and consistent PUAC learnability coincide. In contrast, in the non-uniform case nite characterizability is a stronger condition than PAC learnability. The above characterizations are not fully satisfactory since a concrete learning algorithm often only produces a solution which merely minimizes the empirical error rather than bringing the error to 0. This can, for example, be due to the complexity of the minimizing problem. In the case of model free learning, which will be considered later, it is possible that a minimum simply does not exist. Therefore the above terms are weakened in the following de nition. De nition 5 An algorithm h is asymptotically -consistent if sup P m (x j d^m (f; hm (f; x); x) > ) ! 0 (m ! 1) : f 2F
h is asymptotically uniformly -consistent if P m (x j sup d^m(f; hm(f; x); x) > ) ! 0 (m ! 1) : f 2F
A function class F is nitely 1 -2 -characterizable if sup P m (x j 9g (d^m (f; g; x) 1 ^ dP (f; g ) > 2 )) ! 0 (m ! 1) : f 2F
F ful lls the 1-2 -shrinking width property if P m (x j 9f; g (d^m(f; g; x) 1 ^ dP (f; g) > 2 )) ! 0 (m ! 1) : F is 1-consistently PAC learnable up to accuracy 2 if any asymptotically 1 -consistent algorithm is PAC up to accuracy 2 . F is 1 -consistently PUAC learnable up to accuracy 2 if any asymptotically uniformly 1 -consistent algorithm is PUAC up to accuracy 2 . In the case = 0 and 1 = 0 the previous de nitions result. Now we
consider the question if an algorithm that minimizes the empirical error up to a certain degree is PAC up to a certain accuracy. Analogous to the case of a consistent algorithm the following lemma holds. Lemma 6 F is 1 -consistently PAC learnable up to accuracy if and only if F is nitely 1 --characterizable. F is 1-consistently PUAC learnable up to accuracy if and only if F ful lls the 1 --shrinking width property. 7
Proof: Since
supf 2F P m (x j dP (hm (f; x); f ) > ) supf 2F P m (x j d^m (hm (f; x); f; x) > 1 ) m + supf 2F P (x j 9g (d^m (g; f; x) 1 ^ dP (g; f ) > ))
any asymptotically 1 -consistent algorithm is PAC up to accuracy if the class is nitely 1 --characterizable. If on the contrary the class is not nitely 1 --characterizable, there exist numbers n1 , n2 , : : : tending to 1 and functions f1 , f2 , : : : such that P ni (x j 9g (d^ni (g; fi; x) 1 ^ dP (g; fi) > )) > for a positive . But then any algorithm choosing a function g for a sample x in the above set and unknown function fi so that the above inequalities hold can be expanded to an asymptotically 1 -consistent algorithm that is not PAC up to accuracy . Since P m (x j supf 2F dP (hm (f; x); f ) > ) P m(x j supf 2F d^m(hm(f; x); f; x) > 1 ) + P m (x j 9f; g (d^m(g; f; x) 1 ^ dP (g; f ) > )) any asymptotically uniformly 1 -consistent algorithm is PUAC up to accuracy in case the 1 --shrinking width property holds. In contrast, if this property is violated, a sequence n1 , n2 , : : : tending to 1 with P ni (x j 9f; g (d^ni (g; f; x) 1 ^ dP (g; f ) > )) > for a positive leads to a uniformly 1 -consistent algorithm which is not PUAC up to accuracy if for any x in the above set and corresponding unknown function f the algorithm produces such a function g that the above inequality holds. 2 Since
fx j 9f; g (d^m(f; g) 1 ^ dP (f; g) > 32)g fx j 9f; g (d^m(f; g) 1 ^ dP (hm (f; x); hm(g; x)) > 2g [ fx j supf 2F dP (f; hm(f; x)) > 2g 8
F ful lls the 1-32-shrinking width property if one algorithm exists that is PUAC up to accuracy 2 and additionally the following stability condition holds for m ! 1: P m (x j 9f; g (d^m (f; g; x) 1 ^ dP (hm (f; x); hm(g; x)) > 2 )) ! 0 : In other words, small deviations of the inputs lead to small deviations of the output function of this algorithm. There exist function classes that are PUAC learnable but the 1 -2 shrinking width property is violated. Consider, e.g. the class F = ffl : [0; 1] ! [0; 1] j l 2 Ng (see Fig. 1) with 8 > > >
> > :
l?1
2 [?1 2i 2i + 1 1 =2 if x 2 l ; 2l 1 + e?l 2 i=0 = 2 1 1 ? 1 + e?l otherwise .
These functions are PUAC learnable since the values uniquely determine the corresponding function. However, any two functions have a distance of at least (1 ? 1)=2 if the uniform distribution on [0; 1] is considered, whereas for any sample x two functions with an empirical distance of at most 1 =2 exist. 1
f1 f2
0
f3
0:5
Figure 1: Functions fl for l = 1; 2; 3 In fact an additional stability condition holds for any PUAC algorithm if F ful lls the 1 -2 -shrinking width property, since P m (x j 9f; g (d^m(f; g; x) 1 ^ dP (hm(f; x); hm(g; x)) > 32)) P m (x j supf 2F dP (f; hm(f; x)) > 2 ) + P m (x j 9f; g (d^m (f; g; x) 1 ^ dP (f; g ) > 2 )): 9
2.2 Fixed distribution learning { The model free case
In the case of neural network learning we often know nothing about the function we want to learn about, hence the assumption that the unknown function is contained in F is unrealistic. Assume the unknown function f0 is in a set F0 , which may be dierent from F . Then we can try to nd a function in F approximating f0 best. The following notations are a special case of agnostic learning as introduced in [V](3.3). The minimal error achievable in F is of size JP (f0 ) = finf d (f; f0 ) : 2F P
Given a sample (xi; f (xi ))i the minimal empirical error achievable in F has the size of X jf0(xi) ? f (xi)j : J^m (f0; x) = finf 2F i
We de ne that (F0; F ) is PAC if an algorithm h exists with sup P m (x j dP (f0 ; hm (f0; x)) ? JP (f0 ) > ) ! 0 : f0 2F0
PUAC learnability is characterized by P m (x j sup (dP (f0 ; hm(f0; x)) ? JP (f0 )) > ) ! 0 : f0 2F0
A consistent algorithm satis es d^m (f0 ; hm(f0 ; x); x) ? J^m (f0; x) = 0 for any f0 and sample x. As before, niteness of the covering number of the approximating class F ensures PAC learnability. If at least one consistent algorithm exists, the property sup P m (x j9g 2 F (d^m (f0 ; g; x) = J^m (f0; x) ^ dP (f0 ; g ) ? JP (f0 ) > )) ! 0 f0 2F0
for m ! 1 is equivalent to the fact that every consistent algorithm is PAC. A consistent algorithm does not exist, on the other hand, unless for any f0 2 F0 a function that approximates f0 best is an element of F . We can, of course, consider scale sensitive variations of these terms as de ned in the last subsection. These enable us to deal with the fact that a concrete algorithm does not nd a best solution but only one that is near optimal. If at least one algorithm exists which is PAC up to accuracy , then the scaled version of nite characterizability is equivalent to -consistent PAC learnability in this case, too. 10
2.3 Distribution free learning
In the distribution independent setting we know nothing about the underlying probability measure P . Assume that P is the set of possible probability measures on (X; S ). As before, F denotes a set of functions on X to [0; 1]. We pre x a supP 2P to the terms in Def. 1 that shall converge to 0. Then the notation of PAC and PUAC learnability can be extended to this case as well. If F maps to f0; 1g, the Vapnik-Chervonenkis dimension VC (F ) is the largest size of a set S = fx1 ; : : :; xng in X such that any mapping S ! f0; 1g can be written as the restriction f jS for some f 2 F . The pseudodimension PS(F ) is the largest size of a set S = fx1; : : :; xng in X such that reference points fr1; : : :; rng exist in R so that any mapping d : S ! f0; 1g can be written as d(xi) = H(f (xi ) ? ri ) for some f 2 F , where H denotes the Heavyside function H(x) =
1 x0 0 otherwise :
A set S is said to be shattered by F in both of the above cases, where each dichotomy can be implemented using F as speci ed above. It holds that PS (F ) = VC (ffe : X R ! f0; 1g j fe (x; y ) = H(F (x) ? y); f 2 Fg) [V](p.300). For d = PS(F ) 2 and < 0:94 it is the case that M (; F ; dP ) 2( 2e ln 2e )d for an arbitrary P [V](Theorem 4.2). Furthermore, for d = VC (F ) and < 1=d a probability measure P exists such that M (; F ; dP ) 2d; take, e.g. the uniform distribution on d points that are shattered.
De nition 7 A set F of functions ful lls the property of uniform convergence of empirical means, or UCEM, property if q (m; ) := sup P m (x j sup jdP (f; 0) ? d^m(f; 0; x)j > ) ! 0 : P 2P
f 2F
(Here, 0 denotes the constant function. The terms dP (f; 0) and d^m (f; 0; x) are the mean and empirical mean of f , respectively.)
If the UCEM property is valid, any algorithm producing an empirical error smaller than is PAC up to accuracy 2 [V](Theorem 6.1). The convergence rate of the algorithm is limited by the convergence of the empirical means. 11
q (m; ) can be limited by the pseudo or VC dimension as follows: For d = PS (F ) < 1 and < 0:94 it is valid that d 2 16 e 16 e q (m; ) 8 ln e?m =32 ; if d = VC (F ) < 1 then d 2 2 e m q (m; ) 4 d e?m =8 ;
i.e. niteness of the dimensions ensures learnability [V](Theorems 7.1 and 7.2). For a concept class niteness of VC (F ) is equivalent to the UCEM property [V](Theorem 7.9), whereas for a function class a weaker condition, niteness of a scale sensitive version of the pseudodimension, is sucient [ABCH]. Explicit bounds on the number of examples can be derived as follows: For PS (F ) = d < 1 any consistent algorithm is PUAC up to accuracy with con dence if O(d 1 ln 1 ln ln 1 + 1 1 ) examples are used. If F is a concept class, the number of examples sucient for a consistent algorithm to be PAC up to accuracy with con dence can be improved to O(d 1 ln 1 + 1 ln 1 ) [V](Theorems 7.5 and 7.6). Finiteness of the VC dimension is also necessary for PAC learnability. In the distribution independent learning of concept classes the following three characterizations coincide: F is PAC learnable, F is PUAC learnable, VC (F ) < 1. In the following we will consider the covering number of function classes and the VC or pseudodimension, respectively. Finiteness of the former ensures learnability or is equivalent to learnability for concept learning; niteness of the latter ensures worst case learnability, especially it follows that any consistent algorithm is PAC, and this condition is also necessary in concept learning.
3
Folding networks
Let : R ! R denote an activation function. We will consider the Heavyside function or the sigmoidal function sgd(x) = 1=(1 + e?x ). As usual, a feedforward neural network consists of n input neurons n1 , : : : , nn and N computation units nn+1 ; : : :; nn+N . Some of the neurons are connected with vertices ni ! nj so that the resulting graph is acyclic. The 12
neurons without a predecessor are exactly the input neurons. The neurons
ni1 ; : : :; nim without a successor are called output neurons. Each computation unit is assigned a bias bi 2 R and each connection ni ! nj is assigned a weight wij 2 R. W denotes the total number of weights and biases. A network computes the function f : Rn ! Rm induced by the value of the output units, i.e. f (x) = (ni1 (x); : : :; nim (x)) where ni (x) is computed recursively as P ni (x) = xi for input units, i.e. if i n, and ni (x) = ( nj !ni wji nj (x)+ bi) for other units.
special input
p g
p
a recurrence b c according to the structure of the input nil nil nil
input subtree1subtree2
g nil
a
g
g b
nil
nil
c
nil
nil
Figure 2: Left: Folding network; Right: If a tree serves as an input the network can be unfolded formally according to the structure of the tree. In the resulting graph the output can be computed directly.
De nition 8 (Rm) denotes the set of nite sequences with elements in Rm. A function f : Rm+l ! Rl and vector y 2 Rl induces a function f~y : (Rm) ! Rl as follows: if i = 0 ~ fy ((x1; : : :; xi)) = y
f (f~y (x1 ; : : :; xi?1 ); xi) otherwise A function f : (Rm) ! Rn is de ned as computed by a recurrent neural network if there exist feedforward networks g : Rm+l ! Rl and p : Rl ! Rn such that f = p g~y for an initial context y. By introducing a delay in the inputs we could restrict the network g to a network without hidden units, additionally we could consider the network p to be part of the recurrent network. But this notation has the advantage 13
that a distinction is possible between recurrent parts and parts where no recurrence is needed; besides this notation can be easily extended to other recurrent structures, i.e. trees, as follows:
De nition 9 (Rm)k denotes the set of trees with labels in Rm where any node except the empty node nil has k successors. A function f : Rkm+l ! Rl and vector y 2 Rl induce a function f~y : (Rm)k ! Rl as follows: if t = nil ~ f (t) = y f (f~y (t1 ); : : :; f~y (tk ); r) otherwise In the second case r is the root of t and t1 , : : : , tk are the subtrees of t. A function f : (Rm)k ! Rn is computed by a folding network if there exist feedforward neural networks g : Rkl+m ! Rl and p : Rl ! Rn such that f = p g~y for an initial context y. y
A folding network is depicted in Fig. 2. If the domain of the labels is restricted to a nite alphabet, any mapping from trees to a real vector space can be arbitrarily well approximated. It can be shown that in this case a folding network with a restricted number of neurons in the recursive part and one hidden layer in the second network p is sucient [HS]. Note that in a concrete computation we can proceed as follows: For a xed length k of a sequence or a xed structure of a tree, respectively, we can nd an equivalent feedforward network where some weights are shared according to the recursive structure. See Fig. 2. In the following we will drop the biases since they can be simulated by additional weights receiving constant input 1. The initial context can be simulated by an additional weight and input unit where the input is 1 at the rst computation step and 0 later on, whereas any other input sequence is extended by an entry 0 at the beginning. Therefore, we will drop the subscript y and assume that the initial context is the zero vector. Note that for any height h a tree of height at most h can be extended by zero entries to an equivalent input of height exactly h. By an architecture we mean the set of functions computed by a folding network with a xed grouping of neurons but arbitrarily chosen weights. In the remainder we will consider the learnability with respect to such an architecture.
14
3.1 Worst case analysis
(Rm)k { abbreviated X { denotes the set of trees where each node has k successors { for k = 1 these are sequences. We must de ne a -algebra on the trees. For a xed structure we can consider the product algebra of the L labels, i.e. the Borel algebra on (Rm)L. X is the union of all these sets of trees with a xed structure. We can simply use the -algebra produced by the direct sum of these on (Rm)L . The only fact we will need in the following is that the sets of trees with a certain height are measurable. IfSXh denotes the subset of trees with height at most h, Xh Xh+1 and X = h Xh . We de ne for a xed folding architecture F the restriction
Fh = ff jXh j f 2 Fg : One can derive bounds on the VC dimension and pseudodimension of Fh as follows: Since any input can be expanded to an equivalent input in Xh with the maximal number of labels, we can restrict the input set to inputs with a xed structure. For these inputs the architecture can be formally unfolded resulting in a standard feedforward architecture. The upper bounds of this feedforward architecture transfer to the folding architecture. Note that according to the recursive structure some weights are shared in the unfolded network, which has consequences for the bounds. The following bounds are an immediate consequence of [KS2] (Theorems 1 and 2), the generalization of the proofs for k 2 is straightforward.
Theorem 10 Assume F is a folding architecture with N neurons and W weights. If the activation function is H then
O(NWk + Wh log k + W log W ) if k 2 VC (Fh) = O( NW + W log(hW )) if k = 1 : If the activation function is sgd then
O(W 2N 2k2h ) if k 2 PS (Fh) = O( W 2N 2h2 ) if k = 1 : Additionally, from [KS2, H1] the lower bounds follow: VC (Fh) = (W log(Wh)) if = H and PS (Fh) = (Wh) if = sgd. In the case of the Heavyside function one can improve the lower bound for k 2: Theorem 11 If = H and k 2 then VC (Fh) = (Wh + W log W ). 15
OR AND test if input equals 1 lineextraction w
recurrence OR according to the input tree
identical networks AND like on the left test if input network shattering equals O( log ) points
:::
:::
wi
W
W
Figure 3: Architecture shattering O(Wh + W log W ) points
Proof: If we assume k = 2, the generalization to k > 2 follows directly. We write the 2h?1 dichotomies of h ? 1 numbers as lines in a matrix and denote the ith rst column with ei . ei gives rise to a tree of height h as follows: In the 2h?1 leaves we write the components of ei + (2; 4; 6; : : :; 2h); the other labels are lled with 0. The function g : f0; 1g2+1 ! f0; 1g, g(x1; x2; x3) = x1 _ x2 _ (x3 2 [2j + 21 ; 2j + 23 ]) can be computed by a feedforward network with activation H with 11 weights including biases. g induces a function g~0 which computes just the component j of ei for the input tree corresponding to ei . Since any dichotomy of the h ? 1 trees can be found as a line of the matrix with columns ei the trees corresponding to the ei can be shattered by this architecture. For any w 2 N there exists a feedforward architecture l with activation H and w weights which shatter O(w log w) points [M1]. We can trivially expand the input dimension of this architecture by 2 such that { denoting this architecture with the same symbol { the induced mapping ~l0 coincides with the original mapping l on input trees of height 1. We want to combine several copies of g~0 and ~l0 such that all sets shattered by each single architecture are shattered by only one new architecture. Generally, we can proceed as follows: For any feedforward architectures f 1 : R2+n1 ! f0; 1g, : : : , f n : R2+nn ! f0; 1g with w1 , : : : , wn weights such that the corresponding induced mappings f~1 0 , : : : , f~n 0 shatter t1 , : : : , tn input trees a folding architecture with w1 + : : : + wn + 9n + 1 weights can be constructed shattering t1 + : : : + tn input trees with the same height as 16
before: De ne f : R2n+n1+:::+nn +1 ! f0; 1g, f (x1; : : :; xn ; y1; : : :; yn ; y ) = (f 1 (x1; y1) ^ x = b1; : : :; f n(xn ; yn ) ^ x = bn ) for pairwise dierent bi 2 R. f can be implemented with w1 + : : : + wn + 8n weights. If a tree t with labels in Rni is expanded to a tree t0 with labels in Rn1+:::+nn +1 in such a way that each label x is expanded to (0; : : :; 0; x; 0; : : :; 0; bi) then the induced mapping computes f~0 (t0 ) = (0; : : :; 0; f~i0(t); 0; : : :; 0). Consequently, the folding architecture which combines f~0 with an OR-connection shatters the t1 + : : : + tn input trees obtained by expanding the inputs shattered by the single architectures f 1; : : :; f n in the above way. The height of the trees remains the sameand this architecture has w1 + : : : + wn + 9n + 1 weights. W ?20 Now we take 40 architectures induced by g each shattering h ? 1 trees of height h and one architecture induced by l with [W=2] weights shattering O(W log W ) points. The combined folding architecture as described above shatters O(Wh + W log W ) trees of height at most h and has no more than W parameters. See Fig. 3 for the entire architecture. Note that the input dimension could be reduced because the shattered inputs have labels equal to 0 at most places. 2 As a consequence for inputs of unlimited height the worst case analysis does not lead to any bounds since the VC and pseudodimension become in nite with increasing height. But of course for xed h the covering numbers can be restricted for any measure P with the formula N (; Fh; dP ) = O(( 1 ln 1 )d ) with d = VC (Fh) or d = PS (Fh ), respectively [V](Theorem 4.2). Here, we have not considered the case where the activation function is the identity. Bounds for the pseudodimension in the case k = 1 are derived in [DS].
3.2 Conditions for polynomial learnability
Let F denote the functions computed by a folding architecture, and let P denote a probability measure on X . We set ph = P (Xh ). It is ph ph+1 and ph ! 1 (h ! 1). Ph denotes the probability that P induces on the subset Xh of X . It is possible to ensure learnability and even learnability with a sample size growing only polynomially in 1= if we make assumptions concerning ph .
17
Theorem 12 F is learnable up to accuracy with con dence if at least
m examples are used with ( PS (Fh) ?2 ln(?1 ln(?1)) c1 if F is a function class, m= VC (Fh)?1 ln(?1 ln(?1)) c2 if F is a concept class . c1 and c2 are constants and h is taken such that ph > 1 ? =2. Especially F is polynomially learnable if 1 ? ph = O(d? ) where d = VC (Fh ) or d = PS (Fh ) and is a positive constant. Proof: Assume =2 > 1 ? ph. Since P (FhC ) = 1 ? ph the distance dP (f; g) is smaller than (1 ? ph ) + ph dPh (f jXh ; g jXh ) for f , g 2 F . Therefore d + p ? 1 2e p 2e p h h h M (; F ; dP ) M ph ; Fh; dPh 2 + ph ? 1 ln + ph ? 1 where d = VC (Fh ) or d = PS (Fh ). The minimum risk algorithm is PAC up to accuracy and con dence if 8 8 ln N (=2; F ; dP ) if F is a function class, > > < 2 m > 32 N ( = 2 ; F ; d ) P > if F is a concept class : ln examples are used [V](Theorems 6.3 and 6.4). That is, (
m=
O(PS (Fh ) ?2 ln(?1 ln(?1 ))) if F is a function class, O(VC (Fh ) ?1 ln(?1 ln(?1 ))) if F is a concept class,
examples are sucient. This is valid for any h such that ph > 1 ? =2. Assume d g (h) for a strictly increasing function g . If this is polynomial in 1= the bounds for m are polynomial, too. A sucient condition is ph > 1 ? (g (h))? =2 for a number > 0 since in this case ph > 1 ? =2 is valid for h > g ?1(?1= ). 2 If the number of examples ensuring that an arbitrary consistent algorithm is PAC up to accuracy with con dence = has to be limited, we can argue as follows: For ph > 1 ? =2 supf 2F P m (x j dP (f; hm (f; x)) > ) supf 2F P m (x j dPh (f jFh ; hm(f; x)jFh ) > =2) Phm=2(x j 9f; g 2 Fh (d^m=2(f; g; x) = 0 ^ jdPh (f; g)j > =2)) + (2)m=2 ; 18
because the probability that less than m=2 examples are contained in Xh can be limited by (2)m=2. The supremum of the left term over all probability measures satisfying ph > 1 ? =2 can be limited using the UCEM property of Fh [V](Example 5.5). As before the number of examples is polynomial in 1= if the VC or pseudodimension of Fh , respectively, is polynomial in 1=, i.e. if ph vanishes logarithmically or polynomially or exponentially with h, respectively, as follows from Theorem 10.
3.3 Exponentially growing sample size
If we consider a distribution that assigns the probability 1 to the set of trees with a certain height h, a sample size of order d= ln(1=) (d = VC (trees of height h)) is sucient for concept learning up to accuracy . For a distribution that selects trees with a larger height h, the VC dimension and consequently the bound increases with h. But note that, although this does not lead to a uniform bound that is polynomial in 1=, any single distribution ensures polynomial learnability in this case. Now we construct a folding architecture and a probability measure where the sample size that is necessary for valid generalization grows more than polynomially in accuracy. This in particular answers a question which Vidyasagar posed in problem 12:6 [V]. Using Lemma 2 it is sucient to show that the covering number of the function class that shall be learned grows more than exponentially in 1=. The idea of the construction is as follows: The input space X is a direct sum of subspaces, i.e. the trees of height h. On these subspaces we can nd function sets with a growing covering number. But then we can shift the probabilities of these subspaces so that trees with arbitrarily growing heights have to be considered to ensure a certain accuracy. f1
f2
f3
1
1
1
0
0
0
0:5
0:5
Figure 4: Functions fl for l = 1; 2; 3
19
0:5
De ne Fh = ff1; : : :; fh g with
8 >
0 if x 2 i=0 :
2i ; 2i + 1 2l 2l
1 otherwise. (See Fig. 4.) These are almost the same functions as in the example at the end of subsection 2:1; we have just added a Heavyside function at the output. For the uniform distribution Pu on [0; 1] it is dPu (fi ; fj ) = 0:5 for i 6= j . For x 2 [0; 1] de ne a tree th (x) of height h as follows: Leaf number i (i = 1; : : :; 2h?1 ) contains the label (x=(2i ? 1); x=(2i)) 2 R2, any other label is (0; 0). Any function in fl 2 Fh can be computed as the composition of the injective mapping th with g~0 where g : f0; 1g2 R2 ! f0; 1g, g(x1; x2; x3; x4) = x1 _ x2 _ (x3 > 21l ^ x4 < 21l ) and l is chosen according to the index of fl . Let F denote the folding architecture corresponding to g. Then M (; FjXh0 ; dPh ) h for < 0:5, if Xh0 denotes the set of trees of exactly height h and Ph denotes the probability on Xh0 induced by th and the uniform distribution Pu on [0; 1]. S Since X = _ h Xh0 we can de ne a probability P on X by setting P jXh0 = Ph and choosing P (Xh0 ) = ph with 2 2) if h = 22n for n 1 ; 6 = ( n ph = 0 otherwise. Since dP (f; g ) ph dPh (f jXh0 ; g jXh0 ) it is
p M (; F ; dP ) M ( ; FjXh0 ; dPh ) h p ph p. Assume = 1=n. Then ph p is valid for for 1=2m > and p h = 22 , m = 6= n1=4 . As a consequence M (1=n; F ; dP ) 22cn
1=4
for a constant c, this is growing more than exponentially in n.
4
Conclusion and open questions
Some new characterizations concerning learnability were established. In particular it has been shown that PAC and PUAC learnability are not equivalent. But still a characterization of PAC learnability of a function class is 20
not available. Furthermore, it is not known whether a condition dierent to the entropy condition is useful in practise, i.e. with the assumption of noise [V]. The class of folding networks has in nite VC or pseudodimension, the same is valid for the scale sensitive version of the pseudodimension even in the case of limited weights and inputs [H1]. It is an interesting question whether this is due to the function class or a characteristic of the input set and its recursive structure. Do interesting function classes exist on trees with nite VC dimensions? One function class to consider in this context is folding networks with activation H and only a nite number of dierent input labels; here the VC dimension is nite [KS2]. Can one transform these functions and input sequences in a natural way to standard feedforward networks and inputs without a recursive structure? We derived bounds ensuring polynomial learnability for folding networks. The argumentation uses estimations for the VC dimension. But these methods are rather crude. In particular, there is a large de ciency in the bounds for a sigmoidal folding network. Does a set exist with a size which is exponential in the height that is shattered? Finally, a concept class with exponentially growing sample size was found in this context. We considered trees and could use the natural decomposition of the input and function space corresponding to trees of dierent height. In the distribution independent setting either a class is learnable with polynomially growing sample size, or it is not learnable at all [V]. Assume a concept class F and a distribution P exists where the sample size is growing exponentially. It is an interesting question whether one can construct another distribution P 0 using P so that F is not learnable with respect to P 0.
References [ABCH] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler, Scalesensitive dimensions, uniform convergence, and learnability, 34th IEEE Symposium on Foundations of Computer Science (1993), 292-301. [BSF] Y. Bengio, P. Simard, and P. Frasconi, Learning long-term dependencies with gradient descent is dicult, IEEE Transactions on Neural Networks, 5 (1994), 157-166. 21
[BR] A. Blum and R. Rivest, Training a 3-node neural network is NPcomplete, Neural Networks, 5 (1992), 117-127. [DSS] B. DasGupta, H. T. Siegelmann, and E. D. Sontag, On the complexity of training neural networks with continuous activation functions, IEEE Transactions on Neural Networks, 6 (1995), 1490-1504. [DS] B. DasGupta and E. D. Sontag, Sample complexity for learning recurrent perceptron mappings, IEEE Transactions on Information Theory, 42 (1996), 1479-1587. [H1] B. Hammer, On the generalization of Elman networks, 7th International Conference on Arti cial Neural Networks, 1997, W. Gerstner et.al. (eds.), Lecture Notes in Computer Science, Springer, Berlin, 409414. [H2] B. Hammer, Training a sigmoidal network is dicult, 6th European Symposium on Arti cial Neural Networks (1998), 255-260. [HS] B. Hammer and V. Sperschneider, Neural networks can approximate mappings on structured objects, 2nd International Conference on Computational Intelligence and Neuroscience (1997), 211-214. [KS1] J. Kilian and H. T. Siegelmann, The dynamic universality of sigmoidal neural networks, Information and Computation, 128 (1996), 48-56. [KS2] P. Koiran and E. D. Sontag, Vapnik-Chervonenkis dimension of recurrent neural networks, 3rd European Conference on Computational Learning Theory, 1997, S. Ben-David (ed.), Lecture Notes in Arti cial Intelligence, 1208, Springer, Berlin, 223-237. [M1] W. Maass, Neural nets with superlinear VC-dimension, Neural Computation, 6 (1994), 875-882. [M2] M. Mozer, Neural net architectures for temporal sequence processing, Predicting the future and understanding the past, A. Weigend and N. Gershenfeld (eds.), Addison-Wesley, Reading, MA, 1993, 143-164. [NP] K. S. Narendra and K. Parthasarathy, Identi cation and control of dynamical systems using neural networks, IEEE Transactions on Neural Networks, 1 (1990), 4-27. 22
[OG] C. Omlin and C. L. Giles, Stable encoding of large nite-state automata in recurrent neural networks with sigmoid discriminants, Neural Computation, 8 (1996), 675-696. [R] M. Reczko, Protein secondary structure prediction with partially recurrent neural networks. SAR and QSAR in environmental research, 1 (1993), 153-159. [SKG] S. Schulz, A. Kuchler, and C. Goller, Some experiments on the applicability of folding architectures to guide theorem proving, 10th International FLAIRS Conference (1997), 377-381. [S1] E. D. Sontag, Recurrent neural networks: Some systems-theoretic aspects, Dealing with complexity, M. Karny, K. Warwick, and V. Kurkova (eds.), Springer, London, 1997, 1-12. [S2] E. D. Sontag, Neural networks for control, Essays on Control: Perspectives in the Theory and its Applications, H. L. Trentelman and J. C. Willems (eds.), Birkhauser, Boston, 1993, 339-380. [SS1] H. Siegelman and E. D. Sontag, On the computational power of neural nets, Journal of Computer and System Sciences, 5 (1995), 132-150. [SS2] E. D. Sontag and H. J. Sussmann, Backpropagation can give rise to spurious local minima even for networks without hidden layers, Complex Systems, 3 (1989), 91-106. [V] M. Vidyasagar, A Theory of Learning and Generalization, Springer, Berlin, 1997.
23