Exploiting Random Walks for Learning Paul Fischery
Peter L. Bartlett
Lehrstuhl Informatik II Department of Systems Engineering Universitat Dortmund RSISE, Australian National University D-44221 Dortmund, Germany Canberra, 0200 Australia
[email protected] [email protected]. uni-dortmund.de
Abstract In this paper we consider an approach to passive learning. In contrast to the classical PAC model we do not assume that the examples are independently drawn according to an underlying distribution, but that they are generated by a time-driven process. We de ne deterministic and probabilistic learning models of this sort and investigate the relationships between them and with other models. The fact that successive examples are related can often be used to gain additional information similar to the information gained by membership queries. We show that this can be used to design on-line prediction algorithms. In particular, we present ecient algorithms for exactly identifying Boolean threshold functions, 2-term RSE, and 2-term-DNF, when the examplesn are generated by a random walk on f0; 1g .
1 INTRODUCTION In the classical PAC model as introduced by Valiant in [14], information about the unknown target concept is available through labeled examples which are independently drawn. The assumption of independence is essential for almost every analysis of PAC learning algorithms. In practice this assumption is often violated. Often there is a time-driven process which generates the The author thanks the Australian Telecommunications and Electronics Research Board for their support. y Supported in part by Deutsche Forschungsgemeinschaft grant We 1066/6-1. z The author gratefully acknowledges the support of Bundesministerium fur Forschung und Technologie grant 01IN102C/2.
Klaus-Uwe Hogenz
Lehrstuhl Informatik II Universitat Dortmund D-44221 Dortmund, Germany
[email protected]. uni-dortmund.de
examples and successive ones dier only slightly. For instance, observations of many physical processes, such as the trajectory of a robot, have this feature. Situations like these can be modeled by stochastic processes. On the other hand stochastic processes are general enough to cover the sampling strategy in the PAC model as well. The rst to consider PAC learning in such a setting were Aldous and Vazirani in [1]. In [8] Freund et. al. de ned a speci c learning model for DFAs which essentially uses incremental changes. In their model a random walk follows the state graph of the automaton. It is shown that in this setting most DFAs are learnable without queries. In this paper, we consider a general model where the examples are not necessarily independent and no queries are allowed. The examples are generated by a stochastic process, and we shall show that looking at the changes between successive examples can sometimes give information similar to the information of membership queries. Since the examples come sequentially we use an on-line mistake bound model. As has been shown by A. Blum (see [3]) this model is more demanding than the classical PAC model in the case of independent examples. We de ne several variants of this model in deterministic and stochastic settings. We compare these models and relate them to classical learning models. Moreover we present applications of these models on the Boolean hypercube when the process generating the examples is a random walk, i.e., successive examples dier by no more than one bit. In particular, we develop ecient algorithms for exactly identifying Boolean threshold functions, 2-term Ring-Sum-Expansions (2-term RSE) and 2-term DNF. All these classes are not properly learnable in the PAC model but they are in the query model. Hence we present for the rst time algorithms for exactly identifying these classes in a passive learning model with a polynomial number of mistakes. The paper is structured as follows. In Section 2 we introduce the notation and de ne the learning models. In Section 3 we discuss relationships between these models and with classical learning models. Sections 4, 5, and 6 present learning algorithms for Boolean threshold functions, 2-term RSE and 2-term DNF, respectively.
2 DEFINITIONS Let n be a positive integer and Xn = f0; 1gn. Suppose Fn and Hn are classes of f0; 1g-valued functions de ned 1 on Xn . Let X = [1 n=1 Xn , F = [n=1Fn and H = 1 [n=1Hn. We assume that the elements of X, F and H are represented using an appropriate language that is polynomialtime decidable, such that some polynomial-time algorithm can compute f(x) and h(x) for any x 2 X, f 2 F and h 2 H. Denote S = X f0; 1g and S = [t2IN S t . For an in nite sequence x = (x1 ; x2; : : :) of values in Xn , a function f in Fn , and a positive integer t, de ne the length t sample generated from x by f as samt(x; f) = (x1; f(x1 ); : : :; xt; f(xt )) : A learning algorithm for F works as follows: at each time t it has an intermediate hypothesis ht 2 H. If the unlabeled example xt is presented, then the algorithm predicts ht(xt ). Afterwards the correct label f(xt ) is revealed and the algorithm may update its hypothesis. We shall call F and H the target and hypothesis class, respectively. A deterministic algorithmA de nes a function from S to H, such that the value of the function for a length t sample is the working hypothesis of the algorithm after f(xt ) was presented. We use the same symbol A to denote this function; the meaning is always clear from the context. In two of the models of learning de ned below, the algorithm also takes as input a parameter 2 (0; 1). This parameter speci es the desired performance of the algorithm. We say that a learning algorithm is polynomial time if its running time for computing a single prediction is bounded by a polynomial in the size of the description of one example and (if applicable) in 1=. Suppose x = (x1; x2; : : :) 2 X IN is an in nite sequence of values in Xn , f is a function in Fn, and A is a deterministic algorithm for F. If A takes as input a parameter , de ne the mistake indicator function A (; samt?1(x; f)) (xt) 6= f(xt ) t MA;f (; x) = 10 ifotherwise. t (x) is If A does not take an input parameter , MA;f de ned analogously. A randomized learning algorithm A for F is a learning algorithm that also takes as input a random bit string. The mistake indicator funct (; x) (or M t (x), as appropriate) is de ned tion MA;f A;f as the probability over all random bit strings that the algorithm misclassi es xt . We will restrict the sequences x to certain classes of in nite sequences. For each n 2 IN, let Xn XnIN be the set of legal sequences of values in Xn , and let X = [1 n=1Xn . Such a restriction might force successive examples INto be similar in some way. For example, de ne Wn Xn to be the set of all walks through Xn = f0; 1gn with the usual topology. That is, for all x =
(x1 ; x2; : : :) 2 Wn, the Hamming distance between xi and xi+1 is no more than 1 for all i 2 IN. We consider three models of learning. The rst is a worst-case deterministic model, in which we require the learning algorithm to make few mistakes for all samples generated by a function in Fn from any legal sequence x 2 Xn .
De nition 1 (Mistake bound model) Suppose A is an algorithm for F using hypothesis class H . For n 2 IN , x 2 Xn, f 2 Fn, de ne NA;f (x) as the number of mistakes that A makes on the sample generated by f from x, i.e.,
NA;f (x) =
1 X t=1
t (x); MA;f
provided the sum converges. De ne the mistake bound of A for Fn on a set Xn XnIN as
N^A;Fn ;Xn = fmax max N (x); 2F x2X A;f n
n
provided the maxima exist. We say that F is mistake bound learnable from X by H if there is a polynomial time algorithm A such that N^A;Fn ;Xn is bounded by a polynomial in n. In that case, de ne
; N^ N^Fn ;Xn = min A A;Fn ;Xn
where the minimum is over all polynomial time algorithms A. We say that F is exactly mistake bound learnable from X if there is a hypothesis class H and a polynomial time algorithm A such that N^A;Fn ;Xn is bounded by a polynomial in n and the following condition holds: for each x 2 X and for each t 2 IN the algorithm can determine after example t whether there is more than one concept in F which it cannot rule out as a possible target concept. Moreover, if the algorithm determines after example t that only one possible target concept f 2 F is left, then it can compute the representation of f in polynomial time.
Notice that mistake bound learning from Xn = XnIN is equivalent to Littlestone's learning model presented in [12]. In Sections 4 and 5, we consider mistake bound learning from random walks on the Boolean cube, i.e., Xn = W n . For exact mistake bound learning, the algorithms have to be more involved: they have to identify when the information gathered speci es the target concept f 2 F, and in this case they output (the representation of) f. This distinguishes the exact mistake bound learning from the classical mistake bound model. For applications the form of the representation of the hypothesis is often essential. For example, a function which is given by a multiple case distinction might be hard to implement as a (small) Boolean circuit. Even if F is exactly
learnable, there may be sequences of examples which hide some information from the algorithm. Then exact identi cation will never be possible, but the algorithm will recognize that it may be forced to make another mistake. In the other two learning models which we consider, the sequences x are the sample paths of discrete-time stochastic processes. For n 2 IN, let Pn be a class of stochastic processes with sample paths in Xn, and let P = [1 n=1Pn. If : Xn ! mg < g ; if the minimum exists. We say that F is probably mistake bound learnable from P by H if there is a polynomial time algorithm A such that N^A;Fn ;Pn ; is bounded by a polynomial in n and 1= . De ne N^Fn ;Pn ; = minA N^A;Fn ;Pn ; ; where the minimum is over all polynomial time algorithms A. We say that F is exactly probably mistake bound learnable from P if there is a hypothesis class H and a polynomial time algorithm A such that N^A;Fn ;Pn ; is bounded by a polynomial in n and 1= and the following condition holds: for each x 2 X and for each t 2 IN the algorithm can determine at time t whether there is more than one concept in F which it cannot rule out as a possible target concept. Moreover, if the algorithm determines at time t that only one possible target concept f 2 F is left, then it can compute the representation of f in polynomial time.
In the bounded mistake rate model, we require the learning algorithm to make mistakes only on a small proportion of the examples in a nite random sample.
De nition 3 (Bounded mistake rate model) Suppose A is an algorithm for F using hypothesis class H
? t (; x) : M^ A;Fn ;Pn (; t) = sup sup Ex2P MA;f
f 2Fn P 2Pn
We say that F is learnable from P by H in the bounded mistake rate model if there is a polynomial time algorithm A such that, for all > 0 and all n 2 IN , there is a t0 2 IN such that for all t t0 one has M^ A;Fn ;Pn (; t) < , and t is bounded by a polynomial in n and 1=. De ne o
n
inf 2 (0; 1) : M^ A;Fn ;Pn (; t) < ; M^ Fn ;Pn (t) = inf A where the rst in mum is over all polynomial time algorithms A.
3 RELATIONSHIPS BETWEEN THE LEARNING MODELS In this section, we compare the learning models de ned above, and relate them to other popular models. Clearly, if X is a class of X-sequences and P is a class of stochastic processes with sample paths in X , any algorithm A that learns from X in the mistake bound model learns from P in the probabilistic mistake bound model, and N^A;Fn ;Pn; N^A;Fn ;Xn . The following theorem shows that a polynomial time algorithm for probabilistic mistake bound learning can be used to construct a polynomial time algorithm for learning in the bounded mistake rate model, provided the stochastic processes in P are stationary.
De nition 4 If Xn XnIN and P is a discrete-time stochastic process on Xn, we say that P is stationary if, for all k 2 IN , for all t ; : : :; tk 2 IN , and for all 2 IN , the cumulative distribution functions 1
under P of the random elements (xt1 ; xt2 ; : : :; xtk ) and (xt1 + ; xt2+ ; : : :; xtk+ ) are identical.
Note that the sampling process of the PAC model is an iid stochastic process, and is clearly stationary. Also, the uniform random walk for which the initial distribution on Xn is uniform (that is, Pr(x1 = y) = 1=jXnj for all y 2 Xn ) is stationary.
Theorem 5 If F is probably mistake bound learnable from P , where P is a class of stationary stochastic processes, then F is learnable from P in the bounded mis-
take rate model. Furthermore, M^ Fn ;Pn (t) < for &
'
2N^Fn ;Pn ;=2 t :
Proof Suppose Amb is a polynomial time algorithm for learning F in the probabilistic mistake bound model. Then for any P in Pn and f in Fn, o n P x 2 Xn : NAmb ;f (; x) > N^Amb ;Fn ;Pn ; < : For x 2 X IN , i; j 2 IN, let subi;j (x) = (xi; xi ; : : :; xj ). +1
We will construct a polynomial time randomized algorithm Abmr that uses Amb to learn F in the bounded mistake rate model. For any 2 (0; 1), Abmr chooses m uniformly on f1; : : :; m0 g, where & ' 2N^Amb ;Fn ;Pn ;=2 m0 = : For t m0 , the algorithm takes a sample samt(x; f) passes subt?m+1;t(samt (x; f)) to Amb , and returns Amb 's hypothesis. Using the stationarity of P, ? Ex2P MAt bmr ;f (; x) Z m0 1 X MAm ;f (; subt?m+1;t(x)) dP(x) = x2Xn m0 m=1 mb Z m0 X = m1 MAmmb ;f (; sub1;m(x)) dP(x) 0 x2Xn m=1 1 m (1 ? =2)N^Amb ;Fn ;Pn ;=2 + m0 =2 0 < N^Amb ;Fn ;Pn ;=2=m0 + =2 : 2 A slightly more direct proof shows that if Pn is a class of stationary distributions on Xn, then M^ Fn ;Pn (t) < for t > N^Fn ;Xn =. Since any stochastic process in Piid;n is stationary, this also shows that ecient learning in Littlestone's mistake bound model (that is, mistake bound learning with Xn = XnIN ) is no easier than ecient learning in Haussler, Littlestone, and Warmuth's prediction model (that is, polynomial time bounded mistake rate learning from Pn = Piid;n). Results in [9] and [11] imply this relationship for arbitrary X. However, the proof of
Theorem 5 does not use the fact that Xn = f0; 1gn, and that theorem is tight, since there are function classes for which M^ Fn ;Piid;n (t) = (VCdim(Fn )=t), but N^Fn ;XnIN = VCdim(Fn), where VCdim(Fn) is the Vapnik-Chervonenkis dimension of Fn (see [12] and [10]). It follows from the results above that mistake bound learning is at least as hard as probabilistic mistake bound learning, which is at least as hard as bounded mistake rate learning from stationary stochastic processes. Furthermore, bounded mistake rate learning is possible with Pn = Piid;n if and only if PAC learning is possible [9]. Blum [3] gives an example of a function class for which PAC learning is easier than mistake bound learning, given some cryptographic complexity assumptions. Therefore mistake bound learning is strictly harder than bounded mistake rate learning with Pn = Piid;n . The following theorem shows that any function class that is learnable in the bounded mistake rate model from the uniform distribution is learnable from a uniform randomINwalk. Let Un be the uniform distribution on Xn , so Un 2 Piid;n describes a sequence of independent uniformly distributed random variables. Theorem 6 If M^ Fn ;fUnIN g (t) < =2, then M^ Fn ;Pwalk;n (t0 ) < for n : t0 = O tn log log(=t) Proof We use the fact that the distribution of periodic samples from a random walk is close to the uniform distribution in total variation distance. De nition 7 If S is a countable set and P and Q are distributions on S , the total variation distance between P and Q is X dTV (P; Q) = 2 max j P(A) ? Q(A) j = jP(x) ? Q(x)j : AS x2S
Let Un be the uniform distribution on Xn . If P is a stochastic process on Xn , k 2 IN , and x 2 XnIN satis es
P y 2 XnIN : yi = xi ; i = 1; : : :; k > 0; then denote by Pkjx the following conditional distribution on Xn Pkjx(b) = Pry2P (yk+1 = bjyi = xi; i = 1; 2; : : :; k) : The stochastic process P is said to be -close to uniform if, for all k 2 IN , for all x 2 XnIN , Pkjx is de ned and
dTV (Pkjx; Un) : We will use the following lemma; it is proved inductively using Lemma 3 in [2]. Lemma 8 If m 2 IN , : Xnm ! [0; 1], 0 < < 1, and P is a stochastic process on Xn that is -close to uniform, then
jEx2P ((sub1;m (x))) ? Ex2Un ((sub1;m (x)))j m ;
where sub1;m (x) = (x1 ; : : :; xm).
Lemma 9 Let Pn; be the class of stochastic processes
on Xn that are -close to uniform. If M^ Fn ;fUnIN g (t) < =2 and < =(2t), then M^ Fn ;Pn; (t) < .
Proof Fix an algorithm A that gives the mistake bound M^ Fn ;fUnIN g (t) = =2. Fix any f2 Fn and P 2 Pn; . By t t + t < . 2 Lemma 8, EP MA;f EUn MA;f The following lemma follows trivially from the main result in [5].
Lemma 10 For any uniform random walk P in Pwalk;n
and 0 < < 1, let Qk be the stochastic process that corresponds to sampling the stochastic process P at every k time steps. That is, if (x1 ; x2; : : :) 2 XnIN is a sample path of P , then the corresponding sample path of Qk is (y1 ; y2 ; : : :) = (xk ; x2k; : : :). Then Qk is -close to uniform for
k n +4 1 log log( 2n=2 + 1) :
Theorem 6 follows easily from Lemma 9 and Lemma 10 2 It follows that bounded mistake rate learning from Pwalk is no harder than bounded mistake rate learning from Piid .
4 LEARNING BOOLEAN THRESHOLD FUNCTIONS In this section we develop an algorithm for learning Boolean threshold functions in the mistake bound model. The idea behind the algorithms in this section and in Section 5 is as follows: the prediction of the label l of the next example is made in such a way that an error always increases our knowledge about the target concept. The examples are generated by a walk along the edges of the Boolean cube, i.e. X = Wn . We consider the class BTF of Boolean threshold functions. This class contains concepts fw; : Xn ! f0; 1g, with w = (w1; : : :; wn) 2 f0; 1gn and an integer, satisfying + : : : + wn xn fw; (x1 ; : : :; xn) = 01 ifif ww11xx11 + : : : + wn xn < Let Y t = (y1t ; : : :; yit ; : : :; ynt ) be the example of the walk presented at time t, tlet ^lt be the prediction made by our algorithm, and let l be the correct label. The weights and threshold for the target function are w and . Given a set V of vectors, span(V ) denotes the vector space spanned by V . De ne Algorithm A as follows. For the rst example, the algorithm predicts ^l1 = 0. After that, it predicts
^lt = lt?1 until it makes a mistake. At this point we know that the following equations hold. w Y t?1 ? = lt?1 ? 1; w Y t ? = ?lt?1: So after this mistake, A initializes a set S of linearly independent vectors which correspond to these equations, S = (Y t?1; ?1; 1 ? lt?1 ); (Y t; ?1; lt?1) : For subsequent examples, the algorithm uses the equations corresponding to the elements of S to make its predictions. Each time it makes a mistake, a new vector is added to S. Speci cally, for an example Y t , Algorithm A predicts as follows. IF (Y t ; ?1; lt?1) 2 span(S) THEN predict ^lt = 1 ? lt?1 ELSE predict ^lt = lt?1 IF lt 6= ^lt THEN add (Y t; ?1; lt?1) to S. The following result shows that the vectors in S correspond to a set of linearly independent equations in w and . Thus, if set S contains n + 1 vectors we can compute the corresponding unique solution for w and describing the target concept fw; in BTF. Lemma 11 For all (Y; ?1; l) in S, (w; ; 1)(Y; ?1; l) = 0. Furthermore, S is a linearly independent set. Proof Clearly, the rst two elements addedt to S t?are1 linearly independent, and after that any (Y ; ?1; l ) added to S is not in span(S), so S is a linearly independent set. After the rst mistake with t > 1, the algorithm adds (Y t?1; ?1; 1 ? lt?1) and (Y t ; ?1; lt?1) to S. Since it made a mistake, lt 6= lt?1. Suppose lt?1 = 0. Then w Y t?1 ? + 1 = 0 and w Y t ? = 0. That is, (w; ; 1) z = 0 for all z in S. Similarly for lt?1 = 1. For subsequent mistakes, if l^t = lt?1 but lt 6= l^t , it must be that Y t w ? + lt?1 = 0. So any z added to S has (w; ; 1) z = 0. 2
Lemma 12 Algorithm A makes no more than n + 1 mistakes.
Proof We rst show that if (Y t; ?1; lt?1) is in the span of S then lt = 1 ? lt?1, so the algorithm predicts correctly in that case. Indeed, forPan element of span(S) we can write (Y t ; ?1; lt?1) = z2S z z, where the z are real numbers. It follows from Pthe lemma above that (w; ; 1) (Y t ; ?1; lt?1) = w z2S z z = 0, so w Y t ? = ?lt?1 , and hence lt = 1 ? lt?1. Now, if the algorithm makes a mistake on the rst example, it does not change S and so gains no bene t from that mistake. After the rst mistake with t > 1, it adds two elements to S. After that, each mistake increases the size of S by one. Since S is linearly independent and the dimension of the set f(Y; ?1; l) : Y 2
f0; 1gn; l 2 f0; 1gg is n + 1, when S contains n + 1 elements, the algorithm makes no more mistakes. Hence it can make a total of no more than n + 1 mistakes. 2 To show that this algorithm properly learns BTF, we must show that ^lt is generated by a Boolean threshold function for the points that are within Hamming distance one of Y t?1. Lemma 13 Let Y0 2 Xn. Any Boolean function de ned on a set SY = fY : ham(Y 0; Y ) 1g can be expressed as a threshold function fw; with w 2 f?1; 0; 1gn and
an integer.
Proof Any Boolean function g de ned on the set S = f0; e ; : : :; eng can be expressed as a threshold func1
tion of the required form, where ei = (i;1 ; i;2; : : :; i;n) is the unit vector in the i-th direction and i;j is the Kronecker delta function. Indeed, wi = 2g(ei ) ? 1 for i = 1; : : :; n and = 1 ? g(0) will suce. Given a function de ned on S, and a bijection b between S and SY , we can transform fw; to fw ; , so that fw ; (x) = fw; (b(x)) for all x 2 SY . Furthermore, w0 2 f?1; 0; 1gn and is an integer. To see this, notice that we can represent the bijection b as a composition of re ections bR;i : f0; 1gn ! f0; 1gn, where y = bR;i (x) satis es i yk = x1 k? xk kk 6= = i; and swappings bS;i;j : f0; 1gn ! f0; 1gn, where y = bS;i;j (x) satis es ( xi k = j yk = xj k = i xk otherwise. It is easy to see that fw ; (x) = fw; (bR;i (x)) for all x 2 SY if i wk0 = ?wwkk kk 6= =i and 0 = ? 1. Similarly, fw ; (x) = fw; (bS;i;j (x)) for all x 2 SY if ( wi k = j 0 wk = wj k = i wk otherwise and 0 = . Both transformations leave an integer and wi0 2 f?1; 0; 1g. 2 0
0
0
0
0
0
0
0
Theorem 14 Algorithm A learns BTF with BTF. Proof Since the algorithm always guesses ^lt = lt when Y t = Y t? , and it never guesses ^lt = 6 tlt? unless ^lt = lt , t 1
1
and l is a monotone function of Y , the function describing the algorithm's guesses is a monotone Boolean function. We will show that any monotone Boolean function on SY = fY 0 : ham(Y 0 ; Y ) 1g is a Boolean threshold function fw; , from which it follows that the algorithm always guesses with a Boolean threshold function.
Now, given a monotone Boolean function de ned on SY , Lemma 13 shows that it can be represented as a threshold function fw; with an integer and w 2 f?1; 0; 1g. We will show that it can be expressed as a Boolean threshold function. De ne ipi : f0; 1gn ! f0; 1gn such that y = ipi (x) satis es =i yj = 1yj? yj jotherwise. Suppose wi = ?1 for some i. Consider the two points Y and ipi (Y ). Suppose Yi = 1 (a similar argument applies for Yi = 0). Then we must have fw; (Y ) fw; ( ipi (Y )), by monotonicity. But w Y ? = w
ipi (Y )? ?1, so we must have fw; (Y ) = fw; ( ipi (Y )). Setting =i 0 wj = 0wj jotherwise and 0 = + 1 gives fw ; (Y 0 ) = fw; (Y 0) for all Y 0 2 S with Yi0 = 1, and clearly gives fw ; ( ipi (Y )) = fw; ( ipi (Y )). By performing a similar transformation for any other wi = ?1, we can represent the monotone Boolean function on SY as a Boolean threshold function. 2 0
0
0
0
With an easy adversary strategy one can show that N^BTFn ;Wn n. Thus, we have
Corollary 15 BTF is exactly mistake bound learnable by BTF from random walks Wn , and n N^BTFn ;Wn n + 1.
5 EXACT MISTAKE BOUND LEARNING 2-TERM RSE A 2-term RSE is the parity of two monotone monomials, e.g., (x1 ^ x3) (x3 ^ x4 ^ x5). It is known that this class is not properly learnable in the PAC model but learnable using a larger hypothesis class (see [6]). Our algorithm will use intermediate hypotheses which are not 2-term RSE. We will omit an explicit de nition of this hypothesis class. Instead the hypotheses will be implicitly described in the algorithms. Basically they are nested case distinctions based on previously gathered information. We shall see that as the algorithm makes mistakes this information \increases" and, on the other hand, the number of possible target concepts decreases, until only one is left. Let X = fx1 ; : : :; xng be the set of variables. Let m0 and m1 be (monotone) monomials over X, and let c = m0 m1 be the target concept. We identify a monotone monomial with the set of variables occurring in it. The variables in m0 [ m1 are called relevant. In order to extract information from each mistake, we de ne the following sets that will be dynamically updated: E(i) is initialized to X for i = 1; : : :; n. Then elements are removed, while the following conditions are always maintained: If xi occurs in exactly one
monomial mj of c, then E(i) mj , if xi occurs in both monomials then E(i) m0 \ m1 . (E(i) is a superset of literals that are in all monomials that xi is in.) H(i) is initialized to X. Then elements are removed, while the following conditions are always maintained: If xi occurs in exactly one monomial of c, say in mj , then H(i) m1?j , if xi occurs in both monomials then H(i) m0 \ m1 . T is initialized to X. Then elements are removed, while the following condition is always maintained: T (m0 \ m1 ). R is initialized to X. Then elements are removed, while the following condition is always maintained: R (m0 [ m1 ). So, R always includes the relevant variables. V is initialized to ;. Then elements are inserted, while the following condition is always maintained: V (m0 [ m1). So, V is always a subset of the relevant variables. S is initialized to ;. Then elements are inserted, while the following condition is always maintained: S (m0 \ m1 ). Note that 0 jE(i)j; jH(i)j; jS j; jT j; jV j; jRj n. We say that E(i) (resp. H(i)) is satis ed by example Y t if yj = 1 for all xj 2 E(i) (resp. xj 2 H(i)). Our algorithm can face four situations S1, S2, S3, S4 described in Figure 1. They are distinguished by the label of the last example and the bit- ip that creates the next example. We now describe for each of these situations how the prediction of the next label is made. Later we show that every mistake leads to an increase in information about c. This is documented by an update of one of the above sets. In every case, if the variable that ips is not in R (i.e., de nitely not relevant) we predict the previous label and cannot make a mistake. Let Q be one of the above sets and Y = (y1 ; : : :; yn) 2 f0; 1gn. Then we de ne Q \ Y = fxijxi 2 Q and yi = 1g, i.e., the set of variables in Q which are satis ed by Y . We now show that every mistake results in an update. S1. As c(Y t) = 1 and yit = 0, we know that Y t satis es exactly one monomial, say m1 , and xi 62 m1 . If xi is a member of the other monomial m0 , then E(i) m0 . Now if E(i) is satis ed by Y t+1 , so is m0 , whence c(Y t+1 ) = 0, and the prediction is correct. Thus a mistake at (1.1) can only occur if xi 62 m0 [ m1 , and xi can be removed from R. On the other hand, if a mistake occurs at (1.2) then E(i) is not satis ed by Y t+1 , but m0 must have become satis ed, whence xi 2 m0 . Then E(i) is a proper superset of m0 , and we can remove from E(i) all variables not satis ed by Y t+1. As all variables in m0 are satis ed, E(i) remains a superset of m0 . S2. As c(Y t) = 1, we know Y t satis es exactly one monomial. We claim that prediction (2.1) is correct if xi 2 m0 [ m1 . Indeed, if xi 2 m0 \ m1 then both monomials are not satis ed by Y t+1 . If (wlog) xi 2
m0 n m1 , then E(i) m0 . Example Y t satis es E(i) (and hence m0 ) but not m1 . After switching xi to 0 monomial m0 is no longer satis ed, whence c(Y t+1 ) = 0. Thus the prediction at (2.1) can only be wrong if xi is not relevant. Now, if xi is in exactly one monomial, say xi 2 m0 then Y t+1 does not satisfy m0 , and H(i) m1 . Thus if Y t+1 satis es H(i), it also satis es m1 and one has c(Y t+1) = 1. If xi 2 m0 \ m1 then Y t+1 cannot satisfy H(i) because then xi 2 H(i). Thus a mistake at (2.2) only occurs if xi is not relevant. Now consider a mistake at (2.3). If xi 2 m0 \ m1 the prediction 0 is correct. If xi is in exactly one monomial, say xi 2 m0 n m1 , this mistake happens if m1 is satis ed by Y t+1 , although H(i) is not. But then H(i) is a proper superset of m1 and we remove from it all variables not satis ed by Y t+1. Observe, that no variable from m1 is removed at this point. If xi is irrelevant we also remove those variables from H(i). S3. For the cases (3.1), (3.2) and (3.3) we know that there is a possibly relevant variable xj which is satis ed by neither Y t nor Y t+1 . At (3.1), xi can still be in m0 \ m1 . A mistake at that point means that c(Y t+1 ) = 1 and rules out this possibility. At (3.2) and (3.3) we de nitely know that xj is not in m0 \ m1 . Consider (3.2) rst and assume that xj 2 m0 n m1 . Then H(j) m1 . Now as H(j) is satis ed by Y t+1 so is m1 , but m0 is not as xtj+1 = 0. Thus the prediction 1 can only be wrong if xj is not relevant. At (3.3) H(j) is not satis ed by Y t+1. Assume that xj 2 m0 n m1 , whence H(j) m1 . As m0 is not satis ed by either Y t or Y t+1 , a mistake occurs t +1 only if Y satis es m1 . As shown for (2.3), one can remove variables from H(j). Also, in case that xj is not relevant, variables are removed from H(j), but in this case the set H(j) must not be a superset of a monomial, not even of the intersection of the tmonomials. In (3.4) we know for every j 6= i that, if yj = 0 then xj is not relevant. This means that Y t+1 satis es all relevant variables, whence the prediction 0 is correct. S4. Recall that S m0 \ m1, whence we cannot make a mistake at (4.1). We claim that a mistake at (4.2) can only happen if xi 2 m0 \ m1 . At this point we know that xi is a relevant variable (xi 2 V ) and that Y t satis es E(i). Assume that xi is in exactly one monomial, say xi 2 m0 n m1 . Then E(i) m0 and, whence Y t must satisfy m0 and m1 , because c(Y t) = 0. Switching xi tot+1zero will leave m1 satis ed, but not m0 . Hence c(Y ) = 1 as predicted. Thus an error at (4.2) shows that xi 2 m0 \ m1 , and we can add it to S. Now consider (4.3). We know that xi is relevant. Clearly, if xi 2 m0 \ m1 then the prediction is correct. Thus, a mistake occurs only if (wlog) xi 2 m0 n m1 , whence E(i) m0 . As xti+1 = 0, we know that Y t+1 does not satisfy m0 . If c(Y t+1 ) = 1, then Y t+1 satis es m1 . But then Y t must have satis ed both m0 and m1 . On the other hand, Y t did not satisfy E(i), whence E(i) contains variables which are not in m0 and some of them are not satis ed by Y t . Intersecting E(i) and Y t removes
S1 c(Y t ) = 1, yit = 0 and yit = 1. IF E(i) is satis ed by Y t (1.1) THEN predict 0; IF mistake: remove xi from R (1.2) ELSE predict 1; IF mistake: E(i) := E(i) \ Y t +1
+1
+1
S2: c(Y t) = 1, yit = 1 and yit = 0. IF E(i) is satis ed by Y t (2.1) THEN predict 0; IF mistake: remove xi from R ELSE IF H(i) is satis ed by Y t (2.2) THEN predict 1; IF mistake: remove xi from Rt (2.3) ELSE predict 0; IF mistake: H(i) := H(i) \ Y +1
+1
+1
S3: c(Y t) = 0, yit = 0 and yt = 1. IF there is a j 6= i, s.t. yjt = 0 and xj 2 R THEN IF xj 2 T (3.1) THEN predict 0; IF mistake: remove xj from T ELSE IF H(j) is satis ed by Y t (3.2) THEN predict 1; IF mistake: remove xj from Rt (3.3) ELSE predict 0; IF mistake: H(j) := H(j) \ Y (3.4) ELSE predict 0 +1
+1
+1
S4: c(Y t) = 0, yit = 1 and yt = 0. IF xi 2 S (4.1) THEN predict 0 ELSE IF xi 2 V THEN IF E(i) is satis ed by Y t (4.2) THEN predict 1; IF mistake: add xi to S t (4.3) ELSE predict 0; IF mistake: E(i) := E(i) \ Y (4.4) ELSE predict 0; IF mistake: add xi to V +1
Figure 1: Algorithm for learning 2-term RSE those variables. Finally, consider (4.4). At this point we do not know whether xi is trelevant, since xi 62 V . Predicting the same value c(Y ) for c(Y t+1 ) can only result in a mistake if xi is relevant, whence it can be added to V . Examining all updates one can see that the conditions on the sets are always maintained, e.g. one always has V m0 [ m1 R and xi 2 m0 n m1 ) [H(i) m1 and E(i) m0 ]. Note that every mistake results in an update of some set. Every update either removes at least one element from at least one of the sets R, T, E(i), and H(i), or adds at least one to V or S. As the sizes of all these sets are bounded between 0 and n, and there are O(n)2 sets, there can only be O(n2 ) updates and hence O(n ) mistakes.
Theorem 16 2-term RSE is exactly mistake bound learnable with O(n2) mistakes.
6 PROBABILISTICALLY LEARNING 2-TERM DNF In this section we consider probabilistic mistake bound learning from a uniform random walk on the Boolean cube. We show that 2-term DNF is learnable in this setting. Again we note that this class is not properly learnable in the PAC model [13] but it is with membership queries (see [4], e.g.) or under product distributions (see [7]). As above, we will omit an explicit de nition of the hypothesis class. Instead the hypotheses will be implicitly described in the algorithm. For the description of the algorithm we use the notation of the previous section. Let c = m0 _ m1 be the target concept where m0 and m1 are monomials over X = fx1; x1; : : :; xn; xng, where xi denotes the negation of the variable xi. Again we de ne the following sets E(i) is initialized to X for i = 1; : : :; 2n. Then elements are removed, while the following conditions are always maintained: If xi occurs in exactly one monomial mj of c, then E(i) mj , if xi occurs in both monomials then E(i) m0 \ m1 . The same
S1: c(Y t) = 0, yit = 0, yit = 1 and xi 62 S. IF E(i) satis ed by Y t (1.1) THEN predict 1; IF mistake: S := S [ fxi g (1.2) ELSE predict 0; IF mistake: E(i) := E(i) \ Y t +1
+1
+1
S2: c(Y t) = 0, yit = 0, yit = 1 and xi 2 S. IF E(j) satis ed by Y t for some j with xj 62 S or xj?n 62 S (2.1) THEN predict 1; IF mistake: S := S [ fxj g resp. S := S [ fxj ?ng (2.2) ELSE predict 0; IF mistake: do nothing +1
+1
S3: c(Y t) = 1, yit = 1, yit = 0 and xi 62 S. IF there is a j with E(j) satis ed by Y t (3.1) THEN predict 1; IF mistake: S := S [ fxj g (3.2) ELSE predict 0; IF mistake: do nothing +1
+1
S4: c(Y t) = 1, yit = 1, yit = 0 and xi 2 S. +1
predict 0;
Figure 2: Algorithm for probabilistically learning 2-term DNF holds for xi and the set E(n + i). S is initialized to ;. Then elements are inserted, while the following condition is always maintained: S (m0 \ m1 ). The prediction strategy of the probabilistic mistake bound algorithm L for determining the label of Y t+1 is given in Figure 2 (we describe it only for a ip of xi , the same holds for a ip of xi). The analysis of this prediction strategy often uses arguments analogous to those of learning 2-term RSE (see Section 5). Therefore we only sketch it, a more detailed description can be found in the full paper. The main idea of this strategy is that in each monomial of c there exists a literal li , i.e., xi or xi , which is contained in one monomialbut not in the other. We call these literals private literals of m0 resp. m1 . Since the updates of E(i) for a private literal li always maintain that E(i) mj (this can be shown using arguments similar to those in Section 5), after a sucient number of updates the set E(i) is equivalent to the corresponding monomial mj . Moreover updates of E(i) can only take place if the corresponding literal is relevant (i.e., belongs to m0 [ m1 ). Thus, for all literals not belonging to m0 [ m1 , it must be that E(i) = X and therefore no such literal can ever be inserted in S (in case S1, S2 or S3). From the above follows that S m0 \ m1 is always maintained and case S4 will never cause a mistake. We only have to show that the number of mistakes is bounded by a polynomial p(n; ) with a probability of at least 1 ? . In contrast to the above algorithm for learning 2-term RSE we do not always perform an update if a mistake occurred (see cases S2 and S3). But we can use the following probabilistic argument: if a mistake in case S2 occurred and we made no update then there
1 is a probability of at least (n+1) 3 that we will get some more information from subsequent examples. This follows from the fact that the target function in that case produces a positive label. Thus, we need at most 2 bit
ips of private literals (one for each monomial) to get a negatively labeled example. In the next step there exists a ip which causes an update of a set E(i) or set S. As a result, if a mistake occurs in case S2 without an update then with high probability there will be O(n3) mistakes before the next update. The same is true if a mistake occurs without an update in case S3. Clearly, after a sucient number of updates of the sets E(i) some of these sets contain one of the monomials of the target concepts, some of these sets still contain X and the rest contain variables known to be in S = m1 \ m0 . Thus, we have a probabilistic mistake bound algorithm which, with probability 1 ? , makes at most p(n; ) mistakes (where p is a suitable polynomial) on a uniform random walk over the Boolean hypercube and then has exactly identi ed the target concept.
Theorem 17 Let Pwalk = [nPwalk;n be the class of uniform random walks on f0; 1gn, as de ned in Section 2. Then 2-term DNF is exactly probably mistake bound learnable from Pwalk .
7 CONCLUSION We have introduced an extension of the PAC model to samples generated by stochastic processes. This is of great practical interest because in many application the examples appear in a sequence in which successive observations are somehow related. In the examples of this paper we have shown how one can exploit such relations to gain information comparable to that gained by
membership queries. We have also shown that it is even possible to nd a speci c representation of the target concept. This is important in cases where the concept has to be \implemented" in some way. There are a number of natural extensions of this research. It would be worthwhile investigating geometrical learning problems in this setting. In the Boolean case, one can think of many other natural random walks to generate the examples, such as ips of constantly many bits or coupled bit- ips. There are also more complex concept classes which could be examined. Moreover, it seems worthwhile to further investigate the relationship between PAC and query learning from this point of view.
References
[1] D. Aldous and U. Vazirani. A Markovian extension of Valiant's learning model. In Proc. of the 31st Symposium on the Foundations of Comp. Sci., pages 392{396. IEEE Computer Society Press, Los Alamitos, CA, 1990. [2] P. L. Bartlett. Learning with a slowly changing distribution. In Proc. 5th Annu. Workshop on Comput. Learning Theory, pages 243{252. ACM Press, New York, NY, 1992. [3] A. Blum. Separating PAC and mistake-bound learning models over the boolean domain. Proc. 31th Annu. IEEE Sympos. Found. Comput. Sci., 1990. [4] A. Blum and S. Rudich. Fast learning of k-term DNF formulas with queries. In Proc. 24th Annu. ACM Sympos. Theory Comput., pages 382{389. ACM Press, New York, NY, 1992. [5] P. Diaconis, R. Graham, and J. Morrison. Asymptotic analysis of a random walk on a hypercube with many dimensions. Random Structures and Algorithms, 1:51{72, 1990. [6] P. Fischer and H. Simon. On learning ring-sum expansions. SIAM J. Comput., 21:181{192, 1992. [7] M. Flammini, A. Marchetti-Spaccamela, and L. K. Kucera. Learning DNF formulae under classes of probability distributions. In Proc. 5th Annu. Workshop on Comput. Learning Theory, pages 85{92. ACM Press, New York, NY, 1992. [8] Y. Freund, M. Kearns, D. Ron, R. Rubinfeld, R. Schapire, and L. Sellie. Ecient learning of typical nite automata from random walks. In Proc. 25th Annu. ACM Sympos. Theory Comput., pages 315{324. ACM Press, New York, NY, 1993. [9] D. Haussler, M. Kearns, N. Littlestone, and M. Warmuth. Equivalence of models for polynomial learnability. Information and Computation, 95:129{161, 1991. [10] D. Haussler, N. Littlestone, and M. K. Warmuth. Predicting f0,1g functions on randomly
drawn points. In Proceedings of the 29th Annual IEEE Symposium on Foundations of Computer Science, pages 100{109. IEEE Computer So-
ciety Press, 1988. [11] M. Kearns, L. Pitt, and L. Valiant. Recent results on boolean concept learning. In Proc. 4th Workshop on Machine Learning, 1987. [12] N. Littlestone. Learning when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2:285{318, 1988. [13] L. Pitt and L. Valiant. Computational limitations on learning from examples. J. ACM, 35:965{984, 1988. [14] L. G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134{1142, Nov. 1984.