Simple Learning Algorithms for Decision Trees and Multivariate ...

Report 3 Downloads 29 Views
Simple Learning Algorithms for Decision Trees and Multivariate Polynomials Nader H. Bshouty

Yishay Mansour

Department of Computer Science University of Calgary Calgary, Alberta, Canada

Department of Computer Science Tel-Aviv University Tel-Aviv, Israel

Abstract

In this paper we develop a new approach for learning decision trees and multivariate polynomials via interpolation of multivariate polynomials. This new approach yields simple learning algorithms for multivariate polynomials and decision trees over nite elds under any constant bounded product distribution. The output hypothesis is a (single) multivariate polynomial that is an -approximation of the target under any constant bounded product distribution. The new approach demonstrates the learnability of many classes under any constant bounded product distribution and using membership queries, such as j -disjoint DNF's and multivariate polynomials with bounded degree over any eld. The technique shows how to interpolate multivariate polynomials with bounded term size from membership queries only. This in particular gives a learning algorithm for O (log n)-depth decision tree from membership queries only and a new learning algorithm of any multivariate polynomial over suciently large elds from membership queries only. We show that our results for learning from membership queries only are the best possible.

1 Introduction From the start of computational learning theory, great emphasis has been put on developing algorithmic techniques for various problems. It seems that the great progress has been made in learning using membership queries, especially such functions as decision trees and multivariate polynomials. Generally speaking, three di erent techniques were developed for those tasks: the Fourier transform technique, the lattice based techniques and the Multiplicity Automata technique. All the techniques use membership queries (which is also called substitution queries for nonbinary elds). The Fourier transform technique is based on representing functions using a basis, where a basis function is essentially a parity of a subset of the input. Any function can be represented as a linear combination of the basis functions. Kushilevitz and Mansour [KM93] gave a general 1

technique to recover the signi cant coecients. They showed that this is sucient for learning decision trees under the uniform distribution. Jackson [J94] extended the result to learning DNF under the uniform distribution. The output hypothesis is a majority of parities. (Also, Jackson [J95] generalizes his DNF learning algorithm from uniform distribution to any xed constant bounded product distribution.) The lattice based techniques are, at a very high level, performing a traversal of the binary cube. Moving from one node to its neighbor, in order to reach some goal node. Angluin [A88] gave the rst lattice based algorithm for learning monotone DNF. Bshouty [Bs93] developed the monotone theory, which gives a technique for learning decision trees under any distribution. (The output hypothesis in that case is depth 3 formulas.) Schapire and Sellie [SS93] gave a lattice based algorithm for learning multivariate polynomials over a nite eld under any distribution. (Their algorithm depends polynomially on the size of the monotone polynomial that describes the function.) Multiplicity Automata theory is a well studied eld in Automata theory. Recently, some very interesting connections where given, connecting learning such automata and learning decision trees and multivariate polynomials. Ohnishi, Seki and Kasami [OSK94] and Bergadano and Varricchio [BV96] Bergadano gave an algorithm for learning Multiplicity Automata. Based on this work Catlan and Varricchio [BCV96] show that this algorithm learns disjoint DNF. Then Beimel et. al. [BBB+96] gave an algorithm that is based on Hankel matrices theory for learning Multiplicity Automata and show that multivariate polynomials over any eld are learnable in polynomial time. (In all the above algorithms the output hypothesis is a Multiplicity Automaton.) All techniques, the Fourier Spectrum, the Lattice based and the Multiplicity Automata algorithms give also learnability of many other classes such as learning decision trees over parities (nodes contains parities) under constant bounded product distributions, learning CDNF (poly size DNF that has poly size CNF) under any distribution and learning j -disjoint DNF (DNF where the intersection of any j terms is 0). In this paper we develop a new approach for learning decision trees and multivariate polynomials via interpolation of multivariate polynomials over GF (2). This new approach leads to simple learning algorithms for decision trees over the uniform and constant bounded product distributions, where the output hypotheses is a multivariate polynomial (parity of monotone terms). The algorithm we develop gives a single hypothesis that approximate the target with respect to any constant bounded product distribution. In fact the hypothesis is a good hypothesis under any distribution that supports small terms. That is any distribution D where for a term T of size !(log n) we have PrD [T = 1] = 1=!(poly(n)). Previous algorithms do not achieve this property. It is also known that any DNF is learnable with membership queries under constant bounded product distribution [J95], where the output hypothesis is a majority of parities. Our contribution for j -disjoint DNF is to use an output hypothesis that is a parity of terms and to show that the output hypothesis is an  approximation of the target against any constant bounded distribution. We also study the learnability of multivariate polynomials from membership queries only. We give a learning algorithm for multivariate polynomials over n variables with maximal degree 2

d < cjFj for each variable, where c < 1 is constant, and with terms of size k = O jFj d (log n + log d)

!

using only membership queries. This result implies learning decision trees of depth O(log n) with leaves from a eld F from membership queries only. This result is a generalization of the result in [B95b] and [RB89], where the learning algorithm uses membership and equivalence queries in the former and only membership queries in the latter. The second result is a generalization of the result in [KM93] for learning boolean decision tree from membership queries. The above result also give an algorithm for learning any multivariate polynomial over elds of size q = n=(d(log n + log d)) from membership queries only. This result is a generalization of the results in [BT88, CDG+91, Z90] for learning multivariate polynomials under any eld. Previous algorithms for learning multivariate polynomial over nite elds F require asking membership queries with assignments in some extension of the eld F [CDG+91]. In [CDG+91] it is shown that an extension n of the eld is sucient to interpolate any multivariate polynomial (when membership queries with assignments from an extension eld are allowed). The organization of the paper is as follows. In section 2 we de ne the learning model and the concept classes. In section 3 we give the algorithm for learning multivariate polynomial for the boolean domain. In section 4 we give some background for multivariate interpolation. In section 5 we show how to reduce learning multivariate polynomials to zero testing and to other problems. Then in section 6 we give the algorithm for zero testing and also give a lower bound for zero testing multivariate polynomials.

2 The Learning Model and Concept Classes 2.1

Learning Models

The learning criterion we consider is exact learning [A88] and PAC-learning[Val84]. In the exact learning model there is a function f called the target function f : F n ! F which is a member of a class functions C de ned over the variable set Vn = fx ; : : : ; xng for some eld F . The goal of the learning algorithm is to output a formula h that is equivalent to f . The learning algorithm performs a membership query (also called substitution query for the nonbinary elds) by supplying an assignment a to the variables in Vn = fx ; : : : ; xng as input to a membership oracle and receives in return the value of f (a). For our algorithms we will regard this oracle as a procedure MQf (). The procedure input is an assignment a and its output is MQf (a) = f (a). The learning algorithm performs an equivalence query by supplying any function h as input to an equivalence oracle with the oracle returning either \YES", signifying that h is equivalent to f , or a counterexample, which is an assignment b such that h(b) 6= f (b). For our algorithms we will regard this oracle as a procedure EQf (h). We say the hypothesis class of the learning algorithm is H if the algorithm supplies the equivalence oracle functions from H . 1

1

3

We say that a class of boolean function C is exactly learnable in polynomial time if for any f 2 C over Vn there is an algorithm that runs in polynomial time, asks a polynomial number of queries (polynomial in n and in the size of the target function) and outputs a hypothesis h that is equivalent to f . The PAC learning model is as follows. There is a function f called the target function which is a member of a class of functions C de ned over the variable set Vn = fx ; : : : ; xng. There is a distribution D de ned over the domain F n. The goal of the learning algorithm is to output a formula h that is -close to f with respect to some distribution D, that is, 1

Pr [f (x) = h(x)]  1 ? : D The function h is called an -approximation of f with respect to the distribution D. In the PAC or example query model, the learning algorithm asks for an example from the example oracle, and receives an example (a; f (a)) where a is chosen from f0; 1gn according to the distribution D. We say that a class of boolean functions C is PAC learnable under the distribution D in polynomial time if for any f 2 C over Vn there is an algorithm that runs in polynomial time, asks polynomial number of queries (polynomial in n, 1=, 1= and the size of the target function) and with probability at least 1 ?  outputs a hypothesis h that is -approximation of f with respect to the distribution D. It is known from [A88] that if a class is exactly learnable in polynomial time from equivalence queries and membership queries then it is PAC learnable with membership queries in polynomial time under any distribution D. Let D be a set of distribution. We says that C is PAC learnable under D if there is a PAClearning algorithm for C such that for any distribution D 2 D unknown to the learner and for any f 2 C the learning algorithm runs in polynomial time and outputs a hypothesis h that is an -approximation of f under any distribution D0 2 D. 2.2

The Concept Classes and Distributions

A function over a eld F is a function f : X ! F for some set X . All classes considered in this paper are classes of functions where X = F n. The elements of F n are called assignments. We will consider the set of variables Vn = fx ; : : : ; xng where xi will describe the value of the i-projection of the assignment in the domain F n of f . For an assignment a, the i-th entry of a will be denoted by ai. A literal is a nonconstant polynomial p(xi). A monotone literal is xri for some nonnegative integer r. A term (monotone term) is a product of literals (monotone literals). A multivariate polynomial is a linear combination of monotone terms. A multivariate polynomial with nonmonotone terms is a linear combination of terms. The degree of a literal p(xi ) is the degree of the polynomial p. The size of a term pi1 (xi1 )    pk (xik ) is k. Let MULF (n; k; t; d) be the set of all multivariate polynomials over the eld F over n variables with at most t monotone terms where each term is of size at most k and each monotone literal is of degree at most d. For the binary eld B the degree is at most d = 1 so we will use MUL(n; k; t). MUL?F (n; k; t; d) will be the set of all multivariate polynomial with nonmonotone terms with the 1

4

above properties. We use MUL? (n; k; t) when the eld is the binary eld. Throughout the paper we will assume that t  n. Since every term in MUL?F (n; k; t; d) can be written as multivariate polynomial in MULF (n; k; (d + 1)k ; d) we have

Proposition 1

MUL?F (n; k; t; d)  MULF (n; k; t(d + 1)k ; d):

For the boolean eld B = f0; 1g a DNF (disjunctive normal form) is a disjunction of terms. A j -disjoint DNF is a DNF where the disjunction of any j terms is 0. A k-DNF is a DNF with terms of size at most k literals. A decision tree (with leaves from some eld F ) over Vn is a binary tree whose nodes are labeled with variables from Vn and whose leaves are labeled with constants from F . Each decision tree T represents a function fT : f0; 1gn ! F . To compute fT (a) we start from the root of the tree T : if the root is labeled with xi then fT (a) = fTR (a) if ai = 1, where TR is the right subtree of the root (i.e., the subtree of the right child of the root with all its descendent). Otherwise (when ai = 0), fT (a) = fTL (a) where TL is the left subtree of the root. If T is a leaf then fT (a) is the label of this leaf. It is not hard to see that a boolean decision tree of depth k can be represented in MUL? (n; k; 2k ) (each leaf in the decision tree de nes a term and the function is the sum of all terms), and that a j -disjoint k-DNF of size t can be represented in MUL? (n; k(j ? 1); tj? ). (See for example [K94].) So for constant k and d = O(log n) the number of terms is polynomial. For a DNF and multivariate polynomial, f , we de ne size(f ) to be the number of terms in f . For a decision tree the size will be the number of leaves in the tree. A product distribution is a distribution D that satis es D(a ; : : : ; an) = Qi Di(ai) for some distributions Di on F . A product distribution is xed constant bounded if there is a constant 0 < c < 1=2, that is independent of the number of variables n, such that for any variable xi , c  Prob[xi = 1]  1 ? c. A distribution D supports small terms if for every term of size !(log n), we have PrD [T = 1] = 1=!(poly(n)), where n is the number of variables. 1

1

3 Simple Algorithm for the Boolean Domain In this section we give an algorithm that PAC-learns with membership queries MUL? (n; n; t) under any distribution that supports small terms in polynomial time in n and t. We remind the reader that we assume t  n. All the algorithms in the paper run in polynomial time also when t < n. 3.1

Zero test M U L(n; k; t)

We rst show how to zero-test elements in MUL(n; k; t) in polynomial time in n and 2k assuming k is known to the learner. The algorithm will run in polynomial time for k = O(log n). Let f 2 MUL(n; k; t). Choose a term T = xi1    xij , j  k, of maximal size in f . Choose any values from f0; 1g for the variables not in T . The projection will not be the zero function because the term T will stay alive in the projection. Since the projection is a nonzero function 5

with j  k variables there is at least one assignment for xi1 ; : : : ; xij that gives value 1 for the function. This shows that for a random and uniform assignment a, f (a) = 1 with probability at least 1=2j  1=2k . So to zero test a function f 2 MUL(n; k; t) randomly and uniformly choose polynomial number of assignments ai. If f (ai ) is zero for all the assignments then with high probability we have f  0. Now from the above we have

Claim 2 For any f 2 MUL(n; k; t), f 6 0 the probability that m = 2k log(1=) randomly chosen elements a ; : : : ; am in f0; 1gn satis es f (a ) = : : : = f (am ) = 0 is at most . 1

1

This implies

Claim 3 For f 2 MUL(n; O(log t); t), there is a polynomial time probabilistic zero testing algorithm, that succeeds with high probability. 3.2

Learning M U L(n; k; t).

We now show how to reduce zero-test to learning. Let f 2 MUL(n; k; t). We rst show how to nd one term in f . If f (0) = 1 then we know that T = 1 is a term in f . If f (0) = 0 then let f = f . Since we can zero-test we can nd the minimal i such that f jx1 ;:::;:::;xi1  0. This implies that fx1 ;:::;:::;xi1?1 = xi1 f (xi1 ; : : : ; xn ) for some multivariate polynomial f . If f (0) = 1 then we know that T = xi1 is a term in f . We continue recursively with f = fx1 ;:::;:::;xi1?1 ;xi1 until fj (0)  1, in this case T = xi1    xij is a term in f . After we nd a term T we de ne f^ = f + T . This removes the term T from f , and thus f^ 2 MUL(n; k; t ? 1). We continue recursively with f^ until we recover all the terms of f . Membership queries for f^ can be simulated by memebership for f because MQf (a) = MQf (a) + T (a). The complexity of the interpolation is performing nt calls to the zero testing procedure. This gives 0

1

0

0

0

0

1

1

0

1

+1

1

0

0

1

^

Claim 4 For f 2 MUL(n; k; t), there is an algorithm that with probability at least 1 ?  learns f with

2k nt log nt 

membership queries.

In particular this gives,

Claim 5 For f 2 MUL(n; O(log t); t), there is a polynomial time probabilistic interpolation algorithm, that succeeds with high probability to learn f from membership queries.

3.3

Learning M U L

?

(n; n; t)

We now give a PAC-learning algorithm that learns MUL? (n; n; t) under any distribution that support small terms. We rst give the idea of the algorithm. The formal proof is after Theorem 1. 6

Let f 2 MUL? (n; n; t). To PAC-learn f we randomly choose an assignment a and de ne f 0 = f (x + a). A term in f of size k will have on average k=2 monotone literals in f 0, and terms with k = (log t) variable will have with high probability (k) monotone literals. We perform a zero-restriction, i.e. for each i, with probability 1=2 we substitute xi 0 in f 0. Since a term of size k in f has on average k=2 monotone literals after the rst shift f (x + a), in the second restriction this term will be zero with probability (about) 1 ? 2?k . This probability is greater than 1 ? 1=poly(t) for k = O(log t). Therefore with high probability all the terms of size more than O(log t) will be removed by the second restriction. This ensures that with high probability the projection f 00 is in MUL? (n; O(log t); t), and therefore by Proposition 1 f 00 2 MUL(n; O(log t); poly(t)). Now we can use the algorithm in subsection 3.2 to learn f 00. Notice that for multivariate polynomial h (with monotone terms) when we performed a zero restriction, we delete some of the monotone terms from h, therefore, the monotone terms of f 00 are monotone terms of f 0. We continue to take zero-restrictions and collect terms of f 0 until the sum of terms that appear in at least one restriction de nes a multivariate polynomial which is a good approximation of f 0. We get a good approximation of f 0 with respect to any distribution that supports small terms since we collect all the small (i.e. O(log t)) size terms. Theorem 1 There is a polynomial time probabilistic PAC-learning algorithm with membership queries, that learns MUL? (n; n; t) under any distribution that support small terms. We now prove that the algorithm sketched above PAC-learns with membership queries any multivariate polynomial with non-monotone terms under distributions that support small terms. For the analysis of the correctness of the algorithm we rst need to formalize the notion of distributions that support small terms. The following is one way to de ne this notion. De nition 1. Let Dc;t; be the set of distributions that satisfy the following: For every D 2 Dc;t; and any DNF f with t terms of size greater than c log(t=) we have Pr [f = 1]  : D Notice that all the constant bounded product distributions D where 1 ? d  PrD [xi = 1]  d for all i are in D = =d ;t; . In what follows we will assume that c  2 and  < 1=2. We will use Cherno bound (see [ASE]). 1 log(1

)

Lemma 6 Let X ; : : : ; Xm 2 f0; 1g be independent random variables where Pr[Xi = 1] = 1=2. 1

Then for any a we have

Pr

m

" X

i=1

2 Xi  m2 ? a  2? 2ma : #

Let f = T +    + Tt be a multivariate polynomial where T ; : : : ; Tt are terms and jT j  jT j      jTt j. Our algorithm starts by choosing a random assignment a and de nes f 0(x) = f (x + a). All terms that are of size s (in f 0) will contain on average s=2 monotone literals. Therefore by Cherno bound we have 1

1

1

2

Lemma 7 With probability at least 1=2 all the terms in f 0 of size more than c log(t=), contain at least ( =4)c log(t=) monotone literals, where  4 and c  1. 7

Proof. Let T be any term of size c log(t=). Let P (T ) be the number of monotone literals in T . We have    c  8 c t t ? 8  = t  t : Pr P (T ) < 4 c log   2 log

Since the number of terms of f 0 is t and  < 1=2 the result follows.2 With probability at least 1=2 all the terms of size more than 4c log(t=) will contain at least c log(t=) monotone literals and all terms of size 8c log(t=) will contain at least 2c log(t=) monotone literals. Now we split the function f 0 into 3 functions f , f and f . The function f = T +    + Tt1 will contain all terms that are of size at most 4c log(t=). The function f = Tt1 +    + Tt2 will contain all terms of size between 4c log(t=) and 8c log(t=) and the function f = Tt2 +    + Tt will contain all terms of size more than 8c log(t=). Since f 2 MUL? (n; 4c log(t=); t), by Proposition 1 we have f 2 MUL(n; 4c log(t=); t(t=) c). Similarly, f 2 MUL(n; 8c log(t=); t(t=) c). Our algorithm will nd all the terms in f , some of the terms in f and none of the terms in f . Therefore we will need the following claim. 1

1

2

2

3

1

+1

3

+1

1

8

2

4

1

1

2

3

Claim 8 Let g = f + h where h is a multivariate polynomial that contains some of the terms in f . Then for any D 2 Dc;t; we have Pr [g = 6 f ]  : D 1

2

Proof. The error is Pr [g 6= f ] = Pr [(f + h) + f = 1] = Pr [h + f + f = 1]: D D D 1

2

3

Let

f = T^t1 T~t1 +    + T^t2 T~t2 2 MUL? (n; 8c log(t=); t) where Ti = T^ti T~ti , T^ti is the part of the term that contains monotone literals and T~ti is the part that contains the nonmonotone literals. If T^t1 = xi1    xil and T~t1 = xj1    xjl then notice that when we change T^t1 T~t1 to sum of monotone terms we get T^t1 xjq : 2

+1

+1

+1

+1

+1

+1

X

Y

+1

S fj1 ;:::;jlg

q2S

So every monotone term in f will contain one of the terms T^i , t + 1  i  t . Therefore we can write f = T^t1 f ; +    + T^t2 f ;t2 ?t1 where f ;i are multivariate polynomial with monotone terms. Since h is a multivariate polynomial that contains some of the terms in f we have f + h = T^t1 h ; +    + T^t2 h ;t2 ?t1 . Since jT^ij  c log(t=) for t +1  i  t and jTij  2c log(t=) for i  t + 1, by the de nition of distribution that support small terms we have Pr[(h + f ) + f = 1]  Pr[T^t1 _    _ T^t2 _ Tt2 _    _ Tt]  : 2

2

+1 2 1

1

2

2

2

2

2

+1

21

2

1

2

2

2

3

+1

+1

2 8

The algorithm will proceed as follows. We choose   c !!  c t t r =  ln 4 t zero restrictions p ; : : : ; pr of f 0. Recall that a zero restriction p of f 0 is a function f 0(p) where with probability 1=2,xi 0 and with probability 1=2 it remains alive. We will show that with probability at least 1=2 we have the following: (A) For every term in f there is a restriction pi such that f (pi) contains this term. (B) For every i = 1; : : : ; r we have f (pi)  0. We will regard A and B as events. Let T be the set of terms in f . We know that jT j  t(t=) c and every term in T is of size at most 4c log(t=). Let T be the set of terms in f . We know that the number of terms in T is at most t and every term has at least 2c log(t=) monotone literals. We have Pr[not A] = Pr[(9T 2 T )(8pi) T (pi) = 0]   c  t t Pr[(8pi ) T (pi) = 0] r   c 1 t  t  1 ? 2c t= = 14 : and, for c  2, Pr[not B ] = Pr[(9i)(9T 2 T ) T (pi) 6= 0]   c t= 1  rt 2  41 : 4

1

1

1

3

1

1

1

1

3

4

3

3

1

4

4

log(

)

3

2 log(

)

Therefore we have both events with probability at least 1=2. This shows that with probability at least 1=2 all the projections f (pi) contains terms of size at most 8c log(t=). Therefore, the algorithm proceed by learning each projection f (pi) 2 MUL(n; 8c log(t=); t(t=) c) using the previous algorithm and collecting all the terms of size 2c log(t=).2 The number of membership queries of the above algorithm is O((t=)k n) for some constant k. For the uniform distribution k  19. The above analysis algorithm can also be used to learn functions f : f0; 1gn ! F of the form f =  T +    + tTt where i 2 F , Ti are terms and + is the addition of a eld F . These functions can be computed as follows. For an assignment a, f (a) = PTi a i. This gives the learnability of decision trees with leaves that contain elements from the eld F . 8

1

1

( )=1

4 Multivariate Interpolation In this section we show how to generalize the above algorithm for any multivariate polynomial over any eld. 9

Let

f=

X

2I

a x 1    x nn 1

be a multivariate polynomial over the eld F where a 2 F and ; : : : ; n are integers. We will denote the class of all multivariate polynomials over the eld F and over the variables x ; : : : ; xn by F [x ; : : : ; xn]. The number of terms of f is denoted by jf j. We have jf j = jIj when all a are not zero. When f = 0 then jf j = 0 and when f = c 2 Fnf0g then jf j = 1. Let d be the maximal degree of variables in f , i.e., I  [d]n where [d] = f0; 1; : : : ; dg. Suppose F 0 = f ; : : : ; dg  F are d + 1 distinct eld constants where = 0 is the zero of the eld. A univariate polynomial f (x ) 2 F [x ] over the eld F of degree at most d can be interpolated from membership queries as follows. Suppose f (x ) =  d (f )xd +    +  (f )x +  (f ) where  i (f ) is the coecient of xi in f in its polynomial representation. Then 1

1

1

0

0

1

1

( )

1

(1)

1

(0)

1

( )

8 > > > > < > > > > :

f ( ) f ( ) ... f ( d)

= = ... =

0

1

 d (f ) d +    +  (f ) +  (f )  d (f ) d +    +  (f ) +  (f ) ...  d (f ) dd +    +  (f ) d +  (f ): ( )

(1)

0

( )

(1)

1

( )

(0)

0

(0)

1

(1)

(0)

This is a linear system of equations and can be solved for  i (f ), as follows, ( )

d    i f ( ) i?   

d    i f ( ) i?    det .. ... ... ... .

dd    di f ( d) di?    detjV ( ; : : : ; d)j where V ( ; : : : ; d) is the Vandermonde matrix. If f is a multivariate polynomial then f can be written as

+1 0 +1 1

0

1

0

0

1

1

+1

1

1

1

...

d

1 1 ... 1

0

1



(1)

0

0

f (x ; : : : ; xn) =  d (f )xd +    +  (f )x +  (f ) ( )

1

(1)

1

1

(0)

where  i (f ) is a multivariate polynomial over the variables x ; : : : ; xn. We can still use (1) to nd  i (f ) by replacing each f ( i) with f ( i; x ; : : : ; xn). Notice that from the rst equation in the system, since = 0, we have ( )

2

( )

2

0

 (f ) = f (0; x ; : : : ; xn): (0)

1

(2)

>From (1) a membership query for  i can be simulated using d + 1 membership queries to f . From (2), a membership query to  can be simulated using one membership query to f . We now extend the  operators as follows: for i = (i ; : : : ; ik ) 2 [d]k ( )

(0)

1

i = ik ik?1    i1 : 10

Here  always operates on the variable with the smallest index. So i1 operates on x in f to give a function f 0 that depends on x ; : : : ; xn. Then i2 operates on x in f 0 and so on. We will also write xi for the term xi1 xi2    xikk . The weight of i, denoted by wt(i), is the number of nonzero entries in i. The operator i(f ) gives the coecient of xi in f when represented in F [x ; : : : ; xn] [x ], the operator i(f ) gives the coecient of xi when f is represented in 1

2

1

2

2

2

1

1

F [xk ; : : : ; xn][x ; : : : ; xk ]: Suppose I  [d]k be such that i f 6= 0 for all i 2 I and if = 0 for all i 62 I , that is, xi for i 2 I are the k-suxes of all terms of f . Here the k-sux of a term xi    xinn is xi    xikk . Since i 2 I if and only if xi is a k-sux of some term in f , it is clear that jIj  jf j and we must have f = (if )xi: i2I We now will show how to simulate membership queries for (if )(xk ; : : : ; xn), i 2 I , using a polynomial number (in n and jf j) of membership queries to f . Suppose we want to nd (if )(c) for some c 2 F n?k using membership queries to f . We take r assignments ^ ; : : : ; ^r 2 F k and +1

1

1

1

1

1

X

+1

ask membership queries for (^ i; c) for all i = 1; : : : ; r. If f (^ i; c) = !i then

i2I (.i f )(c)^ i .. i2I (if )(c)^ ri

8 P > >
> : P

Now if I = fi ; : : : ; ir g and detjM[^ j ; ij ]j 6= 0 for

= ... =

1

! ... : !r 1

1

^i1 M[^ j ; ij ] = ...

^ri1 2 6 6 6 4

1

   ^ir ...

...    ^rir 1

3 7 7 7 5

then the above linear system of equations can be solved in time poly(r) = poly(jIj)  poly(jf j). The solution gives (i f )(c). The existence of ^i where the above determinant is not zero will be proven in the next section.

5 Reducing Learning to Zero-testing (for any Field) In this section we show how to use the results from the previous section to learn multivariate polynomials. Let MULF (n; k; t; d) be the set of all multivariate polynomials over the eld F over n variables with t terms where each term is of size k and the maximal degree of each variable is at most d. We would like to answer the following questions. Let f 2 MULF (n; k; t; d). 11

1. Is there a polynomial time algorithm that uses membership queries to f and decides whether f  0? 2. Given i  n. Is there a polynomial time algorithm that uses membership queries to f and decides whether f depends on xi ? 3. Given fi ; : : : ; ir g  [d]n where wt(ij )  k for all j and r  t. Is there an algorithm that runs in polynomial time and nds ; : : : ; r 2 F k such that 1

1



i    ir 1

...

ri1 1

...



... 6= 0?    rir 1

4. Is there a polynomial time algorithm that uses membership queries to f and identi es f ? When we say polynomial time we usually mean polynomial time in n; k; t and d but all the results of this section hold for any time complexity T if we allow a blow up of poly(n; t) in the complexity. We show that 1,2 and 4 are equivalent and 1 ) 3. Obviously 2 ) 1, 4 ) 1 and 4 ) 2. We will show 1 ) 2, 1 ) 3, and 1 + 2 + 3 ) 4. To prove 1 ) 2 notice that f 2 MULF (n; k; t; d) is independent of xi if and only if g = f jxi ? f jxi  0. Since g is the coecient of xi in f we have g 2 MULF (n; k; t; d). Therefore we can zero-test g in polynomial time. To prove 1 ) 3, let ; : : : ; s be a zero-test for functions in MULF (n; k; t; d), that is, run the algorithm that zero-test for the input 0 and take all the membership queries in the algorithm

; : : : ; s. We now have f 2 MULF (n; k; t; d) is 0 if and only if f ( i) = 0 for all i = 1; : : : ; s. Consider the s  r matrix with rows [ ji1 ; : : : ; jir ]. If this matrix have rank r then we choose r linearly independent rows. If the rank is less than r then its columns are dependent and therefore there are constants ci, i = 1; : : : ; r such that 1

0

1

1

r

X

i=1

ci jii = 0 for j = 1; : : : ; s:

This shows that the multivariate polynomial Pri cixii is 0 for all ; : : : ; s. Since Pri cixii is in MULF (n; k; t; d) we get a contradiction. Now we show that 1+2+3 ) 4. This will use results from the previous section. The algorithm rst checks whether f depends on x , and if yes it generates a tree with a root labeled with x that has d children. The ith child is the tree for i (f ). If the function is independent of x it builds a tree with one child for the root. The child is  (f ). We then recursively build the tree for the children. The previous section shows how to simulate membership queries at each level in polynomial time. This algorithm obviously works and it correctness follows immediately from the previous section and (1)-(3). The complexity of the algorithm is the size of the tree times the membership query simulation. The size of the tree at each level is bounded by the number terms in f , and the depth of the 1

=1

=1

1

1

0

12

1

tree is bounded by n, therefore, the tree has at most O(nt) nonzero nodes. The total number of nodes is at most a factor of d from the nonzero nodes. Thus the algorithm have complexity the same as zero testing with a blow up of poly(n; t; d) queries and time. Now that we have reduced the problem to zero testing we will investigate in the next section the complexity of zero testing of MULF (n; k; t; d).

6 Zero-test of

MUL

F (n; k; t; d)

In this section we will study the zero testing of MULF (n; k; ?; d) when the number of terms is unknown and might be exponentially large. The time complexity for the zero testing should be polynomial in n and d (we have k < n so it is also polynomial in k). We will show the following Theorem 2. The class MULF (n; k; ?; d), where d  cjFj, is zero testable in randomized polynomial time in n, d and t (here t is not the number of terms in the target) for some constant c < 1 if and only if ! jFj k = O d (log n + log d + log t) : The algorithm for the zero testing is simply to randomly and uniformly choose poly(n; d) points ai from F n and query f at ai , and receive f (ai). If for all the points ai , f is zero then with high probability f  0. This theorem implies Theorem 3. The class MULF (n; k; t; d) where d < cjFj for some constant c is learnable in randomized polynomial time (in n, d and t) from membership queries if

k = O jFj d (log n + log d + log t) : !

Proof of Theorem 2 Upper Bound. Let (n; k; d) the maximal possible number of roots

of a multivariate polynomial in MULF (n; k; ?; d). We will show the following facts 1. (n; k; d)  jFjn?k(k; k; d), and 2. (k; k; d)  jFjk ? (jFj ? d)k . 3. (1; 1; d) = d. Both facts implies that if f 6 0, when we randomly uniformly choose an assignment a 2 F n, we have Pr[f (a) 6= 0]  1 ? (n; k; d) a

jFjn

 1 ? (k;jFjk;k d) k (jFj ? d)k  1 ? jFj ?jFj k 13

d  1 ? jFj  e?O( jFjdk )

k

!

For d  cjFj we have that this probability is bounded by poly n;d;t . Therefore the expected running time to detect that f is not 0 is poly(n; d; t). It remain to prove conditions (1) and (2). To prove (1) let f 2 MULF (n; k; ?; d) with maximal number of roots. Let m be a term in f with a maximal number of variables. Suppose, without loss of generality, m = xi1    xikk . For any substitution ak ; : : : ; an of the variables xk ; : : : ; xn the term m will stay alive in the projection g = f jxi ai;i k ;:::;n because it is maximal in f . Since g has at most (k; k; d) roots the result (1) follows. The proof of (2) is similar to the proof of Schwartz [Sch80] and Zippel [Zip79]. Let f 2 MULF (k; k; ?; d). Write f as polynomial in F [x ; : : : ; xd ][x ], 1 (

)

+1

1

+1

= +1

2

1

f = fdxd + fd? xd? +    + f : 1

1

1

1

0

Let t be the number of roots of fd. Since fd 2 MULF (k ? 1; k ? 1; ?; d) we have

t  (k ? 1; k ? 1; d): For jFjk? ? t assignments a for x ; : : : ; xd we have fd(a) 6= 0. For those assignments we get a polynomial in x of degree d that has at most d roots for x . For t assignments a for x ; : : : ; xk we have fd is zero and then the possible values of x (to get a root for f ) is bounded by jFj. This implies 1

2

1

1

2

1

(k; k; d)  d(jFjk? ? t) + tjFj = djFjk? + (jFj ? d)t  djFjk? + (jFj ? d)(k ? 1; k ? 1; d): 1

1

1

2

The theorem follows by induction on k.

Proof of Theorem 2 Lower Bound Let A be a randomized algorithm that zero tests f 2 MULF (n; k; ?; d). Algorithm A asks membership queries to f and if f 6 0 it returns with probability at least 2=3 the answer \NO". If all the membership queries in the algorithm returns 0 the algorithm returns the answer \YES" indicating that f  0. We run the algorithm for f  0. Let D ; : : : ; Dl, l = (dnt) be the distributions that the membership assignments a ; : : : ; al are chosen to zero test f . Notice that if all membership queries answers are 0 while running the algorithm for f  0 it would again choose membership queries according to the distributions D ; : : : ; Dl . Now randomly and uniformly choose i;j 2 F , i = 1; : : : ; p; j = 1; : : : ; d and de ne 1

1

1

f? =

p d

Y Y

(xi ? i;j )

i=1 j =1

14

where p = 2 jFjd (ln n + ln d + ln t). Let I [f (ai)] = 1 if f ?(ai ) 6= 0 and 0 otherwise. Note that for any input a we have that

Ef

1 1 ? jFj

? [I [f ? (a)]] =

pd

!

:

Therefore

Ef ? E 8i ai 2Di f ; gn (

)

01

 Ef ? E 8i ai 2Di f ; gn (

)

"

"

_

i

#

I [f ? (ai)]

X

01

i

#

I [f ?(ai )]

 E 8i ai 2Di f ; gn [ Ef ? I [f ?(ai )]] X

(

2

= l4

 23 :

)

01

1 1 ? jFj

pd

!

i 3

0

5

This shows that there exists f ? 6 0 such that running algorithm A for f ? it will answer the wrong answer \YES" with probability more than 2=3. This is a contradiction. 2

References [A88] D. Angluin. Queries and concept learning. Machine Learning, 2(4):319{342, 1988. [ASE] N. Alon, J. H. Spencer. The probabilistic method. A Wiley-Interscience Publication, (1992). [BT88] M. Ben-Or, P. Tiwari. A deterministic algorithm for sparse multivariate polynomial interpolation In Proceedings of the 20th Annual ACM Symposium on Theory of Computing. pages 301{309, May 1988. [Bs93] N. H. Bshouty, Exact learning of boolean functions via the monotone theory. Information and Computation, 123, pp. 146-153, (1995). [B95b] N. H. Bshouty. A Note on Learning Multivariate Polynomials under the Uniform Distribution. In Proceedings of the Annual ACM Workshop on Computational Learning Theory. 1995. [BBB+96] A. Beimel, F. Bergadano, N. H. Bshouty, E. Kushilevitz, S. Varricchio. On the applications of multiplicity automata in learning. In Proceedings of the 37th Symposium on Foundations of Computer Science. pages 349{358, October 1996. 15

[BCV96] F. Bergadano, D. Catalano, S. Varricchio. Learning sat-k-DNF formulas from membership queries. In Proceedings of the 28th Annual ACM Symposium on Theory of Computing, pages 126{130, 1996. [BV96] F. Bergadano and S. Varricchio. Learning behaviors of automata from multiplicity and equivalence queries. SIAM Journal on Computing, 25(6): 1268{1280, 1996. [CDG+91] M. Clausen, A. Dress, J. Grabmeier, M. Karpinski. On zero-testing and interpolation of k-sparse multivariate polynomials over nite elds. Theoretical Computer Science. 84. pages 151{164, 1991. [J94] J. Jackson. An ecient membership-query algorithm for learning DNF with respect to the uniform distribution. In Proceeding of the 35th Annual Symposium on Foundations of Computer Science, 1994. [J95] J. Jackson. On Learning DNF and related circuit classes from helpfull and not-so-helpful teachers, Ph.D. thesis, CMU, 1995. [K94] R. Khardon. On using the Fourier transform to learn disjoint DNF. Information Processing Letters, 49 (05), pp. 219-222 (1994). [KM93] E. Kushilevitz and Y. Mansour. Learning decision trees using the Fourier spectrum. SIAM J. Computing, 22(6):1331{1348, 1993. [Ma92] Y. Mansour. Randomized interpolation and approximation of sparse polynomials. In Automata, Languages and Programming: 19th International Colloquium. pages 261{272, July 1992. (Also: Siam J. on Computing, vol. 2, num. 4, 1995.) [OSK94] H. Ohnishi, H. Seki and T. Kasami. A polynomial time learning algorithm for recognizable series. IEIICE Transactions on Information and Systems, E77-D(10)(5):1077{1085, 1994. [RB89] M. Ron Roth and G. Benedek. Interpolation and approximation of sparse multivariate polynomials over GF(2). SIAM J. Computing, 20(2):291{314, 1991. [Sch80] J. T. Schwartz. Fast probabilistic algorithms for veri cation of polynomial identities. J. of ACM, 27:701{717,1980. [SS93] R. E. Schapire, L. M. Sellie. Learning sparse multivariate polynomial over a eld with queries and counterexamples. In Proceedings of the Sixth Annual ACM Workshop on Computational Learning Theory. July, 1993. [Val84] L. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134{1142, November 1984. [Zip79] R. Zippel. Probabilistic algorithms for sparce polynomials. In Proceedings of EUROSAM 79, volume 72 of Lecture Notes in Computer Science, pages 216-226. Springer-Verlag, 1979. 16

[Z90] R. Zippel. Interpolating polynomials from their values. Journal of Symbolic Computation, 9. pages. 375{403. 1990.

17