An efficient membership-query algorithm for learning DNF with respect ...

Report 1 Downloads 65 Views
An Efficient Membership-Query Algorithm for Learning DNF with Respect to the Uniform Distribution Jeffrey Jackson* Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 Abstract

then it is learnable without membership queries with respect to the class of distributions computable by polynomial-size circuits. This result provides substantial evidence that membership queries, which have proved useful in a number of learning algorithms, may not simplify distribution-independent DNF learning. Recently, Kharitonov [28] again used cryptographic assumptions to show that the class AC' of constantdepth, polynomial-size { A , V, -}-circuits is not even weakly learnable using membership queries with respect to the uniform distribution. While this result does not immediately apply to subclasses of ACo such as DNF, it provided at least circumstantial evidence that DNF might also be hard t o learn with respect to uniform.

We present a membership-query algorithm for eficiently learning DNF with respect to the uniform distribution. In fact, the algorithm properly learns the more general class of functions that are computable as a majority of polynomially-many parity functions. We also describe extensions of this algorithm for learning DNF over certain nonuniform distributions and from noisy examples as well as for learning a class of geometric concepts that generalizes DNF. The algorithm utilizes one of Freund's boosting techniques and relies on the fact that boosting does not require a completely dist ri but ion-in depend ent weak learner. The boosted weak learner is a nonuniform extension of a Fourier-based algorithm due to Kushilevitz and Mansour.

1

In sharp contrast, we prove a positive DNF learning result: DNF is strongly learnable with respect to the uniform distribution using membership queries. This improves on two previous DNF learning algorithms in the same uniform-with-membership model: Mansour's quasi-polynomial-time algorithm [34] and a polynomial-time weak learner due to Blum et al. [9].

Introduction

Ever since Valiant introduced the Probably Approximately Correct (PAC) learning framework [36], there has been a great deal of interest in developing learning algorithms for the class DNF of polynomialsize Disjunctive Normal Form expressions. While a number of algorithms have been developed for learning various subclasses of DNF in a variety of models [36, 27, 6 , 2, 25, 3, 1 , 10, 31, 151, many of the results for unrestricted DNF have been negative. Angluin [5] showed that DNF is not efficiently learnable in the model of exact learning with proper equivalence queries (all models mentioned here are defined in the next section). Subsequently, Angluin and Kharitonov [7] showed that, given standard Cryptographic assumptions, if DNF is learnable with membership queries

Our algorithm for DNF learning is largely the combination of two powerful tools. One of these is a beautiful Fourier-based technique due to Kushilevitz and Mansour [30]. Their KH algorithm uses membership queries to efficiently locate a subset A of the inputs of a function f such that the parity X A of the bits in A correlates well with f with respect to the uniform distribution. That is, KM can be used to find a X A that weakly approximates f with respect to uniformassuming such a X A exists. Such a X A does exist for every DNF f , and therefore KH is a weak learning algorithm for DNF with respect to uniform [9]. We extend this idea, showing that for every DNF f and for every distribution D there is some X A that is a weak approximator for f with respect to D . We also show that the Kushilevitz/Mansour technique can be extended t o efficiently find such a X A for many (but not necessarily all) distributions.

*This research was sponsored by the National Science Foundation under Grant No. CCR-9119319. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. E-mail: j c j @ c s .cmu. edu

42

0272-5428194$04.000 1994 IEEE

The other tool we use is hypothesis boosting [35, 19, 201. In particular, we show how to apply one of Freund's boosting algorithms [19] to boost our weak learner into a strong learner for DNF with respect to uniform. A novel feature of our algorithm is that we apply a standard boosting technique to a weak learner that is not completely distribution independent. The hypothesis output by our algorithm is in PT1, the class of functions expressible as a threshold of polynomially-many parity functions. In fact, we show that the DNF k r n i n g algorithm can also beLsed to properly learn PT1 with respect to uniform. PT1 is a rather rich class of functions; for example, it contains both DNF [29] and parity decision trees [30]. However, results of Bruck [12] and Bruck and Smolensky [13] show that our algorithm does not readily extend to either the class of depth-2 threshold circuits or the class of depth-3 { A , V, -}-circuits. Finally, we extend the basic algorithm in several ways. First, we generalize the algorithm to learn geometric concepts defined over non-Boolean domains, an area that has attracted substantial interest recently ([17] contains a nice summary of research in this area). Among other results, we show that the class UBOXunions of axis-parallel rectangles defined over the domain (0,. . . ,b - l}"-is efficiently learnable with respect to uniform for any constant b. This complements previous positive results for UBOX, in stronger learning models, that either restrict the class in some way or assume that n is constant [ l l , 33, 23, 16, 171. We also show that DNF is learnable not only with respect to uniform but also over a variety of other distributions, including the constant-bounded product distributions. Finally, we show that DNF is learnable, with high probability, even if the target function is corrupted by random but persistent classification noise.

2 2.1

t o run in time polynomialin the complexity of the function f to be learned; we will use the size of a function as a measure of its complexity. The size measure will depend on the function class to be learned. In particular, each function class 9 that we study implicitly defines a natural class R3 of representations of the functions in F. We define the site of a function f E 3 as the minimum, over all r E 723 such that r represents f , of the size of r , and we define below the size measure for each representation class of interest. A DNF expression is a disjunction of terms, where each term is a conjunction of literals and a literal is either a variable or its negation. The size of a DNF expression r is the number of terms in r . The DNF function class is the set of all functions that can be represented as a DNF expression of size polynomial in n. to denote the class Following Bruck [12], we use of functions on (0, l}nexpressible as a depth-2 circuit with a majority gate at the root and polynomiallymany parity gates at the leaves. All gates haveAnbounded fanin and fanout one. The site of a PT1 circuit r is the number of parity gates in r .

2.2

Before defining the learning models we will consider, we define several supporting concepts. Given a function f and probability distribution D on the instance space o f f , we say that function h is an E-approximator for f with respect t o D if Pro[h = f ] 2 1 - E. An example oracle for f with respect to D ( E X ( f , D ) )is an oracle that on request draws an instance x at random according to probability distribution D and returns the example ( I ,f(.)). A membership oracle f o r f ( M E M ( f ) )is an oracle that given any instance x returns the value f ( x ) . Let V, denote a nonempty set of probability distributions on (0, l}n. Any set V = u,Dn is called a distribzltion class. For some distributions that have obvious generalizations to distribution classes (such as the uniform distribution on (0, 1)") we blur the distinction between distribution and distribution class. Now we formally define the Probably Approximately Correct (PAC) model of learnability [36]. Let E and 6 be positive values (called the accuracy and confidence of the learning procedure, respectively). Then we say that the function class 3 is (strongly) PAClearnable if there is an algorithm A such that for any E and 6, any f E F (the target function), and any distribution D on the instance space o f f (the target distribution), with probability at least 1 - 6 algorithm d(EX(f , D ) , E , 6 ) produces an €-approximation for f with respect to D in time polynomial in n , the size of

Definitions and notation Functions and function classes

We will be interested in the learnability of sets (classes) of Boolean functions. The Boolean functions we consider are, unless otherwise noted, of the type f : {O,l}" 4 {-l,+l} for fixed positive values of n, where an output of +1 denotes true and -1 f a l s e . We call (0, 1)" the instance space o f f , an element x in the instance space an instance, and the pair ( x ,f ( x ) ) an example o f f . We denote by xi the ith bit of instance x . Following standard practice, for any real-valued function g we define the norms L,(g) = max={lg(x)l}, Llk)= Ig(x)l, and L2(g) = Intuitively, a learning algorithm should be allowed

c,

Learning models

ax.

43

f , 1 / ~ ,and 1/6. We generally drop the “PAC” from

the correlation o f f and X A with respect to the uniform distribution. Also note that f(0) = E[fxe] = E[f], since xe is the constant function +1. Parseval’s identity states that for every function f , E[f2] = C A!’(A). For Boolean f it follows that CAf 2 ( A )= 1. More generally, it can be shown that for any functions f and g , E[fg] = CAf^(A)j(A).

“PAC-learnable” when the model of learning is clear from context. We will consider a number of variations on the basic PAC model. Let M be any model of learning (e.g., PAC). If 3 is M-learnable by an algorithm A that requires a membership oracle then 3 is M-learnable using membership queries. If 3 is M-learnable for 6 = 1/2 - l / p ( n ,s), where p is a fixed polynomial and s is the size o f f , then 3 is weakly M-learnable. We say that 3 is M-learnable by 7-l if 3 is M-learnable by an algorithm A that always outputs a function h E 31. If 3 is M-learnable by 3 then we say that 3 is properly M-learnable. Finally, note that the PAC model places no restriction on the example distribution D , i.e., it is distribution-independent. If 3 is M-learnable for all distributions D in distribution class V then 3 is Mlearnable with respect to V . We also at times refer to models of exact learning [4]; we now define several concepts related to these models. An equivalence oracle for f ( E Q f ) is an oracle that, given a hypothesis function h , returns an instance 2: such that f(2:) # h(2:) if such an z exists; otherwise, the oracle returns “equivalent.” A function class 3 is exactly learnable if there is an algorithm A such that for all f E 3, d ( E Q j ) runs in time polynomial in R and the size o f f and returns a hypothesis h equivalent to f . Function class 3 is exactly learnable with proper equivalence queries if 3 is exactly learnable by some algorithmd such that for all of the hypotheses h posed by A to the equivalence oracle, h E 3.

3

Two tools

Our DNF learning algorithm is largely the combination of two important learning-theoretic tools: an algorithm for finding the large Fourier coefficients of a function [30] and a method for “boosting” weak hypotheses into stronger ones [35, 19, 201. These algorithms form the basis for our algorithm, and it will also be necessary to tailor them somewhat to our needs, so we discuss each of them in some detail.

3.1

Hypothesis boosting

That is, X A ( ~ : is ) the Boolean function that is 1 when the parity of the bits in 2: indexed by A is even and is - 1 otherwise. With inner product defined by’ (f,s) = E[fgl and norm defined by llfll = { X A I A C { 1,. . . ,n}} is an orthonormal basis for the vector space of real-valued functions on the Boolean cube Z;. That is, every function f : (0, 1)” + R can a linear combination of parbe uniquely expressed ity functions: f = C Af ( A ) x ~ where , f(A) = E [ f x ~ ] . We call the vector of coefficients f the-Foun’er transform o f f . Note that for Boolean f , f ( A ) represents

Schapire [35] first discovered the surprising fact that every function class that can be weakly learned (in a distribution-independent model) can be strongly learned. Subsequently, Freund [19, 20, 211 developed several improved boosting algorithms. All of these algorithms run the given weak learner multiple times and combine the resulting weak hypotheses in some manner to produce a strong hypothesis. Boosting occurs as a result of running the weak learner on a different (simulated) example oracle E X ( f ,Di) each time, thus producing weak hypotheses that “do well” on different regions of the instance space. The DNF learning algorithm presented in this paper is based on one of Freund’s boosting algorithms [19]; we will call this algorithm Fi. We choose to analyze F1 rather than other boosting algorithms because the distributions D, simulated by Fi all have simple closed-form representations, a fact that will facilitate our analysis.2 As input, F I is given positive E and 6, a weak learner WL that produces - 7)-approximate hypotheses for functions in 3, and an example oracle E X ( f ,D) for some f E 3. The algorithm then steps sequentially through IC stages. At each stage i E [ O , k - 11, F i performs one run of WL and produces (with probability at least 1 - 6/k) a - 7)-approximate hypothesis wi. For the first stage, F i runs WL on the given example oracle E X ( f ,D). For each of the succeeding

‘Expectations and probabilities here and elsewhere are with respect to the uniform distribution over the instance space unless otherwise indicated.

*Freund has also developed an asymptotically more efficient boosting algorithm [20, 211 that can be used to learn DNF, but for clarity of exposition we discuss only the simpler F1 algorithm.

2.3

The Fourier transform

For each set A C { 1,. . . , R } we define the function {O,1}” -+ {--1,+1} as

XA :

m,

as

(3

(3

44

stages i > 0, F i runs WL on a simulated example oracle E X ( f , Di), where distribution Di focuses weight on those instances I such that roughly half of the i weak hypotheses generated during previous stages are correct on I and roughly half are incorrect. Before defining the algorithm precisely we need some notation. Let

.={

- r ; k - i - 1,f

0

+7 )

S

ifi - < r otherwise

5

3.2

Finding large Fourier coefficients

To produce the weak hypotheses that will be boosted, we utilize a slightly modified version of an algorithm due to Kushilevitz and Mansour. Their KM algorithm was initially applied to learning parity decision trees with respect to the uniform distribution [30], and it has been the basis of other learning algorithms as well [34, 91. The basic KM algorithm finds, with probability at least 1 - 6, close approximations to all of the large Fourier coefficients of a Boolean function f on (0, l}". Since on this domain a Fourier coefficient f ( A ) represents the correlation between a parity X A and the target function f , KM can be used to find all parities that correlate well with f . By "large" coefficients we mean coefficients of magnitude exceeding some threshold 8; the algorithm runs in time polynomial in n , log(6-'), and e-'. KM makes membership queries for f , but otherwise f is treated as a black box. KM is a recursive algorithm that is given as input a membership oracle M E M ( f ) , threshold 8, confidence 5 , and the specification of a set of Fourier coefficients. A set of coefficients is specified by two parameters, an integer k E [0, n] and a set A ( 1 , . . . ,k}; these parameters define the set C A $ = {J(A U B ) I B {k 1 , . . . ,n } } . Initially, KM is run on the set of all Fourier coefficients o f f , C B , ~ . Each time KM is called, it begins by defining a partition of the input set c A , k into the two equal-sized sets CA,k+l and CA"{k+l},k+l. For notational convenience we let Ci denote either of the sets in the partition. After partitioning the input set, KM next tests to see whether or not a large coefficient can possibly be a member of one or both of the sets. It (conceptually) does this by computing the sum of squares of the Fourier coefficients in each set; we denote the sum for set Ci by L;(Ci). If L:(Ci) < ' 8 for either Ci then certainly none of the coefficients in that Ci exceeds the threshold 8, and the coefficients in that C; can be ignored in subsequent processing. On the other hand, if Li(Ci) 2 ' 8 and lCil > 1 then KM recurses on Ci. The above-threshold singleton sets remaining a t the end of this process are the desired large coefficients. Because the sum of squares of the Fourier coefficients of a Boolean function is 1 (by Parseval), the algorithm will recurse on at most e-' subsets at any level of the recursion. Thus this algorithm runs in time polynoand in the mial in the depth n of the recursion, in F1, maximum time required to compute Li(Cd) for any Ci. Of course, this time could be exponentially large. The KM algorithm gets around this difficulty by estimating L;(Ci) for each Ci in an ingenious way: for

5

where B ( j ;n , p ) = (y)pl(l - p)"-j is the binomial formula. Also, let at = P:/max,.{Pi}. Finally, let r i ( z ) = I(0 5 j < i I w j ( 2 ) = f(z)}I; that is, r i ( z ) is the number of hypotheses wj among those hypotheses produced before stage i that are "right" on I. During stage i > 0, when the weak learner requests an example from its simulated example oracle E X ( f , Di),F i queries the example oracle E X ( f ,D ) and receives an example (I,f ( 2 ) ) . With probability a:,(=), Fi accepts this example. If F i does not accept then it queries for another example, repeating this process until it does accept. Finally, Fi passes the accepted example to WL. Thus the distribution Di simulated by Fi at stage i is

c

+

Stage i is completed when WL outputs its hypothesis Wi. Nominally, Fi terminates after k = $y-'ln(4/€) stages, producing a weak hypothesis at each stage. However, at each stage i the algorithm estimates the denominator of ( l ) , which is the probability that FI accepts an example while simulating the oracle E X ( f , Di). If at any stage i the algorithm detects that D ( y ) a f , ( , ) 5 c1c2 for a fixed constant c1 (c1 = 1/13 is adequate [IS]) then Fi terminates prematurely. Regardless of how the algorithm terminates, the hypothesis output by Fi is the majority function applied to hypotheses W O ,w1,. . . , U&', where k' represents the last stage completed by Fi. Freund [19] proves that this algorithm can be used to boost distributionindependent weak learners into strong learners:

E,

Lemma 1 (F'reund) Algorithm Fi, given positive E and 5 , a - 7)-approximate learner for function class 3, and example oracle E X ( f , D ) for some f E F and any distribution D , runs in time polynomial in n, s, 7-', e-' , and log(5-l) and produces, with probability at least 1- 5 , an €-approximationfor f with respect t o D.

(3

45

c

every function f : (0. I}" every A S (1, . . . , IC},

-+

Proof: There is at least one term T in f such that PrD[x satisfies T ] 2 PrD[x satisfies f ] / s . Let T ( z )be the Boolean function represented by T , i.e., T ( x ) = 1 when 3: satisfies T and T ( x ) = -1 otherwise. Also, assume without loss of generality that none of the literals in T are negated and let V represent the set of variables appearing in T . Then for all x ,

R , every k E [O,n],and

L; ( C A , k ) = E z , y , r[ f ( y x ) f ( z x ) X A(Y)XA(Z)]

1

where 1x1 = n - k , IyI = IzI = IC, and the expectation is uniform over z , y , and z . Thus L ~ ( C A can ,~) be estimated by using membership queries to sample f appropriately, and it can be shown by a Chernoff bound argument and an appropriate reduction of the threshold that the algorithm can achieve the claimed performance. Our learning algorithms will need to find the large coefficients of certain functions that are not Boolean valued. This leads us to extend KM slightly:

where the expectation is uniform over the subsets A V . Let T'(x) E ( T ( x ) 1)/2. Then ED[^ .T'] = Also, E A G V [(-1)IA1ED[ f X A ] ] 5 E A c V [IED[fxA] 11. since T is a term of f ,for any x such that f (x) = - 1 T'(E ) = 0. Thus ED[fT'] = ED[TI] = PrD [T = I]. Therefore, there is some A 5 V such that IED [.fxAll 2 (ED if1 $- 1)/2s. Since ED [f]= E D [ f X 0 ] ) the proof is completed by noting that the inequality above implies that either [ E ~ [ f x ~2] ll/(2s 1) or El ED[!] I -1/w 1). Thus for every DNF f and distribution D there is some X A such that IPrD[f = X A ] - $1 = fl(s-'), and X A (or its inverse) can be used as a weak approximator for f with respect to D . Therefore a polynomialtime algorithm that could find such a parity for every DNF and every distribution could be boosted into a distribution-independent learning algorithm for DNF. While we do not know whether or not such an algorithm exists, we can show the following:

+

~

-

Lemma 2 There is an algorithm KM' such that, for R, threshold 0 > 0 , and any function f : (0, I}" confidence 6 > 0 , K M ' ( n , M E M ( f),0,6)returns (with probability at least 1 - 6 ) a set containing all of the Fourier coefficients f ( A ) such that If(A)I 2 0. KM' uses membership queries and runs in tame polynomial in n , Q-l, log(6-'), and L,(f).

Proof: For the moment, assume that a value is known for Lm(f). Then we can apply Hoeffding's inequality [26]to show that for each set C A , L ~ ~, ( C A can , ~ )be estimated with sufficient confidence and accuracy in time polynomial in n , 0 - l , log(6-'), and Lm(f). Furthermore, the fact that C Af 2 ( A )= E[f2] L & ( f ) means that the number of recursive calls is bounded by a polynomial in n , $-', and L , ( f ) . Finally, note that the algorithm can use a standard guess-and-double technique to eliminate the need to know L,( f). 0