On Learning Correlated Boolean Functions Using ... - Semantic Scholar

Report 3 Downloads 142 Views
On Learning Correlated Boolean Functions Using Statistical Query Ke Yang Computer Science Department, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213, USA. [email protected] 

November 28, 2001 Abstract

In this paper, we study the problem of using statistical query (SQ) to learn highly correlated boolean functions, namely, a class of functions where any pair agree on signi cantly more than a fraction 1/2 of the inputs. We give a limit on how well one can approximate all the functions without making any query, and then we show that beyond this limit, the number of statistical queries the algorithm has to make increases with the \extra" advantage the algorithm gains in learning the functions. Here the advantage is de ned to be the probability the algorithm agrees with the target function minus the probability the algorithm doesn't agree. An interesting consequence of our results is that the class of booleanized linear functions over a nite eld (f (~a(~x) = 1 i (~a  ~x) = 1, where  : GFp 7! f?1; 1g is an arbitrary boolean function the maps any elements in GFp to 1). This result is useful since the hardness of learning booleanized linear functions over a nite eld is related to the security of certain cryptosystems ([B01]). In particular, we prove that the class of linear threshold functions over a nite eld (f (~a;b (~x) = 1 i ~a  ~x  b) cannot be learned eciently using statistical query. This contrasts with Blum et al.'s result [BFK+96] that linear threshold functions over reals (perceptrons) are learnable using SQ model. Finally, we describe a PAC-learning algorithm that learns a class of linear threshold functions in time that is provably impossible for statistical query algorithms to learn the class. With properly chosen parameters, this class of linear threshold functions can become an example of PAC-learnable, but not SQ-learnable functions that are not parity functions.

 An extended abstract to appear in the Proceedings of The Twelfth International Conference on Algorithmic Learning Theory (ALT'01), LNAI 2225.

1

1 Introduction

Pioneered by Valiant [V84], machine learning theory is concerned with problems like \What class of functions can be eciently learned under this learning model?". Among di erent learning models there are the Probably Approximately Correct model (PAC) by Valiant [V84] and the Statistical Query model (SQ) by Kearns [K98]. The SQ model is a restriction to the PAC model, where the learning algorithm doesn't see the samples with their labels, but only get the probabilities that a predicate is true: to be more precise, the learning algorithm provides a predicate g(x; y) and a tolerance , and an SQ oracle returns a real number v that is -close to the expected value of g(x; f (x)) according a distribution of x, where f is the target functions. While seemingly a lot weaker than the PAC model, SQ model turns out to be very useful: in fact, a lot of known PAC learning algorithms are actually SQ model algorithms, or can be converted to SQ model algorithms. The readers are referred to [K98] for more comprehensive description. One interesting feature for SQ model is that there are information-theoretical lower-bounds on the learnability of certain classes of functions. Kearns [K98] proved that parity functions cannot be eciently learned in the SQ model. Blum et al. [BFJ+94] extended his result by showing that if a class of functions has \SQ-dimension" (informally, the maximum number of \almost un-correlated" functions in the class, where the correlation between two functions is the probability these two functions agree minus the probability they disagree) d, then a SQ learning algorithm has to make (d1=3 ) queries, each of tolerance O(d?1=3 ) in order to weakly learn F . In [J00], Jackson further strengthened this lower bound by proving that (2n ) queries are needed for an SQ-based algorithm to learn the class of parity functions over n bits. This result can be extended to any class of completely uncorrelated functions: (d) queries are needed for an SQ-based algorithm to learn a class of functions if this class contains d functions that are completely \uncorrelated". Notice that this upper bound is optimal: [BFJ+94] proved that there are weak-learning algorithms for the class of functions using O(d) queries. In this paper, we study the problem of learning correlated functions. Suppose there is a class of boolean functions F = ff1 ; f2; :::;fdg, where any pair of functions fi , fj are highly correlated, namely fi and fj agree on a fraction (1 + )=2 of the inputs, where  can be signi cantly larger than 0 (say,  = 1=3). There are natural classes of correlated functions. An example is the \booleanized linear functions" in a nite eld GFp de ned in this paper. Informally, these functions are of the form f~a (~x) = (~a~x), where  (called a \booleanizer" function) is an arbitrary function that maps an element in GFp to a boolean value (+1 or ?1), and both ~a and ~x are vectors over GFp. Booleanized linear functions can be viewed as natural extensions to parity functions (which are linear functions in GF2 ), and intuitively, should be hard to learn by statistical query (since parity functions cannot be eciently learned by statistical query). Actually they are (implicitly) conjectured to be hard to learn in general, and there are cryptosystems whose security is based on the assumption that booleanized linear functions are hard to learn. One example is the \blind encryption scheme" proposed by Baird [B01]: Roughly speaking, this private-key crypto-scheme picks a random f~a as the secret key, and encrypts a `0' bit by a random ~x such that f~a (~x) = 1, and a `1' bit by a random ~x such that f~a (~x) = ?1 Knowing the secret key, decryption is just an invocation of the f~a , which can be done very eciently. Furthermore, it is (implicitly in [B01]) conjectured, that, by only inspecting random plaintext-ciphertext pairs h~x; f~a (~x)i, it is hard to learn the function f~a1 . However, the results from [K98, BFJ+94, J00] don't immediately apply here since these booleanized linear functions are indeed correlated, and the correlation can be very large (in particular, [BFJ+94] requires the the correlation between any two functions to be O(1=d3 ), where for the booleanized linear functions, the correlation is of order (1=d), and can even be constants). Notice that in the case of correlated functions, the notion of \weak learning" can become trivial: if any pair of functions fi and fj have correlation , i.e., they agree on (1 + )=2 fraction of the inputs, then by always outputing f1 (x) on every input x, an algorithm can approximate any function fi with advantage at least , (the advantage of an algorithm is de ned as the probability the algorithm predicts a function correctly minus the probability the algorithm predicts incorrectly). So if  is non-negligibly larger than 0, this algorithm \weakly learns" the function class without even making any query to the target function. In the rst part of this paper, we prove that if there are d target functions f1 ; f2 ; ::::;fqd , such that any pair fi and fj have almost the same correlation , then an algorithm can have maximally 1+(dd?1) advantage in approximating all the target functions, if no query is performed. pFurthermore, we prove that in order to have any \extra" advantage S in learning a target function, about d  S=2 queries are needed. The result shows an advantage-query complexity trade-o : the more advantage one wants, the more queries one has to make. One consequence of our result is that booleanized linear functions cannot be learned eciently using statistical query, and if the booleanizer is \almost biased" and the nite eld GFp is large, one cannot even weakly learn this class of functions. Our result provides some positive evidence towards the security of the blind encryption scheme by Baird. 1 This is not exactly what the \blind encryption scheme" does, but is similar. 2

The technique we used in the proof, which could be of interest by itself, is to keep track of the \all-pair statistical distance" between scenarios when the algorithm is given di erent target functions | we denote this quantity by  We. prove that: 1. Before the algorithm makes queries,  = 0. 2. After the algorithm nishes all the queries,  is \large". 3. Each query only increases  by a \small" amount. And then we conclude that a lot of queries are needed in order to learn F well. One interesting consequence from our result is that the class of linear threshold functions are not eciently learnable. A linear threshold function in a nite eld is de ned as f~a;b (~x) = 1 if ~a  ~x  b, and ?1 otherwise, where ~a 2 GFpn and b 2 GFp . These linear threshold functions over GFp are interesting, since their counterparts over reals are well-known as \perceptrons" and they learnability are well studied. Blum et al. [BFK+96] proved that there are statistical query algorithms that learn linear threshold function over reals in polynomial time, even in the presence of noise. It is interesting to see this stark contrast. In the second part of this paper, We present a learning algorithm, BUILD-TREE , that learns a class of linear threshold functions over nite elds where the threshold b is xed to be (p + 1)=2. Our algorithm uses a random example oracle, which produces a random pair h~x; f~a (~x)i upon invocation. The algorithm's running time is pO(n= log n) while the brute-force search algorithm takes time p (n) , and any statistical query learning algorithm also has to take time p (n) to even weakly learn the functions. If we \pad" the input properly, we can make BUILD-TREE 's running time polynomial in the input size, while still no SQ learning algorithms can learn the class eciently. This gives an example of PAC-learnable, but not SQ-learnable class of functions. Previously, both [K98] and [BFJ+94] proved that the class of parity functions ts into this category, and later [BKW00] proved that a class of noisy parity functions also ts. Our example is the rst class of functions in this category that are not parity functions. This result provides some insights towards better understanding of SQ-learning algorithms. The rest of the paper is organizes as follow: section 2 gives some notations and de nitions to be used in this paper; section 3 proves a lower bound for SQ-learning algorithms; section 4 discusses the algorithm BUILD-TREE and its analysis. Due to space constraint, the proofs to most lemmas and theorems are moved into the appendix.

2 Notations and De nitions

We give the notations and de nitions to be used in the paper.

2.1 Functions and Oracles

Throughout this paper we are interested in functions whose input domain is a nite set , where j j = M , and whose outputs domain is f?1; +1g. An input x to a function f is called a positive example if f (x) = +1, and negative example is f (x) = ?1. Sometimes when the function f is clear from the context, we call the value of f (x) the label of x. In a lot of cases, takes a special form: = GFpn , where p is a prime number and n is a positive integer. In this case, we write an input in the vector form: ~x, we use xi to denote its i-th entry, an element in GFp. We now de ne the notion of learning a function. The overall model is an algorithm A with oracle access to a function f that A tries to learn (we call f the target function). A is given an input X and makes queries to the oracle. Finally A outputs a bit as its prediction of A(X ). We use an \honest SQ-oracle" model, which is similar to the de nition of \SQ-based algorithm" in [J00]: De nition 1 An honest SQ-oracle for function f takes two nparameters g and N as inputs, where g : GFpn  f?1; +1g ! f?1; +1g is a function that takes an input in GFp and a boolean input and outputs Pa binary value, and M is a positive integer written in unary, called the \sample count". The oracle returns N1 Ni=1 g(~xi ; f (~xi)) where each xi is a random variable independently chosen according to a pre-determined distribution D. We denote this oracle by HSQf Notice that this de nition of an honest SQ-oracle is di erent from the mostly-used de nition of a \normal" SQ-oracle (sometimes denoted as STATf ) as in [AD98, BFJ+94, BFK+96, BKW00, K98]. Actually an honest SQ-oracle is stronger than a \normal" SQ-oracle. Kearns [K98] proved that one can simulate a STATf oracle eciently in the PAC learning framework, and Decatur [D95] extensively studied the problem of eciently simulating a STATf oracle. Both their results can be easily extended to show that an honest SQ-oracle can be used to eciently simulate a \normal" SQ-oracle. Therefore a lower bound with respect to an honest SQ-oracle automatically translates to a lower bound with respect to a \normal" SQ-oracle. 3

2.2 Bias and Inner Products of Functions

We de ne the bias of a real function f over to be the expected value of f under a distribution D, and we denote that by hf iD : X hf iD = ED [f (~x)] = D(x)f (x) x2

We de ne the inner product of two real functions over to be the expected value of f  g, denoted by hf;giD :

hf;giD = ED [f (x)g(x)] =

X

x2

D(x)  f (x)g(x)

In the rest of the paper, we often omit the letter D if the distribution is clear from the context. We can also view the inner product as the \correlation" between f and g. It is easy to verify that the de nition of inner product is a proper one. Also it is important to observe that if f is a boolean functions, i.e., f (x) 2 f?1; +1g, then hf; f i = 1.

2.3 Approximating and Learning Functions

Given a function f : ! f?1; +1g and an algorithm A which maps elements in to ?1 or +1, we can measure how well A approximates f . The algorithm could be a randomized one and thus the output of A on any input is a random variable. We de ne the characteristic function of algorithm A to be a real-valued function over the same domain : A : ! [?1; +1], such that A (x) = 2  Pr [A outputs 1 on x] ? 1

where the probability is taken over the randomness A uses and, if A make oracle queries, the randomness from the oracles. It is easy to verify that A(x) is always within the range [?1; 1]. Given a probabilistic distribution D over

, we de ne the advantage of algorithm A in approximating function f to be

hf; A i = Pr A;D [A agrees with f on input x] ? Pr A;D [A disagrees with f on input x] where the probability is taken over the randomness from A and the x that is randomly chosen from according to D. It is not hard to see that if A always agrees with f , then A = f , and the advantage of A in approximating f is 1; if A randomly guesses a value for each input, then A  0, and the advantage of A is 0. For a class of functions F , and an oracle algorithm A, we say A approximates F with advantage if for every function f 2 F , the advantage of A in approximating f is at least . In the case A queries an honest SQ-oracle HSQf in order to approximate the target function f , we say A learns F with advantage with respect to an honest SQ-oracle. We note that the \advantage" measure for learning a function isn't very di erent from the more commonly used \accuracy/con dence" measure in PAC learning. Recall that an algorithm learns F with accuracy  and con dence , if for any f 2 F , the algorithm A, using an oracle about f , with probability at least 1 ? , agrees with f with probability at least 1 ? . It is easy to prove the following facts: Lemma 1 Let F be a class of boolean functions over , and let A be an oracle algorithm. If A learns F with accuracy  and con dence , then A learns F with advantage at least 1 ? 2 ? 2. On the other hand, if A learns F with advantage at least , then A learns F with accuracy  and con dence  for any (; ) pair satisfying  1 ? 2 Proof: First, if A learns F with accuracy  and con dence , then we know with probability 1 ? , the advantage of A is at least (1 ? ) ?  = 1 ? 2, and with probability , the advantage of A is at least ?1. So the overall advantage of A is at least (1 ? )(1 ? 2) ?   1 ? 2 ? 2 Second, if A learns F with advantage , we translate that into the PAC language. We de ne A to be \lucky", if its advantage is at least 1 ? 2. We denote the probability that A is lucky by p, and then we have

 p + (1 ? p)(1 ? 2) or

 1 ? 2 + 2p 4

So if we have

 1 ? 2 then we have  > 1 ? p and A is lucky with probability at least 1 ? . Therefore A learns F with accuracy  and con dence . Therefore, roughly speaking: if an algorithm A learns F with high con dence and high accuracy (A \strongly" learns F ), then the advantage of A in learning F is close to 1; if A learns F weakly, then the advantage of A is non-negligibly higher than 0. On the other hand, if the advantage A has in learning F is close to 1, then A (strongly) learns F . The reason that we use the advantage measure in this paper is that we want to show a continuous \tradeo " result between how many queries are needed and how \well" an algorithm learns F , and using one single parameter makes the discussion more convenient.

2.4 Booleanized Linear Functions and Linear Threshold Functions in Finite Fields

Suppose p is a prime number and n a positive integer. Given an arbitrary function that maps inputs from GFp to boolean values,  : GFp ! f?1; +1g

we de ne a class F of booleanized linear functions as a collections of boolean functions:

F = ff~a; (~x) := (~a  ~x) j ~a 2 GFpn g;

and we call function  the booleanizer. Booleanized linear functions can be viewed as natural extensions of parity functions (which are linear functions over GF2n ). If the booleanizer function, , is a threshold function: 8 1 ; if x  b < b (x) = : ?1 ; if x < b we call the corresponding class of booleanized linear functions linear threshold functions, and denote the functions by f~a;b . Here the comparison is performed by rst mapping elements in GFp to integers f0; 1; :::;p?1g in the most straightforward way and then performing integer comparison.

2.5 The Tensor Product and Statistical Distance

Given two probabilistic distributions D and D0 over spaces  and 0 , we de ne their tensor product D D0 to be a new distribution over   0 : Pr

0 0 0 0 D D0 [(X; X ) = (x;x )] = Pr D [X = x]  Pr D0 [X = x ]

Given a nite space  and distributions D1 ; D2 ; :::;Dm over , we de ne the all-pair L2 statistical distance (abbreviated as SD) among D1 ; D2 ; :::;Dm to be SD

2 (D1 ; D2 ; :::;Dm ) =

"X m X m X?

Pr

i=1 j=1 x2

Di [X = x] ? Pr Dj

#  2 [X = x]

1 2

Under this de nition, it is easy to see that SD

2 (D; D) = 0

and 2 (D1; D2 ; :::;Dm ) =

SD

"X m X m i=1 j=1

SD

2 (Di ; Dj

)2

#

1 2

One useful property of the all-pair L2 statistical distance is the sub-additivity: Lemma 2 Let D1 ; D2 ; :::;Dm be distributions over  and D10 ; D20 ; :::;Dm0 be distributions over 0 . Then we have 0 0 0 0 0 0 SD2 (D1 D1 ; D2 D2 ; :::;Dm Dm )  SD2 (D1 ; D2 ; :::;Dm ) + SD2 (D1 ; D2 ; :::; Dm ) 5

Proof: We de ne Pi;x = Pr Di [X = x] and Qi;y = Pr Di0 [Y = y]. Then we have Pr Di D 0 [(X; Y ) = (x;y )] = Pr Di [X = x]  Pr D 0 [Y = y ] = Pi;x Qi;y i i 0 0 0 We denote SD2 (D1 D1 ; D2 D2 ; :::;Dm Dm ) by D. Then we have D = =

 =



"X m X m XX

i=1 j=1 x y "X m X m XX i=1 j=1 x y "X m X m XX i=1 j=1 x y "X m X m X i=1 j=1 y "X m X m X i=1 j=1 y

(Pi;x Qi;y ? Pj;x Qj;y

)2

#

1 2

(Pi;x Qi;y ? Pi;x Qj;y + Pi;x Qj;y ? Pj;x Qj;y (Pi;x Qi;y ? Pi;x Qj;y

(Qi;y ? Qj;y

)2 

(Qi;y ? Qj;y

)2

#

X x 1 2

+

P2

)2

#

#

1 2

i;x

1 2

+

+

)2

"X m X m XX

i=1 j=1 x "X m X m X i=1 j=1 x

"X m X m X

i=1 j=1 x 0 = SD2 (D1 ; D2 ; :::;Dm ) + SD2 (D1 ; D20 ; :::;Dm0 )

y

#

1 2

(Pi;x Qj;y ? Pj;x Qj;y

(Pi;x ? Pj;x

(Pi;x ? Pj;x

)2

#

)2 

X y

Q2

#

)2

#

1 2

1 2

j;x

1 2

where the rst inequality is by the triangle inequality and the second inequality is because

X x

P2

i;x 

X x

Pi;x

!2

=1

Since each random variable naturally induces a distribution, we can also de ne all-pair L2 statistical distance among random variables: For random variables X1 ; X2 ; :::; Xm , their all-pair L2 statistical distance is de ned to be the all-pair L2 statistical distance among the the distributions induced by them. The subadditivity property remains true: suppose we have random variables X1 ; X2 ; :::;Xm and Y1 ; Y2 ; :::; Ym , such that Xi is independent to Yj for any i; j 2 f1; 2; :::;mg, we have SD

2 (X1 Y1 ; X2 Y2 ; :::;Xm Ym )  SD2 (X1 ; X2 ; :::;Xm ) + SD2 (Y1 ; Y2 ; :::; Ym )

2.6 Cherno Bounds

We will be using Cherno bounds in our paper, and our version if from [MR95]. Theorem 1 Let X1 ; X2 ; :::;Xn be a sequence of n independent f0; 1g random variables. Let S be the sum of the random variables and  = E [S ]. Then, for 0    1, the following inequalities hold: Pr

and Pr

2 [S > (1 + )]  e? =3 2 [S < (1 ? )]  e? =2

3 Statistical Query Model: Negative Results

In this section we present a negative result characterizing the Statistical Query Model. Throughout this section, we use to denote a nite set of size M and we are interested in functions take inputs from and output +1 or ?1.

6

3.1 Statistical Dimension and Fourier Analysis

De nition 2 Let be a nite set and let F be a class of boolean functions whose input domain is , and D a distribution over , we de ne SQ-DIM(F ;D), the statistical query dimension of F with respect to D, to be the largest natural number d such that there exists a real number , satisfying 0    1=2 and F contains d functions f1 ; f2 ; ::::;fd with the property that for all i = 6 j , we have jhfi; fj i ? j  d13

Notice the de nition of SQ-DIM in [BFJ+94] can be regarded as the special case where = f?1; +1gn with the restriction that  = 0. Notice though each of the functions f1 ; f2 ; :::;fd can be highly correlated to others, since the correlation is always almost the same, we call it a \false correlation". As we will prove in the next lemma, we can \extract" d new functions f~1 ; f~2 ; :::; f~d from f1 ; f2 ; :::;fd , such that the new functions are almost totally uncorrelated to each other. Lemma 3 Let , D, d, , and f1 ; f2 ; :::;fd be as de ned in de nition 2, and  > 0. We de ne d real-valued functions f~1 ; f~2 ; :::; f~d :

!

d X f~i (x) = p 1 fi (x) ? d1  p 1 ? p 1  fj (x) 1? 1? 1 + (d ? 1) j=1

(1)

jhf~i ; f~i i ? 1j  d83 ; 8i

(2)

jhf~i ; f~j ij  d83 ; 8i 6= j

(3)

Then we have and

Proof: One could directly substitute the de nition of f~1 ; :::; f~d into the formula and check, but here is the reasoning: We rst de ne a new function f which is the average of functions f1 ; f2 ; :::;fd : f(x) = d1 Then we can work out the inner product of f and fi:

hf; fi i = d1 So if we de ne = ( d1 + d?d 1 ), we have

P and since f(x) = d1 di=1 fi (x), we also have

d X j=1

d X i=1

fi (x)

hfi; fj i = d1 + d1

X j6=i

hfi ; fj i

jhf; fii ? )j  d13 jhf; fi ? )j  d13

Therefore we know that  0 since hf; fi  0. We de ne gi (x) = fi (x) ? Z f(x) for i = 1; 2; :::;d. where Z is a constant to be decided. Now we compute the inner products of gi and gj : hgi ; gj i = hfi ? Z f; fj ? Z fi = hfi ; fj i ? Z hfi; fi ? Z hfi ; fi + Z 2 hf; fi =  ? 2 Z + Z 2 + O( d13 )

Let

s

r

Z = 1 ? ?  = 1 ? 1 +1(d?? 1)

Then Z is a solution to the equation

 ? 2 Z + Z 2 = 0 7

and 0  Z < 1. So we can show that

jhgi ; gj ij  d43

Now if we compute the inner product of gi with itself, we get hgi ; gi i = hfi ? Z f; fi ? Z fi = hfi ; fi i ? Z hfi; fi ? Z hfi; fi + Z 2 hf; fi = 1 ? 2 Z + Z 2 + O( d13 ) So we have

jhgi ; gi i ? (1 ? )j  d43

Finally we de ne

!

f(x) f~i (x) = p 1 gi (x) = p 1 fi (x) ? p 1 ? p 1 1? 1? 1? 1 + (d ? 1) and we have

(4)

jhf~i ; f~i i ? 1j  d83 ; 8i

and

jhf~i ; f~j ij  d83 ; 8i 6= j because we have 0    1=2 and therefore 1?1   2.

So we now get a group of functions f~1 ; f~2 ; :::; f~d that are \almost" orthogonal. Next, we can extend this group of functions to a basis and perform Fourier analysis on the basis. The part of analysis are very similar to the proofs in [BFJ+94], but with di erent parameters and (sometimes) improved bounds. Lemma 4 Let , D, d, , and f1 ; f2 ; :::;fd be as de ned in de nition 2, and suppose that D has full support, i.e., 8x 2 ; D(x) > 0, and  > 0, d > 16. Let functions f~1 ; f~2 ; :::; f~d be as de ned in lemma 3, then there exist functions f~d+1 ; :::f~M , such that ff~1 ; :::; f~M g form a basis for the vector space of all real functions over

. Furthermore, the added functions are orthonormal, namely, for any j > d and any k, hf~j ; f~k i = 0 and hf~j ; f~j i = 1. Proof: We rst prove that f~1 ; f~2 ; :::; f~d are linear independent. Otherwise we assume that f~1 = Pj>1 j f~j for some 2 ; :::; d. Then we have

"

0 = ED (f~1 ?

X j>1

j f~j )2

#

X = ED [f~1 2 ] ? 2 j ED [f~1f~j ] + j>1

We use de ne max to be and thus we have

and So we have If max  1, we have

X j;k>1

j k ED [f~j f~k ]

max = maxfj j j : j > 1g

X

j j j2  2max ;

X j>1 j ED [f~1f~j ]  8 max ; j>1 d2 X E [f~ f~ ]  8 2max j;k>1;j6=k j k D j k d 8 2max 0  (1 ? d83 ) + 2max ? 16 dmax 2 ? d 8 2max  1 ? 8 ? 16 ? 8 (1 ? 8=d3 ) + 2max ? 16 dmax ? 2 d d3 d2 d 8

which is more than 0 when d > 16. If max > 1, we have 8 2max  1 ? 8 + 2 (1 ? 16 ? 8 )  1 ? 8 > 0 ? (1 ? 8=d3 ) + 2max ? 16 dmax 2 d d3 max d2 d d3

So either way we have a contradiction. Now we already have d linear independent functions f~1 ; :::f~d, we can use Gram-Schmidt process to extend them to a basis f~1 ; f~2 ; :::; f~d; f~d+1 ; :::; f~M . for the vector space of all real functions over , and make sure the added functions are orthonormal: for any j > d and any k, hf~j ; f~k i = 0 and hf~j ; f~j i = 1. Now that we have a basis for real functions over , we can extend the distribution D to a distribution D~ over  f?1; +1g, and extend the basis to a basis for real functions over  f?1; +1g. Lemma 5 Let , D, d, , and functions f~1 ; f~2 ; :::; f~M be as de ned in lemma 4. We de ne a new distribution D~ over  f?1; +1g as: D~ (x; ?1) = D~ (x; +1) = D(x)=2 In other words, D~ is the tensor product of distribution D over and the uniform distribution over f?1; +1g. We extend the de nitions of f~1 ; :::; f~M to the input domain  f?1; +1g by de ning f~i (x; y) = f~i (x) for x 2 and y 2 f?1; +1g. We also de ne M new functions h1 ; h2 ; :::;hd over  f?1; +1g: hi (x; y) = f~i (x)  y

for x 2 and y 2 f?1; +1g. Then ff~2 ; :::; f~M ; h1 ; :::;hM g form a basis for the real functions over f?1; +1g. Proof: It is easy to check that hf~i; hj i = 21 ED [f~i(x)  hj (x; ?1)] + 21 ED [f~i(x)  hj (x; +1)] = 21 ED [?f~i(x)  hj (x)] + 12 ED [f~i (x)  hj (x)] = 0

and

hhi ; hj i = 21 ED [hi (x; ?1)  hj (x; ?1)] + 21 ED [hi (x; +1)  hj (x; +1)] = ED [f~i(x); f~j (x)]

Now that we have a basis for real functions over  f?1; +1g, we can perform a Fourier Analysis for any function over  f?1; +1g. De nition 3 Let , D, d, and functions f~1 ; f~2 ; :::; f~M , h1 , ..., hM be as de ned in lemma 5. Let g be an arbitrary function over  f?1; +1g, we can write g (uniquely) as

g(x; y) =

M X i=1

i f~i(x) +

M X i=1

i h(x; y)

(5)

We call ( 1 ; :::; M ; 1 ; :::; M ) the Fourier Coecients of function g. Notice the basis isn't an orthonormal one, but it is close. The following lemmas show the upper bounds for the coecients. Lemma 6 Let , D, d, and functions f~1 ; f~2 ; :::; f~M , h1 , ..., hM be as de ned in lemma 5. Let g be an arbitrary function over  f?1; +1g, such that jg(x; y)j  1 for all x 2 and y 2 f?1; +1g. We write g in its Fourier coecients as in de nition 3. Then, for i = 1; 2; :::;M , j i j  1 + 10=d2 and j ij  1 + 10=d2 .

9

Proof: WLOG we assume that j 1j is the largest coecient. We look at the projection of g on f~1 : hf~1 ; gif~1 =

1 +

=

1 +

X i>1 d X i=2

i hf~1; f~i i +

!

X i>1

!

i hf~1 ; hi i f~1

i hf~1; f~i i f~1

And since this is a projection, we know it is no greater than jjgjj, which bounded by 1. So we have 1  jjgjj  jhf~1 ; gif~1 j  jhf~1; f~1ij d X  (j j ? j j  8 )  (1 ? 8 ) 1

i=2

i

d3

 j 1 j(1 ? d82 )  (1 ? d83 )  j 1 j=(1 + 10 d2 )

d3

when d > 16. Lemma 7 Let , D, d, functions f~1 ; f~2 ; :::; f~M , h1 , ..., hM and g be as de ned in lemma 6, where d > 100. Then we have M X i2  1 + 100 (6) d i=1

and

M X i=1

Proof: we have 1  jjgjj2 = j

 

M X i=1

M X i=1 d X i=1

if~i +

M X i=1

(7)

i hi j 2

j ij2 hf~i ; f~i i +

M X i=1

j ij2 hhi ; hi i ? 2  d X

2i  1 + 100 d

j i j2 hhi ; hi i ?

X

1i<jd

X i6=j

j i j hf~i ; f~j ij ?

j i j hf~i; f~j ij ? 2 

X 1i<jd

X i6=j

j i j hhi ; hj ij ?

X i;j

j i j hf~i ; hj ij

j i j hhi ; hj ij

> (1 ? d83 ) j i j2 ? 96 d i=1

P

So when d > 100, we have di=1 j i j2  (1 + 96d )=(1 ? d83 )  1 + 100 d . The proof for the other inequality is similar. An immediate corollary from lemma 7 is: Lemma 8 Let , D, d, functions f~1 ; f~2 ; :::; f~M , h1 , ..., hM and g be as de ned in lemma 6, where d > 100. Then we have d X p j i j  d  (1 + 50d ) (8) i=1

and

d X i=1

p j ij  d  (1 + 50d )

Proof: By Cauchy-Schwartz inequality and lemma 7, d X i=1

j i j 

d

d X i=1

j i

j2

!

p  d + 100 p  d  (1 + 50d ) 10

(9) 1 2

The proof for the other inequality is similar. The next lemma gives a general bound for how the coecients are related to the inner products. Lemma 9 Let , D, d, and functions f~1 ; f~2 ; :::; f~M , be as de ned in de nition 4. Let  be a real-valued function over such that jg(x)j  1 for all x 2 . We write down g as M X

g(x) = Then

i=1

s

p

if~i (x)

hg; fj i  j  1 ?  + 1 ? 1 +1(d?? 1) Proof: By lemma 3, we have

(10)

!r

1 + (d ? 1) + 60

d

s

p

(11)

d

!

fi (x) = 1 ? f~i (x) + 1 ? 1 +1(d?? 1) f(x) for i = 1; 2; :::;d. Summing the expression above for i = 1; 2; :::;d, we get

s

!

d d X X p df(x) = fi (x) = 1 ?  f~i(x) + d 1 ? 1 +1(d?? 1) f(x) i=1 i=1

or

p

d X f(x) = 1 + (dd ? 1)  f~i (x) i=1

Therefore

p

s

!

s

!p

1 ? f~i (x) + 1 ? 1 +1(d?? 1) f(x)

fi (x) =

p

1 ? f~i (x) + 1 ? 1 +1(d?? 1)

=

d 1 + (d ? 1)  X f~ (x)

d

i=1

We are only interested in the coecients 1 ; 2 ; :::; d , and thus we de ne

h(x) =

M X

i=d+1

i

if~i (x)

P Then we have g(x) = h(x) + di=1 i f~i (x), and hh; f~i i = 0 for i = 1; 2; :::;d. Now we compute the inner product of g and fi : d X

s

p

hg; fi i = hh(x) + if~i (x); 1 ? f~i (x) + 1 ? 1 +1(d?? 1) i=1

d 1 + (d ? 1)  X f~ (x)i

d

s

X

p

p

!p

= i 1 ? hf~i ; f~i i + 1 ?  j hf~i; f~j i + 1 ? 1 +1(d?? 1) j6=i

p

d X

p

s

!

 i  1 ?  + 1 ?  i d83 + 1 ? 1 +1(d?? 1) i=1

s

p = i  1 ?  + 1 ? 1 +1(d?? 1) where  =

d X

p

 9d

1? i d83 + 1 ? 1 +1(d?? 1) i=1 d ! p1 + (d ? 1) X i=1

i 

d

1 + (d ? 1) 

d

d ! 1 + (d ? 1)  X i + 

d

d

11

i

1 + (d ? 1) 

!p

!p

s

!

i=1

!p

d X i=1

d ! X i=1

1 + (d ? 1) 

d

d ! X i=1

i f~i (x) 

i (1 + d2  d83 )

i=1

!p

! ! X d ~

i (d2  d83 )

i=1

fi (x)

d ! X

 d39=2  10d

i=1

i

since we have by lemma 8. Therefore,

s

!

!

p1 + (d ? 1) X d p  i + 10 hg; fi i  i  1 ?  + 1 ? 1 +1(d??1) d d i=1 s ! p  p 1 + (d ? 1)  pd  (1 + 50 ) + 10  i  1 ?  + 1 ? 1 +1(d??1) d d d s ! p p 1 + (d ? 1) 60 + d  i  1 ?  + 1 ? 1 +1(d??1) d

3.2 Approximating a Function without a query

We give an upper bound on the advantage an algorithm A can have to approximating a class of functions, if

A doesn't make any queries.

Theorem 2 Let be a domain of size M , let D be a probabilistic distribution over . Let F be a class of functions F = ff1 ; f2 ; :::;fdg, such that jhfi; fj i ? j  1=d3 for all pairs i 6= j , where  > 0. Let g : ! [?1; +1] be the characteristic function of an algorithm A such that hg; fi i  T for i = 1; 2; :::;d. Then we have r T  1 + (dd? 1) + 70 d for d > 100. Proof: We decompose g into basis as shown in equation 10 in lemma 9. Then we have T  hg; fi i

s

p

 1  1 ?  + 1 ? 1 +1(d?? 1)

for i = 1; 2; :::;d. Summing up these d inequalities, we get

dT 



p

s

d X

1 ?   i + d  1 ? 1 +1(d??1) i=1

p

1?

p

p



p

s

1 + (d ? 1) + 60

d

!r

1 + (d ? 1) + 60

d  (1 + 50d ) + d  1 ? 1 +1(d??1)

!r

d

d

s

= 1 ?   d + d  1 ? 1 +1(d?? 1) p p  1 + (d ? 1)  d + 70 or

!r

!r

1 + (d ? 1) + 60

d

r

1 + (d ? 1) + 60 + 1 ? 

d

d

r

T  1 + (dd? 1) + 70 d We next show that this bound is \almost tight", i.e., we give an example where T =

q 1+(d?1)

d . n ? 1 and any integer n  2, there exists a class of d = p boolean functions

Theoremn 3 For any odd prime p over GFp : F = ff1 ; f2 ; :::;fd g, and a distribution D, such that: any pair of the functions has identical inner q 1+( d?1) . product  and the inner product of constant function g(x)  1 and any fi is hg; fiiD = d 12

Proof: We consider the linear threshold functions:

8