Lower Bounds for Perceptrons Solving some Separation Problems ...

Report 8 Downloads 31 Views
Lower Bounds for Perceptrons Solving some Separation Problems and Oracle Separation of AM from PP Nikolai K.Vereshchaginy Moscow Institute of New Technologies in Education 11 Kirovogradskaja Street, Moscow, Russia 113587 E-mail: [email protected] April 29, 1994

Abstract

We prove that perceptrons separating Boolean matrices in which each row has a one from matrices in which many rows have no one must have large size or large order. This result partially strengthens one-in-a-box theorem by Minsky and Papert [8] stating that perceptrons of small order cannot decide if each row of given Boolean matrix has a one. As a consequence, we prove that AM \ co-AM 6 PP under some oracle. This contrasts the fact that MA  PP under any oracle.

1 Introduction A perceptron is a depth-2 circuit with a threshold gate at the root and AND-gates at the remaining level. Each input of the threshold gate is labeled by an integer called its weight. The order of a perceptron is the maximum fanin of its AND-gates. The weight of a perceptron is the maximum absolute value of the weights on the inputs to its threshold gate. The size of a perceptron is the sum of the weights on the inputs to its threshold gate1. Perceptrons have been studied by Minsky and Papert in [8]. They proved strong lower bounds for the order of perceptrons recognizing some predicates. They proved that perceptrons computing parity function of n variables must have order at least n. They proved also that any perceptron recognizing whether all rows in a given Boolean matrix of size n  4n2 contain 1 has order at least n (the so called one-in-a-box theorem). Beigel in [3] constructed a predicate of n variables that is computable by a perceptron having exponential size and order 1 and that is not computable by perceptrons having quasipolynomial (2poly(log(n)))  Theorem A.2 was presented at Conference on Structure in Complexity Theory'92 [13]. y This research was in part done while visiting Rochester; that visit was supported by a grant from the

NAS/NRC COBASE program and by NSF grant CCR-8957604. This research was also in part supported by the grant from AMS and by the grant from \Cultural Initiative" foundation. 1 We use a non-standard de nition of size. According to usual de nition size of a perceptron is the number of AND-gates it contains. See [8].

1

size and polylogarithmic (poly(log(n))) order. To be more precise, he proved lower bound d2 log s = (n) for order d and size s of perceptrons computing that predicate. We extend Minsky and Papert's on-in-a-box theorem in following direction. We consider separation problems instead of problems of predicate computation (i.e., decision problem). Let  stand for the following separation problem: to separate Boolean matrices in which any row contains 1 from matrices in which many rows (e.g., a fraction 0.99 of all rows) contain zeros only. Obviously, any perceptron solving  also recognizes if given matrix has a one in each row. We prove that problem  is not solvable by perceptrons having polylogarithmic order and quasipolynomial size (Corollary 3.5). Quasipolynomial size and polylogarithmic order come up quite often by the following reason. When translating between nondeterministic Turing machine complexity and circuit complexity in the manner of Furst, Saxe, and Sipser [6], polynomial time translates into quasipolynomial size and polylogarithmic order. Relativizable upper bounds for nondeterministic Turing machines with a particular acceptance mechanism translate into upper bounds for depth-2 circuits with a corresponding gate at the root. (In other words, lower bounds for circuits tanslate into separations of Turing machine complexity classes via oracles.) In particular, PP-machines translate into perceptrons having polylogarithmic order and quasipolynomial size. So due to Minsky and Papert's theorem on parity function we get that P 6 PP relative to some oracle. The one-in-a-box theorem implies that NPNP 6 PP under some oracle. The above mentioned result by Beigel implies that PNP 6 PP under some oracle. Our result on the above de ned separation problem  shows that AM 6 PP relative to some oracle. Some slight improvement of that result involves that AM \ co-AM 6 PP under some oracle. The class PP is interesting by following three reasons. First, this class has the following interpretation. Random input r of the probabilistic machine M that recognizes a language L can be regarded as a voter and the output M (x; r) of M on the input word x and random input r can be regarded as the opinion of voter r about whether x is in L. From this point of view PP is the class of all languages L such that membership of x in L can be determined via election with 2poly(jxj) voters, every voter being polynomial time bounded. Second, as shown in [12], the class PP proved to be suprisingly powerful: polynomial hierarchy PH is Turing reducible to PP. Third, PP is closed under polynomial truth table reductions (see [4] and [5]). Thus the class PP has a rather regular structure.

2 De nitions

We consider languages over the binary alphabet B = f0; 1g. The set of all binary words of length n is denoted by Bn . Functions with binary values are called predicates. All Turing machines output 0, 1. De nition 2.1 A perceptron is a depth-2 circuit having a threshold gate at the bottom and AND-gates at remaining level. Inputs of AND-gates are Boolean variables or their negations. Each AND-gate is labeled by a natural number called the weight of that ANDgate. The size of a perceptron is the sum of absolute values of weights on all its AND-gates. The order of a perceptron is the maximal fanin of its AND-gates. 2

Let P be a perceptron, and ' by an assignment of values to its variables. The weight of ', written WP ('), is the sum of weights on all AND's which are true on '. The perceptron outputs 1 on input ' if WP (') is greater than the threshold of its threshold gate and 0 otherwise. The output value is denoted by P ('). Let M be a Boolean matrix having n rows and m columnes. Any matrix of such order can be de ned in usual way by means of mn Boolean values. When we say that a perceptron P has such a matrix M as input we mean that those Boolean values are assigned to its input variables. In this case we denote the output of P by P (M ). Let " be a real in the segment [0;1/2]. Matrix M is called "-good if at least (1 ? ")n rows of M contain a one. Matrix M is called "-bad if at least (1 ? ")n rows of M contain no ones. We say that a perceptron P separates "-good matrices from  -bad matrices of size n  m if P (M ) = 1 for any "-good matrix of size n  m and P (M ) = 0 for any -bad matrix of size n  m. De nition 2.2 A language L belongs to PP i there is a polynomial time probabilistic Turing machine T such that x 2 L , Prob[T (x) = 1] > 1=2. De nition 2.3 L 2 MA i there are a polynomial p and polynomially computable predicate Q(x; r; s) such that

x 2 L ) 9s 2 Bp(jxj) Probr [Q(x; r; s)] > 2=3; x 62 L ) 8s 2 Bp(jxj) Probr [Q(x; r; s)] < 1=3; where probability is with respect uniform distribution in Bp(jxj). De nition 2.4 L 2 AM i there exist a polynomial p and polynomial computable predicate Q(x; r; s) such that

x 2 L ) Probr[9s 2 Bp(jxj)Q(x; r; s)] > 2=3; x 62 L ) Probr[9s 2 Bp(jxj)Q(x; r; s)] < 1=3; where probability is with respect to uniform distribution in Bp(jxj). In the paper [2] it is proven that MA  AM. The proof is relativizable. In [13] it is proven that MA  PP. The proof is relativizable, too.

3 Lower bounds for perceptrons

Let 1  a > b  0 and let n be in N. Denote by p(a; b; n) the probability of event \less than bn ones occurs among n independent trials of tossing a coin with probability a of one".

Theorem 3.1 Let 1 > " > 4e?m= . If there exists a perceptron of order d and size s which separates "-good matrices from "-bad Boolean matrices of size n  m, then 2

or

3=2 (d + 2)3 > p m" 8 2 ln(4=")

(1)

s  (3p(1 ? "=2; 1 ? "; n))?1:

(2)

3

Proof. Let m; n be integers. Denote the set of Boolean matrices having n rows and m columns by M. By Mij we denote the element of matrix M in ith row and j th column. Let  be a probability distribution in the set M. For a property S of matrices in M we denote by Prob(M ) [S (M )] the probability that a matrix M taken at random with respect to  satis es S (M ). Let us say that two probability distributions in the set M,  and  , are d-indistiguishable if

Prob(M )[Mi1 j1 = b1; : : :; Mik jk = bk ] = Prob (M )[Mi1 j1 = b1; : : :; Mik jk = bk ] for any sequence hi1; j1i; : : :; hik ; jk i of indices of length at most d and for any sequence of bits b1; : : :; bk . The proofs easily follows from the following two lemmas. 3=2 ?m= , then there exist probability Lemma 3.1 If (d + 2)  p m" 2 =" and 1 > " > 4e distributions  and  in M such that the following hold: 3

2

8

ln 4

1) a matrix M taken at random with respect to  is not "-good with probability at most p(1 ? "=2; 1 ? "; n); 2) a matrix M taken at random with respect to  is not "-bad with probability at most p(1 ? "=2; 1 ? "; n); 3)  and  are d-indistiguishable.

Lemma 3.2 If there exist probability distributions  and  in M satisfying 1), 2) and 3)

and there exists a perceptron of order d and size s separating "-good matrices from "-bad matrices in M, then the inequality (2) is true.

Let us prove rst Lemma 3.2. Proof of Lemma 3.2. Let 1), 2) and 3) be true. Let P be perceptron of order d and size s separating "-good matrices from "-bad matrices in M. Let E and E stand for the average with respect to distributions  and  . We claim that E WP (M ) = E WP (M ): (3) Let us prove this claim. Let C (M ) stand for the Boolean function computed by an ANDgate C in P . Let l be total number of AND-gates in P , ith gate being Ci and having P l weight wi. Then E WP (M ) = i=1 wi E Ci(M ). Therefore, it suces to prove that Prob(M )[C (M ) = 1] = Prob (M )[C (M ) = 1] for every AND-gate C in P . Let C be V the conjunction dk=1 (Mik jk = bk ), where bk 's are 0 or 1. Then Prob(M )[C (M ) = 1] = Prob(M ) [Mi1 j1 = b1; : : :; Mik jk = bk ]: Thus d-indistinguishability of  and  implies (3). Let t be the value of the threshold of the threshold-gate. Obviously we can assume that jtj < s. Denote p(1 ? "=2; 1 ? "; n) by p. The condition 1) implies that E WP (M )  (1 ? p)(t + 1) ? ps. The condition 2) implies that E WP (M )  (1 ? p)t + ps. Therefore, (1 ? p)(t + 1) ? ps  (1 ? p)t + ps, which implies (2).2 Proof of Lemma 3.1. Assume that trhe condition of of Lemma 3.1 is ful lled. Distributions  and  are constructed as follows. Let  be a probability distribution in the segment [0; 1]. 4

Let us associate with  the following probability distribution in M, which is denoted by (). A matrix M taken at random with respect to () is generated as follows. Pick independently random p1; : : :; pn in [0; 1] with respect to  . Then for each i  n take i-th row of M as the sequence of n Bernoulli tests with probability of 1 being equal to pi . More formally, for any matrix (cij ) 2 M

0 0m 1 1 n Z Y Y M [M = (cij )] = @ @ xi  cij A d (xi)A ; 1

Prob ()(

)

i=1

where x  0 = 1 ? x, x  1 = x. Consider the rst d moments of  :

m1 () =

Z

0

1

x d(x); m2() =

Z

1 0

0

j =1

x2 d(x); : : :; md() =

Z

1

0

xd d(x):

We claim that

Prob ()(M )[Mi1 j1 = b1; : : :; Mik jk = bk ] is a polynomial in m1( ), m2 ( ), : : :, md ( ) for any sequence hi1; j1i; : : :; hik ; jk i of indices of length at most d and for any sequence of bits b1; : : :; bk . Let us prove this claim. Let us prove this claim. Let hi1; j1i; : : :; hik ; jk i be a sequence of indices of length at most d and b1 ; : : :; bk be a sequence of bits. Denote for each i  n by ci the number of l 2 f1; : : :; kg such that il = i and bl = 1 and by ei the number of l 2 f1; : : :; kg such that il = i and bl = 0. Then Prob ()(M )[Mi1 j1 = b1; : : :; Mik jk = bk ] =

n Z Y

i=1

0

1

xci (1 ? x)ei d(x):

R for any i, the value xci (1 ? x)ei d (x) is a linear combination of values mr ( ) = REvidently, xr d(x), r = 0; 1; : : :; c + e . As c + e  d, the claim is proved. 1 0

i

1 0

i

i

i

We shall de ne probability distributions  and  in [0; 1] such that

mi () = mi ( ) for i = 1; 2; : : :; d;

(4)

Then we shall take  = ( ) and  = ( ). By the proven fact  = ( ) and  = ( ) will be d-indistinguishable. In order to satisfy 2) we shall take  such that Prob (p)[p = 0]  1 ? "=2:

(5)

Let us prove that (5) implies 2). Let  satisfy (5) and let M be taken at random with respect to ( ). Then probability q of event \a xed row of M has only zeros" is at least 1 ? "=2. This implies 2). In order to satisfy 1) we shall take  such that Prob(p) [p  ]  1 ? "=4;

(6)

where  = ln(4=")=m. By condition of the lemma  < 1. Let us prove that (6) implies 1). 5

Suppose that  satis es (6). Let M be a random matrix with respect to ( ). Denote by q the probability of event \a xed row of M has a one". Obviously, q  (1?"=4)(1?(1?)m). Since 1 ?  < e? , we have (1 ? )m < e?m = "=4 and q  (1 ? "=4)2 > 1 ? "=2. The latter

inequality implies 1). Thus it remains to prove the following Lemma 3.3 There exist probability distributions  and  in [0; 1] having identical d rst moments and satisfying the conditions (5) and (6). Proof. We shall de ne  explicitly and  implicitly by using a criterion on the existence of a measure in [0; 1] with given moments (a measure di ers from a probability distribution with that a measure of entire segment [0; 1] can be di erent fromR 1; thus the probability distribution can be de ned as any measure  such that ([0; 1]) = 01 1 d(x) = 1). The following theorem is due to M. Riesz. In paper [9] this theorem is proved for the in nite sequences of moments and measures in the set of reals. Riesz' proof is good also in our case. See also [10] (and [1] in Russian). P Let m = (m0; m1; : : :; md) be a sequence of real numbers. Let a(x) = di=0 ai xi be a polynomial of degree at most d. De ne (m; a(x)) to be equal to m0 a0 + m1 a1 +    + md ad . Theorem 3.2 (Riesz [9]) Two following conditions are equivalent: R (i) There exists a measure  in [0; 1] such that 01 xi d(x) = mi for all i 2 f0; 1; : : :; dg. (ii) (m; a(x))  0 for any polynomial a(x) of degree  d being nonnegative on [0; 1]. If d is even then (ii) is equivalent to the condition (iii) (m; b(x)2)  0 for any polynomial b(x) of degree at most d=2 and (m; x(1 ? x)c(x)2)  0 for any polynomial c(x) of degree at most d=2 ? 1. The implication (ii))(iii) is obvious. The implication (i))(ii) is simple and we will use its proof in the sequel. Let us prove it. P Assume that (i) is true and let  be a measure satisfying (i). Let a(x) = di=0 ai xi be a polynomial being nonnegative on [0; 1]. Then (m; a(x)) =

d X i=0

ai mi =

d Z X i=0

ai

1

0

xi d(x) =

Z

0

1

a(x) d(x)  0:

For the seek of completeness we will also prove the implications (iii))(ii) and (ii))(i) in the Appendix. Let us derive now a corollary from Riesz' theorem and from the proof of implication (i))(ii). Corollary 3.3 Let c be a real and let  be a probability distribution in the segment [0; 1] such that R 1 g(x)2 d(x)  c for any polynomial g(x) 0 of degree at most dd=2e whose constant term is 1: (7) Then there exists a probability distribution  in the segment [0; 1] having the same rst d moments as  has and such that Prob (p)[p = 0]  c. 6

Proof. Let  satisfy the conditions of the corollary. Let k = dd=2e. Let us de ne the sequence m = (m0 ; : : :; m2k ) by equalities m0 = 1 ? c, m1 = m1( ), : : :, m2k = m2k ( ). We claim that the sequence m satis es (iii). R Indeed let b(x) be a polynomial of degree at most k. Then (m; b(x)2) = 01 b(x)2 d (x) ? cb(0)2 (because m0 = 1 ? c = m0 () ? c). If constant term b(0) of b(x) is zero, then R 1 2 (m; b(x) ) = 0 b(x)2 d(x)  0. Otherwise using (7) for g(x) = b(x)=b(0) we get (m; b(x)2) = R 1 ( 0 g (x)2 d (x) ? c)b(0)2  0. Let c(x) be a polynomial of degree at most k ? 1. Then

(m; x(1 ? x)c(x)2) =

Z

0

1

x(1 ? x)c(x)2 d(x)  0

(because the polynomial x(1 ? x)c(x)2 has no constant term). By Riesz' theorem there exists a measure  in [0; 1] such that m0() = 1 ? c and mi () = mi() for all i 2 f1; 2; : : :; 2kg. Let f (x) be distribution function of , i.e., f (x) = ([0; x]). Consider the function h(x) = c + f (x). Then h(x) is the distribution function of some probability distribution  in [0; 1]. Evidently  satis es the required conditions.2 Let us continue the proof of Lemma 3.3. Let c = 1 ? "=2. By Corollary 3.3 it remains to construct a probaility distribution  in [0,1] such that (6) and (7) hold. Let us de ne  Ras follows. Prob(p) [p = ] = 1 ? "=4 and if A  [0; 1] n fg then Prob(p) [p 2 A] = " r(x) dx, where r(x) = p1?(12 ?2x)2 . In other words,  is the probability distribu4 x2A R tion such that Prob(p) [p 2 A] = (1 ? "=4)A () + 4" 01 A (x)r(x) dx for all A  [0; 1], where for characteristic function of A. The function r(x) is de ned so that R  the R 1 f (x)rA(xstands ?1 f ( 1?cos t ) dt for any function f (x). Thus R 1 r(x) dx = 1, therefore ) dx = 0 0 0 2  is a probability distribution. Evidently (6) is true. Let us prove (7). R at most k = dd=2e such that R Let g(x) be a polynomial of degree g(0) = 1. Then 01 g(x)2 Rd(x) = (1 ? "=4)g()2 + 4" 01 g(x)2r(x) dx. We claim that either (1 ? "=4)g ()2 or 4" 01 g (x)2r(x) dx is greater than 1 ? "=2. Indeed, assume that R " 1 g (x)2r(x) dx  1 ? "=2  1. 4 0 Let us prove that (1 ? "=4)g ()2  1 ? "=2. Let us make the substitution 1 ? 2x = cos t in the above integral. We obtain " Z  ?1 g  1 ? cos t 2 dt  1: (8) 4 0 2 We shall prove that g () is close to g (0) = 1. Let Ti (x) stand for ith Chebyshev's polynomial, i.e. Ti (cos t) = cos it for all t. It is well known that the polynomials T0, T1, : : :; Tk form a basis in the space of all polynomials of degree at most k. Since deg(g )  k, there exist  t = Pk ai Ti (cos t) = Pk ai cos it. a0 ; : : :; ak such that g 1?cos i=0 i=0 2 R P P k ? 1  2 Then (8) yields that  0 ( i=0 ai cos(it)) dt = a20 + (1=2) ki=1 a2i  4" . Hence p ja0j; : : :; jakj  8=". Let us deduce from this that jg(P ) ? g(0)j is less than "=8.p Let = arccos(1 ? 2). We have jg () ? g (0)j = j ki=0 ai (cos i ? cos 0)j  8="(k + 1) maxki=0 j cos i ? 1j. 7

It is easy to prove that 1 ? 2=2  cos  1 ? 2( = )2 for any in the segment p [?; ]. Therefore p   and hence cosp i  1 ? i2 2=2  1 ? i22=2. Thus we have 2 2 jg() ?pg(0)j  8="(k + 1)k  =2  1=(8")2(d + 2)3. Since (1) is false, (d + 2)3  p "3 =( 82 ) which implies that jg()?g(0)j  "=8, hence g()  1?"=8 and g()2  1?"=4. Thus we have (1 ? "=4)g ()  (1 ? "=4)2  1 ? "=2.2

Corollary 3.4 Let 1 > " > 4e?m= and n  6. If there exists a perceptron of order d and size s which separates "-good matrices from "-bad Boolean matrices of size n  m, then 2

=

(d + 2)3 > p 2m" 2 ln(4=eps) 3 2

or

s  p1 en"(ln 2?0:5): 18 n=" p Proof. The proof easily follows from the upper bounds p(1?"=2; 1?"; n)  6 n="e?n"(ln 2?0:5) proven in the Appendix.2 Let us see what the proven theorem states in the case when " is constant and m = n varies.

Corollary 3.5 For any " 2 (0; 1=2) there exists  > 0 such that the following holds. If there exists a perceptron of order d and size s separating "-good matrices from "-bad matrices of size n  n, then d > n1=3 ? 2 or s > 2n . Proof. This corollary easily follows from the previous one.2 Question 3.1. Are there perceptrons of order no(1) (and of arbitrary large size) separating "-good matrices from "-bad Boolean matrices of size n  n (for some xed " 2 (0; 1=2))? Now we proceed to lower bounds for perceptons which separate Boolean matrices in which each row contains one from Boolean matrices in which many rows contain no one, in other words, separating 0-good matrices from "-bad ones. The well known one-in-a-box theorem by Minsky and Papert states that any perceptron deciding if a given matrix of size n  4n2 is 0-good has order n or greater. There are no non-trivial lower bounds for the order of perceptrons separating 0-good matrices from "-bad ones (for some xed " < 1). Our method allows to obtain the following lower bound for order-size tradeo of perceptrons separating 0-good matrices from "-bad ones. A lower bound for perceptrons separating 0-good matrices from "-bad matrices can be derived from Theorem 3.1 by using the technique of replacing the quanti er \for many" by sequence of quanti ers 89 due to [11, 7]. That technique allows to reduce the problem of separating "-good matrices of size m  n from "-bad ones to the problem of separating 0-good matrices of size m0  n0 from "0 -bad ones for some m0; n0 ; "0 depending on m; n; ". However direct arguments allow to obtain much better lower bound stated in the following theorem.

8

Theorem 3.6 Let 0 < " < 1. If there exists a perceptron of order d and size s separating 0-good matrices from "-bad Boolean matrices of size n  m, then s  (2n exp(?m"3=(211(d + 3)4)) + p(1 ? "=2; 1 ? "; n))  1

(9)

Proof. The proof goes like the proof of Theorem 3.1. Suppose there exists a perceptron of order d and size s separating 0-good matrices from "-bad Boolean matrices of size n  m. We shall construct d-indistinguishable probability distributions  = ( ) and  = ( ) in the set M such that Prob (p)[p = 0]  1 ? "=2 (10) and Prob(p)[p  ] = 1 (11) where  = "3 =(211(d + 3)4). The inequality (11) implies that Prob ()(M )[M is 0-good]  1 ? n(1 ? )m  1 ? ne?m . Let p1 = ne?m and p2 = p(1 ? "=2; 1 ? "; n). As  and  are d-indistinguishable, we can obtain as in the proof of Lemma 3.1 the inequality (1 ? p1)(t + 1) ? p1 s  (1 ? p2)t + p2 s, where t stands for the value of the threshold of the threshold-gate. We can assume that jtj < s. The last inequality implies that s(2p1 + p2)  1, that is, implies inequality (9). Thus it remains to prove the following

Lemma 3.4 There exist probability distributions  and  in [0; 1] having the same d rst moments and satisfying the conditions (10) and (11).

Proof. We shall de ne  explicitly and  implicitly as in the proof of previous theorem. Let c = 1 ? "=2. By Corollary 3.3 it suces to construct a probaility distribution  in [0,1] such thatR (7) and (11) hold. Let us de ne  as follows. Prob(p) [p 2 A] = (1 ? "=4)A () + 4" 1 A (x)r(x) dx for all A  [0; 1], where A stands for the characteristic function of A and r 2 r(x) = 2 : 2 1+ (1 ? ) 1 ? 1? ? 1? x

R

R





The function r(x) is de ned so that 1 f (x)r(x) dx = 0  ?1 f 1+?(12?) cos t dt for any R function f (x). In particular, 1 r(x) dx = 1, therefore  is a probability distribution. Evidently (11) is true. Let usR prove (7). Let g (x) be a polynomial of degree at most k = dd=2e such that g (0) = 1 " R 1 g (x)2r(x) dx. We claim that either (1?")g ()2 2 2 1. Then g ( x ) d ( x ) = (1 ? "= 4) g (  ) + 4  R 0 R or 4" 1 g (x)2r(x) dx is greater than 1 ? "=2. Indeed, assume that 4" 1 g (x)2r(x) dx  1 ? "=2  1. Let us prove that (1 ? "=4)g ()2  1 ? "=2. Let us make the substitution 11+? ? 1?2  x = cos t in the above integral. We obtain

" Z   ?1 g  1 +  ? (1 ? ) cos t 2 dt  1:

4

0

2

9

(12)





Let k = dd=2e and let Ti(x) stand for ith Chebyshev's polynomial. Let h(y ) = g 1+?(12 ?)y . P a T (y). Let a0 ; : : :; ak be such that h(y ) = ki=0 p i i Then (12) yields that ja0j; : : :; jak j  8=" j  is  us deduce from this that jg() ? g(0)  1+. Let 1+  less than "=4. As g () = h(1) and g (0) = h 1? , we have to prove that h(1) ? h 1? is less than "=4. It is easy pto verify that Ti(ch ) = chi for any integer i, where ch = 0:5(e + e? ). Let = ln 11+?p . The value of is de ned so that ch = (1 + )=(1 ? ). We can estimate e as follows: p p 1 + e = p  1 + 4 : 1?    p The last inequality is true because   1=4. Then Ti 11+? = ch(i )  (1 + 4 )i  p (1 + 4 )k . Thus,

1 +   k X jg() ? g(0)j = jh 1 ?  ? h(1)j = j ai(ch(i ) ? 1)j q p ki  8="(k + 1)((1 + 4 ) ? 1): =0

We have

p p p (1 + 4  )k  e4k  = e2(d+1)   e2(d+1)"3=2 =(25:5(d+3)2) = e"3=2 2?4:5 (d+3)?1

It is easy to prove that ea  1 + 2a for any a  1. Therefore,

e"3=2 2?4:5 (d+3)?1  1 + "3=22?3:5(d + 3)?1 : Consequently,

q jg() ? g(0)j  8="(k + 1)" = 2? : (d + 3)?  " d +2 3 2? (d + 3)? = "=8: 3 2

35

1

2

1

Hence, g ()  g (0) ? "=8 = 1 ? "=8 and g ()2  1 ? "=4 and we get the inequality (1 ? "=4)g ()2  (1 ? "=2):2

Corollary 3.7 Let 0 < " < 1=2 and n  6. If there exists a perceptron of order d and size s separating 0-good matrices from "-bad Boolean matrices of size n  m, then   q (13) s  2n exp(?m" =(2 (d + 3) )) + 6 n=" exp(?n"(ln 2 ? 0:5))  1 p Proof. It follows from Theorem 3.6 and from the inequality p(1?"=2; 1?"; n)  6 n="e?n" 3

11

4

?:

(ln 2 0 5)

proven in the Appendix.2

10

Corollary 3.8 For any 0 < " < 1 there exist C;  > 0 such that the following holds. If there exists a perceptron of order d and size s separating 0-good matrices from "-bad Boolean matrices of size n  n, then (d + 3)4 ln(Csn) > n: 3 Proof. It is easy to verify that 211?(dn"+3) 4 > ?n"(ln 2 ? 0:5). Therefore

q

2n exp(?m" =(2 (d + 3) )) + 6 n=" exp(?n"(ln 2 ? 0:5)) > (8n=") exp(?m"3 =(211(d + 3)4)): 3

11

4

Therefore, (13) implies that Csn exp(?n=(d + 3)4  1 for C = 8=",  = "3 =211.2

Acknowledgments The author is sincerely grateful to Vladimir Borisenko, Frederic Green, Lane Hemaspaandra, Andrey Muchnik, Alexander Razborov, Alexander Shen and Yuri Tyurin.

References [1] A. I. Akhiezer. The classical problem of moments and some related topics in calculus. Moscow, Fizmatgiz, 1961. (Russian.) [2] L.Babai. \Trading group theory for randomness", Proc. 17th ACM Symp. on Theory of Comp., 1985, pp. 421{429. [3] R. Beigel. \Perceptrons, PP, and the Polynomial Time Hierarchy", Proc. of 7th Ann. Conf. on Structure in Complexity Theory, June 1992, pp. 14{19. [4] R. Beigel, N. Reingold and D. Spielman. \PP is closed under intersection", Proc. of 23rd ACM Symp. on The. of Comp., 1991, pp. 1{9. [5] L. Fortnow, N. Reingold. \PP is closed under truth table reductions", 6th IEEE Conf. on Structure in Complexity Theory, 1991, pp. 13{15. [6] N. Furst, J. Saxe and M. Sipser, \Parity, Circuits and the Polynomial Time Hierarchy", Mathematical Systems Theory, 1984, v. 17, pp. 13{27. [7] C. Lautemann. \BPP and the polynomial hierarchy", Information Processing Letters, 1983, v. 17, N 4, pp. 215{217. [8] M. Minsky and S. Papert. Perceptorns. Cambridge, MA: MIT Press, 1988. (Expanded edition, rst edition appeared in 1967.) [9] M. Riesz. \Sur le probleme des moments. Troisieme Note", Arkiv for mat., astr. och fys., 1923, v. 17. [10] R. Riesz et B. Sz.-Nagy. Lecons d'analyse functionelle. Akademiai Kiado, Budapest 1972, 6th ed. (There are English and Russian translations.) 11

[11] M. Sipser, \A Complexity Theoretic Approach for Randomness", 15th Annual ACM Symposium on Theory of Computing, 1983, pp. 330{335. [12] S. Toda. \On the computational power of PP and P", Proc. of 30th Symp. on Found. of Comp. Sci., 1989, pp. 514{519. [13] N. Vereshchagin. \On the Power of PP", Proc. 7th Conference on Structure in Complexity Theory, 1992, pp. 138{143.

12

A Appendix

A.1 Applying lower bounds for perceptrons to separation of AM \ co-AM from PP

Lower bound in Corollary 3.5 suces to construct an oracle under which AM 6 PP. To construct an oracle under which AM \ co-AM 6 PP we need a lower bound for perceptrons solving another separation problem. Let Mn stand for the family of Boolean matrices of size n  n and let Nn = Mn  Mn . Let D = hM0; M1i be a pair of matrices in Nn . Say that D is of type 0 [type 1] if at least 2/3 of rows in M0 [M1] contain 1 and at least 2/3 of rows in M1 [M0 ] contain no 1.

Theorem A.1 There exists  > 0 such that the following holds for large enough n. If there exists a perceptron of order d and size s separating elements in Nn having type 0 from elements in Nn having type 1, then d > n = ? 2 or s > 2n . 1 3

Proof. Let n be an integer. Let P be a perceptron of order d and size s separating elements in Nn having type 0 from elements in Nn having type 1. Let us use Lemma 3.1. Let " = 1=3 and m = n. Suppose that d  n1=3 ? 2. If  is small enough then for suciently large n the condition of Lemma 3.1 is ful lled. Therefore there exist probability distributions  and  in Mn such that the following hold: 1) a matrix M taken at random with respect to  is not 1=3-good with probability at most p(5=6; 4=6; n); 2) a matrix M taken at random with respect to  is not 1=3-bad with probability at most p(5=6; 4=6; n); 3)  and  are d-indistiguishable. Denote p(5=6; 4=6; n) by p. Consider the following probability distributions  and  in Nn . To produce a random pair hM0; M1i of matrices with respect to  take M0 at random with respect to  and take M1 at random with respect to  . To produce a random pair hM0; M1i of matrices with respect to  take M0 at random with respect to  and take M1 at random with respect to . Then

Prob(D)[D has type 0 ] > 1 ? 2p; Prob(D)[D has type 1 ] > 1 ? 2p: As we have shown above, 3) implies that E WP (D) = E WP (D): Let t be the value of the threshold of the threshold-gate. Obviously we can assume that jtj < s. We have E WP (D)  (1 ? 2p)(t + 1) ? 2ps and E WP (D)  (1 ? 2p)t + 2ps. Therefore, (1 ? 2p)(t + 1) ? 2ps  (1 ? 2p)t + 2ps, which implies the inequality s > 1=(6p). By Cherno inequalty 1=(6p) > e?n if  is small enough.2

Theorem A.2 [13] There is an oracle A such that AMA \ co-AMA 6 PPA. 13

Proof. Let A be an oracle and let j 2 N. We will consider the value of A on the words of length 4j as a pair of Boolean matrices of size 2j  2j . Denote that pair by Aj . Let us say that Aj is correct if Aj has type 1 or type 0. Associate with any oracle A the language L(A) = f1j j Aj has type 0g. We will construct an oracle A such that Aj is correct for all j 2 N and L(A) 62 PPA . From the former condition we can easily deduce that L(A) is in AMA \ co-AMA . To ensure L(A) 62 PPA let us enumerate all polynomial probabilistic machines and denote i-th machine by PPi . De ne rst A in such a way that Aj is correct for all j 2 N. We will make steps with numbers 0, 1, 2, : : :. On step i we will ensure that L(A) di ers from the language recognized by PPiA . To this end we will change the value of A on nite number of words in such a way that for some j 2 N

1j 2 L(A) 6, Prob[PPiA (1j ) = 1] > 1=2:

(14)

After changing we will x the value of A on all words which the truth value of (14) depends on. This means that on later steps we will not change the value of A on those words. Let us describe ith step. Choose j such that no value of A (the oracle constructed on (i ? 1)th step) on words of length 4j is xed and suciently large (how large should be j we shall see later). Let n = 2j . Then Aj is in Nn . For any D 2 Nn denote by A[D] the oracle obtained from A by replacing Aj with D. Let us prove that there is a correct D 2 Nn such that (14) is true for A[D]. Let d be maximal number of queries that PPi can make on input 1j . Let q be the number of random strings used by PPi on input 1j . Evidently d  poly(j ) = poly(log n) and q  2poly(j ) = 2poly(log n) Let us construct a perceptron P of order d and of size s = 2d q such that P (D) = 1 , Prob[PPiA[D] (1j ) = 1] > 1=2: (15) Let r be a random string used by PPi on input 1j and w = w(1)w(2)   w(d) be a binary string of length d such that PPi accepts provided oracle answers to queries are respectively w(1); : : :; w(d). Denote by u1 ; : : :; ud the questions to oracle made by PPi on input 1j with random string r provided the answers to questions are respectively w(1); : : :; w(d). Assotiate with every such pair hw; ri the following AND-gate C . On assignment D 2 Nn gate C produces 1 if uk 2 A[D] , w(k) = 1 for all k 2 f1; 2; : : :; dg. Let weight on C be 1. Let the threshold of perceptron P be s=2. It is easy to verify that

WP (D) = sProb[PPiA[D] (1j ) = 1]; which implies (15). Recall that the order of P is poly(j ) = poly(log n) and the size of P is 2poly(j ) = poly(log n) 2 . By Theorem A.1 if j is large enough then P cannot separate pairs of type 0 from pairs of type 1 in Nn . Thus there exists a correct D 2 Nn such that

D has type 0 6$ Prob[PPiA[D] (1j ) = 1] > 1=2 and we have done.2 14

A.2 Proof of implication (ii))(i) in Riesz' theorem

Assume that (ii) is true. Let q1 ; q2; : : : be an enumeration of all rational numbers from [0; 1]. Let  ri(x) = 01;; ifif q0 <xxq1.i, i Consider the linear space L over R consisting of all functions f (x) on [0; 1] of the form

f (x) =

X i2I

si ri(x) + a(x)

(16)

where I is a nite set of natural numbers, si 2 R and a(x) is a polynomial of degree at most d. Let K be the set of all f 2 L such that f is nonnegative on [0; 1]. Claim. There is a linear functional l de ned on L such that l is nonnegative on K and l(a(x)) = (m; a(x)) for all polynomials a(x) of degree at most d. Proof of the claim. Let Li be the set of all functions f (x) of the form (16) with I = f1; 2; : : :; ig. De ne l0 to be the functional being de ned on L0 (the set of all polynomials of degree at most d) such that l0(a(x)) = (m; a(x)). Then (ii) means that l0 is nonnegative on K \ L0 . Using the induction we shall prove that there is a sequence l0; l1; l2 : : : of linear functionals such that li is de ned on Li , is nonnegative on Li \ K and extends li?1 . Then as l we can take the union of all li, i 2 N. Let functional li be already de ned in such a way that li is nonnegative on Li \ K . Obviously we have to de ne the value of li+1 only on ri+1 (x). Suppose that this value is equal to v . One can easily verify that in this case li+1 is nonnegative on Li+1 \ K i v satis es following two conditions (a) v  li (f (x)) for all f (x) 2 Li such that ri+1(x)  f (x) for all x 2 [0; 1], (b) li(g (x))  v for all g (x) 2 Li such that g (x)  ri+1 (x) for all x 2 [0; 1]. Let us prove that there is v 2 R satisfying (a) and (b). Let us denote

A = fli(f (x)) j f (x) 2 Li; 8x 2 [0; 1] ri+1(x)  f (x)g B = fli (g(x)) j g(x) 2 Li; 8x 2 [0; 1] g(x)  ri+1 (x)g Evidently it is sucient to prove that A 6= ;, B 6= ; and 8v1 2 A 8v2 2 B v1  v2. As ri+1(x) is bounded and L0 contains all constant functions, we have A 6= ;, B 6= ;. If v1 2 A, v1 = li (f (x)) and v2 2 B , v2 = li(g (x)), then (f (x) ? g (x)) 2 K therefore v1 = li(f (x))  li(g(x)) = v2.2 Now let us consider the function g being de ned on [0; 1] by g (x) = inf fl(ri) j x  qi ; i 2 Ng. We can easily prove that g(x) is monotone and continuous from the right (limy!x+0 g (y ) = g (x)). Hence g is the distribution function for some measure  in [0; 1], i.e., there is a measure  in [0; 1] such that ([0; x]) = g (x) for all x 2 [0; 1]. Obviously for R 1 all i, 0 ri (x) d(x) = ([0; qi]) =R g (qi) = l(ri(x)). From this and the nonnegativeness of l on K we can easily deduce that 01 xi d(x) = l(xi ) = mi for all i 2 f0; 1; : : :; dg.2 15

A.3 Proof of implication (iii))(ii) in Riesz' theorem

This implication easily follows from the fact that for even d any polynomial of degree at most d which is nonnegative on [0; 1] has the form a(x)2 + x(1 ? x)b(x)2 where deg a(x)  d=2, deg b(x)  d=2 ? 1. The latter fact in turn follows from the fact that each polynomial being nonnegative on the set fy 2 R j y  0g has the form p(y )2 + yq (y )2. Indeed, suppose that the latter assertion is true. Let c(x) be a polynomial of degree at most d being nonnegative on [0; 1]. Then the polynomial c( 1+y y )(1 + y )d is nonnegative on [0; +1[ therefore for some polynomials p(y ) and q (y ) it holds c( 1+y y )(1 + y )d = p(y )2 + yq (y )2. Evidently deg p  d=2, deg q  d=2 ? 1. Substituting y = 1?x x we get c(x) = p( 1?x x )2(1 ? x)d + x(1 ? x)d?1 q ( 1?x x )2. Evidently

    a(x) = p 1 ?x x (1 ? x)d=2 and b(x) = q 1 ?x x (1 ? x)d=2?1 are respectively polynomials of degrees at most d=2 and d=2 ? 1. Thus it remains to prove that every polynomial r(y ) which is nonnegative on [0; +1[ has the form r(y ) = p(y )2 + yq (y )2. Let P be the set of all polynomials having such form. Let r(y ) be a polynomial being nonnegative on [0; +1[. Obviously, it is sucient to prove two assertions: (a) r(y ) can be represented as the product of polynomials from P and (b) if r1(y ) 2 P and r2(y ) 2 P then r1 (y )  r2(y ) 2 P . To prove (a) let us decompose r(y ) into the product of irreducible (over R) polynomials

r(y) = A  (y + a1)i1    (y + an)in (y 2 + b1y + c1)j1    (y 2 + bmy + cm)jm Evidently A > 0. Let us take arbitrary k  n. Then ak  0 or ik is even. If ak  0 then y + ak 2 P as y + ak = (pak )2 + y  12. If ik is even then again (y + ak )ik 2 P . Let us take arbitrary k  mp. Obviouslyp ck > 0 because s(y ) = y 2 + bk y + ck is irreducible. We phave s(y ) = (y ? ck )2 + y (2 ck + bk ). Since s(y ) is irreducible, we have p p s( ck ) = ck (2 ck + bk )  0 therefore s(y) 2 P . The assertion (b) follows from the identity

(p(y )2 + yq (y )2)(s(y )2 + yt(y )2) = = (p(y )s(y ) ? yq (y )t(y ))2 + y (p(y )t(y ) + q (y )s(y ))2:2

A.4 Upper bound for p(1 ? "=2; 1 ? "; n)

Lemma A.1 For any " 2 [0; 1=2] and n  6 the probability pn of event \less than (1 ? ")n ones occurs among p n independent trials of tossing a coin with probability 1 ? "=2 of one" do not exceed 6 n="e?n" ? : . (ln 2 0 5)

Proof. Obviously,

pn =

X i