Improved Concentration Bounds for Count-Sketch - CiteSeerX

Report 2 Downloads 98 Views
Improved Concentration Bounds for Count-Sketch Gregory T. Minton MIT

Eric Price MIT

Abstract We present a refined analysis of the classic Count-Sketch streaming heavy hitters algorithm [CCF02]. Count-Sketch uses O(k log n) linear measurements of a vector x ∈ Rn to give an estimate x b of x. The standard analysis shows that this estimate x b satisfies kb x −xk2∞ < kx[k] k22 /k, where x[k] is the vector containing all but the largest k coordinates of x. Our main result is that most of the coordinates of x b have substantially less error than this upper bound; namely, for any c < O(log n), we show that each coordinate i satisfies (b xi − xi )2
 · √ k Setting  =

p t/R yields the desired result. 5

# 2

< 2e−Ω(R ) .

5

Concentration for Sets

Theorem 4.1 shows that each individual error (b xi − x)2 has a constant chance of being less than O(1/R) times the `2∞ bound. In Theorem 5.1 we show that this is almost surely true on average over large sets. Theorem 5.1. Fix a constant α and consider the estimate x b of x from Count-Sketch using R rows and C = ck columns, for sufficiently large (depending on α) constant c. For any set S ⊂ [n] with |S| ≤ k and log |S| . R, " # 2 1 kx[k] k2 2 Pr kb xS − xS k2 > |S| · · R k r !! 1 R5 . ≤ min α, O + |S| k Proof. Define Yi = x bi − xi for i ∈ S and µ2 = kx[k] k22 /C. Throughout we condition on having |Yi | . µ for all i ∈ S, which happens with 1 − |S| 2−Ω(R) = 1 − 2−Ω(R) = 1 − o(1) probability. By Theorem 4.1, we have   t 2 2 Pr Yi > µ < 2e−Ω(t) R for any i and for t < R. Thus E[Yi2 ] . µ2 /R, so E[kY k22 ] . |S| µ2 /R. The quantity we want to bound is Pr[kY k22 > c |S| µ2 /R]; by Markov’s inequality this is O(1/c), which can be made smaller than α. This gives the first bound. The difficult part is the bound which decays as |S| , k → ∞. By Chebyshev’s inequality we have Pr[kY k22 − E[kY k22 ] > |S| µ2 /R] ≤ Pr[(kY k22 − E[kY k22 ])2 > (|S| µ2 /R)2 ] ≤

E[kY k42 ] − E[kY k22 ]2 , (|S| µ2 /R)2

(3)

so we proceed to bound E[kY k42 ] − E[kY k22 ]2 .√ By the exponential bound of Yi around µ/ R, we have E[Yi4 ] . µ4 /R2 . We now bound E[Yi2 Yj2 ] for i 6= j. Let p ≤ R/C be the probability that there is at least one row in which i and j collide, and let E be the associated event. As we show in Lemma 5.2, p E[Yi2 Yj2 | E] ≤ E[Yi2 | E] E[Yj2 | E] + O(µ4 R/C) p ≤ E[Yi2 ] E[Yj2 ]/(1 − p)2 + O(µ4 R/C). On the other hand, Yi2 , Yj2 . µ2 , so E[Yi2 Yj2 | E] . µ4 . Putting these together, E[Yi2 Yj2 ] − E[Yi2 ] E[Yj2 ] ≤(1 − p) E[Yi2 Yj2 | E] + p E[Yi2 Yj2 | E] − E[Yi2 ] E[Yj2 ] p p E[Yi2 ] E[Yj2 ] + O(µ4 R/C) + p E[Yi2 Yj2 | E] ≤ 1−p p p .µ4 (p + R/C) . µ4 R/C. 6

Therefore E[kY k42 ] − E[kY k22 ]2 X X = E[Yi4 ] − E[Yi2 ]2 + E[Yi2 Yj2 ] − E[Yi2 ] E[Yj2 ] i

i6=j 2

p . |S| µ /R + µ |S| R/C 4

2

4

and hence 1 Pr[kY k22 − E[kY k22 ] > |S| µ2 /R] . + |S|

r

R5 C

by (3). We now get the desired result by noting that E[kY k22 ] . |S| µ2 /R and making c = C/k sufficiently large to absorb the resulting constant. Lemma 5.2. In the Count-Sketch table with R rows and k columns, define Yi0 = x bi0 − xi0 and 2 2 µ = kx[k] k2 /k. Suppose R & log k. After conditioning on the event that coordinates i and j do not p collide in any row, and the event that Yi2 , Yj2 . µ2 , we have E[Yi2 Yj2 ] ≤ E[Yi2 ] E[Yj2 ] + O(µ4 R/k). Proof. Without loss of generality, xi = xj = 0 and for all u, hu (i) = i and hu (j) = j. Then Yi0 = medianu yu,i0 for i0 ∈ {i, j}. Consider hashing in the following manner. For each row u, place each element of x into yu,i independently with probability 1/k; call the set of such elements Si . Then independently place each element of x into yu,j with probability 1/(k − 1); call this set Sj . Now consider a subset S 0 of Sj in which each element of Sj appears in S 0 with probability 1/k (this subset will be specified in a moment). Remove the contribution of S 0 from yu,j . We will refer to the value of yu,j before removing S 0 by yu0 , and refer to its median by Y 0 . Let v be the vector v = y∗,j − y 0 and set a = Yj − Y 0 . We will use two particular choices of S 0 . The first is S 0 = Si ∩ Sj ; note that this choice yields the correct Count-Sketch hashing. The second choice is a set chosen independent of Si (so that yu,j is independent of yu,i ). We will show that, after conditioning against 2−Ω(R) probability events, we have p E[Yi2 (Y 0 + a)2 ] − E[Yi2 ] E[Y 02 ] . µ4 R/k. (4) For the two choices of S 0 , Y 0 + a is either Yj or a version of Yj independent of Yi ; combining the two bounds for these choices gives the desired result. We now prove (4). Condition on Yi2 , Yj2 , Y 02 ≤ c2 µ2 , for some constant c, as happens with 1 − 2−Ω(R) probability. Having done so, the lemma statement is immediate if R ≥ k, so we may suppose R ≤ k whenever convenient. We will partition our analysis depending on whether S 0 contains any of the heavy hitters [k], as happens with probability 1/(k − 1). To handle the case when this does happen, just apply the trivial bound E[Yi2 (Y 0 + a)2 ] − E[Yi2 ] E[Y 02 ] ≤ 2c4 µ4 . That shows that the contribution from this part of the mass is O(µ4 /k). Now suppose S 0 does not contain any heavy kitters. Then, conditioned on the locations of the heavy hitters, we still have that Yi and Y 0 are independent, so E[Yi2 (Y 0 + a)2 ] − E[Yi2 ] E[Y 02 ] ≤ E[2Yi2 Y 0 a + Yi2 a2 ] ≤ 2c3 µ3 E[|a|] + c2 µ2 E[a2 ]. p It now suffices to show that E[a2 ] ≤ µ2 R/k, because this implies E[|a|] ≤ µ R/k and then, using p R/k ≤ 1, this implies that both terms above are O(µ4 R/k). 7

We have that a2 = ((medianu yu,j ) − (medianu yu0 ))2 ≤ maxu (yu,j − yu0 )2 = kvk2∞ ≤ kvk22 . But E[kvk22 ] = Rµ2 /(k − 1), because each of the R terms of v is a sample of each non-heavy hitter element of x with probability 1/(k(k − 1)) summed up with random signs. Finally, we would like to avoid conditioning on Y 02 . This is no trouble; if we drop that, but keep the condition on Yi2 , Yj2 , then the additional 2−Ω(R) probability mass can only change E[Yi2 Yj2 ] − p E[Yi2 ] E[Yj2 ] by O(2−Ω(R) µ4 )  µ4 R/k. This completes the proof.

6

Concentration for Compressible Signals

As an application to `2 reconstruction, we consider recovery of signals with suitable decay: that √ is, where |xk | − |x2k | & kx[k] k2 / k. This condition is satisfied by, for example, any power law distribution xi = i−α with α > 0.5. The idea is that while Theorem 5.1 only applies to fixed sets of indices, on such distributions the largest k coordinates of x b will with high probability be among the largest 2k coordinates of x. √ Theorem 6.1. Suppose |xk | − |x2k | & kx[k] k2 / k. Let x b be the result of Count-Sketch using R = Θ(log n) rows and Θ(k) columns, with fully random hash functions. Let S ⊂ [n] be the locations of the largest k entries of x b. Then k(b x − x)S k22 < with max(3/4, 1 − O(

q

log5 n k ))

1 kx k2 . log n [k] 2

probability.

Proof. Let the number of columns be ck for some constant c. Let S ⊂ [n] contain the largest k coordinates of x b. By the standard Count-Sketch bound we have with 1 − n−Θ(1) probability that kb x − xk2∞ < kx[ck] k22 /(ck). Then for sufficiently large c, |b xi | > |b xj | for all i ∈ [k] and j ∈ [2k], so 2 2 S ⊆ [2k]. Thus kb xS − xS k2 ≤ kb x[2k] − x[2k] k2 . But by Theorem 5.1, 1 kx k2 kb x[2k] − x[2k] k22 ≤ log n [2k] 2 q with probability at least max(1 − α, 1 − O( log5 n/k)) for any constant α.

7

Lower bound on Point Queries

The following is an easy application of the proof technique of [PW11], using Gaussian channel capacity to bound the number of measurements required for a given error tolerance. Theorem 7.1. For any 1 ≤ t ≤ log(n/k) and any distribution on O(Rk) linear measurements of x ∈ Rn , there must be some vector x and index i for which the estimate x b of x satisfies 2

Pr[(b xi − xi )2 >

t kx[k] k2 ] > e−Ω(t) . R k

Proof. Suppose without loss of generality that n = k2t (by ignoring indices outside [k2t ]) and that t is larger than some constant. Partition [n] into k blocks of size 2t . Set x = y + w, where y ∈ {0, 1, −1}n has a single random ±1 in each block (so it is k-sparse) and w = N (0,  Rk nt In ) for some constant  is i.i.d. Gaussian. 8

Suppose that, in expectation over x, A ∈ Rm×n allows recovering x b from Ax with 2

t kx[k] k2 (b xi − xi ) ≤ R k 2

(5)

for more than a 1 − 2−2t fraction of the coordinates i. We will show that such an A must have m & Rk rows. Yao’s minimax principle then gives a lower bound for distributions on A. The inability to increase t and k while preserving the number of rows then gives the desired lower bound on failure probability. First, we show that I(Ax; z) & kt. Let E be the event that (5) holds for more than a 1 − 2−t−2 fraction of coordinates i and that kwk22 < 2 E[kwk22 ] = 2Rk/t. E holds with probability 1 − o(1) > 1/2 probability over x. Conditioned on E, we have (b xi − xi )2 < 2 for a 1 − 2−t−2 fraction of the coordinates i. Thus, for  = 1/8, if we round x bi to the nearest integer ∗ ∗ −t−2 we recover x with xi = zi in a 1 − 2 fraction of the coordinates; hence x∗i = zi over at least 3/4 of the blocks. We know that z has (t + 1) bits of entropy in each block. This means, conditioned on E, I(z; x∗ ) = H(z) − H(z | x∗ )   k ≥ k(t + 1) − log( 2(t+1)k/4 ) k/4 ≥ k(t + 1) − k(t + 1)/4 − k log(4e)/4 & kt and hence I(Ax; z | E = 1) & kt by the data processing inequality. But since Pr[E] ≥ 1/2, I(Ax; z) ≥ I(Ax; z | E) − H(E) ≥ I(Ax; z | E = 1) Pr[E] − 1 & kt.

(6)

Second, we show that I(Ax; z) . mt/R. For each row Aj , Aj x = Aj z + Aj w = Aj z + w0 for w0 ∼ N (0, kAj k22 Rk/(nt)). We also have Ez [(Aj z)2 ] = kAk22 k/n. Hence Aj x is an additive white Gaussian noise channel with signal-to-noise ratio E[(Aj z)2 ] t = . E[w02 ] R By the Shannon-Hartley Theorem, this channel has capacity I(Aj x; z) ≤

1 t t log(1 + )< . t/R 2 R R

and thus, by linearity and independence of w0 (as in [PW11]), I(Ax; z) . mt/R Combining (6) and (7) gives m & Rk.

9

(7)

References [BKM+ 00] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. Comput. Netw., 33(1-6):309–320, 2000. [CCF02]

M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. ICALP, 2002.

[CH10]

G. Cormode and M. Hadjieleftheriou. Methods for finding frequent items in data streams. The VLDB Journal, 19(1):3–20, 2010.

[CM04]

G. Cormode and S. Muthukrishnan. Improved data stream summaries: The count-min sketch and its applications. LATIN, 2004.

[CM05]

G. Cormode and S. Muthukrishnan. Summarizing and mining skewed data streams. In SDM, 2005.

[CM06]

G. Cormode and S. Muthukrishnan. Combinatorial algorithms for compressed sensing. Sirocco, 2006.

[CRT06]

E. J. Cand`es, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math., 59(8):1208–1223, 2006.

[GI10]

A. Gilbert and P. Indyk. Sparse recovery using sparse matrices. Proceedings of IEEE, 2010.

[Mit04]

M. Mitzenmacher. A brief history of generative models for power law and lognormal distributions. Internet Mathematics, 1:226–251, 2004.

[Mut05]

S. Muthukrishnan. Data streams: Algorithms and applications. Now Publishers Inc, 2005.

[MV08]

M. Mitzenmacher and S. Vadhan. Why simple hash functions work: exploiting the entropy in a data stream. In Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms, pages 746–755. Society for Industrial and Applied Mathematics, 2008.

[Nis92]

N. Nisan. Pseudorandom generators for space-bounded computation. Combinatorica, 12(4):449–461, 1992.

[PDGQ05] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with sawzall. Scientific Programming, 13(4):277, 2005. [Pri11]

E. Price. Efficient sketches for the set query problem. In Proceedings of the TwentySecond Annual ACM-SIAM Symposium on Discrete Algorithms, pages 41–56. SIAM, 2011.

[PW11]

E. Price and D.P. Woodruff. (1+ eps)-approximate sparse recovery. In Foundations of Computer Science (FOCS), 2011 IEEE 52nd Annual Symposium on, pages 295–304. IEEE, 2011.

10