Improved Concentration Bounds for Count-Sketch - CiteSeerX

Comment

Report 2 Downloads 98 Views

Improved Concentration Bounds for Count-Sketch Gregory T. Minton MIT

Eric Price MIT

Abstract We present a refined analysis of the classic Count-Sketch streaming heavy hitters algorithm [CCF02]. Count-Sketch uses O(k log n) linear measurements of a vector x ∈ Rn to give an estimate x b of x. The standard analysis shows that this estimate x b satisfies kb x −xk2∞ < kx[k] k22 /k, where x[k] is the vector containing all but the largest k coordinates of x. Our main result is that most of the coordinates of x b have substantially less error than this upper bound; namely, for any c < O(log n), we show that each coordinate i satisfies (b xi − xi )2
· √ k Setting =

p t/R yields the desired result. 5

# 2

< 2e−Ω(R ) .

5

Concentration for Sets

Theorem 4.1 shows that each individual error (b xi − x)2 has a constant chance of being less than O(1/R) times the `2∞ bound. In Theorem 5.1 we show that this is almost surely true on average over large sets. Theorem 5.1. Fix a constant α and consider the estimate x b of x from Count-Sketch using R rows and C = ck columns, for sufficiently large (depending on α) constant c. For any set S ⊂ [n] with |S| ≤ k and log |S| . R, " # 2 1 kx[k] k2 2 Pr kb xS − xS k2 > |S| · · R k r !! 1 R5 . ≤ min α, O + |S| k Proof. Define Yi = x bi − xi for i ∈ S and µ2 = kx[k] k22 /C. Throughout we condition on having |Yi | . µ for all i ∈ S, which happens with 1 − |S| 2−Ω(R) = 1 − 2−Ω(R) = 1 − o(1) probability. By Theorem 4.1, we have t 2 2 Pr Yi > µ < 2e−Ω(t) R for any i and for t < R. Thus E[Yi2 ] . µ2 /R, so E[kY k22 ] . |S| µ2 /R. The quantity we want to bound is Pr[kY k22 > c |S| µ2 /R]; by Markov’s inequality this is O(1/c), which can be made smaller than α. This gives the first bound. The difficult part is the bound which decays as |S| , k → ∞. By Chebyshev’s inequality we have Pr[kY k22 − E[kY k22 ] > |S| µ2 /R] ≤ Pr[(kY k22 − E[kY k22 ])2 > (|S| µ2 /R)2 ] ≤

E[kY k42 ] − E[kY k22 ]2 , (|S| µ2 /R)2

(3)

so we proceed to bound E[kY k42 ] − E[kY k22 ]2 .√ By the exponential bound of Yi around µ/ R, we have E[Yi4 ] . µ4 /R2 . We now bound E[Yi2 Yj2 ] for i 6= j. Let p ≤ R/C be the probability that there is at least one row in which i and j collide, and let E be the associated event. As we show in Lemma 5.2, p E[Yi2 Yj2 | E] ≤ E[Yi2 | E] E[Yj2 | E] + O(µ4 R/C) p ≤ E[Yi2 ] E[Yj2 ]/(1 − p)2 + O(µ4 R/C). On the other hand, Yi2 , Yj2 . µ2 , so E[Yi2 Yj2 | E] . µ4 . Putting these together, E[Yi2 Yj2 ] − E[Yi2 ] E[Yj2 ] ≤(1 − p) E[Yi2 Yj2 | E] + p E[Yi2 Yj2 | E] − E[Yi2 ] E[Yj2 ] p p E[Yi2 ] E[Yj2 ] + O(µ4 R/C) + p E[Yi2 Yj2 | E] ≤ 1−p p p .µ4 (p + R/C) . µ4 R/C. 6

Therefore E[kY k42 ] − E[kY k22 ]2 X X = E[Yi4 ] − E[Yi2 ]2 + E[Yi2 Yj2 ] − E[Yi2 ] E[Yj2 ] i

i6=j 2

p . |S| µ /R + µ |S| R/C 4

2

4

and hence 1 Pr[kY k22 − E[kY k22 ] > |S| µ2 /R] . + |S|

r

R5 C

by (3). We now get the desired result by noting that E[kY k22 ] . |S| µ2 /R and making c = C/k sufficiently large to absorb the resulting constant. Lemma 5.2. In the Count-Sketch table with R rows and k columns, define Yi0 = x bi0 − xi0 and 2 2 µ = kx[k] k2 /k. Suppose R & log k. After conditioning on the event that coordinates i and j do not p collide in any row, and the event that Yi2 , Yj2 . µ2 , we have E[Yi2 Yj2 ] ≤ E[Yi2 ] E[Yj2 ] + O(µ4 R/k). Proof. Without loss of generality, xi = xj = 0 and for all u, hu (i) = i and hu (j) = j. Then Yi0 = medianu yu,i0 for i0 ∈ {i, j}. Consider hashing in the following manner. For each row u, place each element of x into yu,i independently with probability 1/k; call the set of such elements Si . Then independently place each element of x into yu,j with probability 1/(k − 1); call this set Sj . Now consider a subset S 0 of Sj in which each element of Sj appears in S 0 with probability 1/k (this subset will be specified in a moment). Remove the contribution of S 0 from yu,j . We will refer to the value of yu,j before removing S 0 by yu0 , and refer to its median by Y 0 . Let v be the vector v = y∗,j − y 0 and set a = Yj − Y 0 . We will use two particular choices of S 0 . The first is S 0 = Si ∩ Sj ; note that this choice yields the correct Count-Sketch hashing. The second choice is a set chosen independent of Si (so that yu,j is independent of yu,i ). We will show that, after conditioning against 2−Ω(R) probability events, we have p E[Yi2 (Y 0 + a)2 ] − E[Yi2 ] E[Y 02 ] . µ4 R/k. (4) For the two choices of S 0 , Y 0 + a is either Yj or a version of Yj independent of Yi ; combining the two bounds for these choices gives the desired result. We now prove (4). Condition on Yi2 , Yj2 , Y 02 ≤ c2 µ2 , for some constant c, as happens with 1 − 2−Ω(R) probability. Having done so, the lemma statement is immediate if R ≥ k, so we may suppose R ≤ k whenever convenient. We will partition our analysis depending on whether S 0 contains any of the heavy hitters [k], as happens with probability 1/(k − 1). To handle the case when this does happen, just apply the trivial bound E[Yi2 (Y 0 + a)2 ] − E[Yi2 ] E[Y 02 ] ≤ 2c4 µ4 . That shows that the contribution from this part of the mass is O(µ4 /k). Now suppose S 0 does not contain any heavy kitters. Then, conditioned on the locations of the heavy hitters, we still have that Yi and Y 0 are independent, so E[Yi2 (Y 0 + a)2 ] − E[Yi2 ] E[Y 02 ] ≤ E[2Yi2 Y 0 a + Yi2 a2 ] ≤ 2c3 µ3 E[|a|] + c2 µ2 E[a2 ]. p It now suffices to show that E[a2 ] ≤ µ2 R/k, because this implies E[|a|] ≤ µ R/k and then, using p R/k ≤ 1, this implies that both terms above are O(µ4 R/k). 7

We have that a2 = ((medianu yu,j ) − (medianu yu0 ))2 ≤ maxu (yu,j − yu0 )2 = kvk2∞ ≤ kvk22 . But E[kvk22 ] = Rµ2 /(k − 1), because each of the R terms of v is a sample of each non-heavy hitter element of x with probability 1/(k(k − 1)) summed up with random signs. Finally, we would like to avoid conditioning on Y 02 . This is no trouble; if we drop that, but keep the condition on Yi2 , Yj2 , then the additional 2−Ω(R) probability mass can only change E[Yi2 Yj2 ] − p E[Yi2 ] E[Yj2 ] by O(2−Ω(R) µ4 ) µ4 R/k. This completes the proof.

6

Concentration for Compressible Signals

As an application to `2 reconstruction, we consider recovery of signals with suitable decay: that √ is, where |xk | − |x2k | & kx[k] k2 / k. This condition is satisfied by, for example, any power law distribution xi = i−α with α > 0.5. The idea is that while Theorem 5.1 only applies to fixed sets of indices, on such distributions the largest k coordinates of x b will with high probability be among the largest 2k coordinates of x. √ Theorem 6.1. Suppose |xk | − |x2k | & kx[k] k2 / k. Let x b be the result of Count-Sketch using R = Θ(log n) rows and Θ(k) columns, with fully random hash functions. Let S ⊂ [n] be the locations of the largest k entries of x b. Then k(b x − x)S k22 < with max(3/4, 1 − O(

q

log5 n k ))

1 kx k2 . log n [k] 2

probability.

Proof. Let the number of columns be ck for some constant c. Let S ⊂ [n] contain the largest k coordinates of x b. By the standard Count-Sketch bound we have with 1 − n−Θ(1) probability that kb x − xk2∞ < kx[ck] k22 /(ck). Then for sufficiently large c, |b xi | > |b xj | for all i ∈ [k] and j ∈ [2k], so 2 2 S ⊆ [2k]. Thus kb xS − xS k2 ≤ kb x[2k] − x[2k] k2 . But by Theorem 5.1, 1 kx k2 kb x[2k] − x[2k] k22 ≤ log n [2k] 2 q with probability at least max(1 − α, 1 − O( log5 n/k)) for any constant α.

7

Lower bound on Point Queries

The following is an easy application of the proof technique of [PW11], using Gaussian channel capacity to bound the number of measurements required for a given error tolerance. Theorem 7.1. For any 1 ≤ t ≤ log(n/k) and any distribution on O(Rk) linear measurements of x ∈ Rn , there must be some vector x and index i for which the estimate x b of x satisfies 2

Pr[(b xi − xi )2 >

t kx[k] k2 ] > e−Ω(t) . R k

Proof. Suppose without loss of generality that n = k2t (by ignoring indices outside [k2t ]) and that t is larger than some constant. Partition [n] into k blocks of size 2t . Set x = y + w, where y ∈ {0, 1, −1}n has a single random ±1 in each block (so it is k-sparse) and w = N (0, Rk nt In ) for some constant is i.i.d. Gaussian. 8

Suppose that, in expectation over x, A ∈ Rm×n allows recovering x b from Ax with 2

t kx[k] k2 (b xi − xi ) ≤ R k 2

(5)

for more than a 1 − 2−2t fraction of the coordinates i. We will show that such an A must have m & Rk rows. Yao’s minimax principle then gives a lower bound for distributions on A. The inability to increase t and k while preserving the number of rows then gives the desired lower bound on failure probability. First, we show that I(Ax; z) & kt. Let E be the event that (5) holds for more than a 1 − 2−t−2 fraction of coordinates i and that kwk22 < 2 E[kwk22 ] = 2Rk/t. E holds with probability 1 − o(1) > 1/2 probability over x. Conditioned on E, we have (b xi − xi )2 < 2 for a 1 − 2−t−2 fraction of the coordinates i. Thus, for = 1/8, if we round x bi to the nearest integer ∗ ∗ −t−2 we recover x with xi = zi in a 1 − 2 fraction of the coordinates; hence x∗i = zi over at least 3/4 of the blocks. We know that z has (t + 1) bits of entropy in each block. This means, conditioned on E, I(z; x∗ ) = H(z) − H(z | x∗ ) k ≥ k(t + 1) − log( 2(t+1)k/4 ) k/4 ≥ k(t + 1) − k(t + 1)/4 − k log(4e)/4 & kt and hence I(Ax; z | E = 1) & kt by the data processing inequality. But since Pr[E] ≥ 1/2, I(Ax; z) ≥ I(Ax; z | E) − H(E) ≥ I(Ax; z | E = 1) Pr[E] − 1 & kt.

(6)

Second, we show that I(Ax; z) . mt/R. For each row Aj , Aj x = Aj z + Aj w = Aj z + w0 for w0 ∼ N (0, kAj k22 Rk/(nt)). We also have Ez [(Aj z)2 ] = kAk22 k/n. Hence Aj x is an additive white Gaussian noise channel with signal-to-noise ratio E[(Aj z)2 ] t = . E[w02 ] R By the Shannon-Hartley Theorem, this channel has capacity I(Aj x; z) ≤

1 t t log(1 + )< . t/R 2 R R

and thus, by linearity and independence of w0 (as in [PW11]), I(Ax; z) . mt/R Combining (6) and (7) gives m & Rk.

9

(7)

References [BKM+ 00] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. Comput. Netw., 33(1-6):309–320, 2000. [CCF02]

M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. ICALP, 2002.

[CH10]

G. Cormode and M. Hadjieleftheriou. Methods for finding frequent items in data streams. The VLDB Journal, 19(1):3–20, 2010.

[CM04]

G. Cormode and S. Muthukrishnan. Improved data stream summaries: The count-min sketch and its applications. LATIN, 2004.

[CM05]

G. Cormode and S. Muthukrishnan. Summarizing and mining skewed data streams. In SDM, 2005.

[CM06]

G. Cormode and S. Muthukrishnan. Combinatorial algorithms for compressed sensing. Sirocco, 2006.

[CRT06]

E. J. Cand`es, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math., 59(8):1208–1223, 2006.

[GI10]

A. Gilbert and P. Indyk. Sparse recovery using sparse matrices. Proceedings of IEEE, 2010.

[Mit04]

M. Mitzenmacher. A brief history of generative models for power law and lognormal distributions. Internet Mathematics, 1:226–251, 2004.

[Mut05]

S. Muthukrishnan. Data streams: Algorithms and applications. Now Publishers Inc, 2005.

[MV08]

M. Mitzenmacher and S. Vadhan. Why simple hash functions work: exploiting the entropy in a data stream. In Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms, pages 746–755. Society for Industrial and Applied Mathematics, 2008.

[Nis92]

N. Nisan. Pseudorandom generators for space-bounded computation. Combinatorica, 12(4):449–461, 1992.

[PDGQ05] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with sawzall. Scientific Programming, 13(4):277, 2005. [Pri11]

E. Price. Efficient sketches for the set query problem. In Proceedings of the TwentySecond Annual ACM-SIAM Symposium on Discrete Algorithms, pages 41–56. SIAM, 2011.

[PW11]

E. Price and D.P. Woodruff. (1+ eps)-approximate sparse recovery. In Foundations of Computer Science (FOCS), 2011 IEEE 52nd Annual Symposium on, pages 295–304. IEEE, 2011.

10

Recommend Documents

Improved Bounds for Wireless Localization

Improved Bounds for Testing Juntas - Springer

IMPROVED BACKWARD ERROR BOUNDS FOR LU AND ...