On Deterministic Sketching and Streaming for Sparse Recovery and

Report 1 Downloads 30 Views
On Deterministic Sketching and Streaming for Sparse Recovery and Norm Estimation Jelani Nelson Princeton

joint work with Huy Nguy˜ ˆen (Princeton) and David Woodruff (IBM Almaden)

The turnstile model of streaming

I

Vector x ∈ Rn starts as ~0

I

Receive a sequence of updates (i1 , v1 ), (i2 , v2 ), . . . with (i, v ) ∈ [n] × R

I

At end of stream, output some function f (x)

I

Goal: Use very little memory, e.g. logO(1) n words

The turnstile model of streaming

I

Vector x ∈ Rn starts as ~0

I

Receive a sequence of updates (i1 , v1 ), (i2 , v2 ), . . . with (i, v ) ∈ [n] × R

I

At end of stream, output some function f (x)

I

Goal: Use very little memory, e.g. logO(1) n words

I

This talk: We will focus on linear sketches, i.e. algorithms that maintain Ax for some A with m  n rows (Natural model in some of the compressed sensing applications we will consider. Also, can easily combine linear sketches across datasets to get aggregate statistics or compute distance measures between data.)

Goal of this talk

Revisit classic sketching/streaming problems with the goal of understanding their deterministic complexities

I

Point query

I

Inner product

I

`1 /`1 sparse recovery

I

Norm estimation

Goal of this talk

Revisit classic sketching/streaming problems with the goal of understanding their deterministic complexities

I

Point query: Given i, output xi ± εkxk1

I

Inner product: Given Ax, Ay , output hx, y i ± εkxk2 ky k2

I

`1 /`1 sparse recovery: Output x˜ with kx − x˜k1 ≤ (1 + ε)kxtail(k) k1

I

Norm estimation: Output (1 ± ε)kxkp

Previous work

I

Point query: Given i, output xi ± εkxk1 Randomized: CountMin [CM05] m = O(ε−1 log(1/δ)), m = Ω(ε−1 log n) for δ < 1/n [JST11] Deterministic: CR-PRECIS [GM07] m = O(ε−2 log2 n/(log(ε−1 log n))), m = Ω(ε−2 + ε−1 log(εn)) [FPRU10], [Ganguly08], [Gluskin82]

Previous work

I

Inner product: Given Ax, Ay , output hx, y i ± εkxk2 ky k2 Randomized: AMS sketch [AMS96] m = O(1/ε2 ), m = Ω(1/ε2 ) [KNW10] Deterministic: Impossible [AMS96]

Previous work

I

`1 /`1 sparse recovery: Output x˜ with kx − x˜k1 ≤ (1 + ε)kxtail(k) k1

√ Randomized: m = O(k log n log3 (1/ε)/ ε) [PW11] Deterministic: m = O(k log(n/k)/ε2 ) [IR08]

Previous work

I

Norm estimation: Output (1 ± ε)kxkp Randomized: (0 < p ≤ 2) m = O(1/ε2 ) [Indyk00], m = Ω(1/ε2 ) [KNW10] ˜ −2 n1−2/p ) [IW05], [BGKS06], [AKO11], (p > 2) m = O(ε ˜ −2 n1−2/p ) [BJKS02], [CKS03], [Ganguly11], m = Ω(ε [Gronemeier09], [Jayram09], [WZ12], [Ganguly11] Deterministic: Impossible for all p [AMS96]

Our contributions We improve some upper and lower bounds in the deterministic case

I

Point query m = O(ε−2 · min{log n, (log(n)/ log(1/ε))2 }). Can even get xi ± εkxtail(1/ε2 ) k1 with almost same m.

I

Inner product Equivalent to point query if we allow hx, y i ± εkxk1 ky k1

I

`1 /`1 sparse recovery We prove m = Ω(k/ε2 + k log(n/k)/ε)

I

Norm estimation Settle for kxkp ± εkxkq (q < p). We show the complexity is then characterized by Gelfand widths from convex geometry, with the tight bound m = Θ(ε−2 log(ε2 n)) for p = 2, q = 1.

Point query/inner product

Reductions between the problems

I

Inner product ⇒ point query Output the estimated inner product of x and ei . Error is εkxk1 kei k1 = εkxk1 .

I

Point query ⇒ inner product Let x˜, y˜ be such that kx − x˜ k∞ ≤ εkxk1 and similarly for y . Output x˜head(1/ε) , y˜head(1/ε) .

Point query and codes [GM07] Let C = {C1 , . . . , Cn } be a code with block length t, alphabet [q], and relative distance 1 − ε.

h1 (i) = (C1 )i h2 (i) = (C2 )i ... t

|————

q

————|

Point query and codes [GM07] Let C = {C1 , . . . , Cn } be a code with block length t, alphabet [q], and relative distance 1 − ε.

h1 (i) = (C1 )i h2 (i) = (C2 )i ... t

|————

q

————|

x˜i is P average of Yj,hj (i) . Equals xi + j6=i (#collisions with i) · xj /t = xi ± εkx−i k1 . [GM07] uses Chinese remaindering codes.

Point query and incoherent matrices

A is incoherent if each column has unit `2 norm and | hAi , Aj i | ≤ ε for all i 6= j.

I

Measurement: Ax

I

Recovery: x˜ = AT Ax

Point query and incoherent matrices

A is incoherent if each column has unit `2 norm and | hAi , Aj i | ≤ ε for all i 6= j.

I

Measurement: Ax

I

Recovery: x˜ = AT Ax Proof: (AT Ax)i = AT i Ax =

Pn

j=1 hAi , Aj i xj

= xi ± εkx−i k1 .

How to get an incoherent matrix I

Johnson-Lindenstrauss Given any v1 , . . . , vN can find A with O(ε−2 log N) rows so that kAvi − Avj k2 = (1 ± ε)kvi − vj k2 for all i, j. If vectors are 0, e1 , . . . , en , this implies A is incoherent. Using Fast JL [Ailon-Chazelle’06], . . . can have A, AT with fast multiplication times.

How to get an incoherent matrix I

Johnson-Lindenstrauss Given any v1 , . . . , vN can find A with O(ε−2 log N) rows so that kAvi − Avj k2 = (1 ± ε)kvi − vj k2 for all i, j. If vectors are 0, e1 , . . . , en , this implies A is incoherent. Using Fast JL [Ailon-Chazelle’06], . . . can have A, AT with fast multiplication times.

I

Codes Given q-ary code C1 , . . . , Cn with block length t and relative distance 1 − ε. Form matrix with n columns and m = qt rows. Ai broken up into t blocks of size q, with 1s indicating what symbol Ci has in each position and 0s elsewhere. [GM07] used chinese-remainder codes, but both Reed-Solomon codes and random codes do better.

How to get an incoherent matrix I

Johnson-Lindenstrauss Given any v1 , . . . , vN can find A with O(ε−2 log N) rows so that kAvi − Avj k2 = (1 ± ε)kvi − vj k2 for all i, j. If vectors are 0, e1 , . . . , en , this implies A is incoherent. Using Fast JL [Ailon-Chazelle’06], . . . can have A, AT with fast multiplication times.

I

Codes Given q-ary code C1 , . . . , Cn with block length t and relative distance 1 − ε. Form matrix with n columns and m = qt rows. Ai broken up into t blocks of size q, with 1s indicating what symbol Ci has in each position and 0s elsewhere. [GM07] used chinese-remainder codes, but both Reed-Solomon codes and random codes do better.

I

AlmostQpairwise independence A set S ⊆ {−1, 1}n such that |Ex∈S i∈T xi | ≤ ε for all |T | = 1, 2. Form a |S| × n matrix p where rows are elements of S scaled down by |S|.

`1 /`1 sparse recovery

`1 /`1 sparse recovery lower bound

I

Show m = Ω(k/ε2 ) via the probabilistic method (based on an argument of Gluskin, rediscovered by Ganguly)

I

Show m = Ω(k log(n/k)/ε) via communication complexity

m = Ω(k/ε2 )

I

Recall we must produce x˜ with kx − x˜k1 ≤ (1 + ε)kxtail(k) k1

I

Thus if x = 0, we must output x˜ = 0. The plan: Show that there exists x ∈ ker(A) such that kxhead(k) k1 > εkxtail(k) k1 , i.e. 0 is not an acceptable output.

m = Ω(k/ε2 )

I

Recall we must produce x˜ with kx − x˜k1 ≤ (1 + ε)kxtail(k) k1

I

Thus if x = 0, we must output x˜ = 0. The plan: Show that there exists x ∈ ker(A) such that kxhead(k) k1 > εkxtail(k) k1 , i.e. 0 is not an acceptable output.

I

Without loss of generality A has orthonormal rows.

m = Ω(k/ε2 )

I

Recall we must produce x˜ with kx − x˜k1 ≤ (1 + ε)kxtail(k) k1

I

Thus if x = 0, we must output x˜ = 0. The plan: Show that there exists x ∈ ker(A) such that kxhead(k) k1 > εkxtail(k) k1 , i.e. 0 is not an acceptable output.

I

Without loss of generality A has orthonormal rows.

I

The bad x: Note x = (IP− AT A)y is in ker(A) for any y . Choose y randomly as ki=1 σi eπ(i) where π is a random permutation and σi are independent random signs. We show kxhead(k) k1 > εkxtail(k) k1 with positive probability.

The tail x = (I − AT A)y , y =

Pk

i=1 σi eπ(i)

Ekxtail (k)k1 ≤ Ekxk1

The tail x = (I − AT A)y , y =

Pk

i=1 σi eπ(i)

Ekxtail (k)k1 ≤ Ekxk1 ≤ Eky k1 + EkAT Ay k1

The tail x = (I − AT A)y , y =

Pk

i=1 σi eπ(i)

Ekxtail (k)k1 ≤ Ekxk1 ≤ Eky k1 + EkAT Ay k1 1/2 √  ≤ k + n · EkAT Ay k22

The tail x = (I − AT A)y , y =

Pk

i=1 σi eπ(i)

Ekxtail (k)k1 ≤ Ekxk1 ≤ Eky k1 + EkAT Ay k1 1/2 √  ≤ k + n · EkAT Ay k22 1/2 √  = k + n · Ey T AT AAT Ay

The tail x = (I − AT A)y , y =

Pk

i=1 σi eπ(i)

Ekxtail (k)k1 ≤ Ekxk1 ≤ Eky k1 + EkAT Ay k1 1/2 √  ≤ k + n · EkAT Ay k22 1/2 √  = k + n · Ey T AT AAT Ay 1/2 √  = k + n · Ey T AT Ay

The tail x = (I − AT A)y , y =

Pk

i=1 σi eπ(i)

Ekxtail (k)k1 ≤ Ekxk1 ≤ Eky k1 + EkAT Ay k1 1/2 √  ≤ k + n · EkAT Ay k22 1/2 √  = k + n · Ey T AT AAT Ay 1/2 √  = k + n · Ey T AT Ay  * +1/2 k k X X √ = k + n · E σj Aπ(j) , σj Aπ(j)  j=1

j=1

The tail x = (I − AT A)y , y =

Pk

i=1 σi eπ(i)

Ekxtail (k)k1 ≤ Ekxk1 ≤ Eky k1 + EkAT Ay k1 1/2 √  ≤ k + n · EkAT Ay k22 1/2 √  = k + n · Ey T AT AAT Ay 1/2 √  = k + n · Ey T AT Ay  * +1/2 k k X X √ = k + n · E σj Aπ(j) , σj Aπ(j)  j=1

j=1

 1/2 k X √ = k + n· EkAπ(j) k22  j=1

The tail x = (I − AT A)y , y =

Pk

i=1 σi eπ(i)

Ekxtail (k)k1 ≤ Ekxk1 ≤ Eky k1 + EkAT Ay k1 1/2 √  ≤ k + n · EkAT Ay k22 1/2 √  = k + n · Ey T AT AAT Ay 1/2 √  = k + n · Ey T AT Ay  * +1/2 k k X X √ = k + n · E σj Aπ(j) , σj Aπ(j)  j=1

j=1

 1/2 k X √ = k + n· EkAπ(j) k22  √ = k + km

j=1

Wrapping up

I

Calculating expectations and using Markov also shows that with constant probability an Ω(1) fraction of the coordinates π([k]) are at least a constant in x, i.e. kxhead(k) k1 = Ω(k) with 2/3 probability.

I

Thus with positive probability we simultaneously have both √ kxhead(k) k1 = Ω(k) and kxtail(k) k1 ≤ 3(k + km). This implies the existence of a bad vector if m < ck/ε2 for some c > 0.

m = Ω(k log(n/k)/ε)

I

Communication complexity, reduction from the Equality problem on strings of length r = Θ((k/ε) log n log(n/k)).

I

In Equality, Alice and Bob receive strings x, y ∈ {0, 1}r , respectively, and they must decide whether x = y . Known that Ω(r ) communication is required deterministically.

Reduction from Equality details I

Inspired by an approach of [DIPW10].

I

Let S be the set of all strings in {0, cε/k}n with `1 norm 1. Note log |S| = Θ((k/ε) log(εn/k)).

I

Alice, Bob each get strings in {0, 1}r with r = log n · log |S|.

I

Alice treats her input as log n indices into elements of S: x 1 , . . . , x log n . Similarly Bob does this with his input to get y 1 , . . . , y log n . P n i i Alice computes u = log i=1 2 x and Bob computes Plog n i i v = i=1 2 y .

I

Reduction from Equality details I

Inspired by an approach of [DIPW10].

I

Let S be the set of all strings in {0, cε/k}n with `1 norm 1. Note log |S| = Θ((k/ε) log(εn/k)).

I

Alice, Bob each get strings in {0, 1}r with r = log n · log |S|.

I

Alice treats her input as log n indices into elements of S: x 1 , . . . , x log n . Similarly Bob does this with his input to get y 1 , . . . , y log n . P n i i Alice computes u = log i=1 2 x and Bob computes Plog n i i v = i=1 2 y .

I

I

Alice sends A0 u where A0 is A rounded to O(log n) precision.

Reduction from Equality details I

Inspired by an approach of [DIPW10].

I

Let S be the set of all strings in {0, cε/k}n with `1 norm 1. Note log |S| = Θ((k/ε) log(εn/k)).

I

Alice, Bob each get strings in {0, 1}r with r = log n · log |S|.

I

Alice treats her input as log n indices into elements of S: x 1 , . . . , x log n . Similarly Bob does this with his input to get y 1 , . . . , y log n . P n i i Alice computes u = log i=1 2 x and Bob computes Plog n i i v = i=1 2 y .

I

I

Alice sends A0 u where A0 is A rounded to O(log n) precision.

I

Bob computes A0 (u − v ) and says “Equal” if the result is 0, else says “Not equal”.

I

Communication is #rows(A) · O(log n), so lower bound on number of rows is Ω(r / log n).

Norm estimation

Connection to Gelfand widths We show: Theorem: For 1 ≤ q < p ≤ ∞, let m be the minimum number such that there is an n − m dimensional subspace S of Rn satisfying kv k supv ∈S kv kqp ≤ ε. Then there is an m × n matrix A and associated output procedure Out which for any x ∈ Rn , given Ax, outputs an estimate of kv kp with additive error at most εkv kq . Moreover, any matrix A with fewer rows will fail to perform the same task.

Connection to Gelfand widths We show: Theorem: For 1 ≤ q < p ≤ ∞, let m be the minimum number such that there is an n − m dimensional subspace S of Rn satisfying kv k supv ∈S kv kqp ≤ ε. Then there is an m × n matrix A and associated output procedure Out which for any x ∈ Rn , given Ax, outputs an estimate of kv kp with additive error at most εkv kq . Moreover, any matrix A with fewer rows will fail to perform the same task. Definition: The Gelfand width of order m is the infimum over all kv k (n − m)-dimensional subspaces S of Rn of supv ∈S kv kqp . Thus, the above theorem tells us that the optimal number of rows is just the smallest m such that the Gelfand width of order m is at most ε (the subspace is then just the kernel of A). For example, [Foucart et al, 2010] and [Garnaev-Gluskin, 1984] for p = 2, q = 1 p give that the width of order m is Θ( (1 + log(n/m))/m), so m = Θ(ε−2 log(ε2 n)) is optimal for these norms.

Proof of theorem

Theorem: For 1 ≤ q < p ≤ ∞, let m be the minimum number such that there is an n − m dimensional subspace S of Rn satisfying kv k supv ∈S kv kqp ≤ ε. Then there is an m × n matrix A and associated output procedure Out which for any x ∈ Rn , given Ax, outputs an estimate of kv kp with additive error at most εkv kq . Moreover, any matrix A with fewer rows will fail to perform the same task. Proof: Suppose A has fewer than m rows. Then some v ∈ ker(A) has kv kp > εkv kq , so 0 is not a valid approximation of kv kp but we must output 0 whenever Av = 0 to be correct on the 0 vector.

Proof of theorem Theorem: For 1 ≤ q < p ≤ ∞, let m be the minimum number such that there is an n − m dimensional subspace S of Rn satisfying kv k supv ∈S kv kqp ≤ ε. Then there is an m × n matrix A and associated output procedure Out which for any x ∈ Rn , given Ax, outputs an estimate of kv kp with additive error at most εkv kq . Moreover, any matrix A with fewer rows will fail to perform the same task. Proof: For the other direction, let A be such that the above n − m dimensional subspace is its kernel. For any sketch z, we must output a number in the range [kxkp − εkxkq , kxkp + εkxkq ] for any x with Ax = z. Assume not possible, so ∃x, y with Ax = Ay but kxkp − εkxkq > ky kp + εky kq . But x − y ∈ ker(A), so kxkp − εky kp ≤ kx − y kp ≤ εkx − y kq ≤ ε(kxkq + ky kq ) This is a contradiction. Thus in fact, our Out procedure can just be to output minx:Ax=z kxkp + εkxkq . This can be solved in poly time using the ellipsoid method; details in paper.

Future Directions (open)

I

Find correct complexity for point query/inner product/heavy hitters.

I

Obtain better time complexity for deterministic point query.

I

Nail exactly complexity for `1 /`1 sparse recovery.

I

Norm estimation with a faster Out procedure (avoid ellipsoid method!).

I

Understand deterministic complexities for other streaming problems.

I

(known?) What’s a simple “very explicit” construction of an incoherent matrix with m = O(ε−2 log n) rows?