Understanding the Sparse Vector Technique for Differential Privacy

Report 2 Downloads 19 Views
Understanding the Sparse Vector Technique for Differential Privacy Min Lyu # , Dong Su ⋆ , Ninghui Li ⋆ #

University of Science and Technology of China

arXiv:1603.01699v1 [cs.CR] 5 Mar 2016

[email protected]

ABSTRACT The Sparse Vector Technique (SVT) is a fundamental technique for satisfying differential privacy and has the unique quality that one can output some query answers without apparently paying any privacy cost. SVT has been used in both the interactive setting, where one tries to answer a sequence of queries that are not known ahead of the time, and in the non-interactive setting, where all queries are known. Because of the potential savings on privacy budget, many variants for SVT have been proposed and employed in privacypreserving data mining and publishing. However, most variants of SVT are actually not private. In this paper, we analyze these errors and identify the misunderstandings that likely contribute to them. We also propose a new version of SVT that provides better utility, and introduce an effective technique to improve the performance of SVT. These enhancements can be applied to improve utility in the interactive setting. In the non-interactive setting (but not the interactive setting), usage of SVT can be replaced by the Exponential Mechanism (EM); we have conducted analytical and experimental comparisons to demonstrate that EM outperforms SVT.

1.

INTRODUCTION

Differential privacy (DP) is increasingly being considered the privacy notion of choice for privacy-preserving data analysis and publishing in the research literature. In this paper we study the Sparse Vector Technique (SVT), a basic technique for satisfying DP, which was first proposed by Dwork et al. [7] and later refined in [16] and [12], and used in [11, 13, 1, 18, 17]. Compared with other techniques for satisfying DP, SVT has the unique quality that one can output some query answers without apparently paying any privacy cost. More specifically, in SVT one is given a sequence of queries and a certain threshold T , and outputs a vector indicating whether each query answer is above or below T ; that is, the output is a vector {⊥, ⊤}ℓ , where ℓ is the number of queries answered, ⊤ indicates that the corresponding query is above the threshold and ⊥ indicates below. SVT works by perturbing both the threshold T and each individual query answer. When we expect that the majority of queries are on one side, e.g., below the threshold, we can use SVT so that while each output of ⊤ consumes some privacy bud-

ACM ISBN 978-1-4503-2138-9. DOI: 10.1145/1235



Purdue University

{su17, ninghui}@cs.purdue.edu

get, each output of ⊥ consumes none. That is, with a fixed privacy budget and a given level of noise added to each query answer, one can keep answering queries as long as the number of ⊤’s is below c, a pre-defined cutoff point. This ability to avoid using any privacy budget for some queries is very powerful for the interactive setting, where one answers a sequence of queries without knowing ahead of the time what these queries are. Some well-known lower-bound results [3, 5, 6, 10] suggest that “one cannot answer a linear, in the database size, number of queries with small noise while preserving privacy” [7]. This limitation can be bypassed by using SVT, as in the iterative construction approach in [11, 12, 16]. In this approach, one maintains a history of past queries and answers. For each new query, one first uses this history to derive an answer for the query, and then uses SVT to check whether the error of this derived answer is below a threshold. If it is, then one can use this derived answer for this new query without consuming any privacy budget. Only when the error of this derived answer is above the threshold, would one need to spend privacy budget to access the database to answer the query. With the power of SVT comes the subtlety of why it is private and the difficulty of applying it correctly. The version of SVT used in [11, 12], which was abstracted into a generic technique and described in Roth’s 2011 lecture notes [15], turned out that it does not satisfy DP as claimed. This error in [11, 12] is arguably not critical because it is possible to use a fixed version of SVT without affecting the main asymptotic results in [11, 12] . Since 2014, several variants of SVT were developed; they were used for frequent itemset mining [13], for feature selection in private classification [18], and for publishing high-dimensional data [1]. These usages are in the non-interactive setting, where all the queries are known ahead of the time, and the goal is to find c queries that have larger values, e.g., finding the c most frequent itemsets. Unfortunately, these variants do not satisfy DP, as pointed out in [2]. When using a correct version of SVT in these papers, one would get significantly worse utility. Since these papers seek to improve the tradeoff between privacy and utility, the results in them are thus invalid. Most of these errors were already known [2]. The SVT variants in [13, 18, 1] were modeled as a generalized private threshold testing algorithm (GPTT) in [2], and a proof showing that GPTT does not satisfy ǫ-DP for any finite ǫ (which we use ∞-DP to denote in this paper) was given. However, as we show in this paper, the proof in [2] that GPTT was ∞-DP was also incorrect. A version of SVT with a correct privacy proof appeared in Dwork and Roth’s 2014 book [8], and was used in some recent work, e.g., [17]. In this paper, we present a version of SVT that adds less noise for the same level of privacy. In addition, we develop a technique that optimizes the privacy budget allocation be-

tween that for perturbing the threshold and that for perturbing the query answers, and experimentally demonstrate its effectiveness. We also observe that most recent usages of SVT in [1, 13, 17, 18] are in the non-interactive setting with the goal to select up to c queries with higher answers. In this setting, one could also use the Exponential Mechanism (EM) [14] c times to achieve the same objective, each time selecting the query with the highest answer. In our experiments, we found that EM almost always outperforms SVT; this suggests that one should use EM instead of SVT in noninteractive settings to obtain better utility, even though SVT cannot be replaced by EM in the interactive setting, because no all queries are known. In summary, this paper has the following novel contributions. 1. We propose a new version of SVT that provides better utility. We also introduce an effective technique to improve the performance of SVT. These enhancements achieve better utility than the state of the art SVT and can be applied to improve utility in the interactive setting. 2. While previous papers have pointed out most of the errors in usages of SVT, we use a detailed privacy proof of SVT to identify the misunderstandings that likely caused the different non-private versions. 3. Through experiments on real datasets, we have evaluated the effects of various SVT optimizations and compared them to EM. Our results show that for non-interactive settings, one should use EM instead of SVT. The rest of the paper is organized as follows. Section 2 gives background information on DP. We analyze six variants of SVT in Section 3. In Section 4, we present our optimizations of SVT. The experimental results are shown in Section 5. Related works are summarized in Section 6. Section 7 concludes our work.

2.

BACKGROUND

D EFINITION 1 (ǫ-DP [4, 5]). A randomized mechanism A satisfies ǫ-differential privacy (ǫ-DP) if for any pair of neighboring datasets D and D′ , and any S ∈ Range(A),   Pr[A(D) = S] ≤ eǫ · Pr A(D′ ) = S .

Typically, two datasets D and D′ are considered to be neighbors when they differ by only one tuple. We use D ≃ D′ to denote this. There are several primitives for satisfying ǫ-DP. The Laplacian mechanism [5] adds a random noise sampled from the Laplace distribution with the scale parameter proportional to GSf , the global sensitivity of the function f . That is, to compute f on a dataset D, one outputs   GS Af (D) = f (D) + Lap ǫf , where GSf = max′ |f (D) − f (D′ )|, and

Pr[Lap (β) = x] =

D≃D 1 −|x|/β e . 2β

In the above, Lap (β) denotes a random variable sampled from the Laplace distribution with scale parameter β. The exponential mechanism [14] samples the output of the data analysis mechanism according to an exponential distribution. The mechanism relies on a quality function q : D ×R → R that assigns a real valued score to one output r ∈ R when the input dataset is D, where higher scores indicate more desirable outputs. Given the quality function q, its global sensitivity GSq is defined as: GSq = max max′ |q(D, r) − q(D′ , r)|. r

D≃D

Outputting r using the following distribution satisfies ǫ-DP:   ǫ Pr[r is selected] ∝ exp q(D, r) . 2 GSq If the changes of all quality values are one-directional, then one can make more accurate selection by choosing each possible out  put with probability proportional to exp GSǫ q q(D, r) , instead of   ǫ q(D, r) . exp 2 GS q DP is sequentially composable in the sense that combining multiple mechanisms A1 , · · · , Am that satisfy DP for P ǫ1 , · · · , ǫm results in a mechanism that satisfies ǫ-DP for ǫ = i ǫi . Because of this, we refer to ǫ as the privacy budget of a privacy-preserving data analysis task. When a task involves multiple steps, each step uses a portion of ǫ so that the sum of these portions is no more than ǫ.

3. VARIANTS OF SVT In this section, we analyze variants of SVT; six of them are listed in Figure 1. Among them, Algorithms 1 and 2 satisfy ǫ-DP, and the rest of them do not. Algorithms 1 is an instantiation of our proposed SVT. Algorithm 2 is the version taken from [8]. Algorithms 3, 4, 5, and 6 are taken from [15, 13, 18, 1] respectively. The table in Figure 2 summarizes the differences among these algorithms. Their privacy properties are given in the last row of the table. To understand the differences between these variants, one can view SVT as having the following three elements. 1. Generate the threshold noise ρ (Line 1 in each algorithm), which will be added to the threshold. In Algorithms 1, 3, 5 and 6, ρ follows the Laplace distribu tion Lap 2∆ . In Algorithm 4, ρ follows Lap 4∆ ; howǫ ǫ 1 ever, this is due to the fact it allocates 4 of total privacy budget to this step, instead of 21 .  In Algorithm 2, ρ follows Lap 2c∆ , where c is the maxǫ imum number of ⊤’s the algorithm outputs. This is much larger than other algorithms, and causes Algorithm 2 to have significantly worse performance than Algorithm 1, as we show in Section 5. Algorithm 2 also refreshes the noisy threshold T after each output of ⊤. We note that making the threshold noise scale with c is necessary for privacy only if one decides to refresh the threshold noise after each output of ⊤; however, such refreshing is unnecessary. 2. For each query qi , perturb the answer by adding noise νi and check the noisy answer against the noisy threshold (lines 3 and 4 in each algorithm), and outputs ⊥ or ⊤.  In Algorithms 1 and 2, νi follows Lap 4c∆ . ǫ  2c∆ In Algorithm 3, νi follows Lap ǫ , which is not enough for ǫ-DP (even though it suffices for 3ǫ -DP). A further error 2 in Algorithm 3 is that it actually outputs the noisy query answer instead of ⊤ for a query above the threshold. The latter error makes Algorithm 3 ∞-DP, as shown in Theorem 3.   In Algorithms 4 and 6, νi follows Lap 4∆ and Lap 2∆ , 3ǫ ǫ respectively, and does not depend on c. Algorithm 5 adds no noise to queries. 3. Keep track of the number of ⊤’s in the output, and stop when one has outputted c ⊤’s (Line 6 in algorithms 1-6). Algorithms 1, 2, 3, 4 do this. Algorithms 5 and 6 do not have a cutoff and, because of this, are ∞-DP, as will be shown later.

Figure 1: A Selection of SVT Variants Input/Output shared by all SVT Algorithms Input: A private database D, a stream of queries Q = q1 , q2 , · · · each with sensitivity no more than ∆, either a sequence of thresholds T = T1 , T2 , · · · or a single threshold T (see footnote ∗ ), and c, the maximum number of queries to be answered with ⊤. Output: A stream of answers a1 , a2 , · · · , where each ai ∈ {⊤, ⊥} ∪ R and R denotes the set of all real numbers. Algorithm 1 An instantiation of the SVT proposed in this paper. Input: D, Q, ∆,T = T1 , T2 , · · · , c. 1: ρ = Lap 2∆ , count = 0 ǫ 2: for each query qi ∈ Q do 3: νi = Lap 4c∆ ǫ 4: if qi (D) + νi ≥ Ti + ρ then 5: Output ai = ⊤ 6: count = count + 1, Abort if count ≥ c. 7: else 8: Output ai = ⊥ 9: end if 10: end for

Algorithm 2 SVT in Dwork and Roth 2014 [8]. Input: D, Q, ∆, T,  c. 1: ρ = Lap 2c∆ , count = 0 ǫ 2: for each query qi ∈ Q do 3: νi = Lap 4c∆ ǫ 4: if qi (D) + νi ≥ T + ρ then  5: Output ai = ⊤, ρ = Lap 2c∆ ǫ 6: count = count + 1, Abort if count ≥ c. 7: else 8: Output ai = ⊥ 9: end if 10: end for

Algorithm 3 SVT in Roth’s 2011 Lecture Notes [15]. Input: D, Q, ∆,T, c. 1: ρ = Lap 2∆ , count = 0 ǫ 2: for each query qi ∈ Q do 3: νi = Lap 2c∆ ǫ 4: if qi (D) + νi ≥ T + ρ then 5: Output ai = qi (D) + νi 6: count = count + 1, Abort if count ≥ c. 7: else 8: Output ai = ⊥ 9: end if 10: end for

Algorithm 4 SVT in Lee and Clifton 2014 [13]. Input: D, Q, ∆,T, c. 1: ρ = Lap 4∆ , count = 0 ǫ 2: for each query qi∈ Q do 3: νi = Lap 4∆ 3ǫ 4: if qi (D) + νi ≥ T + ρ then 5: Output ai = ⊤ 6: count = count + 1, Abort if count ≥ c. 7: else 8: Output ai = ⊥ 9: end if 10: end for

Algorithm 5 SVT in Stoddard et al. 2014 [18]. Input: D, Q, ∆,T . 1: ρ = Lap 2∆ ǫ 2: for each query qi ∈ Q do 3: νi = 0 4: if qi (D) + νi ≥ T + ρ then 5: Output ai = ⊤ 6: 7: else 8: Output ai = ⊥ 9: end if 10: end for

Algorithm 6 SVT in Chen et al. 2015 [1]. Input: D, Q, ∆,T = T1 , T2 , · · · . 1: ρ = Lap 2∆ ǫ 2: for each query qi∈ Q do 3: νi = Lap 2∆ ǫ 4: if qi (D) + νi ≥ Ti + ρ then 5: Output ai = ⊤ 6: 7: else 8: Output ai = ⊥ 9: end if 10: end for

Scale of threshold noise ρ Reset ρ after each output of ⊤ Scale of query noise νi Outputting qi + νi instead of ⊤ Stop after outputting c ⊤’s Privacy Property

Alg. 1 2∆/ǫ No 4c∆/ǫ No Yes ǫ-DP

Alg. 2 2c∆/ǫ Yes 4c∆/ǫ No Yes ǫ-DP

Alg. 3 2∆/ǫ No 2c∆/ǫ Yes Yes ∞-DP ∗∗

Alg. 4 4∆/ǫ No 4∆/3ǫ No Yes 1+6c ǫ -DP 4

Alg. 5 2∆/ǫ No 0 No No ∞-DP

Alg. 6 2∆/ǫ No 2∆/ǫ No No ∞-DP

Figure 2: Differences among Algorithms 1-6. ∗

Algorithms 1 and 6 use a sequence of thresholds T = T1 , T2 , · · · , allowing different thresholds for different queries. The other algorithms use the same threshold T for all queries. We point out that this difference is mostly syntactical. In fact, having an SVT where the threshold always equals 0 suffices. Given a sequence of queries q1 , q2 , · · · , and a sequence of thresholds T = T1 , T2 , · · · , we can define a new sequence of queries ri = qi − Ti , and apply the SVT to ri using 0 as the threshold to obtain the same result. In this paper, we decide to use thresholds to be consistent with the existing papers. ∗∗

∞-DP means that an algorithm doesn’t satisfy ǫ′ -DP for any finite privacy budget ǫ′ .

3.1 Privacy Proof for Algorithm 1 We now prove the privacy of Algorithm 1. We break the proof down into two steps, to make the proof easier to understand, and, more importantly, to point out what confusions likely caused the different non-private variants of SVT to be proposed. In the first step, we analyze the situation where the output is ⊥ℓ , a lengthℓ vector h⊥, · · · , ⊥i, indicating that all ℓ queries are tested to be below the threshold. L EMMA 1. Let A be Algorithm 1. For any neighboring datasets D and D′ , and any integer ℓ, we have h h i i ǫ Pr A(D) = ⊥ℓ ≤ e 2 Pr A(D′ ) = ⊥ℓ . P ROOF. We have h i Z Pr A(D) = ⊥ℓ =



f⊥ (D, z, L) dz, −∞

where f⊥ (D, z, L) = Pr[ρ = z]

Y

Pr[qi (D) + νi < Ti + z] , (1)

i∈L

We can obtain a similar result for positive query answers in the same way. Let

f⊤ (D, z, L) = Pr[ρ = z]

Y

Pr[qi (D) + νi ≥ Ti + z] .

i∈L ǫ

We have f⊤ (D, z, L) ≤ e 2 f⊤ (D ′ , z − ∆, L), and thus h i h i ǫ Pr A(D) = ⊤ℓ ≤ e 2 Pr A(D ′ ) = ⊤ℓ .

This likely contributes to the misunderstandings behind Algorithms 5 and 6, which treat positive and negative answers exactly the same way. The problem is that while one is free to choose to bound positive or negative side, one cannot bound both. We also observe that the proof of Lemma 1 will go through if no noise is added to the query answers, i.e., νi = 0, because Eq (3) holds even when νi = 0. It is likely because of this observation that Algorithm 5 adds no noise to query answers. However, when considering outcomes that include both positive answers (⊤’s) and negative answers (⊥’s), one has to add noises to the query answers, as we show below.

and L = {1, 2, · · · , ℓ}.

T HEOREM 2. Algorithm 1 is ǫ-DP. The probability of outputting ⊥ℓ over D is the summation (or integral) of terms f⊥ (D, z, L), each of which is the product of Pr[ρ Q = z], the probability that the threshold noise equals z, andℓ Pr[qi (D) + νi < Ti + z], the conditional probability that ⊥

P ROOF. Consider any output vector a ∈ {⊥, ⊤}ℓ . Let a = a ha1 , · · · , aℓ i, Ia ⊤ = {i : ai = ⊤}, and I⊥ = {i : ai = ⊥}. Clearly, |Ia ⊤ | ≤ c. We have

i∈L

is the output on D given that the threshold noise ρ is z. (Note that given D, T, the queries, and ρ, whether one query results in ⊥ or not depends completely on the noise νi and is independent from whether any other query results in ⊥.) If we can prove ǫ

f⊥ (D, z, L) ≤ e 2 f⊥ (D′ , z + ∆, L),

(2)

then we have h i Z Pr A(D) = ⊥ℓ = ≤

g(D, z) dz, where

g(D, z) = Pr[ρ = z]

Y

Pr[qi (D)+νi < Ti +z]

Y

Pr[qi (D)+νi ≥ Ti +z] .

i∈Ia ⊤

i∈Ia



We want to show that g(D, z) ≤ eǫ g(D′ , z+∆). This suffices to prove that Pr[A(D) = a] ≤ eǫ Pr[A(D′ ) = a]. Note that g(D, z) can be written as:

Z

ǫ e2

Y Pr[qi (D) + νi ≥ Ti + z] . g(D, z) = f⊥ (D, z, Ia ⊥) i∈Ia ⊤

ǫ

e 2 f⊥ (D ′ , z + ∆, L) dz ∞





f⊥ (D , z , L) dz



from (2) ′

let z = z + ∆

(3)

With (3), we prove (2) as follows: f⊥ (D, z, L) = Pr[ρ = z]

Y

Pr[qi (D) + νi < Ti + z]

i∈L ǫ

≤ e 2 Pr[ρ = z + ∆]

Y

i∈L

ǫ

Pr[qi (D)+νi ≥ Ti +z] ≤ e 2

  Pr qi (D ′ ) + νi < Ti + (z + ∆)



f⊥ (D , z + ∆, L).

That is, by using a noisy threshold, we are able to bound the probability ratio for all the negative query answers (i.e., ⊥’s) by ǫ e 2 , no matter how many negative answers there are.

Y

i∈Ia ⊤

i∈Ia ⊤

This proves the lemma. It remains to prove Eq (2). For any query qi , because |qi (D)−qi (D′ )| ≤ ∆ and thus −qi (D) ≤ ∆−qi (D′ ), we have Pr[qi (D) + νi < Ti + z] = Pr[νi < Ti − qi (D) + z]   ≤ Pr νi < Ti + ∆ − qi (D ′ ) + z   = Pr qi (D ′ ) + νi < Ti + (z + ∆)

Following the proof of Lemma 1, we can show that f⊥ (D, z, Ia ⊥) ≤ and it remains to show

ǫ e 2 f⊥ (D′ , z + ∆, Ia ⊥ ),

Y

h i ǫ = e 2 Pr A(D ′ ) = ⊥ℓ .

=



−∞

f⊥ (D, z, L) dz

−∞ Z ∞

−∞

ǫ e2

Z



−∞

=

Pr[A(D) = a] =

  ′ Pr qi (D )+νi ≥ Ti +z +∆ .

(4)

Because νi = Lap

4c∆ ǫ



and |qi (D) − qi (D′ )| ≤ ∆, we have

Pr[qi (D)+νi ≥ Ti +z] = Pr[νi ≥ Ti +z −qi (D)]   ≤ Pr νi ≥ Ti +z − ∆ − qi (D ′ ) (5)   ǫ ′ 2c ≤ e Pr νi ≥ Ti +z −∆−qi(D )+2∆ (6)   ǫ = e 2c Pr qi (D ′ ) + νi ≥ Ti + z + ∆ .

Eq (5) is because −qi (D) ≥ −∆ − qi (D′ ), and Eq (6) is from the Laplace distribution’s property. This proves Eq (4). The basic idea of the proof is that when comparing g(D, z) with g(D′ , z + ∆), we can bound the probability ratio for all outputs of ǫ ⊥ to no more than e 2 by using a noisy threshold, no matter how many such outputs ǫthere are. To bound the ratio for the ⊤ outputs to no more than e 2 , we need to add sufficient Laplacian noises, which should scale with c, the number of positive outputs. Now we turn to Algorithms 3-6 to clarify what are wrong with their privacy proofs and to give their DP properties.

3.2 Algorithm 3 is ∞-DP

Noting that

Algorithm 3 outputs the noisy query answers for those tested above the noisy threshold. The proof for its privacy in [15] goes as follows:



ǫ

= e(m−1) 2 ,

(7) =



Z



a f⊥ (D, z, I⊥ )

−∞



Y

Pr[qi (D)+νi = ai ] dz

(8)

i∈Ia ⊤ ǫ e2

′ a f⊥ (D , z + ∆, I⊥ )

−∞

ǫ e 2c

Y





Pr qi (D ) + νi = ai

i∈Ia ⊤



dz,

(9)

The error occurs when going from (7) to (8), which is implicitly done in [15]. This step removes the condition qi (D)+νi ≥ T +z. When keeping the condition, we note that the following inequality, which is needed to reach (9), does not always hold:

= e 2 for any z ≤ 0, we thus have

R0 m dz ǫ Pr[A(D) = a] −∞ Pr[ρ = z] (F (z)) −2 = e R0 ′ ′ ′ m ′ Pr[A(D ) = a] −∞ Pr[ρ = z ] (F (z − ∆)) dz R0 ǫ m dz ǫ −∞ Pr[ρ = z] (e 2 F (z − ∆)) = e− 2 R 0 ′ ′ m ′ −∞ Pr[ρ = z ] (F (z − ∆)) dz

Pr[A(D) = a] Z ∞ Y Pr[qi (D)+νi ≥ T +z ∧ qi (D)+νi = ai ] dz = f⊥ (D,z,Ia ⊥) −∞ i∈Ia Z

ǫ

F (z) F (z−∆)



and thus when m > ⌈ 2ǫǫ ⌉ + 1, we have

Pr[A(D)=a] Pr[A(D′ )=a]



> eǫ

.

3.3 DP properties of Algorithms 4, 5 and 6 The key error in the proofs for Algorithms 4, 5, 6 in [13, 18, 1] is that they mistakenly used the following: Z



Pr[ρ = z] −∞

Y

Pr[qi (D)+νi < T +z]

Y

Pr[qi (D)+νi ≥ T +z] dz

i∈Ia ⊤

i∈Ia ⊥

(12)

Pr[qi (D)+νi ≥ T +z ∧ qi (D)+νi = ai ]   ǫ ≤ e 2c Pr qi (D ′ )+νi ≥ T +z +∆ ∧ qi (D ′ ) + νi = ai .

=

T HEOREM 3. Algorithm 3 is not ǫ′ -DP for any finite ǫ′ . P ROOF. Set c = 1 for simplicity. Given any finite ǫ′ > 0, we construct an example to show that Algorithm 3 is not ǫ′ -DP. Consider an example with T = 0, and m + 1 queries q with sensitivity ∆ such that q(D) = 0m ∆ and q(D′ ) = ∆m 0, and the output vector a = ⊥m 0, that is, only the last query answer is a numeric a] ≥ eǫ′ for value 0. Let A be Algorithm 3. We show that PrPr[[A(D)= A(D′ )=a] 2∆ ǫ



by

Pr[A(D) = a] = =

Z



−∞ Z ∞



Pr[ρ = z]

−∞

When z takes a value such that ai − T −∆ < z < ai − T , the left side term is nonzero, while the right side term is zero, because for both qi (D′ )+νi ≥ T+z+∆ and qi (D′ )+νi = ai to be true, it requires z ≤ ai − T −∆. The above error was pointed out to us in [19]; however, it was unclear whether Algorithm 3 satisfies ǫ′ -DP for some other ǫ′ . The theorem below answers this question.

any ǫ′ > 0 when m is large enough. We denote the cumulative distribution function of Lap F (x). By Eq (7), we have

Z

×

Z

Y

Pr[qi (D)+νi < T +z] dz

i∈Ia ⊥



    Y Pr qi (D)+νi ≥ T +z ′ dz ′ . Pr ρ = z ′ i∈Ia ⊤ Z ∞ ′ f⊥ (D,z,Ia ) dz f⊤ (D,z ′ ,Ia ⊥ ⊤ ) dz

−∞

=

Z



−∞

−∞

This led the fact that Algorithms 5 and 6 do not have a cutoff point. This reasoning, however, is incorrect because we have to use the same noisy threshold for all query answers. We also note that even though the privacy proof of Algorithm 4 does not require a cutoff point; it nonetheless stops after outputting ⊤ for c times. However, the noise it adds to each query is a factor of 2c smaller than necessary for ǫ-DP. As a result, Algorithm 4 is  1+6c ǫ-DP. Theorem 6 can be applied to Algorithm 4 to establish 4 this privacy property; we thus omit the proof of this. Algorithms 5 not only requires no cutoff but also injects no noises to query answers, and thus is not ǫ′ -DP for any finite ǫ′ . T HEOREM 4. Algorithm 5 is not ǫ′ -DP for any finite ǫ′ . P ROOF. Consider a simple example, with T = 0, ∆ = 1, q = hq1 , q2 i such that q(D) = h0, 1i and q(D′ ) = h1, 0i, and a = h⊥, ⊤i. Then by Eq (12), we have

f⊥ (D, z, Ia ⊥ ) Pr[∆+νm+1 ≥ z ∧ ∆+νm+1 = 0] dz

Pr[A(D) = a] =

−∞ ∞

Z

−∞

ǫ −ǫ = e 2 4∆

Z

−∞

ǫ −ǫ e 2 = 4∆

Z

0

ǫ −ǫ e 2 = 4∆

Z

0

0

=

f⊥ (D, z, Ia ⊥ )Pr[0 ≥ z] dz

Pr[ρ = z]

−∞

Pr[ρ = z] (F (z))

dz,

(10)

and similarly

−∞

Pr[ρ = z] dz > 0,

Z



−∞

      Pr ρ = z ′ Pr 1 < z ′ Pr 0 ≥ z ′ dz ′ ,

which is zero. So the probability ratio

−∞

0

1

  Pr A(D ′ ) = a =

Pr[νi < z] dz m

Z

Z

which is nonzero; and

i=1

  ǫ Pr A(D ′ ) = a = 4∆

Pr[ρ = z] Pr[0 < z] Pr[1 ≥ z] dz

0

f⊥ (D, z, Ia ⊥ ) dz m Y



−∞

a f⊥ (D, z, I⊥ ) Pr[0 ≥ z] Pr[νm+1 = −∆] dz

ǫ −ǫ = e 2 4∆

Z

  Pr ρ = z ′ (F (z ′ − ∆))m dz ′ . (11)

The fact that 0 is given as an output reveals the information that the noisy threshold is at most 0, forcing the range of integration to be from −∞ to 0, instead of from −∞ to ∞. This prevents the use of changing z in (10) to z ′ − ∆ to bound the ratio of (10) to (11).

Pr[A(D)=a] Pr[A(D′ )=a]

= ∞.

Algorithm 6 is also ∞-DP, because there is no limitation on the number of positive query answers. T HEOREM 5. Algorithm 6 is not ǫ′ -DP for any finite ǫ′ . P ROOF. We construct a counterexample with ∆ = 1, T = 0, and 2m queries such that q(D) = 02m , and q(D′ ) = 1m (−1)m . Consider the output vector a = ⊥m ⊤m . Denote the cumulative distribution function of νi by F (x). From Eq. (12), we have Pr[A(D) = a]

=

Z



Z



Pr[ρ = z]

−∞

=

m Y

2m Y

Pr[0+νi < z]

i=1

The error in this logic is that the parameter α itself is a function of t. Thus choosing a larger t causes α to decrease, δ to increase, κ to decrease, and the need to choose an even larger t, and so on. Such a proof is incorrect.

Pr[0+νi ≥ z] dz

i=m+1 m

Pr[ρ = z] (F (z)(1 − F (z)))

dz,

−∞

3.5 Other Variants

and   Pr A(D ′ ) = a Z ∞ m 2m Y Y Pr[ρ = z] = Pr[1+νi < z] −∞

=

Z



i=1

Pr[−1+νi ≥ z] dz

i=m+1

Pr[ρ = z] (F (z − 1)(1 − F (z +1)))m dz.

−∞

a ] is unbounded as m increases, provWe now show that PrPr[[A(D)= A(D′ )=a] ing this theorem. Compare F (z)(1 − F (z)) with F (z − 1)(1 − F (z + 1)). Note that F (z) is monotonically increasing. When z ≤ 0, ǫ 1 2z e 2 ǫ (z−1) 1 2 e 2

F (z)(1 − F (z)) F (z) ≥ = F (z − 1)(1 − F (z +1)) F (z − 1)

ǫ

= e2 .

When z > 0, we also have 1 − F (z) F (z)(1 − F (z)) ≥ = F (z − 1)(1 − F (z +1)) 1 − F (z + 1)

So,

Pr[A(D)=a] Pr[A(D′ )=a]

≥e

mǫ 2

ǫ 1 −2 e z 2 ǫ 1 −2 e (z+1) 2

ǫ

= e2 .





, which is greater than eǫ when m > ⌈ 2ǫǫ ⌉

for any finite ǫ′ .

3.4 Error in Privacy Analysis of GPTT In [2], the SVT variants in [13, 18, 1] were modeled as a generalized private threshold testing algorithm In GPTT, the   (GPTT). threshold T is perturbed using ρ = Lap ǫ∆1 and each query an  swer is perturbed using Lap ǫ∆2 and there is no cutoff; thus GPTT can be viewed as a generalization of Algorithm 6. When setting ǫ1 = ǫ2 = 2ǫ , GPTT becomes Algorithm 6. There is a proof in [2] to show that GPTT is not ǫ′ -DP for any finite ǫ′ . However, this proof is incorrect. Following the logic of that proof, one can prove any version of SVT, including Algorithm 1, is not ǫ′ -DP for any finite ǫ′ . We now apply the logic in [2] to give such a “proof” that contradicts Lemma 1, and then identify the subtle error in it. Let A be Algorithm 1, with c = 1. Consider an example with ∆ = 1, T = 0, a sequence q of t queries such that q(D) = 0t and q(D′ ) = 1t , and output vector a = ⊥t . ∞

h i Z β = Pr A(D) = ⊥ℓ = h i α = Pr A(D ′ ) = ⊥ℓ =

−∞ ∞

Z

t  Pr[ρ = z] F ǫ (z) dz

−∞

4

 t Pr[ρ = z] F ǫ (z − 1) dz, 4

where F ǫ (x) is the cumulative distribution function of Lap 4

  4 . ǫ

Rδ Find a parameter δ such that −δ Pr[ρ = z] dz ≥ 1 − α2 . Then   t Rδ Pr[ρ = z] F 4ǫ (z − 1) dz ≥ α2 . Let κ be the minimum −δ value of

F ǫ (z) 4

F ǫ (z−1)

in [−δ, δ]; it must be that κ > 1. Then

4

β>

Z

δ −δ

= κt

Z

Z t  Pr[ρ = z] F ǫ (z) dz ≥ 4

δ

−δ

t  Pr[ρ = z] κF ǫ (z − 1) dz 4

 t κt Pr[ρ = z] F ǫ (z − 1) dz ≥ α. 4 2 −δ δ

t

β = κ2 Since κ > 1, one can choose a large enough t to make α to be as large as needed. We note that this contradicts Lemma 1.

Some usages of SVT aim at satisfying (ǫ, δ)-DP [5], instead of ǫ-DP. These often exploit the advanced composition theorem for DP [9], which states that applying pk instances of ǫ-DP algorithms satisfies (ǫ′ , δ ′ )-DP, where ǫ′ = 2k ln(1/δ ′ )ǫ + kǫ(eǫ − 1). In this paper, we limit our attention to SVT variants to those satisfying ǫ-DP, which are what have been used in the data mining community [1, 13, 17, 18]. The SVT used in [12, 16] has another difference from Algorithm 3. In [12, 16], the goal of using SVT is to determine whether the error of using an answer derived from answers to past queries is below a threshold. This check takes the form of “if |q˜i − qi (D) + νi | ≥ T + ρ then,” where q˜i gives the estimated answer of a query one can obtain without consuming privacy budget, and qi (D) gives the true answer. This is incorrect because whenever the output includes at least one ⊤, one immediately knows that ρ ≥ −T . This leakage of ρ is somewhat similar to Algorithm 3’s leakage caused by outputting noisy query answers found to be above the noisy threshold. This problem can be fixed by using “if |q˜i −qi (D)|+νi ≥ T +ρ then” to find those poorlyestimated queries. By viewing ri = |q˜i − qi (D)| as the query to be answered; this becomes a standard application of SVT.

4. OPTIMIZING SVT Algorithm 1 can be viewed as allocating half of the privacy budget for perturbing the threshold and half for perturbing the query answers. This allocation is somewhat arbitrary, and other allocations are possible. Indeed, Algorithm 4 uses a ratio of 1 : 3 instead of 1 : 1. In this section, we study how to improve SVT by optimizing this allocation ratio and by introducing other techniques.

4.1 A Generalized SVT Algorithm We present a generalized SVT algorithm in Algorithm 7, which uses ǫ1 to perturb the threshold and ǫ2 to perturb the query answers. Furthermore, to accommodate the situations where one also wants the noisy counts for positive queries, we also use ǫ3 to output query answers using the Laplace mechanism. Algorithm 7 Our Proposed Standard SVT Input: D, Q, ∆, T = T1 , T2 , · · · , c and ǫ1 , ǫ2 and ǫ3 . Output: A stream   of answers a1 , a2 , · · · 1: ρ = Lap ǫ∆1 , count = 0 2: for Each query ∈ Q do  qi  3: νi = Lap 2c∆ ǫ2

4: if qi (D) + νi ≥ Ti + ρ then 5: if ǫ3 > 0 then   6: Output ai = qi (D) + Lap c∆ ǫ3 7: else 8: Output ai = ⊤ 9: end if 10: count = count + 1, Abort if count ≥ c. 11: else 12: Output ai = ⊥ 13: end if 14: end for

We now prove the privacy for Algorithm 7; the proof requires only minor changes from the proof of Theorem 2.

P ROOF. Because the second phase of Algorithm 7 is still ǫ3 -DP, we just need to show that for any output vector a,

T HEOREM 6. Algorithm 7 is (ǫ1 + ǫ2 + ǫ3 )-DP.

Pr[A(D) = a] =

Z



g(D, z) dz ≤

−∞

Z



eǫ1 +ǫ2 g(D ′ , z + ∆) dz

−∞

  = eǫ1 +ǫ2 Pr A(D ′ ) = a , Y

where g(D, z) = f⊥ (D, z, Ia ⊥)

Pr[qi (D) + νi ≥ Ti + z] ,

i∈Ia ⊤

and f⊥ (D, z, Ia ⊥ ) = Pr[ρ = z]

Y

Pr[qi (D) + νi < Ti + z] .

i∈Ia ⊥

This holds because, analogously to Eq (2) and Eq (4), we have ǫ1 f⊥ (D′ , z + ∆, Ia f⊥ (D, z, Ia ⊥) ⊥) ≤ e Y   Pr qi (D′ )+νi ≥ Ti +z +∆ . Pr[qi (D)+νi ≥ Ti + z] ≤ eǫ2 i∈Ia i∈Ia

Y







−∞

P ROOF. Algorithm 7 can be divided into two phases, the first phase outputs a vector to mark which query is above the threshold and the second phase uses the Laplace mechanism to output noisy counts. Since the second phase is ǫ3 -DP, it suffices to show that the first phase is (ǫ1 + ǫ2 )-DP. For any output vector a ∈ {⊤, ⊥}ℓ , we want to show Pr[A(D) = a] =

Z

  g(D, z)dz ≤ eǫ1 +ǫ2 Pr A(D ′ ) = a ,

where g(D, z) = f⊥ (D, z, Ia ⊥)

In Algorithm 7, one needs to decide how to divide up a total privacy budget ǫ into ǫ1 , ǫ2 , ǫ3 . We note that ǫ1 + ǫ2 is used for outputting the indicator vector, and ǫ3 is used for outputting the noisy counts for queries found to be above the threshold; thus the ratio of (ǫ1 +ǫ2 ) : ǫ3 is determined by the domain needs and should be an input to the algorithm. On the other hand, the ratio of ǫ1 : ǫ2 affects the accuracy of SVT. Most variants use 1 : 1, without a clear justification. To choose a ratio that can be justified, we observe that this ratio affects the accuracy of the following comparison: qi (D) + Lap



2c∆ ǫ2



≥ T + Lap



∆ ǫ1



.

To make this comparison as accurate as possible, we want to min  , which is imize the variance of Lap ǫ∆1 − Lap 2c∆ ǫ2 2



∆ ǫ1

2

+2



2c∆ ǫ2

2

,

when ǫ1 + ǫ2 is fixed. This is minimized when ǫ1 : ǫ2 = 1 : (2c)2/3 .

(13)

We will evaluate the improvement resulted from this optimization in Section 5.

4.3 SVT for Monotonic Queries In some usages of SVT, the queries are monotonic. That is, when changing from D to D′ , all queries whose answers are different change in the same direction, i.e., there do not exist qi , qj such that qi (D) > qi (D′ ) ∧ qj (D) < qj (D′ ). This is the case when using SVT for frequent itemset mining in [13] with neighboring datasets defined as adding   or removing one  tuple.  For monotonic queries, 2c∆ adding Lap c∆ instead of Lap suffices for privacy. ǫ2 ǫ2 T HEOREM 7. Algorithm 7 with νi = Lap



c∆ ǫ2



in line 3 satis-

fies (ǫ1 + ǫ2 + ǫ3 )-DP when all queries are monotonic.

Pr[qi (D) + νi ≥ Ti + z] ,

Y

Pr[qi (D) + νi < Ti + z] .

i∈Ia ⊤

and f⊥ (D, z, Ia ⊥ ) = Pr[ρ = z]

i∈Ia ⊥

It suffices to show that either g(D, z) ≤ eǫ1 +ǫ2 g(D′ , z), or g(D, z) ≤ eǫ1 +ǫ2 g(D′ , z + ∆). First consider the case that qi (D) ≥ qi (D′ ). In this case, we have f⊥ (D, z, Ia ≤ f⊥ (D′ , z, Ia because ⊥) ⊥ ), Pr[qi (D) + νi < Ti + z] ≤ Pr[qi (D′ ) + νi < Ti + z]. Note that qi (D) − qi (D′ ) ≤ ∆, Thus, g(D, z) ≤ eǫ2 g(D′ , z), because Pr[qi (D) + νi ≥ Ti + z] ≤ Pr[qi (D′ ) + νi ≥ Ti + z − ∆] ≤ ǫ2 e c Pr[qi (D′ ) + νi ≥ Ti + z] . Then consider the case in which qi (D) < qi (D′ ). We ǫ1 ′ a have the usual f⊥ (D, z, Ia ⊥ ) ≤ e f⊥ (D , z + ∆, I⊥ ) as in previous proofs. With the constraint that q (D) < qi (D′ ), using i  

νi = Lap ǫ2 ec

c∆ ǫ2

suffices to ensure that Pr[qi (D) + νi ≥ Ti + z] ≤

Pr[qi (D ′ ) + νi ≥ Ti + ∆ + z].

eǫ1 +ǫ2 g(D′ , z + ∆).

4.2 Optimizing Privacy Budget Allocation

Y

Thus proving g(D, z)



For monotonic queries, the optimization of privacy budget allocation (13) becomes ǫ1 : ǫ2 = 1 : c2/3 .

4.4 SVT versus EM Most recent usages of SVT, e.g., [1, 13, 17, 18], are in the noninteractive setting with the goal to select up to c queries with the highest answers. In this setting, one could also use the Exponential Mechanism (EM) [14] to achieve the same objective. More specifically, one runs EM c times, each round with privacy budget cǫ . The quality for each query is its answer; thus  each query is selected ǫ in the general case and with probability proportion to exp 2c∆  ǫ to exp c∆ in the monotonic case. After one query is elected, it is removed from the pool of candidate queries for the remaining rounds. An intriguing question is which of SVT and EM offers higher accuracy. Theorem 3.24 in [8] regarding the utility of SVT with c = ∆ = 1 states: For any sequence of k queries f1 , . . . , fk such that |{i < k : fi (D) ≥ T − α}| = 0 (i.e. the only query close to being above threshold is possibly the last one), SVT is (α, β) accurate (meaning that with probability at least 1 − β, all queries with answer below T − α result in ⊥ and all queries with answers above T − α result in ⊤) for: αSVT = 8(logk + log(2/β))/ǫ. In the case where the last query is at least T + α, being (α, β)correct ensures that with probability at least 1−β, the correct selection is made. For the same setting, we say that EM is (α, β)-correct if given k − 1 queries with answer ≤ T − α and one query with answer ≥ T + α, with probability 1 − β, the correct selection is made. The probability of selecting the query with answer ≥ T + α ǫ(T +α)/2 is at least (k−1)eǫ(Te −α)/2 +eǫ(T +α)/2 . To ensure this probability is at least 1 − β, αEM = (log(k − 1) + log((1 − β)/β))/ǫ, less than 1/8 of the αSVT , which suggests that EM is more accurate than SVT. The above analysis relies on assuming that the first k − 1 queries are no more than T − α. When that is not assumed, it is difficult analyze the utility of SVT. Therefore, we will use experimental methods to compare SVT with EM. Applying SVT in the non-interactive setting has another challenge, that of choosing the appropriate threshold T . When T is

Dataset BMS-POS Kosarak AOL

Number of Transactions 515,597 990,002 647,377

Number of Items 1,657 41,270 2,290,685

Table 1: Dataset characteristics Settings Interactive Non-interactive

Methods SVT-DPBook SVT-S SVT-ReTr EM

Description DPBook SVT (Alg. 2). Standard SVT (Alg. 7). Standard SVT with Retraversal. Exponential Mechanism.

Table 2: Summary of algorithms high, the algorithm may select fewer than c queries after traversing all queries. When T is low, the algorithm may have selected c queries before encountering later queries. No matter how large some of these later query answers are, they cannot be selected. We note that there is a way to deal with this apparent dilemma. One can use a higher threshold T , and when the algorithm runs out of queries before finding c above-threshold queries, one can retraverse the list of queries that have not been selected so far, until c queries are selected. In our experiments, we consider SVT-ReTr, which increases the threshold T by the scale factor of the Laplace noise injected to each query, and applies the retraversal technique.

5.

EXPERIMENTAL RESULTS

In this section, we experimentally study the effectiveness of our proposed SVT algorithm and the privacy budget allocation optimization. We also compare SVT with EM. We use the problem of finding top-c frequent items on three real transaction datasets to in the experiments. We point out that the choice of the problem does not affect the comparison results, as only the distribution of query answers affect the results. Datasets. In the experiments, we use three real transaction datasets, BMS-POS, Kosarak and AOL. The characteristics of these datasets are summarized in Table 1. Utility Measures. One standard metric is False negative rate (FNR), i.e., the fraction of true top-c frequent items that are missed. When an algorithm outputs exactly c results, the FNR is the same as the False Positive Rate, the fraction of incorrectly selected items. The FNR metric has some limitations. First, missing the most frequent item will be penalized the same as missing the c-th item. Second, selecting a very infrequent item will be penalized the same as selecting the (c + 1)-th item, whose count may be quite close to the c’th item. We use another metric that we call Support Error Rate (SER), which measures the ratio of “missed support counts” by selecting S instead of the true top c items, denoted by Topc . SER = 1.0 −

avgSup(S) . avgSup(Topc )

In our experimental results, we have observed that FNR largely correlates with SER, although the correlation is not perfect. Because of space limitation, we present only results for SER. Evaluation Methodology. We consider the following algorithms. SVT-DPBook is from Dwork and Roth’s 2014 book [8] (Algorithm 2). SVT-S is our proposed standard SVT (Algorithm 7); and since the count query is monotonic, we use the version for monotonic queries in Section 4.3. We consider four privacy budget allocations, 1:1, 1:3, 1:c and 1:c2/3 , and the last is what our analysis

suggests for the monotonic case. These algorithms can be applied in interactive or non-interactive settings. SVT-ReTr is SVT with the optimizations of increasing the threshold and retraversing through the queries (items) until c of them are selected. Because of space limit, we present only the results for privacy budget allocation 1:1 and 1:c2/3 . EM is the exponential mechanism. SVT-ReTr and EM can be applied only in the non-interactive setting. We vary c from 25 to 300, and show results for two privacy budgets ǫ = 0.1 and ǫ = 0.5. We run each experiment 100 times, each time randomizing the order of items to be examined. We report the average and standard deviation of SER. All algorithms are implemented in Python 2.7 and all the experiments are conducted on an Intel Core i7-3770 3.40GHz PC with 16GB memory. Comparison Results. Figure 3 and Table 3 report the results. From Figure 3 and Table 3, one can see that the ranking of the six algorithms are the same across 3 datasets and 2 privacy budgets. SVT-DPBook performs the worst, followed by the four SVT-S algorithms, then by SVT-ReTr-1:c2/3 , and then by EM. The differences among these algorithms can be quite pronounced. For example, on the Kosarak dataset, with ǫ = 0.1, c = 50, SVT-DPBook’s SER is 0.705, which means that the average support of selected items is only around 30% of that for the true top-50 items, which we interpret to mean that the output is meaningless. In contrast, all four SVT-S algorithms have SER less than 0.05, suggesting high accuracy in the selection. SVT-DPBook’s poor performance is due to the fact that the threshold is perturbed by a noise with scale as large as c∆/ǫ. From Table 3 we can also observe the effect of the four ways of budget allocation on SVT-S. When ǫ = 0.5, our proposed allocation 1:c2/3 is clearly the best, even though for cases where the errors are quite small, occasionally 1:3 performs slightly better. When ǫ = 0.1, we see that 1:c performs slightly better than 1:c2/3 in terms of average SER; however, we note that the standard deviation for 1:c is significantly larger. Overall, we view as experimental results as supporting the recommendation of using 1:c2/3 budget allocation. Its performance advantage over the standard 1:1 allocation is quite pronounced. For the interactive setting, our proposed SVT-S and budget allocation optimization significantly improves upon SVT-DPBook. For the non-interactive setting, we observe that EM clearly performs best in all cases, even though SVT-ReTr with budget allocation 1:c2/3 is able to come quite close to EM’s performance. This suggests that usage of SVT should be replaced by EM in the noninteractive setting.

6. RELATED WORKS SVT was introduced by by Dwork et al. [7], and improved by Roth and Roughgarden [16] and by Hardt and Rothblum [12]. These usages are in an interactive setting. An early description of SVT as a stand-alone technique appeared in Roth’s 2011 lecture notes [15], which is Algorithm 3 in this paper, and is in fact ∞-DP. The algorithms in [16, 12] also has another difference, as discussed in Section 3.5. Another version of SVT appeared in the 2014 book [8], which is Algorithm 2. This version is used in some papers, e.g., [17]. We show that it is possible to add less noise and obtain higher accuracy for the same privacy parameter. Lee and Clifton [13] used a variant of SVT (see Algorithm 4) to find itemsets whose support is above the threshold. Stoddard et al. [18] proposed another variant (see Algorithm 5) for private feature selection for classification to pick out the set of features with scores greater than the perturbed threshold. Chen et al. [1]

SVT-S-1:3 SVT-S-1:c

SVT-S-1:c23 SVT-ReTr-1:c23

1.0

1.0

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6 SER

SER

SVT-DPBook SVT-S-1:1

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0.0

0.0 25

50

75

100 125 150 175 200 225 250 275 300

25

50

75

1.0

1.0

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

100 125 150 175 200 225 250 275 300

Top-c (b) BMS-POS, ǫ = 0.5

SER

SER

Top-c (a) BMS-POS, ǫ = 0.1

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0.0

0.0 25

50

75

100 125 150 175 200 225 250 275 300

25

50

75

Top-c (c) Kosarak, ǫ = 0.1 1.0

1.0

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

100 125 150 175 200 225 250 275 300

Top-c (d) Kosarak, ǫ = 0.5

SER

SER

EM

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0.0

0.0 25

50

75

100 125 150 175 200 225 250 275 300

Top-c (e) AOL, ǫ = 0.1

25

50

75

100 125 150 175 200 225 250 275 300

Top-c (f) AOL, ǫ = 0.5

Figure 3: Comparison of SVT-DPBook, SVT-S, SVT-ReTr and EM on selecting top-c frequent items in terms of SER. employed yet another variant of SVT (see Algorithm 6) to return attribute pairs with mutual information greater than the corresponding noisy threshold. These usages are not private. [2] proposed a generalized private threshold testing algorithm (GPTT) and considered the SVT variants in [13, 18, 1] as its special cases. They showed that GPTT did not satisfy ǫ′ -DP for any finite ǫ′ . But there is an error in the proof, as shown in Section 3.4.

7.

CONCLUSION

We have introduced a new version of SVT that provides better utility. We also introduce an effective technique to improve the performance of SVT by optimizing the distribution of privacy budget. These enhancements achieve better utility than the state of the art SVT and can be applied to improve utility in the interactive setting. We have also explained the misunderstandings and errors in a number of papers that use or analyze SVT; and believe that these will help clarify the misunderstandings regarding SVT and

help avoid similar errors in the future. We have also shown that in the non-interactive setting, EM should be preferred over SVT.

8. REFERENCES [1] R. Chen, Q. Xiao, Y. Zhang, and J. Xu. Differentially private high-dimensional data publication via sampling-based inference. In KDD, pages 129–138, 2015. [2] Y. Chen and A. Machanavajjhala. On the privacy properties of variants on the sparse vector technique. CoRR, abs/1508.07306, 2015. [3] I. Dinur and K. Nissim. Revealing information while preserving privacy. In PODS, pages 202–210, 2003. [4] C. Dwork. Differential privacy. In ICALP, pages 1–12, 2006. [5] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265–284, 2006.

Dataset Privacy Budget

BMS-POS ǫ = 0.1 Kosarak ǫ = 0.1 AOL ǫ = 0.1 BMS-POS ǫ = 0.5 Kosarak ǫ = 0.5 AOL ǫ = 0.5

c 50 100 150 200 300 50 100 150 200 300 50 100 150 200 300 50 100 150 200 300 50 100 150 200 300 50 100 150 200 300

SVT-DPBook 0.0720.051 0.5610.085 0.6690.073 0.6820.071 0.6210.069 0.7050.142 0.9680.043 0.9720.037 0.9740.037 0.9710.040 0.2250.062 0.9930.008 0.9980.003 0.9990.002 0.9990.002 0.0090.011 0.0400.034 0.1130.047 0.3690.067 0.5160.068 0.0200.044 0.1400.104 0.6400.126 0.8690.075 0.9360.051 0.0080.010 0.0170.015 0.1360.029 0.8000.032 0.9880.007

Interactive Setting SVT-S 1:1 1:3 1:c 0.0170.027 0.0160.025 0.0300.040 0.2550.078 0.0960.047 0.1010.097 0.5050.079 0.3410.080 0.2890.221 0.6280.069 0.5620.070 0.5080.223 0.6310.065 0.6070.064 0.5350.225 0.0470.061 0.0320.062 0.0330.063 0.8960.080 0.6400.142 0.3730.247 0.9560.046 0.9040.075 0.7720.208 0.9670.044 0.9600.034 0.8650.177 0.9710.034 0.9610.039 0.9450.065 0.0160.016 0.0130.015 0.0230.023 0.7390.046 0.0720.029 0.0930.093 0.9860.009 0.9020.028 0.5320.215 0.9960.005 0.9870.009 0.9240.090 0.9980.003 0.9970.003 0.9860.033 0.0080.017 0.0060.016 0.0070.017 0.0110.012 0.0090.017 0.0130.021 0.0120.014 0.0070.005 0.0170.019 0.1230.051 0.0400.023 0.0480.057 0.4290.068 0.3120.061 0.2840.224 0.0210.053 0.0020.005 0.0060.028 0.0360.056 0.0180.032 0.0390.082 0.0420.050 0.0080.013 0.0620.109 0.4300.112 0.0850.061 0.0770.102 0.8450.077 0.6280.095 0.4230.270 0.0050.009 0.0050.009 0.0060.011 0.0100.009 0.0070.007 0.0220.018 0.0130.010 0.0080.008 0.0200.021 0.0160.009 0.0080.005 0.0320.031 0.7480.031 0.0980.017 0.0740.070

1 : c2/3 0.0170.029 0.0630.055 0.2200.083 0.4780.101 0.5810.083 0.0250.049 0.3980.144 0.8180.108 0.9420.045 0.9500.052 0.0110.013 0.0250.017 0.5870.075 0.9550.019 0.9950.005 0.0040.006 0.0080.016 0.0080.008 0.0240.031 0.2370.069 0.0010.003 0.0250.052 0.0150.036 0.0370.045 0.4120.123 0.0040.005 0.0090.009 0.0070.007 0.0080.007 0.0180.012

Non-interactive Setting SVT-ReTr EM 1:1 1 : c2/3 0.0040.016 0 .0010 .001 0 .0010 .001 0.0540.022 0.0100.019 0 .0070 .003 0.1970.057 0.0490.012 0 .0480 .006 0.3160.058 0.1730.067 0 .0920 .008 0.2920.049 0.2240.092 0 .1410 .009 0.0070.018 0.0090.034 0 .0000 .000 0.7690.109 0.1030.098 0 .0330 .005 0.9000.069 0.6270.134 0 .1390 .007 0.9330.063 0.8330.107 0 .2190 .008 0.9500.046 0.9020.079 0 .3150 .007 0.0020.004 0.0020.005 0 .0010 .001 0.5490.047 0.0070.006 0 .0050 .002 0.9750.015 0.3460.082 0 .1530 .015 0.9940.007 0.8940.031 0 .4120 .014 0.9980.003 0.9900.006 0 .6830 .014 0.0000.000 0 .0000 .002 0 .0000 .000 0.0020.006 0 .0000 .001 0 .0000 .000 0.0040.001 0 .0010 .001 0 .0010 .000 0.0180.003 0 .0030 .001 0 .0030 .001 0.1100.043 0.0340.028 0 .0220 .002 0.0020.012 0.0020.012 0 .0000 .000 0.0010.001 0.0020.018 0 .0000 .000 0.0090.002 0 .0010 .000 0 .0010 .000 0.1300.081 0 .0040 .003 0 .0040 .001 0.6720.106 0.1080.086 0 .0480 .004 0.0010.004 0.0010.004 0 .0000 .000 0.0010.000 0 .0000 .001 0 .0000 .000 0.0020.001 0 .0000 .000 0 .0000 .000 0.0050.002 0 .0010 .000 0 .0010 .000 0.5330.031 0 .0050 .002 0 .0050 .001

Table 3: Comparison of SVT-DPBook, SVT-S, SVT-ReTr and EM on selecting top-c frequent items in terms of SER. For each row, the best SER value in the non-interactive setting is marked by italics and the best SER value in the interactive setting is marked by boldface. Each cell gives the average value of SER with standard deviation. [6] C. Dwork, F. McSherry, and K. Talwar. The price of privacy and the limits of LP decoding. In STOC, pages 85–94, 2007. [7] C. Dwork, M. Naor, O. Reingold, G. Rothblum, and S. Vadhan. On the complexity of differentially private data release: efficient algorithms and hardness results. STOC, pages 381–390, 2009. [8] C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Theoretical Computer Science, 9(3-4):211–407, 2013. [9] C. Dwork, G. Rothblum, and S. Vadhan. Boosting and differential privacy. Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, pages 51 – 60, 2010. [10] C. Dwork and S. Yekhanin. New efficient attacks on statistical disclosure control mechanisms. Advances in Cryptology–CRYPTO 2008, pages 469–480, 2008. [11] A. Gupta, A. Roth, and J. Ullman. Iterative constructions and private data release. In TCC, pages 339–356, 2012. [12] M. Hardt and G. N. Rothblum. A multiplicative weights mechanism for privacy-preserving data analysis. In FOCS, pages 61–70, 2010. Full version at http://www.mrtz.org/papers/HR10mult.pdf.

[13] J. Lee and C. W. Clifton. Top-k frequent itemsets via differentially private fp-trees. In KDD ’14, pages 931–940, 2014. [14] F. McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS, pages 94–103, 2007. [15] A. Roth. The sparse vector technique, 2011. Lecture notes for “ The Algorithmic Foundations of Data Privacy”. [16] A. Roth and T. Roughgarden. Interactive privacy via the median mechanism. In STOC, pages 765–774, 2010. [17] R. Shokri and V. Shmatikov. Privacy-preserving deep learning. In CCS, pages 1310–1321, 2015. [18] B. Stoddard, Y. Chen, and A. Machanavajjhala. Differentially private algorithms for empirical machine learning. CoRR, abs/1411.5428, 2014. [19] L. Xiong. Personal communication, Sept. 2015.