How to Estimate Change from Samples

Report 3 Downloads 30 Views
How to Estimate Change from Samples

arXiv:1203.4903v2 [cs.DS] 15 Feb 2013

Edith Cohen∗†

Haim Kaplan †

Abstract Measurements, snapshots of a system, traffic matrices, and activity logs are typically collected repeatedly. Difference queries are then used to detect and localize changes for anomaly detection, monitoring, and planning. When the data is sampled, as is often done to meet resource constraints, queries are processed over the sampled data. We are not aware, however, of previously known estimators for Lp (p-norm) distances which are accurate when only a small fraction of the data is sampled. We derive estimators for Lp distances that are nonnegative and variance optimal in a Pareto sense, building on our recent work on estimating general functions. Our estimators are applicable both when samples are independent or coordinated. For coordinated samples we present two estimators that tradeoff variance according to similarity of the data. Moreover, one of the estimators has the property that for all data, has variance is close to the minimum possible for that data. We study performance of our Manhattan and Euclidean distance (p = 1, 2) estimators on diverse datasets, demonstrating scalability and accuracy – we obtain accurate estimates even when a small fraction of the data is sampled. We also demonstrate the benefit of tailoring the estimator to characteristics of the dataset.

1 Introduction Data is commonly generated or collected repeatedly, where each instance has the form of a value assignment to a set of keys: Daily summaries of the number of queries containing certain keywords, transmitted bytes for IP flow keys, performance parameters (delay, throughput, or loss) for IP source destination pairs, environmental measurements for sensor locations, and requests for resources. In these examples, each set of values (instance) corresponds to a particular time or location. The universe of possible key values is fixed across instances but the values of a key are different in different instances. Difference queries between instances facilitate anomaly detection, monitoring, and planning by detecting, measuring, and localizing change [6, 24]. Figure 1 shows two instances and difference measures that include the Euclidean and Manhattan norms. keys: Instance 1 Instance 2

1 5 7

2 0 10

3 4 3

4 5 0

5 8 6

6 7 7

RG 1 RG 1+ RG 1− RG 2 RG 2+ RG 2−

2 2 0 4 4 0

10 10 0 100 100 0

1 0 1 1 0 1

5 0 5 25 0 25

2 0 2 4 0 4

0 0 0 0 0 0

Difference between instances for keys H: P Lp h∈H RG p± (h) p± (H) = 1 P Lp± (H) = ( h∈H RG p± (h)) p diff: up-only: down-only:

p

RG p (h) = |v1 (h) − v2 (h)| p RG p+ (h) = max{0, v1 (h) − v2 (h)} p RG p− (h) = max{0, v2 (h) − v1 (h)}

Example Queries: L1 ([6]) = 20 L1 ({4, 5, 6}) = 7 L22 ({1, 5})√ = 32 134 ≈ 11.6 L2 ([6]) = √ L2− ([6]) = √ 30 ≈ 5.5 L2+ ([6]) = 104 ≈ 10.2

Figure 1: Difference measures between two instances on subset H of the keys. Top table shows values vi (h) for key h ∈ [6] in instance i ∈ [2]. Bottom table shows single-key diffs. Data collection and warehousing is subject to limitations on storage, throughput, and energy required for transmission. Even when the data is stored in full, exact processing of queries may be slow and resource consuming. Random ∗ Microsoft † The

Research, SVC, USA Blavatnik School of Computer Science, Tel Aviv University, Israel.

1

sampling of datasets is widely used as a means to obtain a flexible summary over which we can query the data while meeting these limitations [25, 33, 4, 3, 7, 19, 20, 1, 16, 1, 21, 10, 18, 8, 11]. Quality estimators are essential to scalable and accurate querying of sampled data. We seek estimators that are accurate when a small fraction of the data is sampled and are efficient to compute. Since differences are nonnegative, we are interested in nonnegative estimators. A sampling algorithm maps data values and a set of random bits to a set of sampled keys. We focus on weighted sampling, meaning that the probability a key is sampled depends on its value. When these values are skewed, sampling schemes which favors heavier keys allow for more accurate estimates. When values are 0/1, keys with 0 value, which are often the majority of possible keys, have 0 inclusion probability and hence need not be explicitly considered – In an IP router, only a small subset of all possible keys IP addresses or IP flow keys are active. As instances are dispersed in time or location, for scalability, the sample of one instance can not depend on values assumed in another [15, 12], but random bits can be public (when generated using random hash functions). The two extremes of the joint distribution of samples of different instances are independent sampling (independent sets of random bits for each instance) and coordinated sampling (identical sets of random bits). With coordinated sampling, similar instances have similar samples whereas independent samples of identical instances can be completely disjoint. These two sampling schemes have different strengths and therefore we consider both: While coordination [2, 31, 28, 30, 4, 3, 7, 19, 20, 1, 10, 21, 11, 15] allows for tighter estimates of many basic queries including distinct counts (set union), quantile sums, and as we shall see, difference norms, [7, 19, 20, 11, 15, 12], it also has pitfalls: (i) it results in unbalanced “burden” where same keys tend to be sampled across instances – an issue when, for example, sampling is performed prior to transmission to save power and (ii) variance on some queries – notably linear combinations of single-instance queries – is larger than with independent sampling (“total traffic on Monday-Wednesday”, from daily summaries) – an issue if sample is primarily used for such queries. Contribution: We derive unbiased nonnegative estimators for Lpp and the downward-only and upward-only components Lpp+ and p Lp− . The sampling of each instance can be Poisson or bottom-k and samples of different instances can be independent or coordinated. We estimate Lpp as a sum over selected keys of nonnegative unbiased variance optimal RG p estimates of the values assumed by the key (see Figure 1 for examples and definitions). Variance optimality is in a Pareto sense – another estimator with strictly lower variance on some data must have strictly higher variance on another data. Lp is estimated by the pth root of our Lpp estimate. The bias of the Lp estimate decreases with the coefficient of variation of the (unbiased) Lpp estimator, which decreases when more keys are selected. Similarly, the downward-only and upwardonly components are estimated as respective sums of RGp− and RGp+ estimates. Over independently-sampled instances, we derive an optimal monotone estimator for RG p of two values (p > 0). Monotonicity means that the estimate value is non-decreasing with the information we can glean from the outcome. Our construction adapts a technique we developed in [12]. Over coordinated samples, we apply [14, 13] to derive the L and U estimators which are unbiased, nonnegative, and variance-optimal. The L estimator has lower variance for data with small difference (range) whereas the U estimator performs better when the range is large. The L estimator is monotone and “variance competitive”: on all data vectors, its variance is not too far off the minimum possible variance for the vector by a nonnegative unbiased estimator. For p = 1, 2, we compute closed form expressions of estimators and their variance and also tight bounds on “competitiveness” of the L estimator. We evaluate and compare the performance of our L1 and L22 difference estimators on diverse data sets, which vary in size and magnitude of change, and relate observed performance to properties of the data. We also examine the behavior of the L and U estimators and provide guidelines to choosing between them based on properties of the data. Roadmap: Section 3 contains necessary background and definitions. We present difference estimators for independent samples in Section 4 and for coordinated samples in Section 5. Section 6 contains an experimental evaluation.

2

2 Related work There was little prior work on estimating difference norms from samples. This is at least partly because, under common schemes such as when sampling via random accesses, there are strong lower bounds [5, 12] on estimation quality, showing that most entries need to be sampled in order to obtain estimates with meaningful accuracy. Our estimators, which are accurate even when only a small fraction of the data is sampled, critically depend on reproducibility of the “random bits” used by the sampling algorithm. More precisely, the inclusion probability of a key depends both on its value and a “random seed.” Knowledge of the seed (which can be hash based) provides the estimator with additional power, since when a key is not sampled we are able to bound its value. Fortunately, known seeds can be integrated with basic sampling schemes when data entries can be individually examined by the sampling algorithm, which is commonly the case when samples are produced as summaries of large data sets. One can attempt to obtain nonnegative and unbiased estimates via classic inverse probabilities (Horvitz Thompson [23]): When the outcome reveals the value of the estimated quantity, the estimate is equal to the value divided by the probability of such an outcome. The estimate is 0 otherwise. Inverse probability estimates, however, are inapplicable to difference estimation over weighted samples, since they require that there is positive probability for outcomes that reveal the exact value of the estimated quantity. With multiple instances and weighted sampling, keys that have zero value in one instance and positive value in another have positive contribution to the difference but because zero values are never sampled, there is zero probability for determining the value from the outcome. The only pre-existing satisfactory difference estimator we are aware of is for L1 over coordinated samples, which uses the relation |v1 − v2 | = max{v1 , v2 } − min{v1 , v2 } to obtain an indirect estimate as the difference of two inverse probability estimates for the maximum and minimum [15]. Our U estimator for p = 1 is a strengthening of this L1 estimator. Lastly, difference estimation of streams was extensively studied using sketches of the streams (e.g. [17]), which are synopses that are not sample-based. With sketches it is possible to obtain tighter estimates on the difference between complete streams but sketches have limited usefulness for other queries and do not preserve information on values of particular keys, and in particular, do not naturally support subset queries.

3 Preliminaries We denote by vi (h) ∈ V the value of key h ∈ K in instance i ∈ [r] and by the vector v(h), the values of key h in all instances. The exponentiated range function over values of a single key v(h) is: RG p (v)

p

= (max(v) − min(v)) (p > 0)

(1)

where max(v) ≡ maxi vi and min(v) = mini vi are the maximum and minimum entry values of the vector v. We omit the subscript when p = 1. For two instances (r = 2) we use the following to isolate upward-only and downward-only changes: = max{v1 − v2 , 0}p p RG p− (v1 , v2 ) = max{v2 − v1 , 0} RG p+ (v1 , v2 )

For a selected subset H ⊂ K of keys, we define Lpp (H) =

X

RG p (v(h))

.

(2)

h∈H p 1/p The p-norm of the difference of two instances (r = 2) = ||v1 (H) − v2 (H)||p . For upwardPis Lp (H) ≡ (Lp (H)) p only and downward-only change we use Lp± (H) = h∈H RG p± (v1 (h), v2 (h)). When data is sampled, we estimate Lpp , Lpp+ , and Lpp− , by summing estimates for the respective single-key primitives (RGp , RGp+ and RGp− ) over keys in H. We use unbiased estimators for the primitives, which result, from

3

4 5 0

5 8 6

6 7 7

seeds u1 seeds u2 v1 /u1 v2 /u2

Independent sampling, PPS 0.23 0.29 0.84 0.15 0.81 0.17 0.48 0.36 21.7 0 4.8 33.3 8.6 58.8 6.25 0

0.58 0.15 13.8 41.7

0.19 0.49 36.8 14.3

seeds u v1 /u v2 /u

Coordinated sampling, PPS 0.23 0.29 0.84 0.15 21.7 0 4.8 33.3 30.4 34.5 3.6 0

Instance 1 Instance 2

1 5 7

2 0 10

3 4 3

0.58 13.8 10.3

0.19 36.8 36.8

Poisson samples E[|S|] = 3: τi S

priority samples |S| = 3: eff. τi S

1 2

independent 29/3 {1, 4, 5, 6} 11 {2, 5, 6}

1 2

independent 13.8 {4, 5, 6} 8.6 {2, 5, 6}

1 2

coordinated u ≡ u1 29/3 {1, 4, 5, 6} 11 {1, 2, 5, 6}

1 2

coordinated u ≡ u1 13.8 {4, 5, 6} 10.3 {1, 2, 6}

Figure 2: Independent and coordinated samples of two instances. Seed selection and ratios. Poisson PPS samples of expected size 3 and priority samples of size 3. linearity of expectation, in unbiased estimates for the sums. Since only a fraction of keys is sampled, our estimates for each primitive generally have high CV (coefficient of variation, which is the ratio of the square root of the variance to the mean). Since estimates for different keys are (pairwise) independent (or nonpositively correlated), variance is (sub)-additive and the CV decreases when |H| increases, allowing for accurate estimates when H is sufficiently large. Unbiasedness of the single-key estimators is essential here. Since unbiasedness is not preserved under exponentiations, we need to carefully tailor to the exponent value p. We estimate Lp (H) by taking the pth root of the estimate for Lpp (H). This estimate is biased, but the error is small when the CV of our Lpp (H) estimate is small. Sampling scheme of instances. Our estimators apply to Poisson sampling, where keys are sampled independently, and bottom-k (order) sampling, that yields a sample size of exactly k. Bottom-k sampling includes Priority (Sequential Poisson) sampling and weighted sampling without replacement [30, 29, 27, 9, 18, 10, 11]. We state these classic schemes in a way which allows the “random” bits to be reproducible, first by the sampling algorithm, to facilitate coordination, and also by the estimator, to yield strong estimates. We reuse notation from [12, 14, 13]. Sampling is specified by a set of nondecreasing continuous functions τih , defined on the interval [0, 1]. Each key h in instance i is associated with a random seed value ui (h) ❀ U [0, 1] chosen uniformly at random. To make randomization reproducible, ui (h) is generated via a random hash function (pairwise independence and fewer bits in the representation of the seed suffice, but we skip these details here). With Poisson sampling, h is sampled in instance i ⇐⇒ vi (h) ≥ τih (ui (h)) . (3) A bottom-k sample of instance i includes the k keys with largest ratios ri (h) = vi (h)/τih (ui (h)). Samples of different instances are independent when the seeds ui (h) are independent for all i. They are coordinated (shared-seed) if the same seed is used for the same key in all instances, that is, ∀h ∈ K, ∀i ∈ [r], ui (h) = u1 (h) ≡ u(h). When threshold functions have the form τih (x) = ax (to simplify notation we use τih (x) ≡ τih x, treating τih as constant), the corresponding Poisson samples are PPS (Probability Proportional to Size) [22]. Strictly, PPS sampling assumes that τih are consistent across the instance, that is, τih ≡ τi , but our analysis is general. Sampling can be performed efficiently both when the threshold P τi is fixed or when set adaptively by a streaming algorithm to achieve a specified expected sample size E[|S|] = h∈K min{1, vi (h)/τi }. As an example, to obtain a PPS Poisson sample of expected size E[|S|] = 3 for the instances in Figure 1 we use τ1 = 29/3 (instance 1) and τ2 = 33/3 = 11 (instance 2). The bottom-k sample obtained with τih (x) ≡ x is a priority (sequential Poisson) sample [27, 18, 32]. Weighted sampling without-replacement is obtained with thresholds τih (x) ≡ − ln x. Figure 2 shows PPS and priority samples obtained with respect to random seeds for the two instances in Figure 1. Sampling model (single key): The exponentiated range estimators are applied to samples of the same key h across instances i ∈ r. That is, we work with the restrivtion of the sample to one key at a time. With Poisson sampling, for key h, we can obtain from the sample (3), the values of sampled entries of key h. The seed vector u ≡ u(h) and the thresholds τ = (τ1 (h), . . . , τr (h)) are all available to the estimator. With Bottom-k 4

sampling of instances, the threshold is not readily available, so we work with effective thresholds as follows. We condition the inclusion of h on seeds of other keys being fixed [10, 18] and define τih ≡ τi to be the inverse of the kth largest ri (h) of keys in instance i with h excluded (which is the k + 1st largest ratio over all keys in the instance). From here onward, we omit from the notation the reference to the key h and focus on exponentiated range estimators. We return to sum aggregates only for the experiments in Section 6. The data (values of a single key h in instances i ∈ [r]) is v = (v1 , v2 , . . . , vr ) ∈ V = V r (we mostly assume V ⊂ Rr≥0 ). The outcome S depends on the data v, random seed vector u and threshold functions τ . The ith entry is included in S if and only if its value is at least τi (ui ): i ∈ S ⇐⇒ vi ≥ τi (ui ) . The set of all data vectors consistent with outcome S (we treat u ∈ [0, 1]r as included with S) is V ∗ (S) =

{v ∈ V | S = S(u, v)} {z | ∀i ∈ [r], i ∈ S ∧ zi = vi ∨ i 6∈ S ∧ zi < τi (ui )} .

=

We can equivalently define the outcome as the set V ∗ (S) since it captures all information available to the estimator. Estimators: An estimator fˆ for f : V is a numeric function applied to the outcome. To be well defined in continuous domains, fˆ should be (Lebesgue) integrable. For exponentiated ranges, which are nonnegative quantities, we are interested in estimators that are nonnegative fˆ(S) ≥ 0 for all S. As explained earlier, since we sum many estimates, we would like each estimate to be unbiased E[fˆ|v] = f (v). Other properties we seek are bounded variance on all data, and variance-optimality (respectively, variance+ -optimality): there is no (resp., nonnegative) estimator with same or lower variance on all data and strictly lower on some data. An intuitive property that is sometimes desirable is monotonicity: the estimate value is non decreasing with the information on the data that we can glean from the outcome V ∗ (S) ⊂ V ∗ (S ′ ) =⇒ fˆ(S) ≥ fˆ(S ′ ). Order-based variance optimality: Given a partial order ≺ on V, an estimator fˆ is ≺-optimal (respectively, ≺+ optimal) if it is unbiased (resp., nonnegative) and for all data v, minimizes variance for v conditioned on the variance being minimized for all preceding vectors. Formally, if there is no other unbiased (resp., nonnegative) estimator that has strictly lower variance on some data v and at most the variance of fˆ on all vectors that precede v. Order-based optimality implies variance optimality.

4 Independent PPS sampling The outcome S(u, v) is determined by the data v and a random seed vector u ∈ [0, 1]r with independent entry values. V ∗ (S) = {z | ∀i ∈ [r], i ∈ S ∧ zi = vi ∨ i 6∈ S ∧ zi < τi ui } . (L)

ˆ p , which is the unique symmetric, monotone, and variance+ optimal estimator, by We derive the L estimator, RG applying our framework from [12] to construct the estimator for a function f . The application has two ingredients: The first is a method to construct a ≺-optimal estimator fˆ(≺) for with respect to a partial order ≺ on data vectors. The second ingredient is to identify a partial order ≺ so that the estimator fˆ(≺) is nonnegative, and therefore, ≺+ -optimal. Review of the construction of fˆ(≺) . The determining vector φ(S) of an outcome S is a ≺-minimal vector in the closure of consistent vectors: φ(S) = min≺ cl(V ∗ (S)). Accordingly, we can specify the sets φ−1 (v) of all outcomes determined by v and all outcomes S0 (v) that precede v, that is, consistent with v but determined by a vector that precedes v: φ−1 (v) S0 (v)

= {S|v = φ(S)} = {S|v ∈ V ∗ (S) ∧ φ(S) ≺ v}

The estimator fˆ(≺) is the same for all outcomes with same determining vector, and therefore we can specify it in terms of the determining vector fˆ(≺) (S) ≡ fˆ(≺) (φ(S)). We now state constraints that must be satisfied by fˆ(≺) . The 5

contribution of the outcomes S0 (v) to the expectation E[fˆ(≺) |v] is f0 (v) = E[fˆ(≺) |S0 (v), v]PR[S0 (v)|v] , where E[fˆ(≺) |S0 (v), v] is the expectation of fˆ(≺) on outcomes that precede v and PR[S0 (v)|v] is the probability that the outcome precedes v when the data is v. For all vectors v ∈ V , we require (this is necessary for unbiasedness) that If PR [φ−1 (v)|v] = 0 then f0 (v) ≡ f (v).

(4)

f (v) − f0 (v) Else, fˆ(≺) (v) = PR[φ−1 (v)|v]

(5)

where PR[φ−1 (v)|v] is the probability that the outcome is determined by v when the data vector is v. Choice of ≺. For RG p , we choose ≺ so that the relation between vectors is according to an increasing lexicographic order on the lists L(v), which we define to be the sorted multiset of differences {vi − min(v) | i ∈ [r]}. The ≺minimum vectors are those with all entries having equal values. With our choice of ≺, the determining vector φ(S) is as follows: if S = ∅ (no entries are sampled), φ(S) = 0. Otherwise, if h ∈ S then φ(S)h = vh and if h 6∈ S then φ(S)h = min{mini∈S vi , mini6∈S ui τi }. The mapping of outcomes to determining vectors for r = 2 is shown in Table 1 (Right). (L)

ˆp RG

v = (v1 , v2 ) v = (0, 0) v1 ≥ v2 > τ2

(v)

outcome S S=∅ S = {1} S = {2} S = {1, 2}

0 τ1 (v1 − v2 )p min{τ1 ,v1 } R v1 −v2 pτ1 τ2 y p−1 dy+ min{v1 ,τ1 } max{0,v1 −τ2 } v1 −y p 1 −τ2 } + τ1 max{0,v min{v1 ,τ1 }

v1 ≥ v2 ≤ τ2

: : : :

φ(S)1 0 v1 min{u1 τ1 , v2 } v1

φ(S)2 0 min{u2 τ2 , v1 } v2 v2

ˆ (L) Table 1: Left: Estimator RG for p > 0 and r = 2 over independent samples, stated as a function of the determining p vector (v1 , v2 ) when v1 ≥ v2 (case v2 > v1 is symmetric). Right: mapping of outcomes to determining vectors. −1 ˆ (L) Derivation of RG (v)|v] > 0 and verify that p . To obtain the estimator, we solve (5) for all v such that PR[φ (L) −1 ˆ f0 (v) ≡ f (v) for all v such that PR[φ (v)|v] = 0. The estimator RGp (p > 0) when r = 2 is specified in Table 1 through a mapping of determining vectors to estimate values. We can verify that for all p > 0, the estimator ˆ (L) ˆ p(L) (v, x) is non-increasing for x ∈ (0, v]) and has finite variances (follow from RG is nonnegative, monotone (RG p R v (L) ˆ p (v, x)2 dx < ∞). Table 2 shows explicit expressions of RG ˆ (L) and RG ˆ 2(L) . RG 0 −1 Vectors with PR[φ (S)] = 0 are exactly those with one positive and one zero entry and we can verify that (4) is ˆ (L) satisfied, that is, that RG (p > 0) is unbiased on these vectors. We conjecture that a solution of (5) for r > 2 is also p nonnegative, and monotone.

v = (v1 , v2 ) v = (0, 0) v1 ≥ v2 > τ2 v1 ≥ v2 ≤ τ2

ˆ RG

(L)

v = (v1 , v2 ) v = (0, 0) v1 ≥ v2 > τ2

(v)

0

τ1 (v1 − v2 ) min{τ1 ,v1 }   τ1 τ2 min{v1 ,τ2 } ln v2 min{τ1 ,v1 }

+

v1 ≥ v2 ≤ τ2

τ1 max{0,v1 −τ2 } min{v1 ,τ1 }

ˆ (L) RG 2 (v) 0

τ1 (v − v2 )2 min{τ1 ,v1 }  1 2τ1 τ2 v2 − min{v1 , τ2 } min{τ1 ,v1 } τ1 max{0,v1 −τ2 }2 + min{v1 ,τ1 }

+ v1 ln

min{v1 ,τ2 } v2

ˆ (L) and RG ˆ (L) Table 2: Explicit form of estimators RG for r = 2 over independent samples. Estimator is stated as a 2 function of the determining vector (v1 , v2 ) when v1 ≥ v2 (case v2 ≥ v1 is symmetric). We now provide details on the derivation. We consider vectors v in increasing ≺ order and solve (5) for f (≺) on outcomes with determining vector v. ˆ (L) We express RG p (v, v − ∆) (p > 0, ∆ ∈ [0, v]) as a function of τ1 and τ2 . 6



• Case: v − ∆ ≥ τ2 . Estimate can be positive only when u1 τ1 ≤ v, which happens with probability min{1, v/τ1 }. ˆ (L) We solve the equality ∆p = min{1, v/τ1 }RG p , obtaining (L)

ˆp RG

(v, v − ∆) =

τ1 ∆p . min{v, τ1 }

(6)

• Case: v − ∆ < τ2 . From (5): ∆p

=

min{v, τ1 } v − ∆ (L) min{v, τ1 } ˆ p (v, v − ∆) + RG τ1 τ2 τ1 τ2

Z



(L)

ˆp RG

(v, v − y)dy

max{0,v−τ2 }

Taking a partial derivative with respect to ∆, we obtain (L)

ˆ p (v, v − ∆) ∂ RG ∆p−1 pτ1 τ2 = ∂∆ min{v, τ1 } v − ∆ We use the boundary value for ∆ = max{0, v − τ2 }: (L)

ˆp RG

(v, min{v, τ2 }) =

τ1 max{0, v − τ2 }p , min{v, τ1 }

and obtain ˆ (L) RG p (v, v − ∆) =

pτ1 τ2 min{v, τ1 }

Z



max{0,v−τ2 }

τ1 max{0, v − τ2 }p y p−1 dy + v−y min{v, τ1 }

ˆ (L) and RG ˆ (L) The special case τ1 = τ2 = τ : The estimators RG as a function of the determining vector and their 2 ˆ (L) = v1 − v2 and VAR[RG ˆ (L) ] = 0. variance are provided in Tables 3 and 4. For data vectors where v1 ≥ v2 ≥ τ , RG (L) (L) ˆ ˆ ] = −2τ v2 ln( vτ2 ) − v22 + (τ )2 . Finally, if v2 ≤ v1 ≤ τ , If v1 ≥ τ ≥ v2 , RG = τ ln vτ2 + v1 − τ , and VAR[RG ˆ RG

(L)

(v1 , v2 ) =

ˆ VAR [ RG = =

(τ )2 v1

ln vv21 and

(L)

|(v1 , v2 )] = Z v1 v1 v2 (τ )2 v1 v12 v1 (τ )2 v1 2 2 ( ln − (v − v )) + +(1 − )(v − v ) + ln − (v1 − v2 ))2 dy ( 1 2 1 2 (τ )2 v1 v2 (τ )2 (τ )2 v2 v1 y v2 v2 v1 ln − ) − (v1 − v2 )2 . 2(τ )2 (1 − v1 v2 v1

Determining vector v1 ≥ v2 v1 ≥ v2 ≥ τ v1 ≥ τ ≥ v2 v2 ≤ v1 ≤ τ

ˆ (L) (v1 , v2 ) RG v1 − v2 τ ln vτ + v1 − τ (τ )2 v1

2

ln

v1 v2

Data v1 ≥ v2 v1 ≥ v2 ≥ τ v1 ≥ τ ≥ v2 v2 ≤ v1 ≤ τ

ˆ (L) |v] VAR [RG 0 −2τ v2 ln( vτ ) − v22 + τ 2 2 v v v 2τ 2 (1 − v2 ln v1 − v2 ) − (v1 − v2 )2 1

2

1

ˆ (L) and its variance for independent samples. Table 3: Estimator RG v1 2 ˆ (L) Similarly, for v2 ≤ v1 ≤ τ , RG 2 (v1 , v2 ) = 2(τ ) (ln v2 − shared-seed sampling (see next section).

v1 −v2 v1 ).

The variance when v1 ≥ τ is the same as for

5 Shared-seed sampling The outcome S(u, v) is determined by the data v and a scalar seed value u ∈ (0, 1], drawn uniformly at random: Entry i is included in S if and only if vi ≥ τi (u), where τi is a non-decreasing continuous function with range containing (min V, max V ). 7

condition v1 ≥ v2 ≥ τ v1 ≥ τ ≥ v2 v2 < v1 ≤ τ

Data v1 ≥ v2 v1 ≥ v2 ≥ τ v1 ≥ τ ≥ v2

(L)

ˆ2 RG

(v1 , v2 ) (v1 − v2 )2 v12 − (τ )2 − 2τ (v1 − v2 ) + 2τ v1 ln 2(τ )2 (ln

v1 v2



v1 −v2 v1

(L)

ˆ 2 |v] VAR [RG 0 −4v1 v2 τ (2v1 − v2 ) ln

τ v2 3τ 8v2 (τ )4 +4v1 v2 (τ ) + 3 + 3 2 2 2 −6v1 v2 τ − 4v1 v2 −v24 +4v1 v23 + 4v12 (τ )2 − 2v1 (τ )3 2(τ )2 4v23 + 5v13 − 9v1 v22 − (v1 − 3v1 2

τ v2

)

v2 ≤ v1 ≤ τ

2

−4(τ ) (2v1 − (L)

ˆ2 Table 4: Estimator RG

v v2 )v2 ln( v1 2

v2 )4

)

and its variance for independent samples.

Our RG p (p > 0) estimators are derived by applying our general theory on shared-seed estimators [14, 13]. We ˆ p (U) . ˆ p (L) and the U estimator RG derive two unbiased nonnegative variance+-optimal estimators, the L estimator RG ˆ p (v) . An estimator is vAs a reference point, we also consider for each vector v, the v-optimal estimate values RG optimal if amongst all estimators that are nonnegative and unbiased for all data, it has the minimum possible variance when the data is v. It turns out that the values assumed by a v-optimal estimator on outcomes consistent with v are unique, up to equivalence, and we refer to them as the v-optimal estimates. We compute closed form expressions of estimators and variances when τ has all entries equal (to the scalar τ ). ˆ p and are omitted. The expressions for the upward-only and downward-only variants follow those for RG Structure of the set of outcomes. The set of data vectors consistent with outcome S(u, v) is V ∗ (S) = {z|∀i ∈ [r], i ∈ S ∧ zi = vi ∨ i 6∈ S ∧ zi < τi (u)} . From the outcome S(u, v), we can determine not only V ∗ (S(u, v)) but also V ∗ (S(x, v)) for all x ≥ u. Observe that the sets V ∗ (S(u, z)) are the same for all consistent data vectors z ∈ V ∗ (S(u, v)). Fixing the data v, the upper bounds τi (u) on values of entries that are not sampled are non-decreasing functions of u and therefore, the set V ∗ (S(u, v)) is non-decreasing with u and the set of sampled entries is non-increasing. This means that the information on the data that we can glean from the outcome increases when u decreases. The lower bound function. To proceed, we need to define the lower bound function RGp : RG p (S)

=

inf

v∈V ∗ (S)

RG p (v)

,

which maps an outcome S to the infimum of RGp values on vectors that are consistent with the outcome. For RG , the lower bound is the difference between a lower bound on the maximum entry and an upper bound on the minimum entry. RG (S)

=

max vi − min{min vi , min τi (ρ)} . i∈S

i∈S

i6∈S

The lower bound on RGp is the pth power of the respective bound on RG , that is, RG p (S) =

RG (S)p .

For S(u, v), we

(v) RG p (u)

which is convenient when we fix v while varying the seed u. use the notation RGp (S) ≡ For PPS sampling with all-entries-equal τ : condition |S| RG (S) max(v) u> τ 0 0 max(v) min(v) ≥ u ≥ 1 . . . r − 1 max(v) − uτ τ τ r RG (v) u < min(v) τ ˆ p has minimum v-optimality. For a data vector v, we can determine when a nonnegative and unbiased estimator RG (v) (v) variance on v. We use the notation HRGp (u) for the lower boundary of the convex hull (lower hull) of RGp (u). This function is monotone non-increasing and therefore differentiable almost everywhere.

8

R1 (v) ˆ p (v, x)dx ˆ p minimizes VAR [RG ˆ p |v] ⇐⇒ HRG (u) = u RG Theorem 5.1. [14] A nonnegative unbiased estimator RG p ⇐⇒ almost everywhere (v) dHRGp (u) ˆ (v) . (7) RG (u) = − p du Note that the specification (7) of the v-optimal estimates on outcomes consistent with v is unique (in an almost everywhere sense). The estimates (7) are monotone non-increasing in u. Observe that the specification for different vectors with overlapping sets of consistent outcomes can be inconsistent and thus, it is not possible to obtain a single universally optimal estimator that is v-optimal for all v. (v) ˆ (v) We can now specify RG p for PPS sampling with all-entries-equal τ . The function RG p (u) is max{0, max(v) − τ } for u ≥

max(v) τ

and equal to

RG p (v)

for u ≤

min(v) . τ

Therefore for u ≥

max(v) , τ

(v)

the lower hull is HRG p (u) = 0

(v)

ˆ p (u) = 0. and RG For p ≤ 1, the function is concave for u ∈ [ min(v) , max(v) ]. The lower hull is therefore a linear function for τ τ (v) (v) τ (u) = RG (v)(1 − u : when max(v) ≤ τ , H u ≤ max(v) p RG p τ max(v) ) and when max(v) ≥ τ , H RG p (u) = RG p (v) − ˆ (v) }: RG u(RGp (v) − (max(v) − τ )p ). The v-optimal estimator is therefore constant for u ≤ min{1, max(v) p (u) = τ (v) τ p ˆ RG (v) max(v) when max(v) ≤ τ , and RG p (u) = RG p (v) − (max(v) − τ ) when max(v) ≥ τ . (v)

, max(v) ]. Geometrically, the lower hull follows the lower bound For p > 1, RGp (u) is convex for u ∈ [ min(v) τ τ function for u > α, where α is the point where the slope of the lower bound function is equal to the slope of a line segment connecting the current point to the point (0, RGp (v)). For u ≤ α, the lower hull follows this line segment and is linear. Formally, the point α is the solution of RG p (v)

= (max(v) − xτ )p−1 (pτ + max(v) − xτ ) .

, min{1, max(v) }], we use α = min{1, max(v) }. The estimates for u ∈ [α, min{1, max(v) }] If there is no solution α ∈ [ min(v) τ τ τ τ ˆ (v) are RG p (u) = −

d RG (v) p (u) du

ˆ (v) = pτ (max(v) − uτ )p−1 and for u ≤ α, RG p (u) = (v)

RG p (v)−(max(v)−ατ )p

(v)

α

.

Figure 3 (top) illustrates RG p and the corresponding lower hull HRGp for example vectors with p ∈ {0.5, 1, 2}. (v)

ˆ p , we can compute for any vector v, the minimum possible variance attainable for it by an unbiased From RG nonnegative estimator: Z 1 ˆ p(v) (u)2 du − RG p (v)2 . ˆ (v) RG (8) VAR [ RG ] = p 0

We use the v-optimal estimates as a reference point to measure the “variance competitiveness” of estimators. The L estimator. (L)

ˆp Theorem 5.2. [13] The estimator RG

(p > 0), specified as the solution of

ˆp ∀v∀ρ RG

(L)

(v, ρ) = −

Z

(v) 1 d RG p (u) du

ρ

u

du

has the following properties: • It is nonnegative and unbiased. • It is the unique (up to equivalence) variance+-optimal monotone estimator. • It is ≺+ -optimal with respect to the partial order ≺ v ≺ z ⇐⇒

RG p (v)


0), specified as the solution of

ˆ p (ρ, v) RG

=

sup

inf

z∈V ∗ (S) 0≤η

RG p (z)

.

• It has finite variances for all data vectors. The U estimator “prioritizes” data where the range (or difference when aggregated) is large. In particular, it is the nonnegative unbiased estimator with minimum variance on data with min(v) = 0. The solution for PPS with all-entries equal τ is provided as Algorithm 1 (see Appendix A for calculation). ˆ p(U) and its variance for p = 1, 2 are provided in Tables 8 and 9 (See Appendix B for details). The estimator RG 1000

100000

ind L 1 coo L 1 coo U 1

10

ind L 1 coo L 1 coo U 1

10000

1 var/mu^2

var/mu^2

var/mu^2

100 1000

0.1

10 100

1

0.01

10 0

1000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 vmin/vmax

1

0.001 0

100000

ind L 2 coo L 2 coo U 2

vmax/tau=0.90 var L1/U1 vmax/tau=0.50 var L1/U1 vmax/tau=0.25 var L1/U1 vmax/tau=0.0001 var L1/U1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 vmin/vmax

1

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 vmin/vmax

1

10

ind L 2 coo L 2 coo U 2

1

10000 var/mu^2

var/mu^2

var/mu^2

100 1000

0.1 0.01

10 100

1

10 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 vmin/vmax

1

0.0001 0

(A)

vmax/tau=0.90 var L2/U2 vmax/tau=0.50 var L2/U2 vmax/tau=0.25 var L2/U2 vmax/tau=0.0001 var L2/U2

0.001

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 vmin/vmax

1

0

(B)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 vmin/vmax

1

(C)

ˆ (L) ˆ (L) Figure 4: Variance (normalized by square of expectation) of RG estimator over independent samples and of RG p p (U) ˆ and RGp over shared-seed samples. Sampling with all-entries equal τ . (A): data with max(v) = 0.25τ . (B): ˆ p(L) ]/VAR[RG ˆ (U) data with max(v) = 0.01τ . (C): ratio VAR[RG p ] for shared-seed sampling, selected ratios max(v)/τ . Sweeping min(v). Top shows p = 1, bottom is p = 2. Choosing an estimator. How to choose between the L and U estimators? Figure 3 shows the v-optimal estimates and the L and U estimators for example vectors, illustrating their form and monotonicity of L and v-optimal. The estimators and their variance depends only on τ and the maximum and minimum entry values max(v) and min(v) min(v). For all-entries-equal τ and max(v) ≤ τ , we study the variance dependence on the ratio max(v) . The variance (U)

ˆp is 0 when RG(v) = 0. The estimator RG ˆ (U) VAR [ RG p |v]

=

ˆ (L) VAR [ RG p |v]

for x =

min(v) max(v) ,

(U)

ˆp VAR[ RG

has lower variance when

min(v) max(v)

is sufficiently small. The solution φp of

is such that (L)

ˆ p |v] ⇐⇒ |v] < VAR [RG

min(v) < φp . max(v)

For p = 1, φ1 ≈ 0.285 (is the solution of the equality (1 − x)/(2x) = ln(1/x)). For p = 2, φ2 ≈ 0.258. This suggests selecting an estimator according to expected characteristics of the data. If we expect RG (v) ≥ ˆ (U) ˆ (L) (1 − φp ) max (v), we choose RG and otherwise choose RG p p . The variance of the L estimator over independent samples and of the L and U estimators over shared-seed samples is illustrated in Figure 4. The figure also illustrates the relation between the variance of the shared-seed L and U

11

dataset destIP Server Surnames OSPD8

# keys 3.8 × 104 2.7×105 1.9×104 7.5×105

p1% 65% 53% 100% 99%

p2% 65% 56% 100% 99%

P

vi (h) 1.1 × 106 2.9 × 106 8.9 × 107 1.57×1010 i,h

p1% 49% 50% 48.6% 46.8%

p2% 51% 50% 51.4% 53.2%

P L1 / i,h vi (h) 0.36 0.75 0.094 0.0826

P L1+ / i,h vi (h) 0.19 0.38 0.0617 0.0727

P L1− / i,h vi (h) 0.18 0.37 0.0327 0.0099

Table 5: Datasets. Table shows total number of distinct keys P with positive value at least one of the two instances, P inP corresponding percentage in each instance, total sum of values i,h vi (h) ≡ h∈K i∈[2] vi (h), and fraction (shown

as percentage) in instance i = 1, 2:

P v (h) Ph∈K i , i,h vi (h)

and normalized L1 , L1+ and L1− differences. (L)

ˆ estimators. When max(v)/τ ≪ 1 (which we expect to be a prevailing scenario), VAR[RG ] is nearly at most 2 times ˆ (U) ] but as min(v) → max(v), the ratio VAR [RG ˆ (U) ]/VAR[RG ˆ (L) ] is not bounded. When max(v)/τ is close to VAR [ RG ˆ (L) ]/VAR[RG ˆ (U) ] is not bounded. Interestingly, for p = 2, 1, the variance of the U estimator is close to 0, and VAR[RG 4 the variance of the U estimator is always at least 3 RG 4 (v), and thus, using (12), the variance of the L estimator is at most 4.4 times the variance of the v-optimal (and thus of the U estimator).

6 Experimental Evaluation We study the estimation quality of our Lpp estimators, recalling that our Lpp estimate is the sum of RGp estimates: P p ˆp = P ˆ L p h∈K RG p (v(h)). The variance is h∈K RG Pp (h). This estimator is unbiased and has expectation Lp = ˆ p (h)]. The squared coefficient of variation is the ratio of the variance and the square of additive and is h∈K VAR[RG ˆ pp ) = the expectation: CV2 (L

P ˆ h∈K VAR [ RG p |v(h)] P ( h∈K RG p (v(h)))2

.

Figure 5 shows the squared coefficient of variation CV2 of our Lpp estimators as a function of the sampled fraction of the dataset. Each of the two instances was subjected to Poisson PPS [22] sampling (Results are essentially identical ˆ p (L) for independent for Priority sampling [27, 18]). We study accuracy when applying the single-key estimator RG (U) (L) ˆp ˆ p and RG for shared-seed (coordinated) samples of the two instances, samples of instances and the estimators RG for p = 1, 2. We used 4 datasets with properties summarized in Table 5. destIP: keys: (anonymized) IP destination addresses. value: the number of IP flows to this destination IP (source: IP packet traces). Instances: two consecutive time periods. Server: Keys: (anonymized) source IP addresses. values: the number of HTTP requests issued to the server from this address. Instances: two consecutive time periods. Source: Web server log. Surnames: Keys: the 18.5 × 103 most common surnames in the US. value: the number of occurrences of the surname in English books digitized by Google and published within the time period [26]. Instances: the years 2007 and 2008. OSPD8: Keys: 7.5 × 104 8 letter words that appear in the Official Scrabble Players Dictionary (OSPD). value: the number of occurrences of the term in English books digitized by Google and published within a time period [26]. Instances: the years 2007 and 2008. We can see qualitatively, that all estimators, even over independent samples, are satisfactory, in that the CV is small ˆ p (L) over coordinated (sharedfor a sample that is a small fraction of the full data set. The monotone estimator RG ˆ p (L) over independent samples. The seed) samples outperforms, by orders of magnitude, the monotone estimator RG gap widens for more aggressive sampling. The first two datasets (destIP and Server) exhibit significant difference between instances: the L1 distance is P P ˆ p (L) on ˆ p (U) outperforms RG a large fraction of the total sum of values h∈K i∈[2] vi (h). On these datasets, RG shared-seed samples. The last two datasets (Surnames and OSPD8) have small difference between instances and ˆ p (U) on shared seed samples. These trends are more pronounced for the higher moment p = 2. ˆ p (L) outperforms RG RG ˆ p (U) over sharedˆ p (L) over independent samples outperform RG In this case, on Surnames and OSPD8 datasets, RG

12

seed samples. We can see that we can significantly improve accuracy by tailoring the selection of the estimator to properties of the data. The performance of the U estimator, however, can significantly diverge with similarity whereas the competitive L estimator is guaranteed not to be too far off. Therefore, when there is no prior knowledge on the difference, we suggest using the L estimator. The datasets also differ in the symmetry of change. The change is more symmetric in the first two data sets Lp+ ≈ Lp− whereas there is a general growthP trend Lp+ ≫ Lp− in theP last two datasets. We did not include performance figures for the asymmetric differences h∈K RGp+ (v(h)) and h∈K RGp− (v(h)), but trends are similar to the symmetric variants.

7 Conclusion Difference queries are essential for monitoring, planning, and anomaly and change detection. Random sampling is an important tool for retaining the ability to query data under resource limitations. We provide the first satisfactory solution for estimating differences over sampled data sets. Our solution is comprehensive, covering common sampling schemes. It is supported by rigorous analysis and novel techniques that also allow us to gain deeper understanding and establish optimality. We demonstrated that our estimators perform well on diverse data sets.

References [1] K. S. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. On synopses for distinct-value estimation under multiset operations. In SIGMOD, pages 199–210. ACM, 2007. [2] K. R. W. Brewer, L. J. Early, and S. F. Joyce. Selecting several samples from a single population. Australian Journal of Statistics, 14(3):231–239, 1972. [3] A. Z. Broder. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences, pages 21–29. IEEE, 1997. [4] A. Z. Broder. Identifying and filtering near-duplicate documents. In Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, volume 1848 of LLNCS, pages 1–10. Springer, 2000. [5] M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. Towards estimation error guarantees for distinct values. In Proceedings of ACM Principles of Database Systems, 2000. [6] S. S. Chawathe, A. Rajaraman, H. Garcia-Molina, and J. Widom. Change detection in hierarchically structured information. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, 1996. [7] E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. System Sci., 55:441–453, 1997. [8] E. Cohen, N. Duffield, C. Lund, M. Thorup, and H. Kaplan. Efficient stream sampling for variance-optimal estimation of subset sums. SIAM J. Comput., 40(5), 2011. [9] E. Cohen and H. Kaplan. Summarizing data using bottom-k sketches. In Proceedings of the ACM PODC’07 Conference, 2007. [10] E. Cohen and H. Kaplan. Tighter estimation using bottom-k sketches. In Proceedings of the 34th VLDB Conference, 2008. [11] E. Cohen and H. Kaplan. Leveraging discarded samples for tighter estimation of multiple-set aggregates. In Proceedings of the ACM SIGMETRICS’09 Conference, 2009. [12] E. Cohen and H. Kaplan. Get the most out of your sample: Optimal unbiased estimators using partial information. In Proc. of the 2011 ACM Symp. on Principles of Database Systems (PODS 2011). ACM, 2011. full version: http://arxiv.org/abs/1203.4903. 13

100

10

ind L p=1 shared L p=1 shared U p=1

10

0.1 var/mu^2

var/mu^2

1 0.1 0.01

0.01 0.001

0.001

0.0001

0.0001

1e-05

1e-05 0.001

0.01 fraction sampled

10

1e-06 0.001

0.1

0.01 fraction sampled

1

ind L p=1 shared L p=1 shared U p=1

1

0.1

ind L p=2 shared L p=2 shared U p=2

0.1 0.01 var/mu^2

0.1 var/mu^2

ind L p=2 shared L p=2 shared U p=2

1

0.01 0.001

0.001 0.0001

0.0001

1e-05

1e-05

1e-06

1e-06 0.001

0.01

1e-07 0.001

0.1

0.01

fraction sampled 1

0.1

ind L p=1 shared L p=1 shared U p=1

0.1

0.1

fraction sampled ind L p=2 shared L p=2 shared U p=2

0.01

var/mu^2

var/mu^2

0.001 0.01 0.001

0.0001 1e-05

0.0001

1e-06

1e-05 0.01

1e-07 0.01

0.1 fraction sampled

10

0.0001

ind L p=1 shared L p=1 shared U p=1

1

ind L p=2 shared L p=2 shared U p=2

1e-05 1e-06

0.1

1e-07 var/mu^2

0.01 var/mu^2

0.1 fraction sampled

0.001 0.0001 1e-05

1e-08 1e-09 1e-10 1e-11

1e-06

1e-12

1e-07

1e-13

1e-08

1e-14 0.01

0.1 fraction sampled

p=1

0.01

0.1 fraction sampled

p=2

Figure 5: Datasets top to bottom: destIP, Server, Surnames, OSPD8. CV2 for fraction of sampled items. 14

[13] E. Cohen and H. Kaplan. A case for customizing estimators: Coordinated samples. cs.ST/1212.0243, arXiv, 2012.

Technical Report

[14] E. Cohen and H. Kaplan. What you can do with coordinated samples. Technical Report cs.DB/1206.5637, arXiv, 2012. [15] E. Cohen, H. Kaplan, and S. Sen. Coordinated weighted sampling for estimating aggregates over multiple weight assignments. Proceedings of the VLDB Endowment, 2(1–2), 2009. full version: http://arxiv.org/abs/0906.4560. [16] T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. In Proc. SIGMOD Conference, pages 240–251, 2002. [17] W. Dong, M. Charikar, and K. Li. Asymmetric distance estimation with sketches for similarity search in highdimensional spaces. In SIGIR, 2008. [18] N. Duffield, M. Thorup, and C. Lund. Priority sampling for estimating arbitrary subset sums. J. Assoc. Comput. Mach., 54(6), 2007. [19] P. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams. In Proceedings of the 13th Annual ACM Symposium on Parallel Algorithms and Architectures. ACM, 2001. [20] P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In International Conference on Very Large Databases (VLDB), pages 541–550, 2001. [21] M. Hadjieleftheriou, X. Yu, N. Koudas, and D. Srivastava. Hashed samples: Selectivity estimators for set similarity selection queries. In Proceedings of the 34th VLDB Conference, 2008. [22] J. H´ajek. Sampling from a finite population. Marcel Dekker, New York, 1981. [23] D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663–685, 1952. [24] D. Kifer, S. Ben-David, and J. Gehrke. Detecting change in data streams. In VLDB, 2004. [25] D.E. Knuth. The Art of Computer Programming, Vol. 2: Seminumerical Algorithms. Addison-Wesley, 1969. [26] J. B. Michel, Y. K. Shen, A. Presser Aiden, A. Veres, M. K. Gray, W. Brockman, The Google Books Team, J. P. Pickett, D. Hoiberg, D. Clancy, P. Norvig, J. Orwant, S. Pinker, M. A. Nowak, and E. L. Aiden. Quantitative analysis of culture using millions of digitized books. Science, 331(6014):176–182, 2011. http://www.sciencemag.org/content/331/6014/176. [27] E. Ohlsson. Sequential poisson sampling. J. Official Statistics, 14(2):149–162, 1998. [28] E. Ohlsson. Coordination of pps samples over time. In The 2nd International Conference on Establishment Surveys, pages 255–264. American Statistical Association, 2000. [29] B. Ros´en. Asymptotic theory for successive sampling with varying probabilities without replacement, I. The Annals of Mathematical Statistics, 43(2):373–397, 1972. [30] B. Ros´en. Asymptotic theory for order sampling. J. Statistical Planning and Inference, 62(2):135–158, 1997. [31] P. J. Saavedra. Fixed sample size pps approximations with a permanent random number. In Proc. of the Section on Survey Research Methods, Alexandria VA, pages 697–700. American Statistical Association, 1995. [32] M. Szegedy. The DLT priority sampling is essentially optimal. In Proc. 38th Annual ACM Symposium on Theory of Computing. ACM, 2006. [33] J.S. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37–57, 1985.

15

Appendix ˆ RG |S| = 0 |S| ≥ 1

(L)

0 } max{max(v) − τ, 0} − max{min(v) − τ, 0} + τ ln min{max(v),τ min{vmin ,τ }

Condition min(v) ≥ τ max(v) ≤ τ, min(v) = 0 max(v) ≤ τ, min(v) > 0 0 < min(v) ≤ τ ≤ max(v) 0 = min(v), τ ≤ max(v) ˆ Table 6: The estimator RG

(L)

ˆ VAR [ RG

(13)

(L)

|v] 0 2RG(v)τ − RG (v)2 2RG(v)τ − RG (v)2 − 2τ min(v) ln( max(v) min(v) ) τ 2 2 (τ ) − min(v) − 2τ min(v) ln( min(v) ) (τ )2 − min(v)2

(14)

(top) and variance for data v (bottom) for shared-seed sampling

(L)

ˆ2 RG |S| = 0 |S| ≥ 1

Condition min(v) ≥ τ max(v) ≤ τ min(v) ≤ τ ∧ max(v) ≥ τ

0 max{max(v), τ }2 − max{min(v), τ }2 − 2 max{min(v), τ }(max(v) − vmin ) } +2τ max(v) ln min{max(v),τ min{vmin ,τ } (L)

ˆ2 VAR [ RG

|v] 0 4 −4τ max(v) min(v) ln( max(v) min(v) )(2 max(v) − min(v)) − (max(v) − min(v)) 3 3 2 + 2τ 3 (5max(v) + 4min(v) − 9 max(v) min(v) ) τ 4 max(v) min(v)τ (min(v) − 2 max(v)) ln min(v)

(16)

3

4

τ +4 max(v) min(v)(τ )2 + (τ3) + 8min(v) 3 2 2 2 −6 max(v) min(v) τ − 4max(v) min(v) 4 3 2 −min(v) + 4 max(v)min(v) + 4max(v) (τ )2 − 2 max(v)(τ )3

(L)

ˆ2 Table 7: The estimator RG

A

(15)

(top) and variance for data v (bottom) for shared-seed sampling

(U )

ˆp Derivation of RG

(U)

ˆ p (ρ, v) = 0. Otherwise, when min(v) < ρτ ≤ max(v), noting that the If ρτ > max(v) then RGp (ρ, v) = 0 and RG supremum is obtained by a vector v ′ with maximum entry max(v) and minimum entry 0,

ˆ (U) RG p (ρ, v)

=

inf

(max(v) − ητ )p −

R min{1, max(v) } τ ρ

ρ−η

0≤η min(v) τ τ max(v) min(v) ≥ 2, ρ ≤ τ τ max(v) ≤ 1, ρ ∈ ( min(v) , max(v) ] τ τ τ max(v) min(v) ≤ 1, ρ ≤ τ τ max(v) ≤ 1, ρ ≥ max(v) τ τ max(v) ∈ [1, 2], ρ > 2 − max(v) , ρ > min(v) τ τ τ max(v) max(v) min(v) ∈ [1, 2], ρ < 2 − , ρ > τ τ τ max(v) ∈ [1, 2], ρ < 2 − max(v) , ρ < min(v) τ τ τ min(v) max(v) max(v) ∈ [1, 2], ρ ≤ > 2 − τ τ τ

(U)

ˆ VAR[RG

(U )

| v]

0 RG(v)(τ − RG (v)) min(v)(τ − min(v))

| v] (right) for shared-seed sampling.

(U)

ˆ2 RG

(S) max(v)2 max(v)2 − 2τ max(v) + min(v)τ 2τ (max(v) − ρτ ) 0 0 4τ (max(v) − τ ) 2τ (max(v) − ρτ ) 0 τ τ min(v) RG 2 (v) − 4τ (max(v) − τ )( min(v) − 1) (U)

ˆ2 VAR[ RG

condition on v min(v) ≥ τ max(v) ≤ τ

| v]

0 RG 3 (v)( 43 τ

max(v) τ

∈ [1, 2],

min(v) τ

≥2−

max(v) τ

max(v) τ

∈ [1, 2],

min(v) τ

0 ⊲ m = max(v) ⊲ n = min(S) ⊲ case: min(v) ≥ τ ⊲ case: min(v) ≤ τ and p ≤ 1

if m ≤ τ then if ρτ > n then return pτ (m − ρτ )p−1 else return 0

⊲ case: max(v) ≤ τ , p > 1

⊲ case: n < τ < max(v) and p > 1

pτ −m η0 ← (p−1)τ if η0 ∈ (0, 1) then −η0 τ )p if ρ ≥ max{η0 , n/τ } then return (m1−η 0

⊲ subcase: η0 ∈ (0, 1)

if n/τ < ρ < η0 then return pτ (m − ρτ )p−1 if ρ ≤ n/τ ≤ η0 then return 0 if ρ ≤ n/τ ≥ η0 then p m−η0 τ )p return τ (m−n) − (τ −nn)( n (1−η0 ) ⊲ subcase: η0 6∈ (0, 1)

else if ρτ > n then return mp else

  return nτ (m − n)p − mp nτ − 1

If ρτ ≤ min(v), ˆ (U) RG p (ρ, v)

=

R min{1, max(v) } τ

RG p (v) − ρ

RG p (v) −

=

ρ R min{1, max(v) } τ } min{1, min(v) τ

ˆp RG

(U)

(u, v)du

ˆp RG

(U)

(u, v)du

min{1, min(v)/τ }

(18)

R1 ˆ (U) ˆ (L) ˆ p (u, v)du = RG p (ρ, v) and the If min(v) ≥ τ , |S| = r, RG = RG ≡ RGp (v). If max(v) ≤ τ , ρ RG p p ˆ infimum is the derivative of the lower bound function, and thus, RGp (ρ, v) = pτ (max(v) − ρτ )p−1 . |S| 0, r : 1...r − 1 :

(U)

ˆp RG

0 pτ (max(v) − ρτ )p−1

(19)

We now consider the case min(v) ≤ τ ≤ max(v), solving (17) for ρ > min(v)/τ . For ρ = 1 we obtain the equation (max(v) − ητ )p ˆ (U) RG . p (1, v) = inf 0≤η τ , we have η0 = 2 − max Variance of RG ∈ (1, 2). We use τ . Thus η0 ∈ (0, 1) ⇐⇒ τ Z (RG 2 (v) − 2τ max(v) + 2(τ )2 u)3 . (RG 2 (v) − 2τ max(v) + 2(τ )2 u)2 du = 6(τ )2

We start with the case max(v) ≤ τ . ˆ 2(U ) |v] VAR[RG

= = =

(1 − (1 −

RG(v)

τ RG(v)

max(v) (RG2 (v) − 2τ max(v) + 2(τ )2 u)3 τ )RG4 (v) + min(v) 6(τ )2 τ

RG 6 (v)

)RG4 (v) + − τ 6(τ )2 4 RG3 (v)( τ − RG(v)) 3

19

RG 3 (v)( RG(v) − 2τ )

6(τ )2

3

The case max(v) ≥ 2τ : (U )

ˆ2 VAR[RG

|v]

min(v) min(v) )(max(v)2 − RG(v)2 )2 + (max(v)2 − 2τ max(v) + min(v)τ − RG2 (v))2 τ τ min(v) min(v) (max(v) − RG(v))2 (max(v) + RG(v))2 (1 − )+ (τ − min(v))2 (2 max(v) − min(v))2 τ τ min(v) τ − min(v) min(v)2 (2 max(v) − min(v))2 + (τ − min(v))2 (2 max(v) − min(v))2 τ τ (2 max(v) − min(v))2 min(v)(τ − min(v))

(1 −

= = = =

We next handle the case τ ≤ max(v) ≤ 2τ , (U )

ˆ2 VAR[RG

|v]

min(v) )(4τ (max(v) − τ ) − RG2 (v))2 τ  2 min(v) τ RG2 (v) τ − min(v) + − 4τ (max(v) − τ ) − RG2 (v) τ min(v) min(v)  2 2(τ − min(v)) RG2 (v) − 4τ (max(v) − τ ) τ

Lastly, for the case τ ≤ max(v) ≤ 2τ , (U )

|v]

min(v) τ

≤ η0 .

=

 2 Z (1 − η0 ) 4τ (max(v) − τ ) − RG2 (v) +

=

max(v) − τ τ +

C

> η0 .

(1 −

=

=

ˆ2 VAR[RG

min(v) τ



4τ (max(v) − τ



η0 min(v) τ

− RG2 (v))2 +

(2(τ (max(v) − uτ ) − RG2 )2 du +

min(v) RG4 (v) τ

min(v) RG4 (v) τ

(RG2 (v) + 4(τ )2 − 4 max(v)τ )3 − RG3 (v)(RG(v) − 2τ )3 6(τ )2

ˆ (L) and RG ˆ (L) Variance of RG 2

ˆ (L) and RG ˆ (L) The estimators RG 2 , provided in Tables 6 and 7 are obtained using (10). We calculate their variance. ˆ (L) = ˆ (L) : When max(v) ≤ τ , we have RG ˆ (L) = τ ln( max(v) Variance of RG τ ρ ) when 1 ≤ |S| ≤ r − 1 and RG τ ln( max(v) min(v) ) when |S| = r. The variance is ˆ VAR[RG

(L)

|v]

max(v) )RG(v)2 + τ

=

(1 −

=

−2τ min(v) ln(

max(v) τ min(v) τ

(RG(v) − τ ln(

max(v) 2 min(v) max(v) 2 )) dy + (RG(v) − τ ln( ) τy τ min(v)

max(v) ) + 2RG(v)τ − (RG(v))2 min(v)

ˆ When min(v) ≤ τ ≤ max(v), RG τ ) when τ + τ ln( min(v) (L) ˆ2 : Variance of RG

Z

(L)

ˆ = max(v) − τ + τ ln( ρ1 ) when 1 ≤ |S| ≤ r − 1 and RG

ˆ |S| = r. The variance is VAR [RG

(L)

2

2

|v] = (τ ) − min(v) − 2τ

(L)

= max(v) −

τ min(v) ln( min(v) ).

ˆ (L) If min(v) < τ ≤ max(v), RG = max(v)2 − (τ )2 + 2τ (uτ − max(v) + max(v) ln u1 when |S| ∈ [r − 1] and 2

20

(L)

ˆ2 RG

τ when |S| = r. The variance is = max(v)2 − (τ )2 + 2τ (min(v) − max(v) + max(v) ln max(v) (L)

ˆ2 VAR[RG

|v] = 4 max(v) min(v)τ (min(v) − 2 max(v)) ln

τ min(v)

(τ )4 − 6 max(v) min(v)2 τ − min(v)4 3 −4max(v)2 min(v)2 + 4 max(v)min(v)3 + 4max(v)2 (τ )2

+4 max(v) min(v)(τ )2 +

−2 max(v)(τ )3 +

8min(v)3 τ 3

(L)

(L)

ˆ 2 = 2τ (uτ − max(v) + max(v) ln max(v) ˆ If max(v) < τ , RG uτ ) when |S| ∈ [r − 1] and RG 2 max(v) max(v) + max(v) ln min(v) ) when |S| = r. The variance is (L)

ˆ2 VAR [RG

|v]

=

−4τ max(v) min(v) ln( +

max(v) )(2 max(v) − min(v)) + min(v)

2τ (5max(v)3 − 9 max(v)min(v)2 + 4min(v)3 ) − RG4 (v) 3

21

= 2τ (min(v) −