Sign Stable Random Projections for Large-Scale ... - Semantic Scholar

Report 2 Downloads 70 Views
arXiv:1504.07235v1 [stat.ML] 27 Apr 2015

Sign Stable Random Projections for Large-Scale Learning Ping Li Department of Statistics and Biostatistics Department of Computer Science Rutgers University Piscataway, NJ 08854, USA [email protected]

Abstract In this paper, we study the use of “sign α-stable random projections” (where 0 < α ≤ 2) for building basic data processing tools in the context of large-scale machine learning applications (e.g., classification, regression, clustering, and near-neighbor search). After the processing by sign stable random projections, the inner products of the processed data approximate various types of nonlinear kernels depending on the value of α. Thus, this approach provides an effective strategy for approximating nonlinear learning algorithms essentially at the cost of linear learning. When α = 2, it is known that the corresponding nonlinear kernel is the arc-cosine kernel. When α = 1, the procedure approximates the arc-cos-χ2 kernel (under certain condition). When α → 0+, it corresponds to the resemblance kernel, which provides the exciting connection between two popular randomized algorithms: (i) stable random projections (ii) b-bit minwise hashing. No theoretical results are known so far for other α values except for α = 2, 1, or 0+. From practitioners’ perspective, the method of sign α-stable random projections is ready to be tested for large-scale learning applications, where α can be simply viewed as a tuning parameter. What is missing in the literature is an extensive empirical study to show the effectiveness of sign stable random projections, especially for α 6= 2 or 1. The paper supplies such a study on a wide variety of classification datasets. In particular, we compare shoulder-by-shoulder sign stable random projections with the recently proposed “0-bit consistent weighted sampling (CWS)” [12] (which is only for nonnegative data). We provide the detailed comparisons on all the 34 datasets used by [12]. In addition, we present the comparison on a larger dataset with 350,000 examples. For all datasets, we experiment with α ∈ {0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2}. For most datasets, sign stable random projections can approach (or in some cases even slightly exceed) the performance of 0-bit CWS, given enough projections. Typically, to reach the same accuracy, sign stable random projections would require significantly more projections than the number of samples needed by 0-bit CWS. There are also datasets for which sign stable random projections could not achieve the same accuracy as 0-bit CWS regardless of α. While the comparison results seem to favor 0-bit consistent weighted sampling (which is only for nonnegative data), the distinct advantage of sign stable random projections is that the method is applicable to general data types, not only for nonnegative data. It is also an interesting research problem to combine 0-bit CWS with sign stable random projections, for example, a strategy similar to “CoRE kernels” [11].

1

1 Introduction In this paper, we focus on the idea of “sign α-stable random projections” and the applications in machine learning with massive (and possibly streaming [18]) data. Consider two data vectors u, v ∈ RD from a data matrix, the central idea is to multiply them with a random projection matrix {sij }, i = 1, ..., D, j = 1, ..., k, whose entries, sij , are sampled i.i.d. from an α-stable distribution, denoted by S(α, 1). That is, xj =

D X i=1

ui sij ,

yj =

D X

vi sij ,

sij ∼ S(α, 1), i.i.d.

j = 1, 2, ..., k

(1)

i=1

The use of α-stable distributions was studied in the context of estimating frequency moments of data streams [7, 10] and in the recent work on “one scan 1-bit compressed sensing” [13]. Here, we  adopt the √ α −1st parameterization [20, 19] such that, if s ∼ S(α, d), then the characteristic function is E e = e−d|t| . When α = 2, S(2, d) is equivalent to a Gaussian distribution N (0, σ 2 = 2d). When α = 1, S(1, 1) is the standard Cauchy distribution. Although in general no closed-form density functions of α-stable distributions are available, one can easily sample from an α-stable distribution by (e.g.,) the classical CMS [3] method. Stable distributions with α < 2 are also known to be “heavy-tailed” distributions because if s ∼ S(α, 1), then unless α = 2, we always have E(|s|λ ) = ∞ if λ ≥ α. This is probably the reason why stable distributions were rarely used in machine learning and data mining applications.

1.1 Sign Stable Random Projections  P   P  D α and y ∼ S α, α , j = By property of stable distributions, we have xj ∼ S α, D |u | |v | i j i i=1 i=1 1, 2, ..., k. Unless α = 2, it might be difficult to imagine how one can make use of these (manually generated) heavy-tailed data for of machine learning applications. Indeed, we do not directly use the projected data. Instead, in this paper, we only utilize the projected data through their signs, i.e., sign(xj ) and sign(yj ), which are well-behaved and can be used for building tools for large-scale machine learning. If xj ≤ 0, we can code xj as a two-dimensional vector [0 1]. If xj > 0, then we code it as [1 0]. Then we concatenate k such two-dimensional vectors to form a vector of length 2k (with k 1’s). We apply the same coding scheme to yj (and all the projected data). The signs, sign(xj ) and sign(yj ), are statistically dependent and it is interesting (and in general challenging) to find out how the signs are related. When α = 2, the relationship between sign(xj ) and sign(yj ) is well-known [6, 4, 15] PD u i vi 1 −1 i=1q α=2: Pr (sign(xj ) = sign(yj )) = 1 − cos ρ2 , ρ2 = qP PD π D 2 2 i=1 |ui | i=1 |vi |

(2)

Thus, the “collision probability” is monotonic in ρ2 , which is the correlation coefficient. Although cos−1 ρ2 P k is nonlinear, the estimator of the probability, i.e., k1 j=1 1{xj = yj } can be viewed as an inner product once we expand a sign as either [0 1] or [1 0]. In other words, we only need to pay the cost of linear learning to approximately train a classifier originally based on nonlinear kernels. It is not so straightforward to calculate the collision probability P PD once α < 2. A recent work [16] focused on α = 1 and showed that, when ui ≥ 0, vi ≥ 0, D u = i=1 i i=1 vi = 1, we have α=1:

Pr (sign(xj ) = sign(yj )) ≈ 1 −

1 cos−1 ρχ2 , π

ρχ 2 =

D X 2ui vi ui + vi

(3)

i=1

Note that the so-called χ2 -kernel, ρχ2 , is popular in computer vision, for data generated from histograms. 2

When α → 0+, [16] mentioned in the “future work” that the collision probability is related to the “resemblance” when the data are nonnegative: PD 1{ui > 0 and vi > 0} 1 1 α = 0+ : Pr (sign(xj ) = sign(yj )) = + R, R = Pi=1 (4) D 2 2 i=1 1{ui > 0 or vi > 0}

Interestingly, this collision probability is essentially the same as the collision probability of “1-bit minwise hashing” [14].

For other α values, at this moment we canPnot relate the collision probabilities to any known similarity measures. On the other hand, the estimator k1 kj=1 1{xj = yj } (which is an inner product) is of course still a valid positive definite kernel for any α. Thus, we can anyway use sign α-stable random projections for building large-scale learning algorithms, where α can be viewed as an important tuning parameter. What is missing in the literature is an extensive empirical study and our paper supplies such a study.

1.2 Resemblance, Min-Max Kernel, and 0-Bit Consistent Weighted Sampling (CWS) As mentioned above, the collision probability of sign stable random projections at α = 0+ is related to the resemblance R when the data (e.g., u and v) are nonnegative. From the definition PD 1{ui > 0 and vi > 0} , ui ≥ 0, vi ≥ 0 (5) R = R(u, v) = Pi=1 D i=1 1{ui > 0 or vi > 0}

we can see that R only makes sense when the data are sparse (i.e., most entries are zero). When the data are fully dense, we have R = 1 always. This may seriously limit the use of resemblance when the data are not sparse. This issue can be largely fixed by the introduction of the min-max kernel which is defined as PD i=1 min{ui , vi } , ui ≥ 0, vi ≥ 0 (6) KM M (u, v) = PD i=1 max{ui , vi } The recent work [12] also provides a variant, called the “normalized min-max kernel”: PD D D X X i=1 min{ui , vi } KN M M (u, v) = PD vi = 1 ui = 1, , i=1 max{ui , vi } i=1 i=1

(7)

The resemblance is a popular measure of similarity for binary data and can be sampled efficiently by minwise hashing [2, 14]. The min-max kernels can also be sampled using the technique called consistent weighted sampling (CWS) [17, 8]. Traditionally, each sample of CWS consists of two values, one of which is unbounded. The so-called ”0-bit” CWS [12] simply discarded the unbounded value to make CWS much more convenient for large-scale machine learning tasks. Because [12] experimented with a large collection of datasets, we hope to compare, shoulder-by-shoulder, sign stable random projections with 0-bit CWS, although we should reiterate that 0-bit CWS is only designed for nonnegative data and is hence not as general as sign stable random projections.

2 Experiments 2.1 Datasets and Summary of Results We have experimented all the 34 datasets used in the recent paper for ”0-bit CWS” [12] to provide a shoulderby-shoulder comparison. The results are summarized in Table 1. The results show that, given enough projections, sign α-stable random projections can often achieve good accuracies (and better than linear). The value of α is an important parameter which needs to be individually tuned for each dataset. 3

Table 1: Datasets and classification accuracies (in %). We use all the datasets in the recent work on “0-bit” CWS [12]. We report the results of linear kernels, min-max kernels (6), normalize min-max kernels (7) and sign α-stable random projections with α ∈ {0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2} and k = 8192. The values for the linear kernel, min-max kernels, and n-min-max (or n-m-m) kernels are directly quoted from [12]. For the min-max (and n-m-m) kernels, the accuracies were computed on the original data using LIBSVM “pre-computed” kernel functionality and l2 -regularized kernel SVM (which has a tuning parameter C). The reported test classification accuracies are the best accuracies from a wide range of C values. The reported accuracies of sign α-stable random projections (i.e., the last 9 columns) and linear kernels l2 regularized linear SVM were computed by LIBLINEAR [5]. We highlight (in bold) the highest accuracies among all methods as well as the highest accuracies of sign α-stable random projections among 9 α values.

Dataset Covertype10k Covertype20k IJCNN5k IJCNN10k Isolet Letter Letter4k M-Basic M-Image MNIST10k M-Noise1 M-Noise2 M-Noise3 M-Noise4 M-Noise5 M-Noise6 M-Rand M-Rotate M-RotImg Optdigits Pendigits Phoneme Protein RCV1 Satimage Segment SensIT20k Shuttle1k Spam Splice USPS Vowel WebspamN1-20k YoutubeVision

# train 10,000 20,000 5,000 10,000 6,238 16,000 4,000 12,000 12,000 10,000 10,000 10,000 10,000 10,000 10,000 10,000 12,000 12,000 12,000 3,823 7,494 3,340 17,766 20,242 4,435 1,155 20,000 1,000 3,065 1,000 7,291 528 20,000 11,736

# test 50,000 50,000 91,701 91,701 1,559 4,000 16,000 50,000 50,000 60,000 4,000 4,000 4,000 4,000 4,000 4,000 50,000 50,000 50,000 1,797 3,498 1,169 6,621 60,000 2,000 1,155 19,705 14,500 1,536 2,175 2,007 462 60,000 10,000

linear 70.9 71.1 91.6 91.6 95.4 62.4 61.2 90.0 70.7 90.0 60.3 62.1 65.2 68.4 72.3 78.7 78.9 48.0 31.4 95.3 87.6 91.4 69.1 96.3 78.5 92.6 80.5 90.9 92.6 85.1 91.7 40.9 93.0 63.3

min-max 80.4 83.3 94.4 95.7 96.4 96.2 91.4 96.2 80.8 95.7 71.4 72.4 73.6 76.1 79.0 84.2 84.2 84.8 41.0 97.7 97.9 92.5 72.4 96.9 90.5 98.1 86.9 99.7 95.0 95.2 95.3 59.1 97.9 72.4

n-m-m 80.2 83.1 95.3 96.0 96.6 95.0 90.2 96.0 77.0 95.4 68.5 70.7 71.9 75.2 78.4 84.3 84.1 83.9 38.5 97.4 98.0 92.0 70.7 96.9 87.8 97.5 87.0 99.6 94.7 94.9 95.3 53.5 97.8 72.4

4

0.1 74.5 76.5 91.0 91.2 90.9 88.0 84.9 95.9 55.6 95.6 47.0 46.4 50.1 53.0 55.4 59.9 60.2 82.6 24.1 95.7 96.6 88.0 69.0 94.8 84.3 96.1 85.5 99.2 95.0 87.4 94.6 41.2 96.9 59.7

0.25 76.7 78.4 92.8 93.3 93.7 92.2 88.1 96.0 64.1 95.7 53.2 54.6 57.1 59.2 62.4 68.4 69.1 83.0 26.8 96.4 97.0 90.4 69.9 94.9 86.1 97.0 86.2 99.2 95.0 90.7 95.3 41.3 97.3 65.0

0.5 77.9 79.8 93.7 94.2 94.9 94.1 90.1 96.0 67.9 95.6 56.8 57.5 60.6 62.9 66.4 72.6 72.5 82.5 29.3 96.7 97.5 91.3 70.6 94.9 87.1 97.4 86.6 99.4 94.9 91.7 95.5 43.8 97.5 68.4

0.75 78.3 80.3 94.5 95.4 95.3 94.8 91.1 95.9 69.9 95.5 58.2 59.4 62.3 65.2 68.6 74.2 74.2 81.6 30.6 97.3 97.7 91.5 70.7 94.9 87.1 97.2 86.7 99.6 94.7 91.6 95.4 46.1 97.5 69.4

1 78.4 80.4 95.2 95.7 95.7 95.3 91.5 95.7 70.9 95.3 58.9 60.6 63.1 66.0 68.9 75.5 75.2 80.9 32.0 97.4 97.9 91.7 70.5 94.9 87.3 97.3 86.7 99.5 94.7 91.0 95.3 47.2 97.5 69.2

1.25 78.5 80.4 94.7 95.9 95.6 95.3 91.9 95.5 71.4 95.2 59.7 61.5 64.0 66.7 70.2 76.1 76.1 80.2 32.7 97.5 97.9 91.6 70.3 94.8 87.7 97.2 86.3 99.6 94.4 90.7 95.3 49.3 97.4 68.9

1.5 78.4 80.7 95.4 95.7 95.8 95.4 92.1 95.4 71.9 95.0 60.4 61.9 64.4 67.2 70.4 76.5 76.5 79.5 33.4 97.8 98.0 91.5 69.7 94.7 88.0 97.2 86.0 99.5 94.4 89.6 95.1 51.2 97.3 67.9

1.75 78.3 80.5 95.3 95.9 95.8 95.6 92.0 95.2 72.1 94.8 60.4 61.5 64.7 67.5 70.7 76.6 76.8 78.8 33.7 97.8 98.1 91.9 69.4 94.6 87.8 96.9 85.3 99.6 94.2 88.9 95.1 52.7 97.2 66.2

2 78.2 80.3 95.4 96.0 95.6 95.6 91.7 95.0 72.0 94.7 60.9 61.7 64.8 67.8 71.5 77.3 77.1 78.2 34.1 97.7 98.1 91.6 68.8 94.4 87.7 96.9 84.7 99.6 94.0 87.3 95.1 52.9 97.0 64.8

2.2 Detailed Results of Sign α-Stable Random Projections Figures 1 to 4 presents the detailed classification results of sign α-stable random projections for selected 4 datasets, using l2 -regularized linear SVM (with a regularization parameter C ∈ [10−2 , 103 ]). In each figure, we present the results for k ∈ {64, 128, 256, 512, 1024, 2048, 4096, 8192} projections and α ∈ {0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2}. All experiments were conducted using LIBLINEAR [5] and we repeated each randomized experiment 5 times and reported the average results. The classification results are very stable (i.e., very small variance) unless k is too small. The results (together with Table 1 and other figures later in the paper) show that, given enough projections (e.g., 8192), the method of sign α-stable random projections can typically achieve good accuracies.

75

8192 4096 2048 1024 512 256 128 64

70 65 60 55

70 65 60 55

−1

10

0

1

10

10

2

10

50 −2 10

3

10

−1

10

0

80

Covertype10k: α = 0.75

8192

65

Accuracy (%)

Accuracy (%)

70

60 55

2

10

−1

0

1

10

10

2

10

80 8192 1024 512 256 128 64

70 65 60

80

−1

10

0

1

10

10

65

Accuracy (%)

70

60 55

2

10

0

1

10

2

10

3

10

2

3

10

10

Covertype10k: α = 1.25

75

8192 4096 1024

70

256

65

128 64

60

50 −2 10

3

10

Covertype10k: α = 1.75

70 65

80

8192 4096 2048 1024 512 256 128 64

−1

10

0

60

50 −2 10

1

10

10

2

3

10

10

Covertype10k: α = 2

75

55 10

1

10

C

75

8192 1024 512 256 128 64

−1

0

10

C

75

10

−1

10

55

50 −2 10

3

10

Covertype10k: α = 1.5

50 −2 10

60

C

55 10

65

50 −2 10

3

10

Covertype10k: α = 1

C

Accuracy (%)

10

75

1024 512 256 128 64

50 −2 10

70

C

75

80

1

10

C 80

8192 1024 512 256 128 64

55

Accuracy (%)

50 −2 10

Covertype10k: α = 0.5

75

8192 4096 2048 1024 512 256 128 64

Accuracy (%)

Accuracy (%)

75

80

Covertype10k: α = 0.25 Accuracy (%)

80

Covertype10k: α = 0.1 Accuracy (%)

80

8192 1024 512 256 128

70 65

64

60 55

−1

10

0

1

10

C

10 C

2

10

3

10

50 −2 10

−1

10

0

1

10

10

2

10

Figure 1: Covertype10k. Classification accuracies of sign α-stable random projections using l2 -regularized SVMs (with a tuning parameter C ∈ [10−2 , 103 ]) for α ∈ {0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2} and k ∈ {64, 128, 256, 512, 1024, 2048, 4096, 8192} projections. In each panel, the highest point (i.e., best accuracy) at k = 8192 was reported in Table 1. In addition, each panel also presents the accuracies of linear SVM (the pink curve marked by *). All experiments were conducted by LIBLINEAR.

5

3

10

C

2048

70 60

1024

50 40

512

30 1

10

10

256 128 3 10 64 10

70 50 40

256

30

128

20 −2 10

2

512

60

−1

10

0

10

C

Accuracy (%)

Accuracy (%)

70

256

60 50 40

128

30

64 −1

0

1

10

10

2

10

70

256

256

60

128

50 40

64

30

128

50 40

64 −1

10

0

1

10

10

2

64 −1

10

0

1

10

2

10

3

10

10

10

8192 2048 1024 512 256

70 60

128

50 40

64

−1

10

0

1

10

C

10 C

10

2

3

10

10

80 70

256

60 50 40

128

30

64 −1

10

0

1

10

10

2

3

10

10

C

80

20 −2 10

1

8192 2048 1024 512

20 −2 10

3

10

30 0

128

100 90 Letter: α = 1.25

60

100 90 Letter: α = 1.75 Accuracy (%)

70

10

256

50 40

C

80

−1

60

20 −2 10

3

80

20 −2 10

10

8192 2048 1024 512

10

70

C

30 3

10

100 90 Letter: α = 1.5 Accuracy (%)

64

8192 2048 1024 512

C

20 −2 10

10

100 90 Letter: α = 1

8192 4096 2048 1024 512

80

10

10

2

80

C

100 90 Letter: α = 0.75

20 −2 10

1

8192 4096 2048 1024 512

30

Accuracy (%)

10

0

80

Accuracy (%)

−1

100 90 Letter: α = 0.5

8192 4096 2048 1024

Accuracy (%)

8192 4096

80

20 −2 10

100 90 Letter: α = 0.25 Accuracy (%)

Accuracy (%)

100 90 Letter: α = 0.1

2

10

3

10

100 90 Letter: α = 2 80 70 60 50 40 30 20 −2 −1 0 10 10 10

8192 2048 1024 512 256 128 64

1

10

2

10

Figure 2: Letter. Classification accuracies of sign α-stable random projections using l2 -regularized SVMs (with a tuning parameter C ∈ [10−2 , 103 ]) for α ∈ {0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2} and k ∈ {64, 128, 256, 512, 1024, 2048, 4096, 8192} projections. In each panel, the highest point (i.e., best accuracy) at k = 8192 was reported in Table 1. In addition, each panel also presents the accuracies of linear SVM (the pink curve marked by *). All experiments were conducted by LIBLINEAR.

6

3

10

C

100 8192 2048 1024 256 128

90 80

64

70

0

1

10

10

2

80 64

70 60 −2 10

3

10

90

10

−1

10

0

C 100

MNIST10k: α = 0.75

8192 2048 1024 256 128

90 80

64

70 60 −2 10

−1

10

0

1

10

10

2

10

Accuracy (%)

Accuracy (%)

64

70

0

1

10

100 8192 2048 1024 256 128

80

64

70

100 8192 2048 1024 256 128

80

−1

64

70

−1

10

0

−1

10

0

1

10

10

2

10

3

10

10

2

10

80

128

70

64

0

1

10

C

10 C

3

10

8192 2048 1024 256 128

90 80

64

70

100

90

−1

2

10

−1

10

0

1

10

10

2

3

10

10

C

8192 2048 1024 256

10

10

MNIST10k: α = 1.25

60 −2 10

3

10

MNIST10k: α = 1.75

60 −2 10

1

10

C

90

10

80

60 −2 10

3

10

8192 2048 1024 256 128

90

C

90

60 −2 10

3

10

MNIST10k: α = 1.5

60 −2 10

2

10

MNIST10k: α = 1

C 100

10

MNIST10k: α = 0.5

C

Accuracy (%)

Accuracy (%)

100

1

10

Accuracy (%)

−1

10

8192 2048 1024 256 128

2

10

Accuracy (%)

60 −2 10

100 MNIST10k: α = 0.25

Accuracy (%)

MNIST10k: α = 0.1

Accuracy (%)

Accuracy (%)

100

3

10

MNIST10k: α = 2

90

8192 2048 1024 256

80

128

70

64

60 −2 10

−1

10

0

1

10

10

2

10

C

Figure 3: MNIST10k. Classification accuracies of sign α-stable random projections using l2 -regularized SVMs (with a tuning parameter C ∈ [10−2 , 103 ]) for α ∈ {0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2} and k ∈ {64, 128, 256, 512, 1024, 2048, 4096, 8192} projections. In each panel, the highest point (i.e., best accuracy) at k = 8192 was reported in Table 1. In addition, each panel also presents the accuracies of linear SVM (the pink curve marked by *). All experiments were conducted by LIBLINEAR.

7

3

10

512

85 80

256

75 70 −2 10

−1

10

0

1

10

10

2

C

85 128

80

70 −2 10

10

−1

10

0

90

100

64

75 −1

10

0

1

10

10

2

Segment: α = 1

10

85

64

100

64

85 80 75

−1

10

0

1

10

10

2

10

0

0

1

10

2

10

3

10

1

10

10

2

3

10

10

90

8192 1024 512 256 128

85

64

Segment: α = 1.25

80

70 −2 10

3

10

−1

10

0

100

Segment: α = 1.75

8192 1024 512 256 128 64

90 85

10

2

3

10

Segment: α = 2

10

8192

95

80

70 −2 10

1

10

C

75 10

−1

10

75

95 Accuracy (%)

90

−1

64

95

80

100

8192 1024 512 256 128

95

10

80

C

Segment: α = 1.5

70 −2 10

128

85

C

90

C

Accuracy (%)

10

8192 1024 512 256 128

70 −2 10

3

10

512 256

90

70 −2 10

3

64

75

70 −2 10 100

2

10

95

85 80

10 C

8192 1024 512 256 128

95

1

10

128

Segment: α = 0.75

8192

75

Accuracy (%)

100

90

Segment: α = 0.5

95

1024 512 256

75 3

10

100

8192

Accuracy (%)

90

Segment: α = 0.25

95

Accuracy (%)

Accuracy (%)

95

Accuracy (%)

100

8192 4096 2048 1024

Accuracy (%)

Segment: α = 0.1

Accuracy (%)

100

512 256 128

90

64

85 80 75

−1

10

0

1

10

C

10 C

2

10

3

10

70 −2 10

−1

10

0

1

10

10

2

10

C

Figure 4: Segment. Classification accuracies of sign α-stable random projections using l2 -regularized SVMs (with a tuning parameter C ∈ [10−2 , 103 ]) for α ∈ {0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2} and k ∈ {64, 128, 256, 512, 1024, 2048, 4096, 8192} projections. In each panel, the highest point (i.e., best accuracy) at k = 8192 was reported in Table 1. In addition, each panel also presents the accuracies of linear SVM (the pink curve marked by *). All experiments were conducted by LIBLINEAR.

8

3

10

2.3 Detailed Comparisons with 0-Bit Consistent Weighted Sampling (CWS) Figures 5 to 8 compare sign α-stable random projections with 0-bit CWS [12] on selected datasets. For clarity, we only show the results of sign stable random projections for k = 128, 256, 1024, 8192 projections, and the results for 0-bit CWS with k = 128, 256, 1024 samples. These results demonstrate that 0-bit CWS requires much fewer samples, although we should keep in mind that 0-bit CWS is only for nonnegative data. 100 k = 8192

95

k = 1024

90

k = 256

85

k = 128

k = 256

85 k = 128

80 −1

10

0

1

10

10

2

10

75 −2 10

3

10

C k = 8192

95 90

k = 1024

85

k = 256

75 −2 10

−1

10

0

1

10

10

2

10

0

k = 256

50

k = 128

40 1

10

10

C 90 M−Rotate: α = 1 80 Accuracy (%)

Accuracy (%)

60

2

10

10

k = 256

80

k = 128

−1

10

0

k = 8192

k = 1024

60 50

k = 256

40

k = 128 −1

10

0

1

10

1

10

10

2

10

k = 128

−1

10

0

10

2

10

60 50

k = 256

40

k = 128 1

10

10

k = 8192

k = 1024

85

k = 256

80

k = 128 −1

10

0

60

2

10

50

10

k = 256

10

10

1

10

C

10 C

2

10

3

10

k = 8192

70 k = 1024

60

k = 256

50

k = 128 −1

10

0

1

10

10

2

10

3

10

2

10

k = 8192

70 60

k = 1024

50 k = 256

40

k = 128 0

M−Rotate: α = 0.5

30 −2 10

3

k = 1024

−1

10

40

k = 8192

70

30 −2 10

1

10

C 90 M−Rotate: α = 2 80

40 3

3

10

MNIST10k: α = 2

80

k = 1024

10

2

10

90

90 k = 8192

70

0

10

C

M−Rotate: α = 0.25

−1

1

10

C

75 −2 10

3

10

C 90 M−Rotate: α = 1.5 80

70

30 −2 10

k = 256

85

95

k = 1024

30 −2 10

3

Accuracy (%)

Accuracy (%)

k = 1024

10

k = 1024

75 −2 10

3

10

k = 8192

80

70

0

2

10

85

90 k = 8192

80

−1

90

100

90

75 −2 10

3

10

k = 8192

C

M−Rotate: α = 0.1

30 −2 10

10

MNIST10k: α = 1.5

C 90

1

10

95

k = 128

80

−1

10

C

100

MNIST10k: α = 1

Accuracy (%)

100

95

80

Accuracy (%)

75 −2 10

Accuracy (%)

k = 1024

90

Accuracy (%)

80

MNIST10k: α = 0.5

k = 8192

Accuracy (%)

Accuracy (%)

95

100

MNIST10k: α = 0.25

Accuracy (%)

MNIST10k: α = 0.1

Accuracy (%)

100

k = 128 3

10

30 −2 10

−1

10

0

1

10

10

2

10

Figure 5: MNIST10k (top 2 rows) and M-Rotate (bottom 2 rows). We compare sign α-stable random projections with 0-bit consistent weighted sampling (CWS). Each panel (for each α) consists of 8 curves. The solid (pink) curve marked by * represents the results of linear SVM. Four solid curves (labelled by k = 128, k = 256, k = 1024, and k = 8192, respectively) represent the results of sign α-stable random projections for 4 different k values. The 3 dashed curves correspond to the results of 0-bit CWS for k = 128, 256, 1024 (a higher curve for a higher k value). These experimental results, all conducted using LIBLINEAR, show that 0-bit CWS requires much fewer samples to achieve the sample accuracies. 9

3

10

C

100

Pendigits: α = 0.1

Accuracy (%)

95

Accuracy (%)

k = 8192 k = 1024

90 85

k = 256

80

Pendigits: α = 0.25

100

k = 8192

95

k = 1024

90

k = 256

85 80

Pendigits: α = 0.5

k = 8192 k = 1024

95 Accuracy (%)

100

k = 256

90 k = 128

85 80

k = 128

k = 128 −1

10

0

1

10

10

2

10

75 −2 10

3

10

−1

10

0

C 100

Pendigits: α = 1

k = 128

85 80

Pendigits: α = 1.5

k = 128

90 85

1

10

2

10

75 −2 10

3

10

90 Accuracy (%)

Accuracy (%)

95

85

k = 8192

80 k = 1024

75 70

k = 256 −1

10

0

1

10

−1

10

0

10

2

10

2

10

k = 128

85

95 90 k = 8192

80

k = 1024

75 70

Accuracy (%)

90

80

k = 256 k = 128

75 70

k = 256

−1

10

0

1

10

10

k = 128 −1

10

0

1

10

10

2

10

85

k = 8192 k = 1024

80

k = 256

75 70

k = 128

−1

10

0

2

3

10

95 90 k = 8192 k = 1024

80

k = 256 k = 128

75 70

60 −2 10

1

10

10

2

10

3

10

C

85

65 10

3

10

Satimage: α = 0.5

60 −2 10

3

10

Satimage: α = 1.5

60 −2 10

1

2

10

65

65 10

3

10

C

85

95

2

10

k = 8192 k = 1024 k = 256

90

75 −2 10

3

10

Satimage: α = 0.25

60 −2 10

3

10

k = 8192 k = 1024

0

Pendigits: α = 2

C

85

10

10

65

Satimage: α = 1

−1

1

10

C

10

1

10

C

Satimage: α = 0.1

60 −2 10

0

10

80

Accuracy (%)

0

10

−1

10

95

Accuracy (%)

−1

10

65

Accuracy (%)

100

k = 8192 k = 1024 k = 256

C

90

75 −2 10

3

10

80

75 −2 10

95

2

10

C

95

k = 256

Accuracy (%)

Accuracy (%)

100

k = 8192 k = 1024

90

90

10 C

95

95

1

10

Accuracy (%)

75 −2 10

Satimage: α = 2

85

k = 8192 k = 1024

80

k = 256 k = 128

75 70 65

−1

10

0

1

10

C

10 C

2

10

3

10

60 −2 10

−1

10

0

1

10

10

2

10

Figure 6: Pendigits and Satimage. We compare sign α-stable random projections with 0-bit consistent weighted sampling (CWS). Each panel (for each α) consists of 8 curves. The solid (pink) curve marked by * represents the results of linear SVM. Four solid curves (labelled by k = 128, k = 256, k = 1024, and k = 8192, respectively) represent the results of sign α-stable random projections for 4 different k values. The 3 dashed curves correspond to the results of 0-bit CWS for k = 128, 256, 1024 (a higher curve for a higher k value). These experimental results, all conducted using LIBLINEAR, show that 0-bit CWS requires much fewer samples to achieve the sample accuracies.

10

3

10

C

100

100 k = 8192 k = 1024

k = 8192

95 Accuracy (%)

k = 1024

90 k = 256

85

k = 128

80

90 k = 128

85 80

0

1

10

10

2

10

75 −2 10

3

10

−1

10

0

2

10

3

10

90

k = 128

85

1

10

10

2

90

k = 128

85

10

10

k k= = 8192 1024 k = 256 k = 128

90 85

−1

10

0

1

10

10

2

10

75 −2 10

3

10

−1

10

0

Accuracy (%)

90 k = 8192

80 k = 1024

70

k = 256

60

1

10

C

100

3

10

Shuttle1k: α = 2

Shuttle1k: α = 1.5

C

Splice: α = 0.1

2

10

80

75 −2 10

3

1

10

95

80 10

0

10

100

k = 8192 k = 1024 k = 256

Shuttle1k: α = 1 0

−1

10

C

95

80

Accuracy (%)

10

100 Accuracy (%)

Accuracy (%)

95

k = 8192 k = 1024 k = 256

−1

Shuttle1k: α = 0.5

75 −2 10

C

100

100

1

10

C

75 −2 10

k = 128

85 80

Accuracy (%)

−1

10

k = 256

90

Shuttle1k: α = 0.25

Shuttle1k: α = 0.1 75 −2 10

k = 8192 k = 1024

95

k = 256

10

2

10

3

10

C

100

Splice: α = 0.25

90

k = 8192

80

k = 1024

70

k = 256

60

k = 128

Accuracy (%)

Accuracy (%)

95

Accuracy (%)

100

Splice: α = 0.5

90

k = 8192

80

k = 1024 k = 256

70

k = 128

60

k = 128 −1

10

0

1

10

10

2

10

50 −2 10

3

10

−1

10

0

C 100

Splice: α = 1 k = 8192

Accuracy (%)

90 80

k = 1024 k = 256 k = 128

70 60 50 −2 10

10

2

10

−1

0

1

10

−1

10

0

10

2

10

3

10

100

Splice: α = 1.5

90

k = 8192

80

k = 1024 k = 256 k = 128

70

50 −2 10

1

10

10

2

10

3

10

C Splice: α = 2

90

60 10

50 −2 10

3

10

C

Accuracy (%)

100

1

10

Accuracy (%)

50 −2 10

k = 8192

80

k = 1024 k = 256 k = 128

70 60

−1

10

0

1

10

C

10 C

2

10

3

10

50 −2 10

−1

10

0

1

10

10

2

10

Figure 7: Shuttle1k and Splice. We compare sign α-stable random projections with 0-bit consistent weighted sampling (CWS). Each panel (for each α) consists of 8 curves. The solid (pink) curve marked by * represents the results of linear SVM. Four solid curves (labelled by k = 128, k = 256, k = 1024, and k = 8192, respectively) represent the results of sign α-stable random projections for 4 different k values. The 3 dashed curves correspond to the results of 0-bit CWS for k = 128, 256, 1024 (a higher curve for a higher k value). These experimental results, all conducted using LIBLINEAR, show that 0-bit CWS requires much fewer samples to achieve the sample accuracies.

11

3

10

C

k = 1024 k = 256

85

k = 128

80 75 −2 10

−1

10

0

1

10

10

2

10

90 k = 256 k = 128

−1

10

0

k = 256 k = 128

85

75 −2 10

3

10

k = 8192

95

k = 1024

90

k = 256 k = 128

85 80

0

1

10

10

2

10

75 −2 10

3

10

100 Accuracy (%)

k = 1024

90

k = 256

85

k = 128

−1

0

1

10

10

2

10

−1

10

0

1

10

10

k = 256

Accuracy (%)

90

k = 128

85

−1

0

1

10

k = 1024

90

k = 256 k = 128

85

100

90

k = 256

85

k = 128

−1

10

0

1

10

10

−1

10

0

10

2

10

1

10

10

3

10

2

10

WebspamN1−20k: α = 0.5

95

k = 1024

90

k = 256 k = 128

85 80 −2 10

3

10

100

k = 8192 k = 1024

95 90

k = 256 k = 128

85

−1

10

0

1

10

C

10 C

3

10

k = 8192

WebspamN1−20k: α = 1.5

80 −2 10

2

10

−1

10

0

1

10

10

2

10

3

10

C

k = 8192 k = 1024

3

10

C

k = 1024

100

2

10

k = 8192

C

WebspamN1−20k: α = 1

10

USPS: α = 2

75 −2 10

3

10

95

C

80 −2 10

2

10

WebspamN1−20k: α = 0.25

80 −2 10

3

10

95

1

10

k = 8192

k = 8192

95

10

0

10

C

WebspamN1−20k: α = 0.1

80 −2 10

−1

10

80

Accuracy (%)

−1

85

100

C

Accuracy (%)

2

10

Accuracy (%)

Accuracy (%)

90

10

k = 256 k = 128

C

95

k = 1024

80

Accuracy (%)

10

USPS: α = 1.5

k = 8192

95

100

1

10

100

75 −2 10

90

C

USPS: α = 1

k = 8192 k = 1024

80

75 −2 10

3

10

100 Accuracy (%)

k = 1024

85

USPS: α = 0.5

95

80

C

100

k = 8192

95

Accuracy (%)

90

100 USPS: α = 0.25

k = 8192

Accuracy (%)

95

100 USPS: α = 0.1

Accuracy (%)

Accuracy (%)

100

2

10

3

10

WebspamN1−20k: α = 2

95

k = 8192 k = 1024 k = 256 k = 128

90 85 80 −2 10

−1

10

0

1

10

10

2

10

Figure 8: USPS and WebspamN1-20k. We compare sign α-stable random projections with 0-bit consistent weighted sampling (CWS). Each panel (for each α) consists of 8 curves. The solid (pink) curve marked by * represents the results of linear SVM. Four solid curves (labelled by k = 128, k = 256, k = 1024, and k = 8192, respectively) represent the results of sign α-stable random projections for 4 different k values. The 3 dashed curves correspond to the results of 0-bit CWS for k = 128, 256, 1024 (a higher curve for a higher k value). These experimental results, all conducted using LIBLINEAR, show that 0-bit CWS requires much fewer samples to achieve the sample accuracies.

12

3

10

C

2.4 Experiment on a Larger Dataset The paper on 0-bit CWS [12] only experimented with datasets of moderate sizes for an important reason. To prove the correctness, they need to show that the result of 0-bit CWS with enough samples could approach that of exact min-max kernel. A straightforward and faithful implementation of SVM with min-max kernel is to use the LIBSVM pre-computed kernel functionality by computing the kernel explicitly and feeding it to SVM from outside. This strategy, although most repeatable, is very expensive for datasets which are not even large [1]. On other hand, once we have proved the correctness of 0-bit CWS, applying the method to larger datasets is easy, except that we would not be able to compute the exact result of min-max kernel. Figure 9 presents the detailed results on the WebspamN1 dataset, which has 350,000 examples. We use 50% of the examples for training and the other 50% for testing. With linear SVM, the test classification accuracy is about 93%. Both sign α-stable random projections and 0-bit CWS can achieve > 98% accuracies given enough samples. The figure also confirm that 0-bit CWS requires significantly fewer samples than the number of projections needed by sign stable random projections, to achieve comparable accuracies. 100 k = 8192

95 k = 1024 k = 256

90

WebspamN1: α = 0.25

95

k = 1024 k = 256

90

1

10

10

2

10

3

10

−1

10

0

10

C WebspamN1: α = 0.75

100 k = 8192 k = 1024

95

k = 256

90

85 −2 10

−1

0

1

10

10

2

10

3

k = 256

−1

10

0

100 Accuracy (%)

Accuracy (%)

1

10

10

0

k = 128 −1

10

0

1

10

2

10

1

10

10

WebspamN1: α = 1.25

3

WebspamN1: α = 1.75

2

10

90

k = 1024

90

k = 256

85 −2 10

3

10

100 k = 8192

k = 256

−1

10

0

1

10

C

10 C

k = 8192

k = 128

k = 1024

95

85 −2 10

3

10

95

−1

10

0

1

10

10

WebspamN1: α = 2

2

10

3

10

k = 8192 k = 1024

95

k = 256

90

k = 128

10

2

10

C

k = 128

10

100

k = 1024

90

k = 8192

k = 256

90

−1

10

k = 8192

95

85 −2 10

10

k = 1024

95

10

10

85 −2 10

3

C

WebspamN1: α = 1.5

85 −2 10

k = 256

C

WebspamN1: α = 1

C 100

90

k = 128

k = 128

10

10

2

k = 8192 k = 1024

95

C

Accuracy (%)

Accuracy (%)

100

1

Accuracy (%)

10

0

85 −2 10

Accuracy (%)

−1

WebspamN1: α = 0.5

k = 128

k = 128

85 −2 10

100 k = 8192

Accuracy (%)

WebspamN1: α = 0.1

Accuracy (%)

Accuracy (%)

100

2

10

k = 128 3

10

85 −2 10

−1

10

0

1

10

10

2

10

Figure 9: WebspamN1. We compare sign α-stable random projections with 0-bit consistent weighted sampling (CWS). Each panel (for each α) consists of 8 curves. The solid (pink) curve marked by * represents the results of linear SVM. Four solid curves (labelled by k = 128, k = 256, k = 1024, and k = 8192, respectively) represent the results of sign α-stable random projections for 4 different k values. The 3 dashed curves correspond to the results of 0-bit CWS for k = 128, 256, 1024 (a higher curve for a higher k value).

13

3

10

C

3 Conclusion This paper provides an extensive empirical study of sign α-stable random projections for large-scale learning applications. Although the paper focuses on presenting the results on classification tasks, one should keep mind that the method is a general-purpose data processing tool which can be used for classification, regression, clustering, or near-neighbor search. Given enough projections, the method can often achieve good performance. The comparison with 0-bit CWS should be also interesting to practitioners. Future work: The processing cost of sign α-stale random projections can be substantially improved by “very sparse stable random projections” [9]. An empirical study is needed to confirm this claim. Another interesting line of research is to combine sign stable random projections with 0-bit CWS, for example, by a strategy similar to that in the recent work of “CoRE kernels” [11].

References [1] L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors. Large-Scale Kernel Machines. The MIT Press, Cambridge, MA, 2007. [2] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In WWW, pages 1157 – 1166, Santa Clara, CA, 1997. [3] J. M. Chambers, C. L. Mallows, and B. W. Stuck. A method for simulating stable random variables. Journal of the American Statistical Association, 71(354):340–344, 1976. [4] M. S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, pages 380–388, Montreal, Quebec, Canada, 2002. [5] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008. [6] M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of ACM, 42(6):1115–1145, 1995. [7] P. Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream computation. Journal of ACM, 53(3):307–323, 2006. [8] S. Ioffe. Improved consistent sampling, weighted minhash and L1 sketching. In ICDM, pages 246–255, Sydney, AU, 2010. [9] P. Li. Very sparse stable random projections for dimension reduction in lα (0 < α ≤ 2) norm. In KDD, San Jose, CA, 2007. [10] P. Li. Estimators and tail bounds for dimension reduction in lα (0 < α ≤ 2) using stable random projections. In SODA, pages 10 – 19, San Francisco, CA, 2008. [11] P. Li. CoRE kernels. In UAI, Quebec City, CA, 2014. [12] P. Li. Min-max kernels. Technical report, arXiv:1503.0173, 2015. [13] P. Li. One scan 1-bit compressed sensing. Technical report, arXiv:1503.02346, 2015. [14] P. Li and A. C. K¨onig. Theory and applications b-bit minwise hashing. Commun. ACM, 2011. [15] P. Li, M. Mitzenmacher, and A. Shrivastava. Coding for random projections. In ICML, 2014. 14

[16] P. Li, G. Samorodnitsky, and J. Hopcroft. Sign cauchy projections and chi-square kernel. In NIPS, Lake Tahoe, NV, 2013. [17] M. Manasse, F. McSherry, and K. Talwar. Consistent weighted sampling. Technical Report MSR-TR2010-73, Microsoft Research, 2010. [18] S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends in Theoretical Computer Science, 1:117–236, 2 2005. [19] G. Samorodnitsky and M. S. Taqqu. Stable Non-Gaussian Random Processes. Chapman & Hall, New York, 1994. [20] V. M. Zolotarev. One-dimensional Stable Distributions. American Mathematical Society, Providence, RI, 1986.

15