arXiv:1504.07235v1 [stat.ML] 27 Apr 2015
Sign Stable Random Projections for Large-Scale Learning Ping Li Department of Statistics and Biostatistics Department of Computer Science Rutgers University Piscataway, NJ 08854, USA
[email protected] Abstract In this paper, we study the use of “sign α-stable random projections” (where 0 < α ≤ 2) for building basic data processing tools in the context of large-scale machine learning applications (e.g., classification, regression, clustering, and near-neighbor search). After the processing by sign stable random projections, the inner products of the processed data approximate various types of nonlinear kernels depending on the value of α. Thus, this approach provides an effective strategy for approximating nonlinear learning algorithms essentially at the cost of linear learning. When α = 2, it is known that the corresponding nonlinear kernel is the arc-cosine kernel. When α = 1, the procedure approximates the arc-cos-χ2 kernel (under certain condition). When α → 0+, it corresponds to the resemblance kernel, which provides the exciting connection between two popular randomized algorithms: (i) stable random projections (ii) b-bit minwise hashing. No theoretical results are known so far for other α values except for α = 2, 1, or 0+. From practitioners’ perspective, the method of sign α-stable random projections is ready to be tested for large-scale learning applications, where α can be simply viewed as a tuning parameter. What is missing in the literature is an extensive empirical study to show the effectiveness of sign stable random projections, especially for α 6= 2 or 1. The paper supplies such a study on a wide variety of classification datasets. In particular, we compare shoulder-by-shoulder sign stable random projections with the recently proposed “0-bit consistent weighted sampling (CWS)” [12] (which is only for nonnegative data). We provide the detailed comparisons on all the 34 datasets used by [12]. In addition, we present the comparison on a larger dataset with 350,000 examples. For all datasets, we experiment with α ∈ {0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2}. For most datasets, sign stable random projections can approach (or in some cases even slightly exceed) the performance of 0-bit CWS, given enough projections. Typically, to reach the same accuracy, sign stable random projections would require significantly more projections than the number of samples needed by 0-bit CWS. There are also datasets for which sign stable random projections could not achieve the same accuracy as 0-bit CWS regardless of α. While the comparison results seem to favor 0-bit consistent weighted sampling (which is only for nonnegative data), the distinct advantage of sign stable random projections is that the method is applicable to general data types, not only for nonnegative data. It is also an interesting research problem to combine 0-bit CWS with sign stable random projections, for example, a strategy similar to “CoRE kernels” [11].
1
1 Introduction In this paper, we focus on the idea of “sign α-stable random projections” and the applications in machine learning with massive (and possibly streaming [18]) data. Consider two data vectors u, v ∈ RD from a data matrix, the central idea is to multiply them with a random projection matrix {sij }, i = 1, ..., D, j = 1, ..., k, whose entries, sij , are sampled i.i.d. from an α-stable distribution, denoted by S(α, 1). That is, xj =
D X i=1
ui sij ,
yj =
D X
vi sij ,
sij ∼ S(α, 1), i.i.d.
j = 1, 2, ..., k
(1)
i=1
The use of α-stable distributions was studied in the context of estimating frequency moments of data streams [7, 10] and in the recent work on “one scan 1-bit compressed sensing” [13]. Here, we adopt the √ α −1st parameterization [20, 19] such that, if s ∼ S(α, d), then the characteristic function is E e = e−d|t| . When α = 2, S(2, d) is equivalent to a Gaussian distribution N (0, σ 2 = 2d). When α = 1, S(1, 1) is the standard Cauchy distribution. Although in general no closed-form density functions of α-stable distributions are available, one can easily sample from an α-stable distribution by (e.g.,) the classical CMS [3] method. Stable distributions with α < 2 are also known to be “heavy-tailed” distributions because if s ∼ S(α, 1), then unless α = 2, we always have E(|s|λ ) = ∞ if λ ≥ α. This is probably the reason why stable distributions were rarely used in machine learning and data mining applications.
1.1 Sign Stable Random Projections P P D α and y ∼ S α, α , j = By property of stable distributions, we have xj ∼ S α, D |u | |v | i j i i=1 i=1 1, 2, ..., k. Unless α = 2, it might be difficult to imagine how one can make use of these (manually generated) heavy-tailed data for of machine learning applications. Indeed, we do not directly use the projected data. Instead, in this paper, we only utilize the projected data through their signs, i.e., sign(xj ) and sign(yj ), which are well-behaved and can be used for building tools for large-scale machine learning. If xj ≤ 0, we can code xj as a two-dimensional vector [0 1]. If xj > 0, then we code it as [1 0]. Then we concatenate k such two-dimensional vectors to form a vector of length 2k (with k 1’s). We apply the same coding scheme to yj (and all the projected data). The signs, sign(xj ) and sign(yj ), are statistically dependent and it is interesting (and in general challenging) to find out how the signs are related. When α = 2, the relationship between sign(xj ) and sign(yj ) is well-known [6, 4, 15] PD u i vi 1 −1 i=1q α=2: Pr (sign(xj ) = sign(yj )) = 1 − cos ρ2 , ρ2 = qP PD π D 2 2 i=1 |ui | i=1 |vi |
(2)
Thus, the “collision probability” is monotonic in ρ2 , which is the correlation coefficient. Although cos−1 ρ2 P k is nonlinear, the estimator of the probability, i.e., k1 j=1 1{xj = yj } can be viewed as an inner product once we expand a sign as either [0 1] or [1 0]. In other words, we only need to pay the cost of linear learning to approximately train a classifier originally based on nonlinear kernels. It is not so straightforward to calculate the collision probability P PD once α < 2. A recent work [16] focused on α = 1 and showed that, when ui ≥ 0, vi ≥ 0, D u = i=1 i i=1 vi = 1, we have α=1:
Pr (sign(xj ) = sign(yj )) ≈ 1 −
1 cos−1 ρχ2 , π
ρχ 2 =
D X 2ui vi ui + vi
(3)
i=1
Note that the so-called χ2 -kernel, ρχ2 , is popular in computer vision, for data generated from histograms. 2
When α → 0+, [16] mentioned in the “future work” that the collision probability is related to the “resemblance” when the data are nonnegative: PD 1{ui > 0 and vi > 0} 1 1 α = 0+ : Pr (sign(xj ) = sign(yj )) = + R, R = Pi=1 (4) D 2 2 i=1 1{ui > 0 or vi > 0}
Interestingly, this collision probability is essentially the same as the collision probability of “1-bit minwise hashing” [14].
For other α values, at this moment we canPnot relate the collision probabilities to any known similarity measures. On the other hand, the estimator k1 kj=1 1{xj = yj } (which is an inner product) is of course still a valid positive definite kernel for any α. Thus, we can anyway use sign α-stable random projections for building large-scale learning algorithms, where α can be viewed as an important tuning parameter. What is missing in the literature is an extensive empirical study and our paper supplies such a study.
1.2 Resemblance, Min-Max Kernel, and 0-Bit Consistent Weighted Sampling (CWS) As mentioned above, the collision probability of sign stable random projections at α = 0+ is related to the resemblance R when the data (e.g., u and v) are nonnegative. From the definition PD 1{ui > 0 and vi > 0} , ui ≥ 0, vi ≥ 0 (5) R = R(u, v) = Pi=1 D i=1 1{ui > 0 or vi > 0}
we can see that R only makes sense when the data are sparse (i.e., most entries are zero). When the data are fully dense, we have R = 1 always. This may seriously limit the use of resemblance when the data are not sparse. This issue can be largely fixed by the introduction of the min-max kernel which is defined as PD i=1 min{ui , vi } , ui ≥ 0, vi ≥ 0 (6) KM M (u, v) = PD i=1 max{ui , vi } The recent work [12] also provides a variant, called the “normalized min-max kernel”: PD D D X X i=1 min{ui , vi } KN M M (u, v) = PD vi = 1 ui = 1, , i=1 max{ui , vi } i=1 i=1
(7)
The resemblance is a popular measure of similarity for binary data and can be sampled efficiently by minwise hashing [2, 14]. The min-max kernels can also be sampled using the technique called consistent weighted sampling (CWS) [17, 8]. Traditionally, each sample of CWS consists of two values, one of which is unbounded. The so-called ”0-bit” CWS [12] simply discarded the unbounded value to make CWS much more convenient for large-scale machine learning tasks. Because [12] experimented with a large collection of datasets, we hope to compare, shoulder-by-shoulder, sign stable random projections with 0-bit CWS, although we should reiterate that 0-bit CWS is only designed for nonnegative data and is hence not as general as sign stable random projections.
2 Experiments 2.1 Datasets and Summary of Results We have experimented all the 34 datasets used in the recent paper for ”0-bit CWS” [12] to provide a shoulderby-shoulder comparison. The results are summarized in Table 1. The results show that, given enough projections, sign α-stable random projections can often achieve good accuracies (and better than linear). The value of α is an important parameter which needs to be individually tuned for each dataset. 3
Table 1: Datasets and classification accuracies (in %). We use all the datasets in the recent work on “0-bit” CWS [12]. We report the results of linear kernels, min-max kernels (6), normalize min-max kernels (7) and sign α-stable random projections with α ∈ {0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2} and k = 8192. The values for the linear kernel, min-max kernels, and n-min-max (or n-m-m) kernels are directly quoted from [12]. For the min-max (and n-m-m) kernels, the accuracies were computed on the original data using LIBSVM “pre-computed” kernel functionality and l2 -regularized kernel SVM (which has a tuning parameter C). The reported test classification accuracies are the best accuracies from a wide range of C values. The reported accuracies of sign α-stable random projections (i.e., the last 9 columns) and linear kernels l2 regularized linear SVM were computed by LIBLINEAR [5]. We highlight (in bold) the highest accuracies among all methods as well as the highest accuracies of sign α-stable random projections among 9 α values.
Dataset Covertype10k Covertype20k IJCNN5k IJCNN10k Isolet Letter Letter4k M-Basic M-Image MNIST10k M-Noise1 M-Noise2 M-Noise3 M-Noise4 M-Noise5 M-Noise6 M-Rand M-Rotate M-RotImg Optdigits Pendigits Phoneme Protein RCV1 Satimage Segment SensIT20k Shuttle1k Spam Splice USPS Vowel WebspamN1-20k YoutubeVision
# train 10,000 20,000 5,000 10,000 6,238 16,000 4,000 12,000 12,000 10,000 10,000 10,000 10,000 10,000 10,000 10,000 12,000 12,000 12,000 3,823 7,494 3,340 17,766 20,242 4,435 1,155 20,000 1,000 3,065 1,000 7,291 528 20,000 11,736
# test 50,000 50,000 91,701 91,701 1,559 4,000 16,000 50,000 50,000 60,000 4,000 4,000 4,000 4,000 4,000 4,000 50,000 50,000 50,000 1,797 3,498 1,169 6,621 60,000 2,000 1,155 19,705 14,500 1,536 2,175 2,007 462 60,000 10,000
linear 70.9 71.1 91.6 91.6 95.4 62.4 61.2 90.0 70.7 90.0 60.3 62.1 65.2 68.4 72.3 78.7 78.9 48.0 31.4 95.3 87.6 91.4 69.1 96.3 78.5 92.6 80.5 90.9 92.6 85.1 91.7 40.9 93.0 63.3
min-max 80.4 83.3 94.4 95.7 96.4 96.2 91.4 96.2 80.8 95.7 71.4 72.4 73.6 76.1 79.0 84.2 84.2 84.8 41.0 97.7 97.9 92.5 72.4 96.9 90.5 98.1 86.9 99.7 95.0 95.2 95.3 59.1 97.9 72.4
n-m-m 80.2 83.1 95.3 96.0 96.6 95.0 90.2 96.0 77.0 95.4 68.5 70.7 71.9 75.2 78.4 84.3 84.1 83.9 38.5 97.4 98.0 92.0 70.7 96.9 87.8 97.5 87.0 99.6 94.7 94.9 95.3 53.5 97.8 72.4
4
0.1 74.5 76.5 91.0 91.2 90.9 88.0 84.9 95.9 55.6 95.6 47.0 46.4 50.1 53.0 55.4 59.9 60.2 82.6 24.1 95.7 96.6 88.0 69.0 94.8 84.3 96.1 85.5 99.2 95.0 87.4 94.6 41.2 96.9 59.7
0.25 76.7 78.4 92.8 93.3 93.7 92.2 88.1 96.0 64.1 95.7 53.2 54.6 57.1 59.2 62.4 68.4 69.1 83.0 26.8 96.4 97.0 90.4 69.9 94.9 86.1 97.0 86.2 99.2 95.0 90.7 95.3 41.3 97.3 65.0
0.5 77.9 79.8 93.7 94.2 94.9 94.1 90.1 96.0 67.9 95.6 56.8 57.5 60.6 62.9 66.4 72.6 72.5 82.5 29.3 96.7 97.5 91.3 70.6 94.9 87.1 97.4 86.6 99.4 94.9 91.7 95.5 43.8 97.5 68.4
0.75 78.3 80.3 94.5 95.4 95.3 94.8 91.1 95.9 69.9 95.5 58.2 59.4 62.3 65.2 68.6 74.2 74.2 81.6 30.6 97.3 97.7 91.5 70.7 94.9 87.1 97.2 86.7 99.6 94.7 91.6 95.4 46.1 97.5 69.4
1 78.4 80.4 95.2 95.7 95.7 95.3 91.5 95.7 70.9 95.3 58.9 60.6 63.1 66.0 68.9 75.5 75.2 80.9 32.0 97.4 97.9 91.7 70.5 94.9 87.3 97.3 86.7 99.5 94.7 91.0 95.3 47.2 97.5 69.2
1.25 78.5 80.4 94.7 95.9 95.6 95.3 91.9 95.5 71.4 95.2 59.7 61.5 64.0 66.7 70.2 76.1 76.1 80.2 32.7 97.5 97.9 91.6 70.3 94.8 87.7 97.2 86.3 99.6 94.4 90.7 95.3 49.3 97.4 68.9
1.5 78.4 80.7 95.4 95.7 95.8 95.4 92.1 95.4 71.9 95.0 60.4 61.9 64.4 67.2 70.4 76.5 76.5 79.5 33.4 97.8 98.0 91.5 69.7 94.7 88.0 97.2 86.0 99.5 94.4 89.6 95.1 51.2 97.3 67.9
1.75 78.3 80.5 95.3 95.9 95.8 95.6 92.0 95.2 72.1 94.8 60.4 61.5 64.7 67.5 70.7 76.6 76.8 78.8 33.7 97.8 98.1 91.9 69.4 94.6 87.8 96.9 85.3 99.6 94.2 88.9 95.1 52.7 97.2 66.2
2 78.2 80.3 95.4 96.0 95.6 95.6 91.7 95.0 72.0 94.7 60.9 61.7 64.8 67.8 71.5 77.3 77.1 78.2 34.1 97.7 98.1 91.6 68.8 94.4 87.7 96.9 84.7 99.6 94.0 87.3 95.1 52.9 97.0 64.8
2.2 Detailed Results of Sign α-Stable Random Projections Figures 1 to 4 presents the detailed classification results of sign α-stable random projections for selected 4 datasets, using l2 -regularized linear SVM (with a regularization parameter C ∈ [10−2 , 103 ]). In each figure, we present the results for k ∈ {64, 128, 256, 512, 1024, 2048, 4096, 8192} projections and α ∈ {0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2}. All experiments were conducted using LIBLINEAR [5] and we repeated each randomized experiment 5 times and reported the average results. The classification results are very stable (i.e., very small variance) unless k is too small. The results (together with Table 1 and other figures later in the paper) show that, given enough projections (e.g., 8192), the method of sign α-stable random projections can typically achieve good accuracies.
75
8192 4096 2048 1024 512 256 128 64
70 65 60 55
70 65 60 55
−1
10
0
1
10
10
2
10
50 −2 10
3
10
−1
10
0
80
Covertype10k: α = 0.75
8192
65
Accuracy (%)
Accuracy (%)
70
60 55
2
10
−1
0
1
10
10
2
10
80 8192 1024 512 256 128 64
70 65 60
80
−1
10
0
1
10
10
65
Accuracy (%)
70
60 55
2
10
0
1
10
2
10
3
10
2
3
10
10
Covertype10k: α = 1.25
75
8192 4096 1024
70
256
65
128 64
60
50 −2 10
3
10
Covertype10k: α = 1.75
70 65
80
8192 4096 2048 1024 512 256 128 64
−1
10
0
60
50 −2 10
1
10
10
2
3
10
10
Covertype10k: α = 2
75
55 10
1
10
C
75
8192 1024 512 256 128 64
−1
0
10
C
75
10
−1
10
55
50 −2 10
3
10
Covertype10k: α = 1.5
50 −2 10
60
C
55 10
65
50 −2 10
3
10
Covertype10k: α = 1
C
Accuracy (%)
10
75
1024 512 256 128 64
50 −2 10
70
C
75
80
1
10
C 80
8192 1024 512 256 128 64
55
Accuracy (%)
50 −2 10
Covertype10k: α = 0.5
75
8192 4096 2048 1024 512 256 128 64
Accuracy (%)
Accuracy (%)
75
80
Covertype10k: α = 0.25 Accuracy (%)
80
Covertype10k: α = 0.1 Accuracy (%)
80
8192 1024 512 256 128
70 65
64
60 55
−1
10
0
1
10
C
10 C
2
10
3
10
50 −2 10
−1
10
0
1
10
10
2
10
Figure 1: Covertype10k. Classification accuracies of sign α-stable random projections using l2 -regularized SVMs (with a tuning parameter C ∈ [10−2 , 103 ]) for α ∈ {0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2} and k ∈ {64, 128, 256, 512, 1024, 2048, 4096, 8192} projections. In each panel, the highest point (i.e., best accuracy) at k = 8192 was reported in Table 1. In addition, each panel also presents the accuracies of linear SVM (the pink curve marked by *). All experiments were conducted by LIBLINEAR.
5
3
10
C
2048
70 60
1024
50 40
512
30 1
10
10
256 128 3 10 64 10
70 50 40
256
30
128
20 −2 10
2
512
60
−1
10
0
10
C
Accuracy (%)
Accuracy (%)
70
256
60 50 40
128
30
64 −1
0
1
10
10
2
10
70
256
256
60
128
50 40
64
30
128
50 40
64 −1
10
0
1
10
10
2
64 −1
10
0
1
10
2
10
3
10
10
10
8192 2048 1024 512 256
70 60
128
50 40
64
−1
10
0
1
10
C
10 C
10
2
3
10
10
80 70
256
60 50 40
128
30
64 −1
10
0
1
10
10
2
3
10
10
C
80
20 −2 10
1
8192 2048 1024 512
20 −2 10
3
10
30 0
128
100 90 Letter: α = 1.25
60
100 90 Letter: α = 1.75 Accuracy (%)
70
10
256
50 40
C
80
−1
60
20 −2 10
3
80
20 −2 10
10
8192 2048 1024 512
10
70
C
30 3
10
100 90 Letter: α = 1.5 Accuracy (%)
64
8192 2048 1024 512
C
20 −2 10
10
100 90 Letter: α = 1
8192 4096 2048 1024 512
80
10
10
2
80
C
100 90 Letter: α = 0.75
20 −2 10
1
8192 4096 2048 1024 512
30
Accuracy (%)
10
0
80
Accuracy (%)
−1
100 90 Letter: α = 0.5
8192 4096 2048 1024
Accuracy (%)
8192 4096
80
20 −2 10
100 90 Letter: α = 0.25 Accuracy (%)
Accuracy (%)
100 90 Letter: α = 0.1
2
10
3
10
100 90 Letter: α = 2 80 70 60 50 40 30 20 −2 −1 0 10 10 10
8192 2048 1024 512 256 128 64
1
10
2
10
Figure 2: Letter. Classification accuracies of sign α-stable random projections using l2 -regularized SVMs (with a tuning parameter C ∈ [10−2 , 103 ]) for α ∈ {0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2} and k ∈ {64, 128, 256, 512, 1024, 2048, 4096, 8192} projections. In each panel, the highest point (i.e., best accuracy) at k = 8192 was reported in Table 1. In addition, each panel also presents the accuracies of linear SVM (the pink curve marked by *). All experiments were conducted by LIBLINEAR.
6
3
10
C
100 8192 2048 1024 256 128
90 80
64
70
0
1
10
10
2
80 64
70 60 −2 10
3
10
90
10
−1
10
0
C 100
MNIST10k: α = 0.75
8192 2048 1024 256 128
90 80
64
70 60 −2 10
−1
10
0
1
10
10
2
10
Accuracy (%)
Accuracy (%)
64
70
0
1
10
100 8192 2048 1024 256 128
80
64
70
100 8192 2048 1024 256 128
80
−1
64
70
−1
10
0
−1
10
0
1
10
10
2
10
3
10
10
2
10
80
128
70
64
0
1
10
C
10 C
3
10
8192 2048 1024 256 128
90 80
64
70
100
90
−1
2
10
−1
10
0
1
10
10
2
3
10
10
C
8192 2048 1024 256
10
10
MNIST10k: α = 1.25
60 −2 10
3
10
MNIST10k: α = 1.75
60 −2 10
1
10
C
90
10
80
60 −2 10
3
10
8192 2048 1024 256 128
90
C
90
60 −2 10
3
10
MNIST10k: α = 1.5
60 −2 10
2
10
MNIST10k: α = 1
C 100
10
MNIST10k: α = 0.5
C
Accuracy (%)
Accuracy (%)
100
1
10
Accuracy (%)
−1
10
8192 2048 1024 256 128
2
10
Accuracy (%)
60 −2 10
100 MNIST10k: α = 0.25
Accuracy (%)
MNIST10k: α = 0.1
Accuracy (%)
Accuracy (%)
100
3
10
MNIST10k: α = 2
90
8192 2048 1024 256
80
128
70
64
60 −2 10
−1
10
0
1
10
10
2
10
C
Figure 3: MNIST10k. Classification accuracies of sign α-stable random projections using l2 -regularized SVMs (with a tuning parameter C ∈ [10−2 , 103 ]) for α ∈ {0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2} and k ∈ {64, 128, 256, 512, 1024, 2048, 4096, 8192} projections. In each panel, the highest point (i.e., best accuracy) at k = 8192 was reported in Table 1. In addition, each panel also presents the accuracies of linear SVM (the pink curve marked by *). All experiments were conducted by LIBLINEAR.
7
3
10
512
85 80
256
75 70 −2 10
−1
10
0
1
10
10
2
C
85 128
80
70 −2 10
10
−1
10
0
90
100
64
75 −1
10
0
1
10
10
2
Segment: α = 1
10
85
64
100
64
85 80 75
−1
10
0
1
10
10
2
10
0
0
1
10
2
10
3
10
1
10
10
2
3
10
10
90
8192 1024 512 256 128
85
64
Segment: α = 1.25
80
70 −2 10
3
10
−1
10
0
100
Segment: α = 1.75
8192 1024 512 256 128 64
90 85
10
2
3
10
Segment: α = 2
10
8192
95
80
70 −2 10
1
10
C
75 10
−1
10
75
95 Accuracy (%)
90
−1
64
95
80
100
8192 1024 512 256 128
95
10
80
C
Segment: α = 1.5
70 −2 10
128
85
C
90
C
Accuracy (%)
10
8192 1024 512 256 128
70 −2 10
3
10
512 256
90
70 −2 10
3
64
75
70 −2 10 100
2
10
95
85 80
10 C
8192 1024 512 256 128
95
1
10
128
Segment: α = 0.75
8192
75
Accuracy (%)
100
90
Segment: α = 0.5
95
1024 512 256
75 3
10
100
8192
Accuracy (%)
90
Segment: α = 0.25
95
Accuracy (%)
Accuracy (%)
95
Accuracy (%)
100
8192 4096 2048 1024
Accuracy (%)
Segment: α = 0.1
Accuracy (%)
100
512 256 128
90
64
85 80 75
−1
10
0
1
10
C
10 C
2
10
3
10
70 −2 10
−1
10
0
1
10
10
2
10
C
Figure 4: Segment. Classification accuracies of sign α-stable random projections using l2 -regularized SVMs (with a tuning parameter C ∈ [10−2 , 103 ]) for α ∈ {0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2} and k ∈ {64, 128, 256, 512, 1024, 2048, 4096, 8192} projections. In each panel, the highest point (i.e., best accuracy) at k = 8192 was reported in Table 1. In addition, each panel also presents the accuracies of linear SVM (the pink curve marked by *). All experiments were conducted by LIBLINEAR.
8
3
10
2.3 Detailed Comparisons with 0-Bit Consistent Weighted Sampling (CWS) Figures 5 to 8 compare sign α-stable random projections with 0-bit CWS [12] on selected datasets. For clarity, we only show the results of sign stable random projections for k = 128, 256, 1024, 8192 projections, and the results for 0-bit CWS with k = 128, 256, 1024 samples. These results demonstrate that 0-bit CWS requires much fewer samples, although we should keep in mind that 0-bit CWS is only for nonnegative data. 100 k = 8192
95
k = 1024
90
k = 256
85
k = 128
k = 256
85 k = 128
80 −1
10
0
1
10
10
2
10
75 −2 10
3
10
C k = 8192
95 90
k = 1024
85
k = 256
75 −2 10
−1
10
0
1
10
10
2
10
0
k = 256
50
k = 128
40 1
10
10
C 90 M−Rotate: α = 1 80 Accuracy (%)
Accuracy (%)
60
2
10
10
k = 256
80
k = 128
−1
10
0
k = 8192
k = 1024
60 50
k = 256
40
k = 128 −1
10
0
1
10
1
10
10
2
10
k = 128
−1
10
0
10
2
10
60 50
k = 256
40
k = 128 1
10
10
k = 8192
k = 1024
85
k = 256
80
k = 128 −1
10
0
60
2
10
50
10
k = 256
10
10
1
10
C
10 C
2
10
3
10
k = 8192
70 k = 1024
60
k = 256
50
k = 128 −1
10
0
1
10
10
2
10
3
10
2
10
k = 8192
70 60
k = 1024
50 k = 256
40
k = 128 0
M−Rotate: α = 0.5
30 −2 10
3
k = 1024
−1
10
40
k = 8192
70
30 −2 10
1
10
C 90 M−Rotate: α = 2 80
40 3
3
10
MNIST10k: α = 2
80
k = 1024
10
2
10
90
90 k = 8192
70
0
10
C
M−Rotate: α = 0.25
−1
1
10
C
75 −2 10
3
10
C 90 M−Rotate: α = 1.5 80
70
30 −2 10
k = 256
85
95
k = 1024
30 −2 10
3
Accuracy (%)
Accuracy (%)
k = 1024
10
k = 1024
75 −2 10
3
10
k = 8192
80
70
0
2
10
85
90 k = 8192
80
−1
90
100
90
75 −2 10
3
10
k = 8192
C
M−Rotate: α = 0.1
30 −2 10
10
MNIST10k: α = 1.5
C 90
1
10
95
k = 128
80
−1
10
C
100
MNIST10k: α = 1
Accuracy (%)
100
95
80
Accuracy (%)
75 −2 10
Accuracy (%)
k = 1024
90
Accuracy (%)
80
MNIST10k: α = 0.5
k = 8192
Accuracy (%)
Accuracy (%)
95
100
MNIST10k: α = 0.25
Accuracy (%)
MNIST10k: α = 0.1
Accuracy (%)
100
k = 128 3
10
30 −2 10
−1
10
0
1
10
10
2
10
Figure 5: MNIST10k (top 2 rows) and M-Rotate (bottom 2 rows). We compare sign α-stable random projections with 0-bit consistent weighted sampling (CWS). Each panel (for each α) consists of 8 curves. The solid (pink) curve marked by * represents the results of linear SVM. Four solid curves (labelled by k = 128, k = 256, k = 1024, and k = 8192, respectively) represent the results of sign α-stable random projections for 4 different k values. The 3 dashed curves correspond to the results of 0-bit CWS for k = 128, 256, 1024 (a higher curve for a higher k value). These experimental results, all conducted using LIBLINEAR, show that 0-bit CWS requires much fewer samples to achieve the sample accuracies. 9
3
10
C
100
Pendigits: α = 0.1
Accuracy (%)
95
Accuracy (%)
k = 8192 k = 1024
90 85
k = 256
80
Pendigits: α = 0.25
100
k = 8192
95
k = 1024
90
k = 256
85 80
Pendigits: α = 0.5
k = 8192 k = 1024
95 Accuracy (%)
100
k = 256
90 k = 128
85 80
k = 128
k = 128 −1
10
0
1
10
10
2
10
75 −2 10
3
10
−1
10
0
C 100
Pendigits: α = 1
k = 128
85 80
Pendigits: α = 1.5
k = 128
90 85
1
10
2
10
75 −2 10
3
10
90 Accuracy (%)
Accuracy (%)
95
85
k = 8192
80 k = 1024
75 70
k = 256 −1
10
0
1
10
−1
10
0
10
2
10
2
10
k = 128
85
95 90 k = 8192
80
k = 1024
75 70
Accuracy (%)
90
80
k = 256 k = 128
75 70
k = 256
−1
10
0
1
10
10
k = 128 −1
10
0
1
10
10
2
10
85
k = 8192 k = 1024
80
k = 256
75 70
k = 128
−1
10
0
2
3
10
95 90 k = 8192 k = 1024
80
k = 256 k = 128
75 70
60 −2 10
1
10
10
2
10
3
10
C
85
65 10
3
10
Satimage: α = 0.5
60 −2 10
3
10
Satimage: α = 1.5
60 −2 10
1
2
10
65
65 10
3
10
C
85
95
2
10
k = 8192 k = 1024 k = 256
90
75 −2 10
3
10
Satimage: α = 0.25
60 −2 10
3
10
k = 8192 k = 1024
0
Pendigits: α = 2
C
85
10
10
65
Satimage: α = 1
−1
1
10
C
10
1
10
C
Satimage: α = 0.1
60 −2 10
0
10
80
Accuracy (%)
0
10
−1
10
95
Accuracy (%)
−1
10
65
Accuracy (%)
100
k = 8192 k = 1024 k = 256
C
90
75 −2 10
3
10
80
75 −2 10
95
2
10
C
95
k = 256
Accuracy (%)
Accuracy (%)
100
k = 8192 k = 1024
90
90
10 C
95
95
1
10
Accuracy (%)
75 −2 10
Satimage: α = 2
85
k = 8192 k = 1024
80
k = 256 k = 128
75 70 65
−1
10
0
1
10
C
10 C
2
10
3
10
60 −2 10
−1
10
0
1
10
10
2
10
Figure 6: Pendigits and Satimage. We compare sign α-stable random projections with 0-bit consistent weighted sampling (CWS). Each panel (for each α) consists of 8 curves. The solid (pink) curve marked by * represents the results of linear SVM. Four solid curves (labelled by k = 128, k = 256, k = 1024, and k = 8192, respectively) represent the results of sign α-stable random projections for 4 different k values. The 3 dashed curves correspond to the results of 0-bit CWS for k = 128, 256, 1024 (a higher curve for a higher k value). These experimental results, all conducted using LIBLINEAR, show that 0-bit CWS requires much fewer samples to achieve the sample accuracies.
10
3
10
C
100
100 k = 8192 k = 1024
k = 8192
95 Accuracy (%)
k = 1024
90 k = 256
85
k = 128
80
90 k = 128
85 80
0
1
10
10
2
10
75 −2 10
3
10
−1
10
0
2
10
3
10
90
k = 128
85
1
10
10
2
90
k = 128
85
10
10
k k= = 8192 1024 k = 256 k = 128
90 85
−1
10
0
1
10
10
2
10
75 −2 10
3
10
−1
10
0
Accuracy (%)
90 k = 8192
80 k = 1024
70
k = 256
60
1
10
C
100
3
10
Shuttle1k: α = 2
Shuttle1k: α = 1.5
C
Splice: α = 0.1
2
10
80
75 −2 10
3
1
10
95
80 10
0
10
100
k = 8192 k = 1024 k = 256
Shuttle1k: α = 1 0
−1
10
C
95
80
Accuracy (%)
10
100 Accuracy (%)
Accuracy (%)
95
k = 8192 k = 1024 k = 256
−1
Shuttle1k: α = 0.5
75 −2 10
C
100
100
1
10
C
75 −2 10
k = 128
85 80
Accuracy (%)
−1
10
k = 256
90
Shuttle1k: α = 0.25
Shuttle1k: α = 0.1 75 −2 10
k = 8192 k = 1024
95
k = 256
10
2
10
3
10
C
100
Splice: α = 0.25
90
k = 8192
80
k = 1024
70
k = 256
60
k = 128
Accuracy (%)
Accuracy (%)
95
Accuracy (%)
100
Splice: α = 0.5
90
k = 8192
80
k = 1024 k = 256
70
k = 128
60
k = 128 −1
10
0
1
10
10
2
10
50 −2 10
3
10
−1
10
0
C 100
Splice: α = 1 k = 8192
Accuracy (%)
90 80
k = 1024 k = 256 k = 128
70 60 50 −2 10
10
2
10
−1
0
1
10
−1
10
0
10
2
10
3
10
100
Splice: α = 1.5
90
k = 8192
80
k = 1024 k = 256 k = 128
70
50 −2 10
1
10
10
2
10
3
10
C Splice: α = 2
90
60 10
50 −2 10
3
10
C
Accuracy (%)
100
1
10
Accuracy (%)
50 −2 10
k = 8192
80
k = 1024 k = 256 k = 128
70 60
−1
10
0
1
10
C
10 C
2
10
3
10
50 −2 10
−1
10
0
1
10
10
2
10
Figure 7: Shuttle1k and Splice. We compare sign α-stable random projections with 0-bit consistent weighted sampling (CWS). Each panel (for each α) consists of 8 curves. The solid (pink) curve marked by * represents the results of linear SVM. Four solid curves (labelled by k = 128, k = 256, k = 1024, and k = 8192, respectively) represent the results of sign α-stable random projections for 4 different k values. The 3 dashed curves correspond to the results of 0-bit CWS for k = 128, 256, 1024 (a higher curve for a higher k value). These experimental results, all conducted using LIBLINEAR, show that 0-bit CWS requires much fewer samples to achieve the sample accuracies.
11
3
10
C
k = 1024 k = 256
85
k = 128
80 75 −2 10
−1
10
0
1
10
10
2
10
90 k = 256 k = 128
−1
10
0
k = 256 k = 128
85
75 −2 10
3
10
k = 8192
95
k = 1024
90
k = 256 k = 128
85 80
0
1
10
10
2
10
75 −2 10
3
10
100 Accuracy (%)
k = 1024
90
k = 256
85
k = 128
−1
0
1
10
10
2
10
−1
10
0
1
10
10
k = 256
Accuracy (%)
90
k = 128
85
−1
0
1
10
k = 1024
90
k = 256 k = 128
85
100
90
k = 256
85
k = 128
−1
10
0
1
10
10
−1
10
0
10
2
10
1
10
10
3
10
2
10
WebspamN1−20k: α = 0.5
95
k = 1024
90
k = 256 k = 128
85 80 −2 10
3
10
100
k = 8192 k = 1024
95 90
k = 256 k = 128
85
−1
10
0
1
10
C
10 C
3
10
k = 8192
WebspamN1−20k: α = 1.5
80 −2 10
2
10
−1
10
0
1
10
10
2
10
3
10
C
k = 8192 k = 1024
3
10
C
k = 1024
100
2
10
k = 8192
C
WebspamN1−20k: α = 1
10
USPS: α = 2
75 −2 10
3
10
95
C
80 −2 10
2
10
WebspamN1−20k: α = 0.25
80 −2 10
3
10
95
1
10
k = 8192
k = 8192
95
10
0
10
C
WebspamN1−20k: α = 0.1
80 −2 10
−1
10
80
Accuracy (%)
−1
85
100
C
Accuracy (%)
2
10
Accuracy (%)
Accuracy (%)
90
10
k = 256 k = 128
C
95
k = 1024
80
Accuracy (%)
10
USPS: α = 1.5
k = 8192
95
100
1
10
100
75 −2 10
90
C
USPS: α = 1
k = 8192 k = 1024
80
75 −2 10
3
10
100 Accuracy (%)
k = 1024
85
USPS: α = 0.5
95
80
C
100
k = 8192
95
Accuracy (%)
90
100 USPS: α = 0.25
k = 8192
Accuracy (%)
95
100 USPS: α = 0.1
Accuracy (%)
Accuracy (%)
100
2
10
3
10
WebspamN1−20k: α = 2
95
k = 8192 k = 1024 k = 256 k = 128
90 85 80 −2 10
−1
10
0
1
10
10
2
10
Figure 8: USPS and WebspamN1-20k. We compare sign α-stable random projections with 0-bit consistent weighted sampling (CWS). Each panel (for each α) consists of 8 curves. The solid (pink) curve marked by * represents the results of linear SVM. Four solid curves (labelled by k = 128, k = 256, k = 1024, and k = 8192, respectively) represent the results of sign α-stable random projections for 4 different k values. The 3 dashed curves correspond to the results of 0-bit CWS for k = 128, 256, 1024 (a higher curve for a higher k value). These experimental results, all conducted using LIBLINEAR, show that 0-bit CWS requires much fewer samples to achieve the sample accuracies.
12
3
10
C
2.4 Experiment on a Larger Dataset The paper on 0-bit CWS [12] only experimented with datasets of moderate sizes for an important reason. To prove the correctness, they need to show that the result of 0-bit CWS with enough samples could approach that of exact min-max kernel. A straightforward and faithful implementation of SVM with min-max kernel is to use the LIBSVM pre-computed kernel functionality by computing the kernel explicitly and feeding it to SVM from outside. This strategy, although most repeatable, is very expensive for datasets which are not even large [1]. On other hand, once we have proved the correctness of 0-bit CWS, applying the method to larger datasets is easy, except that we would not be able to compute the exact result of min-max kernel. Figure 9 presents the detailed results on the WebspamN1 dataset, which has 350,000 examples. We use 50% of the examples for training and the other 50% for testing. With linear SVM, the test classification accuracy is about 93%. Both sign α-stable random projections and 0-bit CWS can achieve > 98% accuracies given enough samples. The figure also confirm that 0-bit CWS requires significantly fewer samples than the number of projections needed by sign stable random projections, to achieve comparable accuracies. 100 k = 8192
95 k = 1024 k = 256
90
WebspamN1: α = 0.25
95
k = 1024 k = 256
90
1
10
10
2
10
3
10
−1
10
0
10
C WebspamN1: α = 0.75
100 k = 8192 k = 1024
95
k = 256
90
85 −2 10
−1
0
1
10
10
2
10
3
k = 256
−1
10
0
100 Accuracy (%)
Accuracy (%)
1
10
10
0
k = 128 −1
10
0
1
10
2
10
1
10
10
WebspamN1: α = 1.25
3
WebspamN1: α = 1.75
2
10
90
k = 1024
90
k = 256
85 −2 10
3
10
100 k = 8192
k = 256
−1
10
0
1
10
C
10 C
k = 8192
k = 128
k = 1024
95
85 −2 10
3
10
95
−1
10
0
1
10
10
WebspamN1: α = 2
2
10
3
10
k = 8192 k = 1024
95
k = 256
90
k = 128
10
2
10
C
k = 128
10
100
k = 1024
90
k = 8192
k = 256
90
−1
10
k = 8192
95
85 −2 10
10
k = 1024
95
10
10
85 −2 10
3
C
WebspamN1: α = 1.5
85 −2 10
k = 256
C
WebspamN1: α = 1
C 100
90
k = 128
k = 128
10
10
2
k = 8192 k = 1024
95
C
Accuracy (%)
Accuracy (%)
100
1
Accuracy (%)
10
0
85 −2 10
Accuracy (%)
−1
WebspamN1: α = 0.5
k = 128
k = 128
85 −2 10
100 k = 8192
Accuracy (%)
WebspamN1: α = 0.1
Accuracy (%)
Accuracy (%)
100
2
10
k = 128 3
10
85 −2 10
−1
10
0
1
10
10
2
10
Figure 9: WebspamN1. We compare sign α-stable random projections with 0-bit consistent weighted sampling (CWS). Each panel (for each α) consists of 8 curves. The solid (pink) curve marked by * represents the results of linear SVM. Four solid curves (labelled by k = 128, k = 256, k = 1024, and k = 8192, respectively) represent the results of sign α-stable random projections for 4 different k values. The 3 dashed curves correspond to the results of 0-bit CWS for k = 128, 256, 1024 (a higher curve for a higher k value).
13
3
10
C
3 Conclusion This paper provides an extensive empirical study of sign α-stable random projections for large-scale learning applications. Although the paper focuses on presenting the results on classification tasks, one should keep mind that the method is a general-purpose data processing tool which can be used for classification, regression, clustering, or near-neighbor search. Given enough projections, the method can often achieve good performance. The comparison with 0-bit CWS should be also interesting to practitioners. Future work: The processing cost of sign α-stale random projections can be substantially improved by “very sparse stable random projections” [9]. An empirical study is needed to confirm this claim. Another interesting line of research is to combine sign stable random projections with 0-bit CWS, for example, by a strategy similar to that in the recent work of “CoRE kernels” [11].
References [1] L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors. Large-Scale Kernel Machines. The MIT Press, Cambridge, MA, 2007. [2] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In WWW, pages 1157 – 1166, Santa Clara, CA, 1997. [3] J. M. Chambers, C. L. Mallows, and B. W. Stuck. A method for simulating stable random variables. Journal of the American Statistical Association, 71(354):340–344, 1976. [4] M. S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, pages 380–388, Montreal, Quebec, Canada, 2002. [5] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008. [6] M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of ACM, 42(6):1115–1145, 1995. [7] P. Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream computation. Journal of ACM, 53(3):307–323, 2006. [8] S. Ioffe. Improved consistent sampling, weighted minhash and L1 sketching. In ICDM, pages 246–255, Sydney, AU, 2010. [9] P. Li. Very sparse stable random projections for dimension reduction in lα (0 < α ≤ 2) norm. In KDD, San Jose, CA, 2007. [10] P. Li. Estimators and tail bounds for dimension reduction in lα (0 < α ≤ 2) using stable random projections. In SODA, pages 10 – 19, San Francisco, CA, 2008. [11] P. Li. CoRE kernels. In UAI, Quebec City, CA, 2014. [12] P. Li. Min-max kernels. Technical report, arXiv:1503.0173, 2015. [13] P. Li. One scan 1-bit compressed sensing. Technical report, arXiv:1503.02346, 2015. [14] P. Li and A. C. K¨onig. Theory and applications b-bit minwise hashing. Commun. ACM, 2011. [15] P. Li, M. Mitzenmacher, and A. Shrivastava. Coding for random projections. In ICML, 2014. 14
[16] P. Li, G. Samorodnitsky, and J. Hopcroft. Sign cauchy projections and chi-square kernel. In NIPS, Lake Tahoe, NV, 2013. [17] M. Manasse, F. McSherry, and K. Talwar. Consistent weighted sampling. Technical Report MSR-TR2010-73, Microsoft Research, 2010. [18] S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends in Theoretical Computer Science, 1:117–236, 2 2005. [19] G. Samorodnitsky and M. S. Taqqu. Stable Non-Gaussian Random Processes. Chapman & Hall, New York, 1994. [20] V. M. Zolotarev. One-dimensional Stable Distributions. American Mathematical Society, Providence, RI, 1986.
15