Gregory Shakhnarovich John W. Fisher Brown University ...

Report 2 Downloads 49 Views
Performance of Approximate Nearest Neighbor Classification John W. Fisher Massachusetts Institute of Technology

• In high dimensions exact NN search is reduced to brute force (linear) search. • Locality sensitive hashing: search in O(dN

1/1+

).

2

• Newer algorithm [1]: O(N 1/(1+) +o(1). • Other algorithms exist (Best Bin First, ANN, etc.), but with no known theoretical guarantees.

Gaussians: full covariance • Both classes: fc(x) = N (x; µc, σ 2I). h √ i • Bayes risk R∗ = 21 1 − erf 2kµ1 − µ2k/4σ . ∗

• We consider binary classification problems: data XN = d {(xi, yi)}N drawn i.i.d. from f ( x , y) over R × {1, 2}. i=1 • Priors: pc = Pr(y = c) • Compound density: f (x) = p1f1(x) + p2f2(x).

x0

• The precise underlying model of choosing in “real” algorithms like LSH is not known. Empirically it seems to be biased towards lower kx0 − x0k.

ε=0.25

R =0.07, ε=0.10 1

N=100 N=500 N=1000 N=5000 N=10000 N=50000 N=100000 N=500000

PB

The computational model of -NN

0.5

0.5

0.5

0.4

0.4

0.4

0.4

0.3

0.3

0.3

0.3

0.2

0.2

0.2

0.2

0.1

0.1

0.1

0.1

0

0

0 0

0.1

0.2

R

0.3

0.4

0.5

0.6

0

0.1

0.2

10000

R

0.3

0.4

0.5

0 0

0.6

0.1

0.2

1000

0.3

R300

0.25 0

4

d

32

ε=0.50

1

1

1

0.75

0.75

0.75

0.25

0.25

0.25

0

• NN classifier: find x ∈ XN such that

• We assume that the classifier selects one of them with probability 1/L, and uses its label to predict y0.

ρ = kx0 − x0k = min kx0 − xk x∈XN

and

predict yb0

Bx0 (ρ)

0

:= y .

x3

• -NN classifier: yb0 := y0 where

x4

x2 0

x1 = x

kx0 − x0k ≤ (1 + )ρ.

Bx0 ((1 + )ρ)

Note: the random variable ρ depends on f, N and x0.

ρ

x6 = x0

With prob. 1/7 x0 = xi, for each x5

i = 1, . . . , 7.

x0

256

0

4

d

32

0

256

4

d

32

0

256

4

d

32

256

• Accuracy results (not shown) reflect this: even with  = 0.1 and N = 500, most of the training set is included in Bx0 ((1 + )ρ), and the classifier is reduced to guessing.

 R100000

PB 1

• Bayes risk is R∗ = Ex0 [R∗(x0)]

1. Draw test point (x0, y0) ∼ f (x, y)

• NN risk for N -sample: RN = Ex0,X [R(x0; XN )]

2. Draw distance to NN ρ ∼ p(ρ|x0, N ; f ). This defines the R probability mass PB = Bx0 ((1+)ρ) f (x)dx.

• Cover and Hart’s asymptotic bound [3]: ∗

3. Draw L0 from Binomial (N − 1, PB ). L = L0 + 1 would be the number of -NN of x0 (including x0).



R∞ ≤ 2R (1 − R ) 0

Key idea of the proof: limN →∞ ρ(N ) = 0, and so y ∼ η . Then, R∞(x0) = 2η(1 − η), and the inequality follows by taking the expectation (and considering the variance term).

0.3 0.35

0.3

PB

0.25

0.1

R∗=0.07, ε=0.10 1

N=100 N=500 N=1000 N=5000 N=10000 N=50000 N=100000 N=500000

PB 0.25 4

d

32

• Thus, we can extend Cover’s asymptotic result to -NN:

1

N=100 N=500 N=1000 N=5000 N=10000 N=50000 N=100000 N=500000

0.75

SNR=12.5dB

PB 0.25 0

= R∞ .

.25

.5

ε

.75

1

0 0

.05

.1

.25

.5

ε

.75

1

0

.05

.1

.25

.5

.75

ε

1

0.8

• 10-class problem (does not exactly comply with assumptions here); N = 60, 000.

4

d

32

256

ε=0.25

ε=0.50 1

1

0.75

0.75

0.75

0.25

0.25

0.25

0

4

d

32

0

256

4

ε=0.25

d

32

0

256

ε=0.50 1

1

0.75

0.75

0.75

0.25

0.25

0.25

4

d

32

0

256

4

d

d

32

256

32

0

256

4

d

32

256





0.2

0.4



R =0.07



R =0.05

R =0.01 0.05

0.15

0.04

0.35 0.3

40

0.15 Exact NN ε=0.10 0.1 ε=0.25 ε=0.50 ε=1.00 60 80 100 120risk Bayes Cover’s bound d

0.03

0.1

0.02 0.01

40

60

d

80 100 120

20

40

60

d

80 100 120

20

40

60

d

Rε100000

0.3 0.25 40

.05

.1

.25

ε

.5

.75

1

• When intrinsic dimension of data is high, -NN becomes meaningless even for small .

• For small  there may be a small gain in performance, (we conjecture it is due to reduced variance of the risk).

 – Distribution-specific bounds on RN , similar to [6, 7]. – Distribution-independent bounds. Quantities of inter   est: RN − RN ,RN /RN , or (RN − RN )/R∞like in [5]. – Adjustment of the overly pessimistic sampling model to a particular search algorithm, e.g. LSH.

80 100 120

References

12.5dB

0.35

0

0

0.05 20

0

• Theoretical analysis (current work):

5dB R =0.25

0.2

• When there is low dimensional structure in data, using moderate values of  incurs only limited loss in accuracy for large N .

ε=1.00

1

0

4

0.4

−0.2

ε=1.00

1

0.6

Conclusions

• Accuracy of classification, N = 100000

20

 R∞

256

R∗=0.07, ε=0.10

[1] A. Andoni and P. Indyk. New LSH-based algorithm for approximate nearest neighbor. Technical Report MIT-CSAIL-TR-2005-073, MIT, Cambridge, MA, December 2005.

R∗=0.25

The question we are interested in:  compared to R ? How much worse is RN N What is the accuracy/speed tradeoff between exact and approximate NN classification?

.1

• More reasonable behavior of PB :

20

• If limN →∞ ρ = 0 then also limN →∞(1 + )ρ = 0 (by dominated convergence theorem).

.05

0

• The protocol: embed 5-dimensional Gaussian in a linear subspace of Rd, with noise: " Gaussian #! I5 0 fc(x) = N x; µc, + N (x; 0, σnI). 0 0 • σn set to achieve desired SNR=10 log1 0(5/dσ 2).

0.25

Asymptotic behavior of -NN

0.2

1

6. Draw yb0 from f (y|x0).

• Convergence of RN to R∞ can be arbitrarily slow [2, 4]; no distribution-independent results for RN are known.

0.6

0.4

Gaussians: low intrinsic dimension

4. With probability 1/L the classifier sets x0 = x0. 5. With probability 1 − 1/L, x0 is drawn from f (x | x ∈ Bx0 ((1 + )ρ) \ Bx0 (ρ)).

0.5

0.05

Rε100000

• Conditional Bayes risk: R∗(x0) = min{η, 1 − η}.

0.4

0.5

0.75

0 0

SNR=5dB

For our ongoing theoretical analysis, we use the following “inverse” sampling model:

0.3

R20

0.15

0

Known results

0.2

0.45

0.25

0.75

R

0.1

 R100000 − R100000

0.2

Equivalently, with prob 6/7 x0 ∼ f (x|R).

x7

0

• 45 binary classification problems; N ≈ 12, 000

RNε( x0) mean± std

• Test point (x0, y0) ∼ f (x).

• Let x be the exact NN of x0 in XN , and let L be the number of x ∈ X s.t. x ∈ B((1 + )ρ.

0.6

• 28×28 grayscale images of handwritten digits.

A simplifying model used in our experiments: 0

0.5

MNIST data

ε=1.00

0.1

• Posterior: Pr(y = c|x) = ηc(x); shorthand η ≡ η(x) ≡ η1(x).

0.4

• The mass of Bx0 ((1 + )ρ) grows too fast: 0.75

Problem definition

0.5

RεN − RN

Nearest-neighbor (NN) classifiers are often accurate but prohibitively expensive due to the cost of search. Recently proposed algorithms allow for much faster search at the cost of settling for an approximate, rather than exact, NN. We investigate the effect such approximations have on the classification error.

Experiments

RNε

Why use an -NN classifier?

Rε=.25 100000

Introduction

• A “fair” comparison (with equal computation):  = .25 =1

Rε=1 100000

Gregory Shakhnarovich Brown University

R∗=0.07

R∗=0.05

R∗=0.01

[2] T. M. Cover. Rates of Convergence for Nearest Neighbor Procedures. In Proc. 1st Ann. Hawaii Conf. Systems Theory, pages 413–415, January 1968.

0.14

0.1

0.03

[3] T. M. Cover and P. E. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13:21–27, January 1967.

0.12

0.08

0.02

[4] L. Devroye. On the inequality of Cover and Hart in nearest neighbor discrimination. IEEE Transactions on Pattern Analysis and Machine Intelligence, 3:75–78, 1981.

Exact NN 0.1 ε=0.10 ε=0.25 0.08 ε=0.50 ε=1.00 0.06 risk 60 80 Bayes 100 120 d Cover’s bound

0.01

0.06

0

0.04 20

40

60

d

80 100 120

[5] K. Fukunaga and D. M. Hummels. Bias of nearest neighbor error estimates. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-9(1):103–112, January 1987.

20

40

60

d

80 100 120

20

40

60

d

80 100 120

[6] D. Psaltis, R. R. Snapp, and S. S. Venkatesh. On the Finite Sample Performance of the Nearest Neighbor Classifier. IEEE Transactions on Information Theory, 40(3):820–837, May 1994. [7] R. R. Snapp and S. S. Venkatesh. Asymptotic derivation of the finite-sample risk of the k nearest neighbor classifier. Technical Report UVM-CS-1998-0101, University of Vermont, Burlingotn, Burlington, VT, October 1997.