Distance concentration and detection of ... - Semantic Scholar

Report 7 Downloads 84 Views
Distance concentration and detection of meaningless distances Ata Kab´ an School of Computer Science The University of Birmingham Birmingham B15 2TT, UK http://www.cs.bham.ac.uk/∼axk

Dagstuhl Seminar 12081 - Information Visualization, Visual Data Mining and Machine Learning 19.02.12 – 24.02.12

Distance concentration & Detecting meaningless distances

Distance concentration Distance concentration is the phenomenon that, as the data dimensionality increases, all the pairwise distances (dissimilarities) may converge to the same value. The lack of contrast between the nearest and the furthest points affects each area where high-dimensional data processing is required — e.g. high-dimensional data analysis, database indexing & retrieval, data analysis, statistical machine learning. Concentration of the L2-norm (Demartinez,’94) Let x ∈ Rm a random vector with i.i.d. components of any distribution. Then, E[||x||2 ] lim = const.; m→∞ m1/2

Ata Kab´ an

lim Var[||x||2 ] = const.

m→∞

(1)

1

Distance concentration & Detecting meaningless distances

Concentration of arbitrary dissimilarity functions in arbitrary multivariate distributions (Beyer et al.,99). Let Fm , m = 1, 2, . . . be an infinite sequence of data distributions (m) (m) and x1 , . . . , xn a random sample of n independent data vectors distributed as Fm . For each m, let ||.|| : dom(Fm ) → R+ be a function that takes a point from the domain of Fm and returns a positive real value. Let p > 0 an arbitrary positive constant. Assume E[||x(m) ||p ] and Var[||x(m) ||p ] are finite and E[||x(m) ||p ] 6= 0. Theorem [Sufficient conditions] (Beyer et al.,’99) Var[(||x(m) ||)p ] If lim E[(||x(m) ||)p ]2 = 0, then, m→∞ » – (m) (m) ∀ǫ > 0, lim P max ||xj || ≤ (1 + ǫ) min ||xj || = 1. m→∞

Ata Kab´ an

1≤j≤n

1≤j≤n

2

Sample estimate of Var[x2] / E[x2]2

Distance concentration & Detecting meaningless distances

0.5 0.4 0.3 0.2 0.1 0

0

100

200

300 400 dimensions (m)

500

600

700

0

100

200

300 400 dimensions (m)

500

600

700

log ( DMAXm / DMINm )

5 4 3 2 1 0

Ata Kab´ an

3

Distance concentration & Detecting meaningless distances

Our research - Roadmap

• →Necessary conditions? • Which data analysis methods are still suitable in high dimensional problems? • How to detect problem cases from the data?

R.J. Durrant and A Kab´ an: ”When is ’nearest neighbor’ meaningful: A converse theorem and implications.” Journal of Complexity, 2009, pp.385-397.

Ata Kab´ an

4

Distance concentration & Detecting meaningless distances

(m)

Notations: DM INm (n) = min kxj 1≤j≤n

(m)

k; DM AXm (n) = max kxj 1≤j≤n

k

Theorem [Necessary conditions] (Durrant & Kab´ an ’09) p p Assume n is large enough so that E[||x(m) ||p ] ∈ [DM INm (n), DM AXm (n)] holds. Now, if ∀ǫ > 0, lim P {DM AXm (n) < (1 + ǫ)DM INm (n)} = 1, m→∞

then Var[||x(m) ||p ] lim E[||x(m) ||p ]2 = 0. m→∞

Definition RVm (p) ≡ exponent p.

Ata Kab´ an

Var[||x(m) ||p ] E[||x(m) ||p ]2

is called the relative variance with

5

Distance concentration & Detecting meaningless distances

Is it possible for the converse theorem to apply? • Notice that RVm (p) can be rewritten as the following: Pm Pm Pm p p p Cov[|x | , |x | ] i j Var[ i=1 |xi | ] j=1 i=1 Pm = Pm Pm RVm (p) = p p E[ i=1 |xi |p ]2 i=1 j=1 E[|xi | ]E[|xj | ]

If the numerator grows no slower than the denominator with increasing m the RVm will not tend to zero, and contrast among distances between the data points remains good!

• Growing literature in the data mining domain focuses on studying the problem of distance concentration in i.i.d. data distributions. Hence the picture looks much more pessimistic than it is in reality — real data sources have structure (correlations between features).

Ata Kab´ an

6

Distance concentration & Detecting meaningless distances

• Easy to construct examples that have enough feature correlations that distance concentration does not create problems even for the Euclidean distance. • However, i.i.d. (or close to i.i.d.) content (i.e. unstructured noise) triggers distance concentration. For example, take: xi = ai y + δi , ∀i = 1, · · · , m

(2)

which posits that xi are generated from a common systematic factor y embedded in a high-dimensional space by the parameters ai and additive unstructured 0-mean noise δi .

Ata Kab´ an

7

Distance concentration & Detecting meaningless distances

2

y ~ Unif [0,b]; δ~N(0,s ) 0.8 0.7

lim

m→ ∞

RV

m

0.6 0.5 0.4 0.3 0.2 0.1 0 3 2.5 2 1.5 1

1

b

Figure 1:

Ata Kab´ an

0.6

0.2

0

E[δ2]/A

Example. y ∼ Uniform[0, b] (so E[y 2 ] = b2 /3 and Var[y 2 ] = 4b4 /45); A :=

lim a2m ; δi = δ, ∀i = 1, ..., m; and so

m→∞

0.8

0.4

lim RVm =

m→∞

Var[y 2 ] (E[y 2 ]+E[δ 2 ]/A)2

.

8

Distance concentration & Detecting meaningless distances

• Under what conditions would distance concentration not occur? • →Which data analysis methods are still suitable in high dimensional problems? • How to detect problem cases from the data?

A Kab´ an: ”On the Distance Concentration Awareness of Certain Data Reduction Techniques”. Pattern Recognition. Vol. 44, Issue 2, Feb 2011, pp.265-277.

Ata Kab´ an

9

Distance concentration & Detecting meaningless distances

• Under what conditions would distance concentration not occur? • Which data analysis methods are still suitable in high dimensional problems? • →How to detect problem cases from the data?

A Kab´ an: ”Non-parametric Detection of Meaningless Distances in High-Dimensional Data.” Statistics and Computing. 22(1): 375-385.

Ata Kab´ an

10

Distance concentration & Detecting meaningless distances

• So far we characterised the phenomenon of distance concentration in the limit of infinite dimensions (Beyer et al. ’99; Durrant & Kab´ an ’09). • However, real data is finite dimensional, and hence the infinitedimensional characterisation is insufficient. • Our analysis so far was model-based. The conclusions hold when the model holds. • We need to quantify the phenomenon more precisely, for high but still finite dimensional data spaces, by bounding the tails of the probability that distances are concentrated. This can be used to test & detect problematic cases at the desired confidence level.

Ata Kab´ an

11

Distance concentration & Detecting meaningless distances

Given an m-dimensional data set of size r, drawn iid from Fm , given a dissimilarity function k · k, and given an ǫ > 0, what is P [DM AXm (n) < (1 + ǫ)DM INm (n)]? — i.e., how likely it is that in an arbitrary sample of size n, drawn iid. from Fm , the largest distance would be no more than ǫ away from the smallest one? Fm is unknown, and not factorisable; and m is finite... ...One could get an estimate of RVm (p) from the data set... but how ‘small’ should that be to conclude that the above probability is ‘large’ ?

In this work, we lower bound the above probability, - first, as if E[||x(m) ||p ] and Var[||x(m) ||p ] were known - then refining the result to use estimates from the available data set of size r.

Ata Kab´ an

12

Distance concentration & Detecting meaningless distances

(m)

(m)

Theorem 1. Let x1 , . . . , xn ∼i.i.d. Fm a random sample, n ≥ (m) (m) 2, and DM INm (n) = min ||xj || and DM AXm (n) = max ||xj ||. 1≤j≤n

1≤j≤n

Then, ∀ǫ > 0, P {DM AXm (n) < (1 + ǫ)DM INm (n)} ≥ ( ! )n „ «2 2 1− + 1 RVm (p) (1 + ǫ)p − 1

(3)

+

where (u)+ = max(0, u), and RVm (p) ≡ variance with exponent p.

Var(kx(m) kp ) E(kx(m) kp ))2

is the relative

Corollaries. The previous results, for infinite dimensions, (Beyer et al’99) & (Durrant & Kab´ an ’09) can both be derived as corollaries of Theorem 1.

Ata Kab´ an

13

Distance concentration & Detecting meaningless distances

(m)

(m)

Theorem 2. Let x1 , ..., xn ∼iid. Fm a random sample, n ≥ 2, (m) (m) DM INm (n) = min ||xj || and DM AXm (n) = max ||xj ||, as be1≤j≤n (m) (m) y1 , . . . , yr

1≤j≤n

fore, and let ∼iid. Fm a data sample, r ≥ 2. Assume P (kx1 k = · · · = kxn k = ky 1 k = · · · = ky r k) = 0, and assume nothing (not even finiteness of the mean and variance) of Fm . Then, P {DM AXm (n) < (1 + ǫ)DM INm (n)} ≥ ! )n ( «2 „ 2 1 r −1 2 − + 1 RV (p) 1− m,r (1 + ǫ)p − 1 r2 r

(4)

+

where RV m,r (p) is the estimated relative variance, from the data set y 1 , ..., y r . Obs. As expected, the r.h.s. of (4) recovers that of (3) in the limit when r → ∞, i.e. if the observed sample to estimate RV m,r (p) is infinitely large.

Ata Kab´ an

14

Distance concentration & Detecting meaningless distances

Turning the probability bound into a statistical test The null hypothesis is that the dissimilarity function k·k suffers from concentration. By setting the probability bound to 1 − δ, the desired confidence level, and then solving for RV m,r , yields an upper bound on RV m,r : RV m,r

r2 1 − (1 − δ)1/n − 1/r ≤ “ ”2 × 2 r −1 2 + 1 p (1+ǫ) −1

(5)

This quantifies how small the estimated relative variance needs to be in order to confirm, with confidence 1 − δ that, in an arbitrary sample of size n, the largest distance would be within ǫ of the smallest one. It is a function of the available sample size r, the size n of the unseen sample that we are reasoning about, ǫ, δ, (and p). Ata Kab´ an

15

Distance concentration & Detecting meaningless distances

δ=0.05, n=10, p=1

r=5000, n=10, p=1 −3

−4

x 10 RVestimate Upper Bound

RVestimate Upper Bound

x 10 8 6 4 2 0 4000

1

2000 r

0

0.5 ε

10 8 6 4 2 0.4 1

0.2 δ

0

0.5 ε

Figure 2: Left: The upper bound on RV m,r required in the statistical test, to confirm for n = 10 previously unseen samples, with risk δ = 0.05, on the basis of an observed sample (data set) of size r, that ||.|| suffers from the concentration phenomenon, at various given levels ǫ; Right: Dependence of this bound on ǫ and δ, when the available data set size is fixed to r = 5000.

Ata Kab´ an

16

Distance concentration & Detecting meaningless distances

Turning the probability bound into a statistical test (cont’d) We can also ask, what is the smallest ǫ for which the test statistic RV m,r confirms with the desired confidence 1−δ that, in an arbitrary sample of size n, the largest distance would be within such ǫ of the smallest one. This provides a notion of ‘degree of concentration’. Note on the required sample size r: For (5) to be meaningful we need the observed sample size r ≥ 1 a ∈ O(n) . This is reasonable, since we try to conclude 1/n 1−(1−δ) about n previously unseen cases based on r observed ones. a since

Ata Kab´ an

1 /n n→∞ 1−(1−δ)1/n

lim

= ln

1 1−δ

17

Distance concentration & Detecting meaningless distances

9000

δ=0.01 δ=0.05 δ=0.1 δ=0.5

Minimum required sample size (r)

8000

7000

6000

5000

4000

3000

2000

1000

10

20

30

40

50

60

70

80

90

Nr samples to predict for (n)

Figure 3: Dependence of the minimum required sample size (r) on the size of the unseen sample that we want to reason about, in case of various risk levels δ. As one would expect, the smaller the allowed risk, the more samples are required. Ata Kab´ an

17

Distance concentration & Detecting meaningless distances

Testing synthetic data sources

Lower Bound on P{DMAX (2)