Non-parametric detection of meaningless ... - Semantic Scholar

Report 5 Downloads 104 Views
Stat Comput DOI 10.1007/s11222-011-9229-0

Non-parametric detection of meaningless distances in high dimensional data Ata Kabán

Received: 30 June 2010 / Accepted: 10 January 2011 © Springer Science+Business Media, LLC 2011

Abstract Distance concentration is the phenomenon that, in certain conditions, the contrast between the nearest and the farthest neighbouring points vanishes as the data dimensionality increases. It affects high dimensional data processing, analysis, retrieval, and indexing, which all rely on some notion of distance or dissimilarity. Previous work has characterised this phenomenon in the limit of infinite dimensions. However, real data is finite dimensional, and hence the infinite-dimensional characterisation is insufficient. Here we quantify the phenomenon more precisely, for the possibly high but finite dimensional case in a distribution-free manner, by bounding the tails of the probability that distances become meaningless. As an application, we show how this can be used to assess the concentration of a given distance function in some unknown data distribution solely on the basis of an available data sample from it. This can be used to test and detect problematic cases more rigorously than it is currently possible, and we demonstrate the working of this approach on both synthetic data and ten real-world data sets from different domains. Keywords High dimensional data · Curse of dimensionality · Distance concentration · Nearest neighbour · Chebyshev bound · Statistical test

1 Introduction Distance concentration, i.e. the lack of contrast between distances to a query point that may occur in high dimensional A. Kabán () School of Computer Science, The University of Birmingham, Edgbaston, Birmingham, B15 2TT, UK e-mail: [email protected]

data spaces, is a long standing problem in database research, since it can compromise similarity search (Aggarwal et al. 2001; Beyer et al. 1999; Pramanik and Li 1999) in high dimensional databases. This is becoming a threat more generally in data analysis, data mining, pattern recognition, and statistical learning from high dimensional data. Better understanding this phenomenon and its implications is therefore the subject of much recent research (Giannella 2009; Hsu and Chen 2009; Radovanovi´c et al. 2010; François et al. 2007; Durrant and Kabán 2009; Kabán 2011). Previous work (Beyer et al. 1999; Durrant and Kabán 2009) has characterised the phenomenon of distance concentration asymptotically, in the limit of infinite dimensions. This enables analyses of a given distance function in a given data model family, and allows us to identify conditions on the data distribution that matter. E.g. the existence correlations among the features was shown to be a favourable trait (Durrant and Kabán 2009), whereas weakly dependent or independent features lead to meaningless distances in high dimensions. Another stream of research seeks to alleviate the problem by devising dissimilarity functions that suffer less in a worst-case scenario, e.g. on i.i.d. uniformly distributed features (François et al. 2007; Hsu and Chen 2009). However, neither approaches are in direct connection with the real-world data. The model based approach derives conclusions that hold if the model holds. The studies under the worst-case scenario are unrepresentative since they ignore a lot of benign factors that are often present in real data sources. Indeed, if the distance concentration was as severe in practice as in the theory that assumes i.i.d. features or weak correlations, then (e.g. cf. results in Beyer et al. 1999) learning from data beyond cca. 10 dimensions would be impossible. What is overlooked is that distance concentration depends on two ingredients: the distance function and the

Stat Comput

data distribution. Data typically has some structure, and this varies a lot across data sources. Hence, it is not very useful in practice to only talk about distance concentration without reference to the particular data source. To partly address this, a large number of possible indices and scores have been put forth (Hsu and Chen 2009) to estimate the extent of the problem for the case of particular data sets. While these are indicative to some extent, they all have advantages and disadvantages and there is no clear way to objectively interpret them in a unified way. In this paper we generalise the previous asymptotic results in Beyer et al. (1999), Durrant and Kabán (2009), Hsu and Chen (2009), and give a finite-dimensional characterisation of the distance concentration phenomenon that recovers previous results in the limit of infinite dimensions. We do this by bounding the tails of the probability that distances from a reference query point would concentrate, while we maintain the level of generality of Beyer et al. (1999) and Durrant and Kabán (2009) i.e. assume nothing about the data distribution. We then further extend our approach to a version that only requires sample estimates—which subsequently we turn into a statistical test suitable for assessing distance concentration problems on the basis of a finite dimensional data sample. The next subsection introduces some background that will be built upon in the subsequent sections. 1.1 Preliminaries and previous results Let Fm , m = 1, 2, . . . an infinite sequence of multivariate (m) (m) distributions. Let x 1 , . . . , x n ∼i.i.d. Fm a random sample drawn i.i.d. from Fm . For each m, let · : dom(Fm ) → R+ be a positive valued function on the domain of Fm . We (m) will denote DMIN m (n) = min1≤j ≤n x j ; DMAX m (n) = (m)

max1≤j ≤n x j . The function · may be interpreted as a dissimilarity (or distance metric, though it does not have to satisfy the properties of a metric). The positive integer m may be interpreted as the dimensionality of the data space. It will be assumed that, ∀p > 0 fixed constant, E[x (m) p ] and Var[x (m) p ] are finite and E[x (m) p ] = 0. This will be relaxed in a later section. No other assumption is made about the distribution Fm , in particular, it is not assumed to be factorisable, since data features are typically not independent of each other. Theorem (Beyer et al. 1999) Let p > 0 an arbitrary posx (m) p ] = 0, then ∀ > 0, itive constant. If limm→∞ Var[ E[x (m) p ]2 limm→∞ P [DMAX m (n) < (1 + )DMIN m (n)] = 1; where the operators E[·] and Var[·] refer to the theoretical expectation and variance of the distributions Fm , and the probability on the r.h.s. is over the random sample of size n drawn from Fm .

Converse Theorem (Durrant and Kabán 2009) Assume that n is large enough so that it holds that p p E[x (m) p ] ∈ [DMIN m (n), DMAX m (n)], where p > 0 is an arbitrary fixed positive constant. If limm→∞ P {DMAX m (n) < (1 + )DMIN m (n)} = 1, ∀ > 0, x (m) p ] = 0. then, limm→∞ Var[ E[x (m) p ]2 For some p > 0, the infinite sequence RVm (p) ≡ m = 1, 2, . . . , is termed the sequence of relative variances, and RVm (p) is the m-th term of this sequence. The above two results establish that, the phenomenon that the contrast between the smallest and the largest distance between sample points vanishes as m → ∞ is equivalent to the condition that the relative variance of the distance distribution converges to zero. This completes the characterisation of the phenomenon of distance concentration in the limit of infinite dimensions. However, real data sets are possibly high but still finite dimensional. Hence, in order to assess whether or not a given dissimilarity function suffers from the distance concentration problem on some data distribution, the asymptotic characterisation is insufficient. Among many possible indicator statistics proposed in Hsu and Chen (2009), the sample estimate of RVm (n) has been found useful in Hsu and Chen (2009) to indicate the severity of the distance concentration issue in the case of high dimensional data sets. Though, it remains largely unclear how exactly it relates to the probability that distances are concentrated in some finite mdimensions. That is, how ‘small’ this estimate should be in order to conclude that P [DMAX m (n) < (1 + )DMIN m (n)] is ‘large’? The approach we take in this paper is to lower bound P [DMAX m (n) < (1 + )DMIN m (n)], given a dissimilarity function and an , for an arbitrary sample of size n in finite (m) dimensions. As in François et al. (2007), the query point will be taken as the origin for the ease of exposition, without loss of generality. To avoid making assumptions on the data distribution Fm so as to keep the generality of the previous asymptotic results, we will employ the distribution-free bound of Chebyshev. As we shall see, this allows us to express the lower bound as a function of the relative variance. We will also show that the existing asymptotic results can be readily recovered from our bound when letting m → ∞. Furthermore, noting that Fm is unknown in practice, while we have access to an observed sample from it, we refine our result to only require estimates from the available data set. This refinement also relaxes the finite moments assumption made earlier. Finally, we use the resulting bound to detect problematic cases with a pre-specified confidence. Var[x (m) p ] , E[x (m) p ]2

Stat Comput

2 Bounding the probability of distance concentration

We compute: p

In this section we bound the tails of the probability that distances are concentrated.

p

DMAX m (n) − DMIN m (n) p

p

= |DMAX m (n) − DMIN m (n)| p

(m) xi , i

Theorem 1 Let = 1, . . . , n, and independently drawn m-dimensional sample points from some distribu(m) tion Fm , and denote DMIN m (n) = min1≤j ≤n x j  and x (m) ,

DMAX m (n) = max1≤j ≤n x (m) j .

p

+ |E(x (m) p ) − DMIN m (n)|

(6)

 )

(7)

< 2 × E(x

(m) p

p < 2/(1 − )DMIN m (n)

Then,

P {DMAX m (n) < (1 + )DMIN m (n)}  n   2 2 ≥ 1− + 1 RVm (p) (1 + )p − 1

(1)

+

where (u)+ = max(0, u). Proof Since no structural assumption was made on Fm and the resulting distribution of distances, we employ a distribution-free bound, namely Chebyshev’s inequality to bound the probability that xp is close to its expectation E(xp ):

 x (m) p 1 P − 1 <  ≥ 1 − 2 RVm (p) (m) p  E(x  )

(2)

 (m) x j p − 1 <  P ∀j = 1, . . . , n, E(x (m) p ) 

(3)

In the above, (.)+ denotes capping possible negative values to zero to avoid confusion when n is even and the expression in the bracket goes negative. Thus, we have with probability {(1 − 12 RVm (p))+ }n that the following holds simultaneously for all the n points: (m)

|x j p − E(x (m) p )| <  × E(x (m) p ), (m)

∀x j , j = 1, . . . , n

where (6) follows from the triangle inequality, in (7) we applied (4) by instantiating x (m)  as DMAX m (n) and DMIN m (n), and (8) follows from applying the l.h.s. inequality of (5) to DMIN m (n). In consequence, (8) holds with probability at least {(1 − 12 RVm (p))+ }n . Rearranging, we have1 : p

p

P {DMAX m (n) < (1 + 2/(1 − ))DMIN m (n)}  n  1 ≥ 1 − 2 RVm (p)  +

(9)

Now, setting  := 2/(1 − ), and solving for  we get  =  /(2+ ). Replacing this into the r.h.s. of (9), and renaming  to , yields: p



 n 1 1 − 2 RVm (p)  +

(8)

p

P {DMAX m (n) < (1 + )DMIN m (n)}

It follows, since the sample points are drawn i.i.d., that, ∀ > 0:



≤ |DMAX m (n) − E(x (m) p )|

(4)

or equivalently,

≥ {(1 − (2/ + 1)2 RVm (p))+ }n

(10)

Finally, using that (.)1/p on the positive domain is an increasing function, we set 1 +  := (1 + )1/p to find  = (1+ )p −1, which replaced into the r.h.s. of (10), and again renaming  to , gives the result stated in (1).  Theorem 1 provides a finite-dimensional characterisation of the distance concentration phenomenon that is also distribution-free. The lower bound on the probability of distance concentration is given as a function of the relative variance and its exponent, the radius or contrast parameter , and the number of distances n that entered in the maximum DMIN m (n) and minimum DMAX m (n) distance. The behaviour of the obtained bound, (1) is illustrated in Fig. 1. Most importantly, notice that it tightens as RVm (p) decreases—which makes it useful in the critical problematic cases. As expected, it also tightens when  is increased, or when n is decreased. Further, one may also observe that, the bound (1) tightens with decreasing p if RVm (p) were constant—however in general RVm (p) varies with p,

(m)

(1 − )E(x (m) p ) < x j p < (1 + )E(x (m) p ), (m)

∀x j , j = 1, . . . , n

1 Alternatively

(5) p

p

Therefore (4)–(5) must hold also for DMAX m and DMIN m .

(9) could have been obtained more directly, noting that p

1 n m (n) (5) implies that P { DMAX < 1+ p 1− } ≥ {(1 −  2 RVm (p))+ } , howDMIN m (n) ever, we would then need to exclude the elements of the sequence for which DMIN m (n) = 0.

Stat Comput Fig. 1 The probability bound of Theorem 1, as a function of the relevant parameters

hence the use of small p, i.e. “fractional norms” defined as p 1/p with p ∈ (0, 1), only reduces the xp = ( m i=1 |xi | ) concentration effect in certain cases (e.g. in the case of uniform data distribution) but not always—in agreement with empirical results reported in François et al. (2007). The next section shows that the two previous asymptotic results (Beyer et al. 1999; Durrant and Kabán 2009), given in the previous section, can be obtained as corollaries of Theorem 1. This is of interest not only to see that Theorem 1 subsumes these results, but the alternative proof will reveal the additional insight that the bound (1) is in fact tight when P [DMAX m (n) < (1 + )DMIN m (n)] is large, i.e. when the probability of distance concentration is large—which is exactly the situation of our interest. 2.1 Corollaries: Recovering previous asymptotic results Corollary 1 Theorem of Beyer et al. (1999). Proof From (1) in Theorem 1, we have that, if RVm (p)−→m→∞ 0, then the r.h.s. converges to 1 (irrespective of n), and consequently the probability on the l.h.s. must also converge to 1.  Corollary 2 Converse of Theorem of Beyer et al. (2009). Proof From the precondition E[x



(m) p

p p ] ∈ [DMIN m (n), DMAX m (n)]

(11)

we have that: p

A:

p

p

DMAX m (n) < (1 + )DMIN m (n)

(14)

equivalently, A:

p

p

p

|DMAX m (n) − DMIN m (n)| <  × DMIN m (n)

Combining this with (12), gives by transitivity: p

|x (m) p − E(x (m) p )| <  × DMIN m (n) ≤  × E(x (m) p )

(15)

where in the last inequality we used the precondition (11). This event (15), denoted by ‘B’ in the sequel, can be rewritten equivalently as: x (m) p 0 as before. Using this, the sample-estimated analogue of Theorem 1 bounds the probability of distance concentration for n randomly drawn points (again, not in the observed sample of r points). (m) Theorem 2 Let x (m) 1 , . . . , x n ∼iid. Fm a random sample, (m) n ≥ 2, DMIN m (n) = min1≤j ≤n x j  and DMAX m (n) = (m)

(m)

(m)

max1≤j ≤n x j , as before, and let y 1 , . . . , y r ∼iid. Fm an observed data sample, r ≥ 2. Assume P (x 1  = · · · = x n  = y 1  = · · · = y r ) = 0, and make no assumptions about Fm . Then, 

P DMAX m (n) < (1 + )DMIN m (n) 



2 +1 ≥ 1− (1 + )p − 1  n r2 − 1 1 − × r + r2

2 RV m,r (p) (21)

where RV m,r (p) is the estimated relative variance, from the data set y 1 , . . . , y r . As expected, the r.h.s. of (21) recovers the form of (1) in the limit when r → ∞. Yet, it requires no knowledge of the theoretical mean and variance of the data distribution, and not even their finiteness. Corollary 3 (Sample estimated version of Beyer’s theorem) For any p > 0 fixed constant, if limm→∞,r→∞ RV m,r (p) = 0, then, ∀ > 0, limm→∞,r→∞ P {DMAX m (n) ≤ (1 + ) DMIN m (n)} = 1. Corollary 4 (Sample estimated version of the Con p verse of Beyer’s theorem) Assume 1/r ri=1 x (m) i  ∈ p p [DMIN m (n), DMAX m (n)] holds, where p > 0 is a fixed constant. Now, if limm→∞,r→∞ P {DMAX m (n) < (1 + ) DMIN m (n)} = 1, ∀ > 0, then limm→∞,r→∞ RV m,r (p) = 0.

4 Application: Assessing distance concentration problems in data sets The bound (21) can now be turned into a statistical test to assess whether the data presents sufficient evidence to conclude that a dissimilarity function . of interest suffers from concentration in the distribution represented by the available data set. This can be of use to detect problematic data sets and distance functions in practice in a more rigorous quantitative manner than it has been currently possible. Employing (21), we can test on the basis of r observed dissimilarities from some fixed query point,3 to some userdefined risk δ parameter, whether in an arbitrary sample of size n, drawn from the same unknown data distribution as the observed r points, the nearest distance to the query point would be within some  of the farthest one. Setting the r.h.s. of (21) greater or equal to 1 − δ, where δ is the allowed risk, we obtain the following upper bound for the estimated relative variance: RV m,r ≤

1 − (1 − δ)1/n − 1/r r 2  2 2 r2 − 1 (1+)p −1 + 1

(22)

Figure 2 displays the behaviour of this upper bound, for n = 2 and for n = 10, when varying r,  and δ. One can observe, as somewhat expected, that small non-zero estimates of the relative variance are indicative of the presence of the distance concentration problem—and we have now quantified this rigorously. We can see that the thresholds (i.e. the upper bounds on RV m,r ) are smaller for a larger n, since the larger n signifies a stronger notion of distance concentration (namely that any n randomly drawn points from the distribution represented by the observed sample of size r is guaranteed to have its nearest and farthest distance from the query point within  of each other). Further, as one would expect, in the case when only a small sample size r is available, the required RV m,r estimate to confirm the same -concentration (i.e. having a deviation or contrast parameter of ) with confidence 1 − δ must be even smaller than in the case of a large sample size r—this is since a poor estimate of the relative variance is less conclusive than a more accurate estimate. Though, we can also observe from the left hand plots, that the variation with r of the bound levels up at a reasonable sample size of cca. 1000, and even much lower r still produce meaningful bounds.

3 Apart from studying the phenomenon of distance concentration per se, this may be particularly useful e.g. in transductive learning type problems, where the query points are known at the time of learning.

Stat Comput Fig. 2 The upper bound on RV m,r required to confirm from a sample of size r, with confidence δ = 0.95, that . suffers from -concentration in the sense of n = 2, and n = 10 respectively (left plots); Dependence of this bound on both  and δ, when the sample size is fixed to r = 5000 (right plots)

4.1 Comments

Fig. 3 Dependence of the minimum required sample size (r) on the size of the unseen sample that we want to reason about, in case of various risk levels δ. As one would expect, the smaller the allowed risk, the more samples are required

One must be aware that inequality (20) may be loose in some cases, hence a comment is in order with regards the effects this may have on the statistical tests derived from it. That is, when (20) is loose, the test may fail to detect the concentration problem. A non-detection can also be caused by having a too small sample size r. However, when the estimated RV m,r does fall below the threshold for the levels tested, that definitely indicates, with confidence 1 − δ, the presence of the distance concentration problem at the level  tested. Regarding the values of  that are worth testing, previous experience (Beyer et al. 1999) has indicated that a small but non-negligible value of the ratio between the median of squared distance and the nearest squared distance can already produce problems for algorithms that require neighbourhood searching. This suggests that a value of  not quite near zero is already worth taken seriously in order to detect practically problematic cases.

Regarding the minimum required sample size, we see 1 from the r.h.s. of (22) that we must have r ≥ 1−(1−δ) 1/n . This 1 1 is O(n) (since limn→∞ 1−(1−δ) 1/n /n = ln 1−δ = const.) which is reasonable, since we are reasoning about n previously unseen samples based on the r observed ones. The relationship between n and the minimum r is shown in Fig. 3 for various risk levels δ, and we see the minimum r is indeed O(n).

5 Numerical experiments and applications As a validation of the procedure described in the previous section, we demonstrate its ability to detect the distance concentration problem in synthetic data. Data distributions with i.i.d. dimensions (features) represent worst-case situations:

Stat Comput

Fig. 4 Numerical experiments on synthetic data drawn from a i.i.d. Uniform distribution over [0, 1]m . The number of points is r = 1000, the test is for n = 2, and the dimensionality m is varied. The average over 1000 query points is shown along with one standard error

Increasing dimensionality causes distances to concentrate and thus become meaningless. In particular, the case of features drawn for an i.i.d. uniform density has been the subject of study in e.g. Aggarwal et al. (2001) and François et al. (2007), and they have shown that, for this particular distribution, the fractional distances with smaller exponent resist the problem up to a larger dimensionality. We use this distribution as our testbed to demonstrate that our testing procedure is able to find valid conclusions from a data sample, that agree with these known results. Figure 4 shows the estimated lower bounds on the probability that n = 2 randomly drawn and previously unseen points would have nearly the same (within ) distances from a reference query point, using (21), plotted against a range of values of  tested. These experiments are conducted on four data sets with dimensions m ∈ {50, 100, 500, 1000}, and r = 1000 points used for estimation. We repeated each test 1000 times, each time with a different reference query point (drawn from the same distribution as the data set and then fixed during the test). The results shown on the plots represent the average (plus one standard error that is so small that hardly visible) over these 1000 repetitions. The left-hand plot shows this for the Euclidean distance, the righ-hand plot for the fractional distance with exponent parameter 0.1. ¯ m,r (p) was set to 1 throughout. For The exponent p in RV each value of , the probability bound means the confidence 1 − δ with which the test would detect distance concentration with deviation . We can see that indeed, (i) for every fixed , this confidence gets higher as the data dimension increases—in agreement with the theory; (ii) the confidence reaches a high value close to 1 in the most problematic case (Euclidean distance, data with largest dimension m tested) for reasonably low values of —that is, the bound is indeed

tight in severe cases of distance concentration; and (iii) for any fixed dimensionality m, the Euclidean distance is more concentrated than the fractional distance—again, in agreement with the theory. We therefore conclude that the proposed approach has the ability to quantify and detect the severity of distance concentration problems from the data sample. 5.1 Testing real data sources Contrary to numerous studies about the distance concentration problem on i.i.d. features, very little is known about whether and to what extent distance concentration affects real data sets. Real data often (though not always) has some structure, and this may suspend the dimensionality curse with regards to distance concentration. This has been observed empirically (François et al. 2007) and has been also confirmed theoretically in a class of non i.i.d. data distributions, namely in linear latent variable models (Durrant and Kabán 2009). However, real data may or may not conform to such models, and it is desirable to be able to assess distance concentration problems directly from the data. We now apply our method to ten real-world data sets from different domains. We have chosen four gene array data sets, since that is an area where major concerns were raised about the potential problems caused by the distance concentration phenomenon (Clarke et al. 2008): Adenocarcinoma (9856 features, 25 points), Brain (5597 features, 42 points), Colon (2000 features, 62 points), and Lymphoma (4026 features, 62 points). Of these, the Adenocarcinoma gene expression arrays (Kowalczyk 2007) are a particularly difficult data set that was found to be “antilearnable” by means of classification methods. The other

Stat Comput

Fig. 5 Lower bounds on the probability that Euclidean distances are concentrated in the underlying unknown data distribution, for the case of four gene expression data sets and three text based data collections. For each data set, the average is shown when each data point acts as the query in turn while the remaining points are used for the estimation. The error bars represent one standard error. We can see that the probability that distances are meaningless is generally much higher in the case of gene arrays than it is the case for text data. Among these seven data sets tested, the Adenocarcinoma gene arrays display the highest evidence for the distance concentration problem

gene array data sets have been used in previous studies for classification and biomarker identification, see e.g. (Shim et al. 2009) for some of the most recent works. Three further data sets were chosen from text retrieval benchmarks, where many successful studies were reported previously, so there seem to be less problems despite the high dimensionality of the data. These are: the NIPS papers collection4 (13,649 features, 1740 points), MEDLINE5 (medical abstracts; 5735 features, 1063 points), and CRANFIELD6 (aeronautical systems abstracts; 4563 features, 1623 points). For the latter two data sets, the total number of points utilised is the union of the training and query instances from the benchmark distribution. Figure 5 shows the lower bounds on the confidence that Euclidean distances are concentrated in the underlying unknown data distribution, for n = 2, plotted against a rage of deviations  tested. We repeated the test in a leave-one-out fashion, each time one point was held out to act as the query point. So the number of repetitions equals the number of points available in the various data sets, and r is one point less. As before, each test quantifies the probability that n = 2 randomly drawn points from the unknown data distribution would have similar distance from the query point. These bounds are superimposed for the above seven data sets, so 4 http://www.cs.toronto.edu/~roweis/data.html.

Fig. 6 Lower bounds on the probability that fractional distances are concentrated in the underlying unknown data distribution, for the case of four gene expression data sets We see the confidence bounds are generally lower for the fractional distance, however the Adenocarcinoma still presents strong evidence for the distance concentration problem

we can see the probability that distances are meaningless is generally much higher for the gene expression data in comparison to text. Of the gene arrays, the ‘anti-learnable’ Adenocarcinoma data set present the highest evidence for the distance concentration problem. For the four gene array data sets, we repeated the tests using the non-metric fractional distances with parameter p = 0.1, and these are shown in Fig. 6. We see the confidence bounds decrease in nearly all cases, so the use of fractional norms may help alleviate the problem to some extent (though not always), in agreement with (François et al. 2007). Finally, we performed similar testing on three high dimensional UCI data sets7 that have been originally designed for feature selection benchmarking, and were also used in a previous study on distance concentration (Hsu and Chen 2009) aimed at validating a non-Euclidean dissimilarity function. These data are: Aracene (10,000 features, 100 points)—for cancer prediction from mass-spectrometry data—Gisette (5000 features, 2000 points)—a handwriting recognition data set—and Madelon (500 features, 1000 points)—a synthetic data set designed to have difficult spurious features. Figure 7 summarises the results, displaying the confidence bounds for the concentration of both the Euclidean distances (solid lines) and the fractional distances (dashed lines). We see that the use of fractional distance does not always reduce the problem. It does considerably so on Gisett and Aracene, though not on Madelon. This is

5 ftp://ftp.cs.cornell.edu/pub/smart. 6 http://scgroup6.ceid.upatras.gr:8000/wiki/index.php/Main_Page.

7 http://archive.ics.uci.edu/ml/index.html.

Stat Comput

Fig. 7 Confidence bounds on the concentration of Euclidean (solid lines) and fractional (dashed lines) distances on three high-dimensional UCI data sets. The same marker represents the same data set, under the two different distances tested. We see that the use of fractional distance is not always a cure for the problem. While it reduces the problem drastically on Gisett and Aracene, the problem remains in the case of Madelon

because the data distribution differs from uniform distribution, which is when the fractional distance has a guaranteed beneficial effect (François et al. 2007). Comparing our analysis with the assessments in Hsu and Chen (2009) we may notice that their Pearson Variation index also indicates that the distances are slightly more concentrated in Madelon and Gisett than in Aracene, however it is less straightforward to give a unique objective interpretation of those indices in comparison to our probability bounds. Summing up, from previous studies (Durrant and Kabán 2009) we know that noisy spurious features trigger the problem of meaningless distances in high dimensions, whereas a strong correlation structure in the data can set us free of the problem up to arbitrary high dimensions. Both the four gene array data and the three UCI data sets contain a large fraction of noise features—hence the results found are as one would expect. However, as opposed to previous analyses on parametric models (Durrant and Kabán 2009; Kabán 2011), or contrary to using one or the other of a variety of possible indices (Hsu and Chen 2009) that capture something about the problem on real data, we are now able to quantify the problem and estimate a rigorous lower bound on its severity solely from the data.

6 Conclusions and future work We provided a finite dimensional characterisation of the phenomenon of concentration of norms in high dimensional data spaces, that recovers previous asymptotic results in

the limit of infinite dimensions, and it provides probability bounds in the case of finite dimensions. By extending this to a version that only requires sample estimates of the relative variance, we then devised confidence bounds that can be estimated from a data sample. These are lower bounds on the probability that a given dissimilarity or distance function is meaningless in the (unknown) data distribution that underlies the data set. This can be used to detect problematic data sets and distance functions in practice in a more rigorous manner than it is currently possible. We demonstrated the working of this approach on both synthetic data and ten real-world data sets from different domains. Although the main tool we used was Chebyshev inequality, which is not very tight for a number of specific distributions, it has the advantage that the resulting testing procedure makes no assumptions on the data distribution. Moreover, it should be remembered that a tighter distributionfree bound cannot be obtained without introducing assumptions. The issue of identifying specific data distributions of wide enough interest that allow tighter bounds to be derived is non-trivial and, hence, subject to further research. New measure concentration inequalities for dependent variables (Kontorovich 2007) may be investigated for this purpose. Nevertheless, we have also demonstrated that the Chebyshev bound tightens exactly in severe distance concentration situations, and this makes its application suitable for detecting such problematic cases. Finally, another avenue of interest would be to extend this analysis to consider the query point as being random, in order to study the average and tails of the distribution of distance concentration probabilities.

References Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the surprising behavior of distance metrics in high dimensional space. In: Proc. ICDT, pp. 420–434 (2001) Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is nearest neighbor meaningful? In: Proc. ICDT, pp. 217–235 (1999) Clarke, R., Ressom, H.W., Wang, A., Xuan, J., Liu, M.C., Gehan, E.A., Wang, Y.: The properties of high-dimensional data spaces: Implications for exploring gene and protein expression data. Nat. Rev., Cancer 8, 37–49 (2008) Durrant, R.J., Kabán, A.: When is ‘nearest neighbour’ meaningful: a converse theorem and implications. J. Complex. 25(4), 385–397 (2009) François, D., Wertz, V., Verleysen, M.: The concentration of fractional distances. IEEE Trans. Knowl. Data Eng. 19(7), 873–886 (2007) Giannella, C.: New instability results for high dimensional nearest neighbor search. Inf. Process. Lett. 109(19), 1109–1113 (2009) Hsu, C.-M., Chen, M.-S.: On the design and applicability of distance functions in high-dimensional data space. IEEE Transactions on Knowledge and Data Engineering 21(4) (2009) Kabán, A.: On the distance concentration awareness of certain data reduction techniques. Pattern Recognit. 44(2), 265–277 (2011) Kontorovich, L.: Measure concentration of strongly mixing processes with applications. Ph.D. thesis, School of Computer Science, Carnegie Mellon University (2007)

Stat Comput Kowalczyk, A.: Classification of anti-learnable biological and synthetic data. In: Proc. PKDD, pp. 176–187 (2007) Pramanik, S., Li, J.: Fast approximate search algorithm for nearest neighbor queries in high dimensions. In: Proc. ICDE, p. 251 (1999) Radovanovi´c, M., Nanopoulos, A., Ivanovi´c, M.: Hubs in space: popular nearest neighbors in high-dimensional data. J. Mach. Learn. Res. 11, 2487–2531 (2010)

Saw, J.G., Yang, M.C.K., Mo, T.C.: Chebyshev inequality with estimated mean and variance. Am. Stat. 38(2), 130–132 (1984) Shim, J., Sohn, I., Kim, S., Lee, J.-W., Green, P., Hwang, C.: Selecting marker genes for cancer classification using supervised weighted kernel clustering and the support vector machine. Comput. Stat. Data Anal. 53(5), 1736–1742 (2009)