A Trimmed Mean Approach to Finding Spatial ... - Semantic Scholar

Report 11 Downloads 38 Views
A Trimmed Mean Approach to Finding Spatial Outliers Tianming Hu∗and Sam Yuan Sung Department of Computer Science, National University of Singapore Singapore 119260 Abstract: Outlier detection concerns discovering some unusual data whose behavior is exceptional compared to other data. In contrast to non-spatial outliers which only consider non-spatial attributes, spatial outliers are defined to be those sites which are very different from its neighbors defined in terms of spatial attributes, i.e., locations. In this paper, we propose a local trimmed mean approach to evaluating the spatial outlier factor which is the degree that a site is outlying compared to its neighbors. The structure of our approach strictly follows the general spatial data model, which states spatial data consists of trend, dependence and error. We empirically demonstrate trimmed mean is more outlier-resistant than median in estimating sample location and it is employed to estimate spatial trend in our approach. In addition to using the 1st order neighbors in computing error, we also use higher order neighbors to estimate spatial trend. With true outlier factor supposed to be given by the spatial error model, we compare our approach with spatial statistic and scatter plot. Experimental results on two real datasets show our approach is significantly better than scatter plot, and slightly better than spatial statistic. Keywords: spatial outlier detection, trimmed mean, spatial data mining, spatial error model

1

Introduction

One important branch of data mining is outlier detection. Outlier detection concerns discovering some unusual data whose behavior is so exceptional compared to other data that it indicates something unusual has happened and people’s attention must be drawn upon them. Although the meaning of outlier seems straightforward to many people, there is no universally accepted formal definition and only some intuitive interpretations are available in the literature. A well-known definition of outlier was given by Hawkins [12] who defined it as an observation which deviates so much from other observations as to arouse suspicious that it was generated by a different mechanism. Similar definition also appeared in Barnett and Lewis’s book [6] which stated an outlier is an observation or subset of observation which appears to be inconsistent with the remainder of that set of data. Beckman and Cook [8] also gave an alternative definition of outlier as a contaminant or a discordant observation, where a discordant observation refers to any observation that appears surprising or discrepant to the investigator, and a contaminant is any observation that is not a realization from the target distribution. Although outliers are often treated as noise or error in many operations such as clustering and discarded away, they may have potential causes and bear useful information that cannot be mined from other data that reside deep inside clusters. It is not unusual that one man’s noise is another one’s signal. After identifying those possible outliers, we may go further to study the underlying reason why they happen and these knowledge may be profitable. For instance, outliers may be produced by an incorrect assumption of distribution. In such situations, the further investigation for outliers can lead to a more appropriate statistical model, which, in turn, leads to a more appropriate statistical inference. Occasionally, the presence of outliers indicates more information than being assumed. This is often true in exploratory data analysis, for at least three structures, cluster, complete spatial random (Poisson) process and regular spacing are often simultaneously present in the data. So in a way, finding outliers is at least as important as finding general patterns like clustering structure. Outlier detection has already found practical application including discovering crime in e-commerce, discovering computer intrusion, detecting credit card fraud, etc. ∗

Corresponding author. Email: [email protected]

1

1.1

Background of Spatial Outliers

In this paper, we study finding spatial outliers in spatial data. To be more exact, we calculate , for each data point, an outlier factor, the degree to which it is outlying, rather than assign it to one of two sets, outliers and non-outliers. Spatial data differ from other data in that associated with each object, the attributes under consideration include not only non-spatial normal attributes which also exist in other data, but also spatial attributes which are often unique or emphasized in spatial database and describe the object’s spatial information such as location and shape. Independent identical distribution is a fundamental assumption often made on data sampling in machine learning, pattern recognition and data mining. In spatial data, this assumption is no longer valid. Let us first examine independence. In practice, almost every datum is related to each other, to a varying degree. For example, houses in nearby neighborhoods tend to have similar prices. This property has long ago been found by geographers who described it as the first law of geography: everything is related to everything else, but nearby things are more related than distinct things. As for identical assumption, there are cases of spatial data where different regions seem to have different distribution, which is referred to as spatial heterogeneity. Spatial data often exhibit two unique characteristics: spatial trend (large scale variance) and spatial dependence (small scale variance) [10]. Spatial dependence, also called spatial autocorrelation, has two types: positive and negative. Positive correlation means nearby sites tend to have similar characteristics while negative correlation denotes nearby sites often have very different characteristics. Common outliers detection techniques only consider non-spatial attributes and define neighbors based on distance in the non-spatial attributes space. If most data follows a distribution such as Gaussian, then outliers are probably those points on the tail of distribution with low values of probability density function. In contrast, if we assume homogeneity in spatial data, i.e., similar sites tend to cluster together, spatial outliers are defined to be those sites which are very different, in terms of non-spatial attributes, from its neighbors, in terms of spatial attributes (location). These sites are outlying locally and finding these spatial outliers may provide interesting or unexpected knowledge. For example, it often holds that neighboring cities and towns have similar economic power. If a town is found to be significantly more powerful economically than its neighbors, it is labeled a spatial outlier and further investigation by domain experts is called upon, which may reveal the cause that it has an airport which makes it more economically competitive in terms of convenience of transportation. Our contribution in this paper is as follows. • We propose a local trimmed mean approach to evaluating spatial outlier factor. The structure of our approach strictly follows the general spatial data model, which states spatial data consists of trend, dependence and error. Only after removing spatial trend and dependence, and consequently making most errors, if not all, independent and identically distributed, then it makes sense to compare spatial data and evaluate their outlier factors. Trimmed mean is used to estimate the spatial trend. • Most existing methods for finding spatial outliers only use the 1st order neighbors in computing the difference (error) of each site from its neighbors. We propose to use higher order neighbors, in the hope that a larger sample of neighboring sites would offer a more robust estimation of spatial trend and hence error. The rest of the paper is organized as follows. Section 2 reviews related work on non-spatial outliers and two tests for spatial outliers, spatial statistic and scatter plot, with which we compare our approach. In Section 3, we introduce the spatial error model, which is assumed to give the true outlier factors, followed by a short brief on median polish approach, which tries to estimate spatial trend with median. Section 4 first formulates the problem of estimating spatial outlier factor, then gives our local trimmed mean approach. Comparative experimental evaluation on two real spatial datasets is reported in Section 5. Section 6 concludes this paper with a summary and some remarks.

2

2 2.1

Related Work Techniques for Non-spatial Outliers

Most outlier detection techniques handle common (non-spatial) outliers, ignoring spatial relation. They are often designed to identify global outliers and do not differentiate spatial and non-spatial attributes. They can be divided into two classes, based on the dimensionality of data. The first class [6] handles one dimensional data. Many techniques in this class are developed in the statistical field and assume a statistical distribution such as Gaussian and try to fit the data to the model by estimate the parameters such as mean and variance from the data. They vary in terms of type of distribution, number of outliers to be identified and type of outliers. Then they employ a test based on the distribution property to identify outliers w.r.t. ˆ denote sample mean defined in Eq. (1), this distribution. For a dataset of n values, {xi : i = 1...n}, let µ and σ ˆ denote sample standard deviation defined in Eq. (2), then z-score of a point x is defined in Eq. (3)

µ ˆ ≡ σ ˆ2 ≡ z(x) ≡

n 1 xi n i=1

n 1  (xi − µ ˆ)2 n − 1 i=1

x−µ ˆ σ ˆ

(1) (2) (3)

For data from Gaussian like distributions, z(x) ∼ N (0, 1), and one popular test labels a point x an outlier if its absolute z-score exceeds 3, i.e., |z(x)| > 3. We can see this test targets those points on the distribution tail. Unfortunately, even for an outlier with a very large value x, its z-score cannot be arbitrarily large, for although the numerator x − µ ˆ in Eq. (3) is large, it contributes much to σ ˆ , making the denominator large √ too. It can be proved that for a dataset of size n, |z(x)| cannot be greater than (n − 1)/ n [25]. In reality, prior knowledge about the distribution of a dataset is not always available. Besides, these methods do not scale well for even a dataset of modest size. The other class of techniques deal with multi-dimensional data. They regard the data as a set of points in the high dimensional (attributes) space and offer test based on the distance, density, etc. These techniques are closely related to the corresponding cluttering algorithms which try to find the general pattern followed by most points, rather than the single pattern followed by few elusive points (outliers). Whenever we declare a point an outlier, we always imply some pattern (cluster) w.r.t. which it is outlying. In fact, given a clustering algorithm with a function to measure its clustering quality, a naive algorithm for calculating outlier factor can assign each point a value which equals the absolute difference of original clustering quality and the new clustering quality after removing that point. Distance based techniques distinguish points which are likely to be outliers from others based on the number of points in the neighborhood of a point. They do not assume any prior distribution of the data and limit the counting of points to the neighborhood of each point. These properties make them suitable for finding outliers in large datasets. Corresponding to clustering algorithms that find convex clusters [18], one well-known such technique is DB(p, d)-outlier proposed by Knorr et al. [15], where a point in a dataset T is an outlier if at least p fraction of points in T lie greater than distance d from it. The strength of this definition includes simplicity and capture of the basic meaning of Hawkins’ definition. However, it cannot handle data with different local patterns and characteristics and hence can only find global outliers. A special case of DB(p, d)-outlier is proposed by Ramaswamy et al. [21], who consider k-distance, the distance to the k-th nearest neighbor(s) for each point, and rank the top n points with the largest distances as the outliers. This scheme also shares the same weakness of DB(p, d)-outlier. Corresponding to clustering algorithms capable of finding arbitrary shape clusters consisting of points with similar local densities [11, 3], Breunig et al. [9] proposed the notion of local outlier factor, which measures the degree of outlierness, based on the difference in the local density of a point and its k nearest neighbors. If the local density of a point is much lower than those of its k nearest neighbors, its outlier factor is high. So this technique provides 3

a ranking list of points in descending order of outlier factor and is able to identify local outliers that may not be outlying globally. Sometimes, a point could be outlying in a subspace obtained by projecting the original full space onto one of its subsets. Corresponding to clustering algorithms capable of finding clusters in subspace [2], Aggarwal and Yu [1] considered such situations and searched for all possible subspace where there are regions with much lower density than the rest of the subspace. All points in those low density regions are declared as outliers. Knorr and Ng [14] also considered subspace and tried to explain why a point is outlying in terms of intensional knowledge by finding the minimal subspace where it is outlying for the first time.

2.2

Tests for Spatial Outliers

Numerous tests have been proposed for spatial outliers in the statistical field. They explicitly separate spatial attributes from non-spatial attributes. Spatial attributes are employed in computation of distance, neighborhood, etc. Non-spatial attributes are used to compare a site from its neighbors. These tests can be divided into two categories, graphical tests and quantitative tests. The former provides the user a graph which visualizes the data and highlight the spatial outliers. The latter provides the user a partition of the dataset, outliers and non-outliers. Because we discuss outlier factor in this paper, we only concentrate on the latter. Spatial statistic (SS) and scatter plot (SP) are two commonly used quantitative tests [5, 24]. The original forms of these two tests assign each site to one of two sets, outliers and non-outliers, with a threshold determined by the user. In a straightforward way, we generalize them to output an outlier factor for each site. Informally speaking, SS assumes a site a standard non-outlier if its attributes are equal to the average of those of its neighbors. Let y(si ) denote the non-spatial attribute value of site si , N (si ) denote the set of neighboring sites (exclusive of si ) of si , y(si ), defined in Eq. (4), denote the average attribute value of the neighbors of si , SS computes eSS (si ), the difference of si from its neighbors in Eq. (5) and assigns the resulting outlier factor in Eq. (6), where z() is the z-score defined in Eq. (3). y(si ) ≡

 1 y(sj ) |N (si )| s ∈N (s ) j

(4)

i

eSS (si ) ≡ y(si ) − y(si ) OSS (si ) ≡ |z(eSS (si ))|

(5) (6)

If y(si ) follows a Gaussian distribution, so is eSS (si ). SP’s assumption of a standard non-outliers is more general than SS’s. In contrast to SS’s assumption that for a standard non-outlier, y(si ) = y(si ), SP assumes there is a linear relation between y(si ) and y(si ), i.e., y(si ) = a × y(si ) + b. It declares a point is a standard non-outlier if this point is on the regression line. We can see if the slope of the line is approximately 1, SP’s result is similar to SS’s. In detail, SP computes the error (difference) of si in Eq. (7) and output its outlier factor in Eq. (8). eSP (si ) ≡ y(si ) − (a × y(si ) + b)

(7)

OSP (si ) ≡ |z(eSP (si ))|

(8)

For the regression line, the slope a and the intercept b can be estimated to minimize the sum of squared  error (SSE), SSE ≡ ni=1 [y(si ) − (a × y(si ) + b)]2 , by setting its partial derivative w.r.t. them to 0. The original forms of SS and SP simply include one more parameter, θ, to be determined by the user. Any site whose outlier factor is greater than θ is declared an outlier.

2.3

An Illustration for Comparison

To illustrate the difference between spatial outlier and common outlier, let us see an example dataset in Fig. 1(a), where we plot outlier factors for DB(p, d)-outlier, SS and SP in Fig. 1(b,c,d), respectively. Site 4

si , i = 2...9, has two neighbors, si−1 and si+1 . s1 has neighbor s10 and s2 . s10 has neighbor s9 and s1 . DB(p, d)-outlier is generalized to output an outlier factor as follows. Given a fixed d, it outputs for each point a value p defined as p fraction of points lie greater than distance d from that point. We set d to be the average distance of all pairs of points. All outlier factors in each definition M are rescaled to the range [0, 1] by dividing them with the maximum of original factors, i.e., OM (si ) = OM (si )/maxsi {OM (si )}. We can see DB(p, d) assign high factors to sites s1 , s10 , whose attribute values are extremely small and lie far away from most other points centered around 0.6. Although s3 ’s attribute value is close to 0.6 and therefore not very exceptional globally, it achieves the highest outlier factor by SS, for the difference between it and the average of its two neighbors is very exceptional, considering most sites’ attributes are close to the average of the neighbors while s3 ’s attribute is higher than both of its neighbors. Similarly, s4 ’s attribute value is just 0.6 and lies deep inside the densest area, but it is ranked most outlying by SP, partially because it is one of few points whose attribute value is exceptionally lower than the average of its neighbors.

3 3.1

Spatial Error Model and Median Polish Spatial Error Model

To compare the performance of spatial statistic, scatter plot and our trimmed mean approach, we need to know the true outlier factor, which is supposed to be given by spatial error model (SEM). The general model of spatial data , which is the underlying model for spatial statistics [10], is as follows. spatial data = trend + dependence + error

(9)

Only after removing the spatial trend and spatial dependence, is it meaningful to compare the residual (error) and evaluate outlier factor. Generally, for a set of values, only after making sure that most of them come from a Gaussian-like distribution, is it reasonable to compare their z-scores. For example, suppose we have a set of values recording traffic volume of different sites on some roads in a region. If we find the value of a site on a major highway is significantly larger than those of its neighbors that stand on another road with lighter traffic, we cannot come to the conclusion that it is an outlier. We need to first detrend, e.g., subtracting from it the average values of all sites on that major highway. If its residual is still very different from its neighbors’, it can be labeled a potential outlier with more confidence. Formally, SEM decomposes spatial data in matrix form as in Eq. (10), y − µ = ρW (y − µ) +  µ = Xβ

(10) (11)

where µ denotes spatial trend,  denotes error. Spatial dependence is captured by ρW , where ρ is a parameter in [0, 1] to be estimated and W is the contiguity matrix, W (i, j) > 0 iff sites si and sj are neighbors. Because there are at least as many parameters as the data, i.e., n parameters in µ, we need to resolve it by regressing a dependent variable on m (m < n) independent variables. One common practice is to model spatial trend in Eq. (11), where β is a m × 1 vector of regression coefficients and each row of X is a vector of known explanatory attributes for a site. Often X is augmented as [X, 1]. Only after removing trend and dependence, can we regard the elements in the error vector  are independent and identically distributed, e.g.,  ∼ N (0, σ 2 I), where I denotes the n × n identity matrix. In this case, y ∼ N (µ, (I − ρW )−1 σ 2 I(I − ρW )), and parameters ρ and β can be estimated to maximize likelihood of y [10]. After obtaining estimates for the parameters, we can obtain the outlier factor for si by transforming the estimated residual vector ˆ to standard Gaussian distribution N (0, I), i.e., OSEM (si ) ≡

|ˆ (i)| σ ˆ

(12)

Because maximum likelihood estimation is computationally expensive, especially for large datasets, we try to approximate  in a non-parametric way. It is not obvious how to estimate ρ, though in practice, the 5

estimated one by maximizing likelihood often takes on a value in [0.5, 1]. In our approach, we just assume it is fixed at 1. The remaining job is to estimate spatial trend µ based on y.

3.2

Median Polish

Since locations are in 2-D (dimensional) space, median polish approach [27, 10] tries to decompose spatial trend into 2 directional components, horizontal and vertical. In detail, suppose a grid of p × q cells overlayed on all sites which are assigned to the nearest cells. Ideally there is one site per cell. The spatial trend of si in the cell (k, l) can be decomposed as µ(si ) = t + rk + cl , where t denotes the total effect, rk denotes the row effect of k-th row, cl denotes the column effect of l-th column. Median polish estimates these components by computing the respective medians and removing them. This process is repeated until convergence. Apparently, median polish approach suffers from two weaknesses. First median operation is a holistic aggregate function, which means it cannot be computed from a constant number of sub-aggregates from each partition of the dataset. The iterative call of such operation puts it at disadvantage for large datasets. Second, it is not obvious how to derive a universal strategy to overlay the grid on the sites with one site per cell. A natural simplification of median polish is localization, i.e., applying it in the neighborhood of each site, in the hope that the differences between its row (column) effect and its neighbors’ would cancel out, e.g., rk−1 + rk+1 = 2rk . A even more simplified version is just to compute the median of the neighborhood, without distinguishing row and column effects. However, we will show later that, trimmed mean, the estimator employed in our approach, is more outlier-resistant. Besides, it unifies other estimators such as mean and median.

4

Finding Spatial Outliers with Local Trimmed Mean

4.1

Problem Formulation

• Given 1. A spatial framework of n sites,S = {si }ni=1 . A neighbor relation N ⊆ S × S. Sites si , sj are neighbors iff (si , sj ) ∈ N, i = j. 2. Associated with each si , there is an attribute yi ≡ y(si ) ∈ . yi is a particular realization of the random variable Yi ≡ Y (si ) ∈ observed at si . Let Y ≡ {Y1 , ..., Yn },y ≡ [y1 , ..., yn ]T . We compute outlier factor based on this attribute. • Find A mapping model O : S ⇒ . O(si ) is the spatial outlier factor of site si . The higher the factor, the more outlying the site is. • Objective Maximize similarity between ranking in terms of outlier factor by O and the ranking in terms of true outlier factor. Here we assume spatial error model gives the true outlier factor. • Constraint Spatial context exists, i.e., P (Yi = yi ) is affected by the values yj of its neighbors in N (si ) ≡ {sj : (sj , si ) ∈ N }. Neighbor relation N may not be given a priori and needs to be computed with site’s coordinates. Neighborhood information can also be stored in an n × n contiguity matrix W which is often sparse. It holds W (i, j) > 0 iff (si , sj ) ∈ N, i = j, W (i, j) = 0 otherwise, and W (i, j) > 0 ⇔ W (j, i) > 0. Often W is normalized with each row summing to 1.

6

4.2

Trimmed Means

Trimmed mean refers to the mean of a sample of data excluding extreme values. It calculates the mean of a dataset in 1-D excluding the highest and lowest p percent of the observations. A common application of trimmed mean is scoring gymnastics and skating, where the highest score and the lowest score by the referees are excluded and the final score is the sample mean of the remaining scores. Formally, given a sample of ordered n data values, {xi }ni=1 , the 100p% trimmed mean T (p), adapted from [13], is defined in Eq. (13),

T (p) ≡

n−r

i=r+1 xi

n − 2r

(13)

where r = pn. We can see trimmed mean can actually unify sample mean and median by varying p. Setting p = 0, trimmed mean is just sample mean. When setting p = (n − 1)/2/n, trimmed mean equals median. We now show trimmed mean is a robust estimator of the sample center. If there are outliers in the data, a desirable property of outlier-resistant estimator is that its variance is small. We will use a small experiment to demonstrate that the trimmed mean is a more representative estimator than median for the center of the data. We compare trimmed mean and median based on efficiency of them relative to the sample mean for estimating the population mean. Formally, the efficiency E of the 100p% trimmed mean T (p), relative to the sample mean µ ˆ defined in Eq. (1), is defined as [13] E≡

var(ˆ µ) var (T (p))

(14)

The efficiency of median relative to the sample mean can be similarly defined by substituting trimmed mean with median in Eq. (14). It can be seen the larger efficiency, the more robust the estimator. Of course, the efficiency depends on the sample size, p, the percent to be trimmed and the underlying distribution from which data is assumed to come. We perform two sets of experiments. In one set, we randomly draw n samples with n data values from the standard Gaussian distribution N (0, 1) in each sample. In the other set, we randomly draw n samples with n data values from N (0, 1) and one outlier value from N (0, 102 ) in each sample. We set p = 15%, a value recommended by many texts. The results are listed in Table 1, where trimmed mean gives consistently higher efficiency than median in both cases.

4.3

Our Approach

Now we present our algorithm for estimating spatial outlier factor based on local trimmed mean. The main idea is to approximate spatial trend for each site with the trimmed mean of its neighborhood. Besides, we also adopt a robust procedure to transform the independent and identically (Gaussian) distributed error to standard Gaussian distribution. There are two parameters in our algorithm. One is p in trimmed mean definition. The other is k, the order of neighborhood. Besides the immediate neighbors, we can use 2nd order or even higher order neighbors, in the hope that the difference in the spatial trend of the site under examination and its neighbors would be more likely to cancel out, i.e., giving a more accurate estimates of its spatial trend. First we formally define k-th order neighbor and illustrate how to efficiently find them with contiguity matrix W . Given the spatial frame work, sites set S and neighbor relation N , a path P of length k is a sequence of k + 1 different sites (s0 , s1 , ..., sk ), i.e., |{(s0 , s1 , ..., sk }| = k + 1, where sl ∈ S denotes the l + 1-th site appearing on the path and (sl , sl+1 ) ∈ N, l = 0...k − 1. Consequently, sl is a l-th order neighbor of s0 and we denote si ’s l-th order neighbor set with Nl (si ). In particular, we define si is the only 0-th neighbor of si , i.e., N0 (si ) = {si }. Let Wk ≡ (W + I)k and in particular (W + I)0 = I. It can be proved that  Wk (i, j) > 0 ⇔ sj ∈ kl=0 Nl (si ). Our algorithm for estimating spatial outlier factor based on local trimmed mean is as follows. 7

1. For each site si , estimate its spatial trend as µ ˆ(i) = Tk (p), where trimmed mean is computed over the k set of sites sj in l=0 Nl (si ), i.e., Wk (i, j) > 0. 2. Estimate error by removing spatial trend and dependence as ˆ = (I − W )(y − µ ˆ). 3. Output spatial outlier factor OT R (si ) by transforming the error to standard Gaussian distribution with Median about Absolute Deviation to median (M AD), which is more robust than sample variance in z-score in estimating sample scale (variance) [13]. The constant 0.6745 is needed because for a large sample from N (0, σ 2 ), E(M AD/0.6745) = σ, where E is the expectation operator.

(i) − mediani (ˆ (i))|) M AD ≡ mediani (|ˆ 0.6745|ˆ  (i) − mediani (ˆ (i))| OT R (si ) ≡ M AD

5

(15) (16)

Experimental Evaluation

We evaluate our local trimmed mean approach in comparison with spatial statistic and scatter plot on two real datasets. Spatial error model is supposed to give the true outlier factor. It is implemented in the spatial econometrics toolbox [16] with Monte Carlo sampling [20, 7]. We fix p = 0.15 for all methods involving trimmed mean.

5.1

Performance Criteria

We use precision and recall [23] to measure the performance of various methods on finding outliers. In detail, sites are first sorted in descending order of spatial outlier factor by each method. Given true answer set LSEM , the set of the outlier sites by SEM, and predicted answer set LM , the set of the outlier sites by method M , we can evaluate the performance of M in terms of standard recall and precision: 

recall(M ) ≡ precision(M ) ≡

|LSEM LM | |LSEM |  |LSEM LM | |LM |

(17) (18)

Often we only care about those top t outliers, LM (t), the set of sites with the largest t outlier factors. In this case, recall equals precision, for |LM (t)| = |LSEM (t)| = t. We can see here every top outlier site receives the same weight, no matter whether it is the top 1st outlier or top t-th outlier. This unweighted practice cannot distinguish the following situation. Suppose true answer set LSEM (10) = {s1 , ..., s10 } where sites are  in descending order of outlier factor. For two methods M 1 and M 2 with LSEM (10) LM 1 (10) = {s1 , ..., s5 }  and LSEM (10) LM 2 (10) = {s6 , ..., s10 }, their recalls are both 0.5, but apparently M 1 is better than M 2. For this reason, we propose weighted recall based on rank where each site has a weight proportional to its rank in the outlier list. We use rank rather than the outlier factor itself because rank is less sensitive to the scale and change in outlier factors. Rank has been used in comparing various interesting measure on association rules [26]. Formally, for a top t outlier list by a method M , LM (t) = {si : i = 1...t}, we define the direct rank (drank) and the corresponding normalized rank (R) for si ∈ LM (t) in Eq. (19,20) respectively. From now on, we refer to normalized rank R when we mention rank. drankM (si ) ≡ |{sj : OM (sj ) ≤ OM (si ), si , sj ∈ LM (t)}| drankM (si ) RM (si ) ≡  sj ∈LM (t) drankM (sj )

8

(19) (20)

Consequently, we can define rank based weighted recall (wrec) for a method M on top t outliers as follows. 

wrec(M, t) ≡

sj ∈LSEM (t)





LM (t)

sj ∈LSEM (t)

RSEM (sj )

RSEM (sj )

(21)

As the standard recall, the higher the weighted recall, the better the method M , because the denominator is the same for all methods M . However, it is not straightforward to similarly define weighted precision, for if we substitute SEM in the denominator in Eq. 21 with M , we need to consider ranks of sites that are  not in LSEM (t) LM (t). So we only use weighted recall in the later evaluation. Besides considering top t outliers, we also use standard recall and precision based on thresholding outlier factors. Formally, given a user determined threshold θ, the outlier list by a method M is defined as LM (θ) ≡ {si : OM (si ) ≥ θ}, and consequently, thresholded recall and precision are defined similarly as in Eq. (17,18), where LM = LM (θ) and LSEM = LSEM (θ). In addition to considering performance on those top outliers, we also evaluate global similarity of the ranking of all sites in terms of descending outlier factors between SEM and other methods. Formally, given n sites S = {si : i = 1...n}, a method M : S → [0, 1], i.e., mapping each site to a rank RM (si ) defined in Eq. (20), we can evaluate the similarity of the global ranking between M and SEM by computing the entropy and cross entropy as in Eq. (22,23). entropy(M, SEM ) ≡ − cross entropy(M, SEM ) ≡ −

 si

 si

RSEM (si )ln(RM (si ))

(22)

[RSEM (si )ln(RM (si )) + (1 − RSEM (si ))ln(1 − RM (si ))]

(23)

Entropy is often used to measure information quantity [17]. The smaller the (cross) entropy, the more similar M is with SEM , w.r.t. the set of sites under examination at least. The minimum is achieved by comparing with itself, i.e., entropy(M, SEM ) ≥ entropy(SEM, SEM ) [22].

5.2

Crime Dataset

We compare different methods first on a real small crime dataset [4]. It records crime, household income and housing values in 49 neighborhoods in Columbus Ohio, USA and is illustrated in Fig. 2(a). Household income and housing values are treated as explanatory attributes to predict crime. The contiguity matrix is obtained from the sites’ latitude-longitude pairs with [16] and is illustrated in Fig. 2(b). The histogram of (without absolute operator) outlier factors by SEM with ρˆ = 0.748783 in Eq. (10) is illustrated in Fig. 3(a). As expected, factors center around 0 and most of them are in [−1, 1]. Apparently, those whose absolute values are greater than 2 are more likely to be true outliers. We compare the performance of various methods in terms of thresholded recall and precision by varying θ on 5 values in {0.5i : i = 2, ..., 6}. The number of outliers by SEM with factor greater than or equal to θ, |LSEM (θ)|, is given in Fig. 3(b), where |LSEM (3)| = 1, and |LSEM (1)| = 11, making up about 20% of the whole population. Fig. 3(c) illustrates the thresholded recalls of SS,SP, TR1 and TR2, where k in TRk refers to  the order of neighborhood, kl=0 Nl (si ), the set of neighboring sites over which trimmed mean is obtained. We can see TR2’s recall is best over all θ. TR1’s is better than or equal to SS’s, except on the smallest value θ = 1, where TR1’s recall is slightly lower than SS’s. However, it is not important, because we focus more on those top outliers with high factors surviving high thresholds. SP’s recall is always the lowest. When it comes to the thresholded precision illustrated in Fig. 3(d), SS is best, followed by TR1, whose precision is lower on the smallest two thresholds. Then comes TR2, followed by SP. The reason why TRk’s precision is lower than SS is that most outlier factors are comparatively smaller than SS and SEM, as a result of using M AD to transform error to outlier factor. Nevertheless, the problem of relatively low precision is not as serious as that of relatively low recall. Because if a method’s recall is low, it means we miss some true 9

outliers. On the other hand, if a method’s precision is low but its recall is high, it means we have to do some fruitless work, i.e., inspecting those sites that turn out to be non-outliers later. As for the standard (weighted) recall of top t outliers, we sample t on 6 values in {1 + 2i : i = 0, ..., 5}. SS and TR2 are top two methods in terms of standard recall, as shown in Fig. 3(e). They only differ at t = 3, where SS > T R2, and t = 11, where T R2 > SS. Because small t receives more attention, SS seems to be slightly better than TR2. Although SS’s standard recall is the same as TR2 at 4 different ts, further investigation based on weighted recall, as illustrated in Fig. 3(f), indicates at 2 values of t, t = 5, 7, the ranks of recovered outliers by SS are generally lower than those by TR2, which means TR2 recovers true outliers more faithfully than SS. We also try using up to 3rd order neighbors, whose results are similar to using up to 2nd order neighbors. For such a small dataset, as order of neighbors we employ gets higher, the estimated spatial trend with trimmed mean quickly gets less local and more global. The average of neighbors in up to k-th order is illustrated in Table 2, where we can see as k gets close to 4, it is approximately equivalent to using the global trimmed mean as estimation of spatial trend for each site. In addition to plotting recall and precision curves, we also compare the similarity of rankings over all sites between various methods and SEM. The results are shown in Table 3. As expected, entropy with itself (SEM) achieves the minimum. As far as the ranking of all sites is concerned, SS is the best, though the margin by which its (cross) entropy is smaller than TR2’s is insignificant. It is also noted TR2’s (cross) entropy is the smallest among TRks, i.e., using up to 2nd order neighbors achieves the most similar global ranking as SEM.

5.3

Election Dataset

We also test these methods on a large data set for 1980 US presidential election results covering 3107 counties [19]. The original variables vs and the transformed variables xs and y are listed in Table 4 where we use xi , i = 1, 2, 3, as explanatory variables to predict y, the dependent variable in SEM. The distribution of y and the contiguity matrix are illustrated in Fig. 4(a, b) respectively. The histogram of (without absolute operator) outlier factor by SEM is illustrated in Fig. 5(a), where ρˆ = 0.719174 in Eq. (10). The likelihood of y is maximized by making most errors immediately around 0. We can see very few sites have outlier factors greater than 3 and those who do have are potential top outliers. Although there are a total of 3107 sites, we focus on the up to top t = 50 outliers. Thresholded recall and precision are computed at 5 values of θ in {0.5i : i = 6, ..., 10}. The number of outliers by SEM with factor greater than or equal to θ, |LSEM (θ)|, is given in Fig. 5(b), where about 40 sites surviving threshold 3 in SEM, making up about 1% of the whole population. Fig. 5(c, d) illustrates the thresholded recalls and precisions of SS,SP,TR1 and TR3, where k in TRk refers to the order of neighborhood. For clarity, we did not plot TR2, whose performance is between TR1 and TR3. Similar to the crime dataset, TR3(1) is better than SS in recall but a little worse in precision. As we explained above, a higher recall is worth a lower precision in finding those top outliers. As for the standard (weighted) recall of top t outliers, we sample t on 5 values in {10i : i = 1, ..., 5} and illustrate the results in Fig. 5(e, f). SP is the worst in both cases, while the other three, SS, TR1 and TR3 are very similar with negligible differences. We also try using up to 4th order neighbors and get similar results as using up to 3rd order neighbors. It seems using 3rd order neighbors is enough to achieve the maximum recall for the top 50 outliers in this dataset.  Average number of sites (neighbors plus itself) in kl=0 Nl (si ) for election dataset is illustrated in Table 5, where we can see such numbers nearly double each time as k increments from 2. When using up to 3rd order neighbors for estimating spatial trend, we retrieve, on average, 40.4 neighboring sites, which make up 1.3% of total 3107 sites. So this estimation is still local. Entropies between global rankings over all sites by different methods and SEM are illustrated in Table 6. We can see as higher order neighborhood gets used to compute trimmed mean to estimate spatial trend, the entropy decreases, which means the global ranking by TRk gets closer to SEM. The minimum entropy and cross entropy are both achieved at TR3. Then they increase as using up to 4th order neighbors. TR3 is also better than SS, though the difference is not significant.

10

6

Conclusion

Spatial outliers are different from non-spatial outliers in that they are outlying in its local neighborhood, though they may not be outlying globally. In this paper, we developed a local trimmed mean approach to estimating spatial outlier factors. Trimmed mean unifies other estimators such as mean and median by varying the parameter p, the highest and lowest fraction to be trimmed off. The structure of our approach strictly follows the general spatial data model, i.e., spatial data consists of trend, dependence and error. Only after removing spatial trend and dependence and thus making error independently and identically distributed, is it meaningful to compare them to evaluate the outlier factor. In estimating spatial trend in a non-parametric way, we proposed to use higher order neighbors, in addition to the 1st order neighbors, which are commonly used in other approaches. The general spatial data model is also the underlying model for spatial error model, which is supposed to give the true outlier factors. In the final step of our approach where the error needs to be transformed to the outlier factor that follows standard Gaussian distribution, we employed M AD, which is more robust than sample variance in z-score in estimating sample scale (variance). We employed 3 variants of standard recall and precision of top outliers to evaluate the performance of our approach in comparison with spatial statistic and scatter plot. Experimental results based on 2 real datasets show our approach is significantly better than scatter plot and slightly better than spatial statistic. It seems using 2nd order neighbors is enough for small datasets, e.g., size less than 100, and using 3rd order neighbors is enough for large datasets, e.g., size greater than 1000. Entropy between various methods versus SEM shows our approach is similar to spatial statistic, in terms of giving similar global rankings over all sites as spatial error model does. Although our goal is to identify outliers, we need to obtain accurate estimation which is able to accommodate most non-outlier data so that they will not be falsely labeled outliers. Trimmed mean has been proved to give more robust and outlier-resistant estimation in presence of outliers. It is interesting to study in the future work the performance of other approaches such as spatial statistic where mean operation is substituted with trimmed mean.

References [1] C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pages 37–46, 2001. [2] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pages 94 – 105, 1998. [3] M. Ankerst, M. M. Breunig, H. P. Kriegel, and J. Sander. OPTICS: Ordering points to identify the clustering structure. In Proceedings of 1999 ACM SIGMOD International Conference on Management of Data, pages 49–60, 1999. [4] L. Anselin. Spatial Econometrics: Methods and Models. Kluwer, 1988. [5] L. Anselin. Exploratory spatial data analysis and geographic information systems. In New Tools for Spatial Data Analysis, pages 45–54, 1994. [6] V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley & Sons, 3 edition, 1994. [7] R. Barry and R. K. Pace. A monte carlo estimator of the log determinant of large sparse matrices. Linear Algebra and its Applications, 289(1-3):41–54, 1999. [8] R. J. Beckman and R. D. Cook. Outliers. Technometrics, 25:119–149, 1983. [9] M. M. Breunig, H. P. Kriegel, R. T. Ng, and J. Sander. LOF: Identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pages 93–104, 2000. 11

[10] N. A. Cressie. Statistics for Spatial Data. Wiley, 1993. [11] M. Ester, H. P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining, pages 226–231, 1996. [12] D. M. Hawkins. Identification of Outliers. Chapman and Hall, 1980. [13] B. Iglewicz and D. C. Hoaglin. How to Detect and Handle Outliers, volume 16. Quality Press, 1993. [14] E. M. Knorr and R. T. Ng. Finding intensional knowledge of distance-based outliers. In Proceedings of the 25th International Conference on Very Large Data Bases, pages 211–222, 1999. [15] E. M. Knorr, R. T. Ng, and V. Tucakov. Distance-based outliers: Algorithms and applications. The Very Large Data Bases Journal, 8(3):237–253, 2000. [16] J. Lesage. MATLAB Toolbox for Spatial Econometrics. http://www.spatial-econometrics.com, 1999. [17] T. M. Mitchell. Machine Learning. McGraw-Hill, 1997. [18] R. Ng and J. Han. CLARANS: A method for clustering objects for spatial data mining. IEEE Transactions on Knowledge and Data Engineering, 14(5):1003–1016, 2002. [19] R. K. Pace and R. Barry. Quick computation of spatial autoregressive estimators. Geographical Analysis, 29:232–247, 1997. [20] R. K. Pace and R. P. Barry. Simulating mixed regressive spatially autoregressive estimators. Computational Statistics, 13:397–418, 1998. [21] S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pages 427–438, 2000. [22] S. Ross. A First Course in Probability. Prentice Hall, 5 edition, 1998. [23] G. Salton and M. Mcgill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. [24] S. Shekhar, C. T. Lu, and P. Zhang. Detecting graph-based spatial outliers: Algorithms and applications (a summary of results). In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 371–376, 2001. [25] R. E. Shiffler. Maximum z scores and outliers. The American Statistician, 42:79–80, 1988. [26] P. N. Tan, V. Kumar, and J. Srivastava. Selecting the right interestingness measure for association patterns. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 32–41, 2002. [27] J. W. Tukey. Exploratory Data Analysis. Addison-Wesley, 1977.

12

a: dataset

b: DB(p,d) 1 0.8

0.6

outlier factor

attribute f(si)

0.8

0.4 0.2 0

0.6 0.4 0.2

0

2

4

6

8

0

10

0

2

1

1

0.8

0.8

0.6 0.4 0.2 0

4

6

8

10

8

10

location si d: SP

outlier factor

outlier factor

location si c: SS

0.6 0.4 0.2

0

2

4

6

8

0

10

0

2

location si

4

6

location si

Figure 1: Comparison of DB(p, d)-outlier, SS and SP. (a): a dataset of 10 sites. si , i = 2...9, have two neighbors, si−1 and si+1 . s1 has neighbor s10 and s2 . s10 has neighbor s9 and s1 . (b): DB(p, d)-outlier factor. (c): SS outlier factor. (d): SP outlier factor.

a: crime distribution

b: contiguity matrix

55

0

50 10 latitude

45 20

40 35

30

30

[0,20] (20,40] (40,60] (60,80]

25 20 20

25

30

35

40

40

50

45

0

10

20 30 nz = 270

40

50

longitude

Figure 2: Crime dataset. (a): Crime distribution. (b): Contiguity matrix W . There are 270 non-zeros. W (i, j) > 0 ⇔ (si , sj ) ∈ N

13

SS SP TR1 TR2

5

thresholded recall

0 −4

−2

0 c

1

4

0.5

0

1

1.5

2 e

2.5

3

0.5

0

0

2

4

6

b 15 10 5 0

1

1.5

2 d

2.5

3

1

1.5

2 f

2.5

3

1

0.5

0 1

weighted recall

1

standard recall

2

thresholded precision

frequency

10

no. of outliers by SEM

a 15

8

10

0.5

0

12

0

2

4

6

8

10

12

Figure 3: Results on crime dataset. x-axis of (b,c,d) refers to outlier factor threshold θ in thresholded recall and precision. x-axis of (e,f) refers to t in recall of top t outliers. k in TRk refers to the order of neighborhood, k l=0 Nl (si ), the set of neighboring sites over which trimmed mean is obtained. (a): Histogram of (without absolute operator) outlier factor by SEM. (b): No. of outliers by SEM vs outlier factor threshold. (c): Thresholded recall vs outlier factor threshold. (d): Thresholded precision vs outlier factor threshold. (e): Standard recall vs t in top t outliers. (f): Rank based weighted recall vs t in top t outliers.

14

Table 1: Efficiency comparison of median and 15% trimmed mean. In the first set, n data from N (0, 1). In the second set, n data from N (0, 1) and 1 outlier from N (0, 10). N (0, 1) + N (0, 102 ) median trimmed 5.5843 5.7763 1.6699 2.4662 0.6939 1.0072

N (0, 1) median trimmed 0.6497 0.9872 0.6110 0.9044 0.6221 0.9041

n 10 100 1000

Table 2: Average number of sites (neighbors plus itself) in k

k

avg |

l=0 Nl (si )|

k

l=0 Nl (si )

0

1

2

3

4

1

6.5

17.3

31

43

for crime dataset.

Table 3: Similarity of global ranking with SEM on crime dataset.

entropy cross entropy

SEM 3.7084 4.6948

SS 3.9838 4.9741

SP 4.1813 5.1746

TR1 4.0414 5.0328

TR2 4.0073 4.9979

TR3 4.0410 5.0315

TR4 4.0079 4.9984

Table 4: Variables in election dataset. number v7 v8 v9 v10 v11 x1 x2 x3 y

description population casting votes population over age 19 eligible to vote population with college degrees home ownership income ln(v9 /v8 ) ln(v10 /v8 ) ln(v11 /v8 ) ln(v7 /v8 )

Table 5: Average number of sites (neighbors plus itself) in

k

k

avg |

k

l=0 Nl (si )|

l=0 Nl (si )

0

1

2

3

4

1

6.99

19.84

40.4

69.6

for election dataset.

Table 6: Similarity of global ranking with SEM on election dataset.

entropy cross entropy

SEM 7.8484 8.8482

SS 8.1199 9.1197

SP 8.3061 9.3060

15

TR1 8.1566 9.1565

TR2 8.1200 9.1198

TR3 8.1162 9.1161

TR4 8.1221 9.1219

7

5

x 10

b: contiguity matrix

a: log voting rate distribution 0 500

4.5

latitude

1000 4 1500 3.5

3

2.5 −1.4

2000 [−3.5,−1] (−1,−0.5] (−0.5,0]

2500 3000

−1.2

−1 longitude

−0.8

−0.6

0

8

x 10

500

1000 1500 2000 2500 3000 nz = 18600

Figure 4: Election dataset. (a): Distribution of log voting rate in 3170 counties. (b): Its contiguity matrix. There are a total of 18600 nonzeros. W (i, j) > 0 ⇔ (si , sj ) ∈ N .

16

standard recall

−5

1

0 c

5

10

0.5

0

3

3.5

4 e

4.5

5

1

20

30

20

0

3

3.5

4 d

4.5

5

3

3.5

4 f

4.5

5

0.2 10

20

30

40

50

0.5

0 0.6

0.5

10

no. of outliers by SEM

thresholded recall

0 −10

weighted recall

frequency

SS SP TR1 TR3

1000

b 40

thresholded precision

a 2000

40

50

0.4

Figure 5: Results on election dataset. x-axis of (b,c,d) refers to outlier factor threshold θ in thresholded recall and precision. x-axis of (e,f) refers to t in recall of top t outliers. k in TRk refers to the order  of neighborhood, kl=0 Nl (si ), the set of neighboring sites over which trimmed mean is obtained. (a): Histogram of (without absolute operator) outlier factor by SEM. (b): No. of outliers by SEM vs outlier factor threshold. (c): Thresholded recall vs outlier factor threshold. (d): Thresholded precision vs outlier factor threshold. (e): Standard recall vs t in top t outliers. (f): Rank based weighted recall vs t in top t outliers.

17