A Dimension Reduction Approach Using Shrinking ... - Semantic Scholar

Report 3 Downloads 67 Views
A Dimension Reduction Approach Using Shrinking for Multi-Dimensional Data Analysis Yong Shi

A Dimension Reduction Approach Using Shrinking for Multi-Dimensional Data Analysis Yong Shi Department of Computer Science and Information Systems Kennesaw State University, Building 11, Room 3060, Kennesaw, GA 30144 [email protected] doi:10.4156/ijiip.vol1. issue2.9

Abstract In this paper, we present ongoing research on data analysis based on our previous work on the shrinking approach. Shrinking [22] is a novel data preprocessing technique which optimizes the inner structure of data. It can be applied in many data mining fields. Following our previous work on the shrinking method for multi-dimensional data analysis in full data space, we propose a shrinking-based dimension reduction approach which tends to solve the dimension reduction problem from a new perspective. In this approach data are moved along the direction of the density gradient, thus making the inner structure of data more prominent. It is conducted on a sequence of grids with different cell sizes. Dimension reduction process is performed based on the difference of the data distribution projected on each dimension before and after the data-shrinking process. Those dimensions with dramatic variation of data distribution through the data-shrinking process are selected as good dimension candidates for further data analysis. This approach can assist to improve the performance of existing data analysis approaches. We demonstrate how this shrinking-based dimension reduction approach affects the clustering results of well-known algorithms.

Keywords: Dimension Reduction, Shrinking, Multi-dimensional Data 1. Introduction With the advance of modern technology, the generation of multi-dimensional data has proceeded at an explosive rate in many disciplines. Data preprocessing procedures can greatly benefit the utilization and exploration of real data. Shrinking [22] is a novel data preprocessing technique which optimizes the inner structure of data inspired by the Newton’s Universal Law of Gravitation [19] in the real world. It can be applied in many data mining fields. In this paper, we first give a brief introduction about our previous work on the shrinking concept formation and its application for clustering approaches in fullspace; then, we propose a shrinking-based approach for the dimension reduction problem in multidimensional data analysis.

1.1. Related work Data preprocessing transforms the data into a format that can be more easily and effectively processed for the purpose of the users. It is commonly used as a preliminary data mining practice. There are a number of data preprocessing techniques [18, 5] including data cleaning, data integration, data transformation and data reduction. These data processing techniques can substantially improve the overall quality of the patterns mined and/or the time required for the actual mining [13]. Cluster analysis is used to identify homogeneous and well-separated groups of objects in databases. The basic steps in the development of a clustering process can be summarized as [7] feature selection, application of a clustering algorithm, validation of results, and interpretation of the results. Among these steps, the clustering algorithm is especially critical, and many methods have been proposed in the literature for this step. Existing clustering algorithms can be broadly classified into four types [12]: partitioning [11, 15, 17], hierarchical [24, 8, 9], grid-based [23, 21, 2], and density-based [6, 10, 3] algorithms. One of the most common problems of existing algorithms is rapid degeneration of performance with increasing dimensions [10], particularly with approaches originally designed for lowdimensional data.

- 86 -

International Journal of Intelligent Information Processing Volume 1, Number 2, December 2010

It is well acknowledged that in the real world a large proportion of data has irrelevant features which may cause a reduction in the accuracy of some algorithms. One of the well-known techniques for improving the data analysis performance is the method of dimension reduction [2, 1, 20] in which data are transformed to a lower dimensional space while preserving the major information it carries, so that further processing can be simplified without compromising the quality of the final results. There are several different ways in which the dimensions of a problem can be reduced. One approach uses the optimal selection of a subset of attributes of existing dimensions (attributes). Another kind of commonly used approaches is projection methods in which the new projected dimensions are linear or un-linear combination of old dimensions. A popular approach is to use principal component analysis (PCA) through singular value decomposition (SVD) [14] for numerical attributes which defines new attributes (principal components or PCs) as mutually-orthogonal linear combinations of the original attributes.

1.2. Data shrinking In this subsection, we will briefly introduce our previous work on the shrinking approach [22]. Shrinking is a data preprocessing approach that optimizes the inner structure of data by moving each data point along the direction in which way it is most strongly attracted by other data points. In the previous work [22], we proposed a shrinking-based approach for multi-dimensional data analysis which consists of three steps: data shrinking, cluster detection, and cluster evaluation and selection. This approach is based on grid separation of the data space. Since grid-based clustering approaches depend on the proper selection of grid-cell size, we used a technique called density span generation to select a sequence of grids of different cell sizes and perform the data-shrinking and cluster-detection steps based on these suitable grids. In the data-shrinking step, each data point moves along the direction of the density gradient and the data set shrinks toward the inside of the clusters. The neighboring relationship of the points in the data set is grid-based. The data space is first subdivided into grid cells. Data points in sparse cells are considered to be noise or outliers and will be ignored in the data-shrinking process. Data-shrinking proceeds iteratively; in each iteration, we treat the points in each cell as a rigid body which is pulled as a unit toward the data centroid of those surrounding cells which have more points. Therefore, all points in a single cell participate in the same movement. Following the data-shrinking step, the cluster-detection step is performed in which neighboring dense cells are connected and a neighboring graph of the dense cells is constructed. A breadth-first search algorithm is applied to find connected components (clusters) of the neighboring graph. The clusters detected at multiple grid scales are compared by a cluster-wise evaluation measurement. Internal connecting distance (ICD) and external connecting distance (ECD) are defined to measure the closeness of the internal and external relationships, respectively. Compactness is then defined as the ratio of the external connecting distance over the internal connecting distance, and it is used to evaluate clusters detected at different scales and then select the best clusters as the final result.

1.3. Proposed approach In this paper, we propose a shrinking-based dimension reduction approach for multi-dimensional data analysis to address the inadequacies of current clustering algorithms in handling multidimensional data. It tends to solve the dimension reduction problem from a new perspective. In the proposed algorithm, data points are moved along the direction of the density gradient simulating the Newton’s Universal Law of Gravitation, leading to clusters which are condensed and widely-separated. This process is grid-based. Instead of choosing a grid with a fixed cell size, we use a sequence of grids of different cell sizes to perform our algorithm. In the proposed algorithm, data points are moved along the direction of the density gradient simulating the Newton’s Universal Law of Gravitation, leading to clusters which are condensed and widely-separated. This process is grid-based. Instead of choosing a grid with a fixed cell size, we use a sequence of grids of different cell sizes to perform our algorithm. Dimension reduction process is performed based on the difference of the data distribution projected on each dimension through data-shrinking process. For those dimensions which make large

- 87 -

A Dimension Reduction Approach Using Shrinking for Multi-Dimensional Data Analysis Yong Shi

contribution to the good results of data analysis (practically clustering process here), the alterations of the histogram variance of them through the data-shrinking process are significant. By evaluating the ratio of the histogram variances through data-shrinking process, good dimension candidates for further data analysis steps (e.g., clustering algorithms) can be picked out efficiently, and unqualified ones are discarded. It can improve the clustering results of existing clustering algorithms. We will demonstrate how the dimension reduction process will improve the performance of existing clustering algorithms in the experimental part. The remainder of this paper is organized as follows. Section 2 discusses the details of dimension reduction process. Section 3 presents experimental results, and concluding remarks are offered in Section 4.

2. Dimension Reduction In this section a shrinking-based dimension reduction approach is proposed. In this approach, we consider optimal selection of a subset of existing dimensions for the purpose of easy interpretation.

2.1. Histogram variance The shrinking-based dimension reduction approach is formalized as follows. In order to describe our approach we shall introduce a few notations and definitions. Let the input d-dimensional dataset be X

   X  { X 1 , X 2 ,..., X n }

(1)

The dataset is normalized to be within the hypercube [0, 1]d ⊂ Rd. Each data point Xi is a ddimensional vector:

 X i  { X i1 , X i 2 ,..., X id }

(2)

Data-shrinking process is performed under various grid scale candidates acquired by the density span generation method [22]. For each grid scale candidate and each dimension of X, a histogram is set up based on the current grid scale information:

H  {H1 , H 2 ,..., H d }

(3)

The number of segments on each dimension is not necessarily the same. Let ηi be the number of bins in the histogram on the ith dimension. We denote each histogram as:

H i  {H i1 , H i 2 ,..., H ii }

(4)

Hij is a bin for the histogram. We denote the region of bin Hij as [Minij, Maxij] for j = ηi, or [Minij, Maxij) otherwise. The size of bin Hij is the amount of data points whose ith attributes are in the region of Hij: | H ij || {X l | Minij  X lj  Maxij } | (5) for j =

 i , or | H ij || {X l | Minij  X lj  Maxij } |

for j   i . Let

H

i

be the mean of the bin sizes of Hi, and

 H2

- 88 -

i

be the variance of the bin sizes of Hi:

(6)

International Journal of Intelligent Information Processing Volume 1, Number 2, December 2010

 We called

 H2

i

2 Hi

 (| H  j

ij

| - Hi )

2

i

(7)

as the histogram variance of the ith dimension.

(a) (b) Figure 1. Histogram variance of Ecoli data on each dimension (a) before data shrinking process (b) after data-shrinking process

2.2. Criteria for dimension reduction Histogram is a popular way to mirror the inner structure of data. The histogram information of each dimension is different. Observation on the trait of the histograms on different dimensions can help the determination of the dimension selection process. Histogram variance of a dimension mentioned in the previous subsection can indicate the data distribution projected on that dimension to a certain degree. However, that is not necessarily the case. Figure 1 shows an example on Ecoli data which we will describe in details later in the experimental section. The data set has seven dimensions. Figure 1(a) is the original histogram variance demonstrations of all the seven dimensions of Ecoli data, and Figure 1(b) is the one after datashrinking process. From Figure 1, we can see that dimension 3 and dimension 4 have much larger histogram variances than other dimensions both before and after the data-shrinking process. However, these two dimensions actually give poor support for the following clustering process. The reason is that, on those dimensions, most part of the data are in a very narrow dense area, and just a few data points are far away from the dense area, which makes the histogram variance extremely large, although those dimensions are almost of no help to the capture of the characteristic of the data distribution. The alteration of the histogram variances through the data-shrinking process on each dimension reflects the characteristic aspects of the data distribution on the dimension much better than the histogram variance itself. For those dimensions which make large contribution to the good results of data analysis (practically clustering process here), the alterations of the histogram variance of them through the data-shrinking process are significant. By evaluating the ratio of the histogram variances through data-shrinking process on each dimension instead of the histogram variance itself, good dimension candidates for further data analysis steps (e.g., clustering algorithms) can be picked out efficiently, and unqualified ones are discarded. For example, for Ecoli data mentioned above, by computing the ratio of the histogram variances through data-shrinking process, we can find that neither dimension 3 nor dimension 4 has a high ratio compared to those of other dimensions, so these two dimensions will be properly removed from the candidate list of dimensions.

2.3. Dimension reduction process In this subsection we will present the details of the proposed approach. Algorithm 1 describes the shrinking-based dimension reduction process. It is a grid-based approach.

- 89 -

A Dimension Reduction Approach Using Shrinking for Multi-Dimensional Data Analysis Yong Shi

Select reasonable grid scales. For grid-based approaches, it is crucial to properly select the grid size. However, proper grid size selection is problematical without prior knowledge of the structure of a data set. In the Shrinking approach, a histogram-based multi-scale gridding technique is proposed to address this problem. Here we just briefly describe the idea. Instead of choosing a grid with a fixed cell size, we use a sequence of grids of different cell sizes. We use the term density span to denote a combination of consecutive bins’ segments on a certain dimension in which the number of data points exceeds a given threshold. For each histogram hi, i=1, ..., d, we sort its bins based on the number of data points they contain in descending order. Then we start from the first bin of the ordered bin set, merge it with its neighboring bins until the total amount of data points in these bins exceeds a given threshold. Thus a density span is generated as the combination of the segments of these bins. The operation is continued until all the non-empty bins of this histogram are in some density spans. Each histogram has a set of density spans. We regard density spans with similar sizes as identical density spans. Once we get the set S of all the density spans from all the histograms, we sort them based on their frequencies in set S. We choose the most frequent density spans as the reasonable multiple scales for the following procedure. In other words, those density spans which appear most often in set S are chosen. Perform data-shrinking and compute histogram variance alteration. After we get the set A of P reasonable grid scales, under each grid scale j, we do the following steps: a) For each dimension i, we first calculate the histogram variance defined in equation (7). b) The data-shrinking process is then performed. In the data-shrinking step, each data point moves along the direction of the density gradient and the data set shrinks toward the inside of the dense areas. Data shrinking proceeds iteratively. In each iteration, data points are “attracted” by their neighbors and move to create denser areas. The movement of a point can be intuitively understood as analogous to the attraction of a mass point by its neighbors, as described by Newton’s Law of Universal Gravitation. Thus, the data points in the dense cells move toward the data centroid of the neighboring cells. We compute the movement of the data points in a dense grid cell C ij in the ith iteration as





 js   j

i

Movement( C j ) =

s



if ||  j   j || ≥ T mv x

1 and k



w

k 1

n jk  n j ;

(8)

0 otherwise,

   j is the centroid of the grid cell C ij ,  js is the data centroid of the surrounding cells of C ij ,      js   j || is the distance between  js and  j , nj is the number of data points in grid cell C ij , Where

||



w

k 1

n jk is the number of data points in the surrounding grid cells of C ij , T mv is a threshold to ensure

1 is the side length of grid cells. It states that, if the distance k s  i between  j and  j is not too small and the surrounding cells have more points, cell C j will be that the movement is not too small, and

translated such that its data centroid is moved to the data centroid of the surrounding dense cells;

C ij remains static. The movement for each cell indicates the approximate direction of  i the density gradient around the cell. After movement, a data point X in cell C j is moved to    X +  js -  j . otherwise, cell

The data-shrinking iterations terminate if the average movement of all points is less than a threshold or if the number of iterations exceeds a threshold. c) Let

~H

~

i

be the mean of the bin sizes of H i, | H ij | be the size of bin H ij after the data-

shrinking process, we compute ~H i as the histogram variance after the data-shrinking process: 2

- 90 -

International Journal of Intelligent Information Processing Volume 1, Number 2, December 2010

~

~H2

 (| H  j

i

ij

~ )2 | - Hi

i

(9)

d) We evaluate the variance difference between the original histogram status and after-shrinking histogram status by the ratio of the variances:



2 Hi



~H2

i

 H2

(10) i

The sum of the histogram variance alterations on all the dimensions under a certain grid scale condition is calculated: d

    2

(11)

Hi

i 1

Refine the set of grid scales. Suppose there are P reasonable grid scale candidates previously generated. We calculated P different histogram variance sums:

 = {  1,  2,…,  p}

(12)

Under some grid scales, data cannot be properly shrunk via the data-shrinking process because those grid scales cannot help make the data distribution more piercing. In such cases, the variance differences through the data-shrinking process are not prominent. Those grid scales are discarded. Select good dimension candidates. For those grid scales selected according to the sum of the histogram variance alterations, dimensions having significant histogram variance alteration through the data-shrinking process are selected as good candidates for clustering process, based on the integrated variance differences under these grid scales. Under each selected grid scale λ, we sort dimensions D1 , D2,…, Dd in descending order according ˆ ,D ˆ ,..., D ˆ , we select the first several to   2 . Suppose the dimension list after sorting is D 1 2 d Hi

dimensions. The cut on the dimension list is performed as follows. To keep most valuable dimensions, the second half of the ordered dimension list is checked, and the cut spot is set on the first sharp descent dimension. We will demonstrate the dimension cut process in the experimental part. For each grid scale candidate, an ordered dimension list based on the histogram variance alteration is generated. The ultimate selection of valuable dimensions is based on the integration of the dimension selections on these qualified grid scales. Algorithm 1 (Shrinking-based dimension reduction) Input: data set X Output: set of good dimension candidates for further data analysis steps 1) Select set A of P reasonable grid scales; 2) Under each grid scale j, do: a) Calculate the histogram variance

 H2

i

for each dimension i;

b) Perform data-shrinking process on each dimension i; c) Calculate the histogram variance

~H2

i

for each dimension i;

d) Calculate the sum of the histogram variance status on all the dimensions under current grid d

scale j:  j    2  i 1

Hi

3) Select subset B of grid scales set A by large value of  in  = {  1 ,  2 ,…,  p }; 4) Select set C of good dimension candidates from the integrated histogram variance alterations in set B; 5) Output the dimension set C.

- 91 -

A Dimension Reduction Approach Using Shrinking for Multi-Dimensional Data Analysis Yong Shi

2.4. Time and space analysis Throughout the dimension reduction process, we need to keep track of the histogram information of each dimension for each grid scale candidate. Suppose the number of the grid scale candidates is P, the maximum bin amount of a dimension is η max , and the dimensionality is d. The dimension reduction process occupies O(Pd η max ) space. Under a certain grid scale, the time of the dimension sorting based on the variance differences is O(dlogd). So the altogether sorting time of dimensions for all suitable grid scales is O(Pd logd).

3. Experiments We conducted comprehensive experiments to assess the accuracy and efficiency of the proposed approach. Our experiments were run on SUN ULTRA 60 workstations with the Solaris 5.8 system. Experiments are conducted to demonstrate how the shrinking-based dimension reduction approach can benefit well known algorithms. Our algorithm has been found to yield encouraging results in real-world clustering problems. We tested our approach on three data sets from real applications and demonstrate that it will help improve the performance of existing clustering algorithms such as OPTICS and BIRCH as well as data visualization tools such as VizCluster [16].

3.1. Data sets The three data sets were obtained from UCI Machine Learning Repository [4]. Data Set 1 Wine Recognition data contains the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. It contains 178 instances, each of which has 13 features, including alcohol, magnesium, color intensity, etc. The data set has three clusters with the sizes of 59, 71 and 48. Data Set 2 Ecoli data contains data regarding Protein Localization Sites. This data set is made up of 336 instances, with each instance having seven features. It contains 8 clusters with the sizes of 143, 77, 52, 35, 20, 5, 2 and 2. Data Set 3 The third data set is Pendigits, or Pen-Based Recognition of Hand-written Digits. It was created by collecting 250 samples from 44 writers. It has two subsets used, respectively, for training and testing. For the purpose of this experiment, we have combined these two subsets, resulting in a combined dataset with 10992 instances, each containing 16 attributes. The data set has ten clusters with the sizes of 1143, 1144, 1055, 1056, 1055, 1055, 1143, 1055, 1144 and 1142.

3.2. Shrinking-based dimension reduction approach on real data sets Wine Data. We first performed the data-shrinking process on the Wine data. Table 1 shows the histogram variance differences of Wine data before and after the data-shrinking process for a certain grid scale. We can see from Table 1 that the twelfth dimension has the most dramatic histogram variance change. Figure 2 shows the cut of the integrated ordered variance differences list on multiple scales for Wine data. Seven dimensions are selected for further processing.

Table 1. histogram variance differences of Wine data for a certain grid scale

- 92 -

International Journal of Intelligent Information Processing Volume 1, Number 2, December 2010

Figure 2. The cut of the variance differences list on Wine data Ecoli Data. We then performed the data-shrinking process on the Ecoli data. Table 2 shows the histogram variance differences of Ecoli data before and after the data-shrinking process for a certain grid scale. We can see from Table 2 that the first dimension has the most dramatic histogram variance change. Figure 3 shows the cut of the integrated ordered variance differences list on multiple scales for Ecoli data. Five dimensions are selected for further processing.

Table 2. histogram variance differences of Ecoli data for a certain grid scale

Figure 3. The cut of the variance differences list on Ecoli data Pendigits Data. We also performed the data-shrinking process on the Pendigits data. Table 3 shows the histogram variance differences of Pendigits data before and after the data-shrinking process for a certain grid scale. We can see from Table 3 that the seventh dimension has the most dramatic histogram variance change. Figure 4 shows the cut of the integrated ordered variance differences list on multiple scales for Pendigits data. Eleven dimensions are selected for further processing.

Table 3. histogram variance differences of Pendigits data for a certain grid scale

- 93 -

A Dimension Reduction Approach Using Shrinking for Multi-Dimensional Data Analysis Yong Shi

Figure 4. The cut of the variance differences list on Pendigits data

3.3. Testing results of existing clustering algorithms In this section we will demonstrate how the shrinking-based dimension reduction approach improves the performance of well-known algorithms such as OPTICS and BIRCH, as well as data visualization tools such as VizCluster. OPTICS: We adopted the implementation of OPTICS provided by Peer Kroeger. OPTICS does not produce a clustering of a data set explicitly. It instead creates an augmented ordering of the data set representing its density-based clustering structure. We can roughly estimate the generated clusters by the observation of its results. We set the parameter values for OPTICS just to be “large” enough to yield a good result. Figure 5 shows the testing results of OPTICS on Wine data before and after the shrinking-based dimension reduction process. We can see that after shrinking based dimension reduction process, the clusters shown in the curve of (b) are much clearer than the original one(a).

(a) (b) Figure 5. Testing result of OPTICS for Wine data (a) without shrinking-based dimension reduction (b) after shrinking-based dimension reduction Figure 6 shows the testing results of OPTICS on Ecoli data before and after the shrinking-based dimension reduction process. The clusters shown in the curve of Figure (b) are also clearer than the original one (a).

- 94 -

International Journal of Intelligent Information Processing Volume 1, Number 2, December 2010

Figure 6. Testing result of OPTICS for Ecoli data (a) without shrinking-based dimension reduction (b) after shrinking-based dimension reduction Figure 7 shows the testing results of OPTICS on Pendigits data before and after the shrinking-based dimension reduction process. Again the performance with shrinking-based dimension reduced data is better than the original one.

Figure 7. Testing result of OPTICS for Pendigits data (a) without shrinking-based dimension reduction (b) after shrinking-based dimension reduction VizCluster: VizCluster [16] is an interactive visualization tool for multi-dimensional data. It combines the merits of both multi-dimensional scatter-plot and parallel coordinates. Integrated with useful features, it can give a simple, fast, intuitive and yet powerful view of the data set. Due to the space limitation, here we just present the testing results on Wine data and Ecoli data. Figures 8, 9, 10 and 11 show the testing results on these two data sets respectively. Different shapes of the points present different cluster id information. From Figures 8 and 9 we can see that the visualization result of VizCluster on the shrinking-based dimension reduced Wine data is much better than that on the original one.

Figure 8. Testing result of VizCluster for Wine data without shrinking-based dimension reduction

- 95 -

A Dimension Reduction Approach Using Shrinking for Multi-Dimensional Data Analysis Yong Shi

Figure 9. Testing result of VizCluster for Wine data after shrinking-based dimension reduction From Figures 10 and 11 we can see that the visualization result of VizCluster on the shrinking-based dimension reduced Ecoli data is also better than that on the original one.

Figure 10. Testing result of VizCluster for Ecoli data without shrinking-based dimension reduction

Figure 11. Testing result of VizCluster for Ecoli data after shrinking-based dimension reduction BIRCH: We also tested how the shrinking-based dimension reduction approach affects the performance of BIRCH on different data. Due to the space limitation, here we just show the testing result on Ecoli data mentioned in the previous sections. The ground truth is that the Ecoli data contains 8 natural clusters, with the sizes of 143, 77, 52, 35, 20, 5, 2, 2. Our test includes two steps. In the first step, we applied the BIRCH algorithm directly on the data, resulting in 8 clusters with the sizes of 133, 93, 74, 24, 6, 3, 2 and 1. In the second step, we applied BIRCH again on the data after the shrinkingbased dimension reduction process, and get 8 clusters with the sizes of 143, 101, 76, 7, 4, 2, 1 and 1. From the comparison of the two different clustering results, we can see that the major clusters

- 96 -

International Journal of Intelligent Information Processing Volume 1, Number 2, December 2010

generated after the shrinking-based dimension reduction process match the ground truth better than those generated from the original BIRCH algorithm.

4. Conclusion and discussion In this paper, we presented a shrinking-based dimension reduction approach for multi-dimensional data. We select good dimension candidates for further data analysis based on the observation of the alteration of the histogram variance of each dimension through the data-shrinking process. We demonstrated the effectiveness and efficiency of our approach by the demonstration of tests on real data sets. Data analysis methods still pose many open issues. Shrinking-based approaches rely on the selection of grid scales to a certain degree. Improvement of the grid scale acquirement approach will greatly benefit the whole shrinking concept and implementation. This is one of our primary further researches.

5. References [1] Aggarwal, C., Procopiuc, C., Wolf, J., Yu, P. & Park, J. (1999). Fast algorithms for projected clustering. In Proceedings of the ACM SIGMOD CONFERENCE on Management of Data, pages 61-72, Philadelphia, PA. [2] Agrawal, R., Gehrke, J. , Gunopulos, D. , & Raghavan, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the ACM SIGMOD Conference on Management of Data, pages 94-105, Seattle, WA. [3] Ankerst, M., Breunig, M. M., Kriegel H.-P., Sander J. (1999). OPTICS: Ordering Points To Identify the Clustering Structure. Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’99), Philadelphia, PA, pages 49-60. [4] Bay, S. D. (1999). The UCI KDD Archive [http://kdd.ics.uci.edu]. Irvine, CA: University of California, Department of Information and Computer Science. [5] D. Barbara, W. DuMouchel, C. Faloutsos, P. J. Haas, J. H. Hellerstein, Y. Ioannidis, H. V. Jagadish, T. Johnson, R. Ng, V. Poosala, K. A. Ross, and K. C. Servcik. The New Jersey data resuction report. Bulletin of the Technical Committee on Data Engineering, 1997. [6] M. Ester, K. H.-P., J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 1996. [7] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI Press, 1996. [8] S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. In Proceedings of the ACM SIGMOD conference on Management of Data, pages 73–84, Seattle, WA, 1998. [9] S. Guha, R. Rastogi, and K. Shim. Rock: A robust clustering algorithm for categorical attributes. In Proceedings of the IEEE Conference on Data Engineering, 1999. [10] A. Hinneburg and D. A.Keim. An efficient approach to clustering in large multimedia databases with noise. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 58–65, New York, August 1998. [11] J. MacQueen. Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Volume I,Statistics., 1967. [12] A. Jain, M. Murty, and P. Flyn. Data clustering: A review. ACM Computing Surveys, 31(3), 1999. [13] Jiawei Han, Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2001. [14] K. R. Kanth, D. Agrawal, and A. Singh. Dimensionality reduction for similarity searching in dynamic databases. In Proceedings of the ACM SIGMOD CONFERENCE on Management of Data, pages 166–176, Seattle, WA, 1998.

- 97 -

A Dimension Reduction Approach Using Shrinking for Multi-Dimensional Data Analysis Yong Shi

[15] L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990. [16] Li Zhang, Chun Tang, Yong Shi, Yuqing Song, Aidong Zhang and Murali Ramanathan. VizCluster: An Interactive Visualization Approach to Cluster Analysis and Its Application on Microarray Data. 2002. [17] R. T. Ng and J. Han. Efficient and Effective Clustering Methods for Spatial Data Mining. In Proceedings of the 20th VLDB Conference, pages 144–155, Santiago, Chile, 1994. [18] T. Redman. Data Quality: Management and Technology. Bantam Books, 1992. [19] Rothman, Milton A. The laws of physics. New York, Basic Books, 1963. [20] T. Seidl and H. Kriegel. Optimal multi-step k-nearest neighbor search. In Proceedings of the ACM SIGMOD conference on Management of Data, pages 154–164, Seattle, WA, 1998. [21] G. Sheikholeslami, S. Chatterjee, and A. Zhang. Wavecluster: A multi-resolution clustering approach for very large spatial databases. In Proceedings of the 24th International Conference on Very Large Data Bases, 1998. [22] Y. Shi, Y. Song, and A. Zhang. A shrinking-based clustering approach for multidimensional data. IEEE Trans. on Knowl. and Data Eng., 17(10):1389–1403, 2005. [23] W. Wang, J. Yang, and R. Muntz. STING: A Statistical Information Grid Approach to Spatial Data Mining. In Proceedings of the 23rd VLDB Conference, pages 186–195, Athens, Greece, 1997. [24] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An Efficient Data Clustering Method for Very Large Databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pages 103–114, Montreal, Canada, 1996.

- 98 -