Comparison of Scalable Fuzzy Clustering Methods - Semantic Scholar

Report 4 Downloads 16 Views
WCCI 2012 IEEE World Congress on Computational Intelligence June, 10-15, 2012 - Brisbane, Australia

FUZZ IEEE

Comparison of Scalable Fuzzy Clustering Methods Jonathon K. Parker and Lawrence O. Hall

James C. Bezdek

Department of Computer Science and Engineering University of South Florida Tampa, FL 33620, USA Email: [email protected] and [email protected]

Department of Electrical and Electronic Engineering University of Melbourne Parkville, Victoria 3010, Australia Email: [email protected]

Abstract—Fuzzy c-means (FCM) is a well-known algorithm for clustering data, but for large datasets termination takes significant time. As a result, a number of scalable algorithms based on FCM have been developed. In this paper, four scalable variants of FCM are compared to the base algorithm. Runtime and three quality metrics are calculated. Experimental results using five data sets are analyzed. We show that the scalable algorithms are consistent with regard to speedup, but less consistent when quality is considered. The three quality measures are shown to have little correlation and vary in magnitude across datasets. Selection of a scalable algorithm must consider a tradeoff between the quality of results and speed. Of the variants, single pass FCM (SPFCM) is fastest with good fidelity to FCM, and extensible fast FCM (eFFCM) is almost as fast as SPFCM (as implemented) with very good fidelity to FCM. Random FCM is the fastest overall and often close in quality to FCM. The results showed that scalable algorithms occasionally produce better optimized results than FCM. Index Terms—scalable, fuzzy, clustering, comparison

I. I NTRODUCTION Given a set of data, clustering is the process of partitioning the dataset so that the members of each cluster are more similar to each other than to those in different clusters [1], [2]. Clustering is an important method for data analysis when the dataset is unlabeled. Fuzzy c-means (FCM), based on the venerable hard c-means algorithm (HCM) [3], is an effective, popular clustering method. A vast volume of electronic data exists and the amount grows continuously [4]. Large repositories of data exist that typically exceed the memory of computers [5]. Additionally, there are data sources that continuously produce data, such as records of cell phone calls. If the data in question is unlabeled, clustering is the natural choice to find structure in the data. However, using a clustering algorithm such as FCM may take too long when the volume of data is large. Many scalable algorithms based on FCM have been developed [6]–[11]. However, a review of the literature makes it difficult to select the best scalable algorithm for a particular application. This is because the research was performed on disparate machines and disparate code bases using inconsistent measures for speedup and quality. The research described in this paper seeks to provide a fair comparison between four scalable algorithms and FCM. The same code base is used for all algorithms and the same quality metrics are collected and compared against FCM. Comparisons

U.S. Government work not protected by U.S. copyright

are also made directly between similar scalable algorithms. Experiments were performed on five real world datasets. II. A LGORITHMS The clustering algorithms studied all employ fuzzy sets as opposed to classical (i.e crisp) sets. With fuzzy sets, an object can have a grade of membership in multiple sets [12]. In the case of clustering algorithms, this means a data element can simultaneously belong to multiple clusters. The degree of ’belonging’ of object k in cluster i is uik . This is subject to the following constraints [1]: uik ∈ [0, 1], 1 ≤ i ≤ c, 1 ≤ k ≤ n c X

uik = 1, 1 ≤ k ≤ n

(1)

(2)

i=1 n X

uik > 0, 1 ≤ i ≤ c

(3)

k=1

where n is the number of data elements and c is the number of clusters. A. Fuzzy c-means (FCM) This fuzzy variant of HCM produces a set of c cluster centers by approximately minimizing the objective function that calculates the within-group sum of squared distances from each data element to each cluster center. FCM alternates between calculating optimal cluster centers given the membership values of each data element and calculating the optimal membership values given the cluster centers. Assume the input data are feature vectors xk in Rs . The objective function (Jm ) is defined as: Jm (U, V ) =

c X n X

um ik Dik (xk , vi )

(4)

i=1 k=1

The functions for determining optimal membership values and optimal cluster centers are: 1

uik = Pc

Dik (xk , vi ) 1−m

j=1 Djk (xk , vj ) Pn m j=1 (uij ) xj vi = Pn m j=1 (uij )

1 1−m

(5)

(6)

where These weighted cluster centers represent the partition information of the dataset from the first PDA. These are converted n: is the number of examples. to c additional data elements and added to the dataset for the m > 1: is the ’fuzzifier’. second PDA, which is then processed by WFCM. The final c: is the number of clusters. U : is the membership matrix. uik refers to the membership values for V calculated in the first PDA are used as the initial value of the kth data element (xk ) for the ith cluster. values for V in the second PDA. This routine repeats until all V : is the set of cluster centers. vi is the ith cluster center. data is processed and a final set of cluster centers is obtained. Dik (xk , vi ): is the squared distance between the kth data element and ith cluster center. Any inner C. Online Fuzzy c-means (OFCM) product induced distance metric can be used, but The OFCM algorithm is similar to SPFCM, but there are in this research the Euclidean distance was used. key differences [10]. Significantly, the dataset is not assumed to be randomized. The dataset is broken into PDAs and cluster The U or V matrices may be initialized with any valid set centers from each PDA are calculated using FCM. As each set of values. Typically, the uik are initialized with a set of values of cluster centers is obtained, weights for the cluster centers adhering to (1) to (3) or each vi is set to equal the position of a are calculated using (9). While each PDA can be clustered random point in the data set. The FCM algorithm is terminated independently, in our experiments we use the cluster centers when the difference between successive membership matrices from the previous PDA as an initialization. This routine repeats or cluster centers does not exceed a given parameter . until all data has been processed. The FCM algorithm has an expected runtime complexity of Then the combined set of weighted cluster centers from all 2 O(nisc ) [11] where n and c are defined as above and s is PDAs serve as the dataset for WFCM and a final set of cluster the dimension of the data and i is the number of iterations. centers is obtained. This can be reduced to O(nisc) [13] with some algorithmic modifications. Given a dataset with c natural clusters, runtime can be reduced by reducing n, s or i. The scalable solutions D. Extensible Fast Fuzzy c-means (eFFCM) in this work reduce n or i. This algorithm randomly samples the dataset (with replacement) in an effort to obtain a statistically significant B. Single Pass Fuzzy c-means (SPFCM) sample. Statistical significance is tested for with the Chi-square 2 (χ ) statistic or divergence. If the initial sample fails testing, The SPFCM algorithm [11], as the name implies, scans additional data is added to the sample until a statistical test is through the dataset a single time. The algorithm makes the passed [8]. eFFCM uses a progressive sampling scheme that assumption that the dataset is randomized. Many datasets have was designed explicitly for image data because images have some inherent order of data elements, which is typical in both spatial and intensity distributions. magnetic resonance images. Randomization minimizes the In this work, the size of the initial sample, drawn without likelihood of having subsets of data that are significantly replacement, is equal to n times the fPDA parameter defined different from the overall distribution. The data set is broken into “partial data accesses” (PDA), for SPFCM and OFCM, where n equals the total number of which contain a percentage of the data. The size of the PDA is data elements. If the initial sample is not statistically significant, equal to n times the fractional PDA parameter (fPDA), where a small addition (n∗deltaPDA) is added to the initial sample n equals the total number of data elements. Then a weighted and the sample is retested for statistical significance. The version of FCM (WFCM) is run on each PDA. WFCM differs parameter deltaPDA represents a small increment. This process from FCM in that each data element xi has an associated weight repeats until an acceptable sample is found. Depending on wi . The objective function and cluster center calculation are the dataset, statistical test used, and parameters, the eFFCM sampling procedure can oversample the input data. See [14] modified as follows [10], [11]: for variants of the statistical sampling technique. c X n X The final, statistically significant sample is then processed Jmw (U, V ) = um (7) by FCM to obtain a set of cluster centers. ik wk Dik (xk , vi ) i=1 k=1

Pn

j=1 vi = Pn

wj (uij )m xj

j=1

wj (uij

)m

E. Random Fuzzy c-means (randFCM) (8)

All data elements are initially given a weight of 1. After cluster centers V are calculated from the first PDA, they are assigned weights using the following equation [10]: wi0 =

n X (uij )wj , 1 ≤ i ≤ c j=1

(9)

This algorithm randomly samples the dataset and FCM is run on the sample. Despite its simplicity, on certain datasets randFCM was reported to perform quite well in [15]. The speed up of the method is due to a direct reduction of the size of the dataset. In this work, the size of the sample is equal to n times the fPDA parameter defined for SPFCM and OFCM, where n equals the total number of data elements.

TABLE I DATASETS

III. M ETRICS FOR C OMPARISON The purpose of scalable methods is to decrease runtime, but the benefits of a faster algorithm may be balanced by lower quality results. Therefore metrics must be defined to measure speed and quality. For this work, a single metric is employed to measure speed and three metrics to measure quality. The three quality metrics assess Jm , V and U respectively. A. Relative Speedup (SU) This calculation is a ratio of run times. If t1 is the time for the reference algorithm (algorithm 1) and t2 is the time for algorithm 2, the speed up (SU12 ) of algorithm 2 relative to algorithm 1 is: SU12 =

t1 t2

(10)

Thus if the runtime of algorithm 1 is 800ms and algorithm 2 takes 200ms, the speed up is 4x; i.e four times as fast. B. Difference in Quality of Objective Function (DQ Rm ) The objective function Jm (4) is mathematically equivalent to a reformulated optimization criterion (Rm ) [16], [17]:

Rm (V ) =

n c X X k=1

!(1−m) Dik (xk , vi )

1 (1−m)

(11)

i=1

The Rm calculation is more convenient than Jm , as it only requires the original dataset and the cluster centers. Minimization of this value is the objective of the FCM algorithm and its variants, so comparisons using Rm have been employed as a means of comparing the quality of results for different algorithmic variants [11], [17]. The percentage difference in quality between algorithm 1 and 2 is calculated as follows [11]:   Rm2 − Rm1 DQRm % = ∗ 100 (12) Rm1

Dataset MN016 MN017 MN018 Pendigits Landsat

examples (n) 3,882,771 3,898,407 4,293,292 10,992 6,435

attributes (s) 3 3 3 16 36

clusters (c) 3 3 3 10 6

baseline algorithm. ||.||: is the Euclidean length of the vector. The distance in the numerator is the Euclidean distance. The denominator scales the value. This measures how close the algorithm under evaluation is to the baseline algorithm (Vavg ) it is being compared with. D. Cluster Membership Change (CC) Based on the concept that datasets have a ’natural’ partition into groups that is revealed by clustering, this metric compares the cluster assignments of two soft clustering algorithms on the same dataset. Both partitions are hardened using the maximum membership rule on each column. If the assignment is not the same for column i in both partitions, the indicator variable δi is set to 1; otherwise this value is set to 0. The formula originally given by Gu [17] is presented below, as a percentage, in slightly modified form. Pn δi CC% = i=1 ∗ 100 (14) n The comparison metrics VDQ% and CC% require identifying corresponding clusters across trials and experiments. The Hungarian method [18] was employed to provide the best match of clusters between partitions or algorithms. IV. DATASETS

Five datasets were used for the experiments. See Table I for details on each dataset. Three datasets were magnetic resonance images (MRI) of a normal human brain. The intensities from where Rm1 is the reformulated optimization criterion from the registered volumes of the T1-weighted, T2-weighted and Proton reference algorithm (algorithm 1) and Rm2 from algorithm 2. Density weighted sequences were the three features. These MRI images were clustered into three regions of the brain C. Quality of Partitions (VDQ) which generally correspond to cerebro-spinal fluid (CSF), gray This measure, introduced in [17], is designed to attach a matter (GM) and white matter (WM) [10]. metric to the variation of an algorithm’s predicted cluster center Two small datasets were from the UCI repository, Pendigits compared to a baseline algorithm. It can be used on a single and Landsat [19]. Pendigits is a collection of handwritten algorithm reflexively, or can be used to compare two algorithms. numerals converted to (x,y) position and spatially resampled VDQ is calculated as a percentage as follows: [20]. The ten clusters represent the numerals 0 through 9. Landsat is a collection of spatial and spectral information from ! Pt Pc 0 ||V − V avg || satellite imagery. The clusters represent six different landcover j ij i=1 j=1 P V DQ% = ∗ 100 (13) types [21]. c ||V avgj || t∗ j=1

where t: is the number of trials. 0 Vij : is the j th cluster center (from the ith trial) to be compared. V avgj : is the average position the j th cluster center for the

V. E XPERIMENTS The primary goal of the experiments was to compare the runtime and quality of the scalable algorithms on different datasets. Runtime is influenced by the number of iterations required for termination, which in turn depends on the

TABLE II E XPERIMENT PARAMETER S ETTINGS Parameter m  α deltaP DA f P DA Randomize

Value 2.0 0.001 0.200 0.02 0.05, 0.10 or 0.20 0 or 1

initialization of the iterations. Initialization was performed by randomly selecting c examples in X as a starting set of Vs. Each experiment consisted of 30 trials to ensure a statistically significant sample. We report averages over the 30 trials. The algorithms have several parameters that can be tuned. Common parameters (m, ) and algorithm-specific parameters (α, deltaPDA) were fixed for all experiments. Only two parameters were varied. The fractional partial data access (fPDA) was varied to see its effects on speed and accuracy. It was later decided that a flag parameter, which indicated data randomization prior to processing, would be toggled to explore the difference in the speed and accuracy of SPFCM and OFCM. These parameters are summarized in Table II. Each experiment recorded the results from all algorithms on the same dataset with identical parameter settings. Metrics were also recorded for each algorithm. Specifically, run time, the number of iterations to termination, the cluster centers and Rm were recorded. VI. A LGORITHM I MPLEMENTATION N OTES All algorithmic variants used the same weighted FCM implementation, written in C. Some code from [22] was used. All algorithms requiring random sampling or randomization used the same function. All or part of the dataset is loaded into memory and the positions of pairs of data elements are selected to be swapped n ∗ e times, where n is the number of data elements and e is the natural logarithm base. The randomized version of the dataset is then written to disk. Algorithms requiring a random sample read the file sequentially until the desired sample size is obtained. For consistency, this randomization method was used for sample selection with the eFFCM algorithm, despite [8] stating that selection should be made with replacement. The choice of the number of n∗e pairs of swaps is motivated to ensure the probability of a data element not being picked to be swapped at least once is [23]:  2ne 1 1− = 0.00435 (15) n as n approaches infinity. This ensures a greater than 99.5% probability that any given element will have its position altered, a reasonably high likelihood. A. Termination Criterion The FCM algorithm terminates when the difference between successive matrices or successive sets of cluster centers (as measured by a convenient matrix norm) falls below .

Using a matrix norm to compare either alternative is problematic given the goal of the research, i.e. to compare algorithms across multiple datasets. The dataset sizes, sizes of PDA and range of the data differ in each experiment. Selection of a single  parameter in these cases makes fair comparison difficult. By way of example, imagine comparing two successive U matrices by using the summation of differences between cluster memberships as the matrix norm. This can be tuned for a single algorithm, but the same value for  cannot be used as termination criterion to compare FCM and SPFCM. The dataset size is n for FCM, but fPDA*n for SPFCM. To obtain an equal value of , FCM would require a smaller average difference in individual u values than SPFCM, increasing its runtime. Likewise, comparing V matrices has the same problem as above when comparing performance across different datasets with different numbers of clusters. Using the same value for  would force higher quality results from datasets with more clusters and allow lower quality results from datasets with fewer clusters. Another difficulty with using a V norm is that different datasets have different ranges of data, making relative quality of results dependent on the dataset. Scaling the data would correct this one issue, but scaling is not possible for a true implementation of an online streaming algorithm such as OFCM, so scaling was not performed. The solution is to use the U matrix sup norm, which is less dependent on the dataset size, scale or number of clusters. In our research, the maximum difference (maxChange) in a membership value for a data element in successive U s is used. As this criterion tests the stability of the membership values, it is independent of dataset size and data scale. While the effects due to number of clusters is unknown, this technique provides a clear and unambiguous stopping criterion. This leaves the question of how to initialize the algorithm. The V matrix is initialized by randomly selecting c data elements from the input dataset (except where noted in the algorithms’ descriptions). The U membership matrix is then initialized by calling the function that executed equation (5) above. We now iterate V and U until termination [6], [7]. The software implementation calculates the maximum difference in membership value (maxChange) for a data element in X while equation (5) is executed. The stopping criterion for the FCM implementation is when  ≥ maxChange. B. eFFCM The eFFCM algorithm [8] presented the greatest implementation challenge. Some details were unclear, leaving ambiguity in how to implement eFFCM. The first is to decide a test for statistical significance; we selected the χ2 statistic. Using this statistic, equation (16) can be used to estimate ”goodness of fit” at the desired level of significance (α) [24]: χ2 =

k X (oi − ei )2 i=1

ei

(16)

where k: is the number of ’bins’, oi : is the observed number of objects in bin i, and ei : is the expected number of objects in bin i. This leaves a few decisions of high impact to the experimenters. It was decided to consider each dimension of the dataset separately. The mean and standard deviation for each dimension was calculated. The values in each dimension were initially placed into bins with a width equal to the standard deviation divided by the number of clusters. The goodness of fit equation requires a minimum of 5 objects in each bin. To ensure successful sampling is possible, any bin with fewer than 5/f P DA data elements is merged with an adjacent bin until all surviving bins have a minimum of 5/f P DA data elements from the full dataset. So if f P DA = 0.05, each bin must have a minimum of 100 data elements. To merge the bins, we process the data from the lowest value bin to the highest value bin. If a bin has fewer data elements than required, it is merged with the next highest and retested. When the last bin is tested, the merging changes direction until the last bin tested is valid. A sample size of size n ∗ f P DA (where n is the number of data elements and f P DA ∈ [0, 1]) was then collected and each dimension was tested against equation (16). If tests pass for all dimensions, the sample is considered statistically significant. Otherwise a sample of size n ∗ deltaP DA is added to the initial sample and the combined sample retested. This process repeats until a sample passes all tests. A significance level (α) for the χ2 statistic also had to be chosen. Initial trials showed that high values for α, such as 0.95 or 0.90 would often require over 50% of the data before the goodness of fit test passed. This appeared to be an unduly large penalty on the runtime of the algorithm, so a rather relaxed value of 0.20 was chosen for α. In the choice of a value for α, there is a tradeoff between speed and getting a diverse sample. We attempted to increase the speed-up at a potential quality cost compared to FCM. The χ2 statistic was calculated with code from [25]. VII. R ESULTS The metrics collected were used to calculate relative speedup (SU), DQ Rm %, VDQ% and CC% between the five algorithms for the five datasets. This created a large volume of data; Table III shows results from one volume of MRI data of the human brain. The speedups range from 1.5 times to over 10 times. Using FCM as a baseline, Table IV displays the performance of each algorithm over all MRI datasets for each PDA in terms of speedup and quality. An average was taken of the MRI images, as performance was more or less uniform across them. The other two datasets (Tables V, VI) show more volatility. Table VII shows the average performance compared with FCM over all PDAs, for each dataset and algorithm. VIII. D ISCUSSION These results are remarkable in a few ways. First, we examine the quality measures of the partitions. The quality measures of all algorithmic variants on the MRI datasets, for

TABLE III S PEEDUP COMPARISON FOR MN016, F PDA = 0.05 Algorithm FCM SPFCM OFCM eFFCM randFCM

vs. FCM 1 5.0795 1.5701 5.1868 10.8234

vs. SPFCM 0.1969 1 0.3091 1.0211 2.1308

vs. OFCM 0.6369 3.2352 1 3.3035 6.8936

TABLE IV AVERAGE P ERFORMANCE VS . FCM fPDA 0.05 0.05 0.05 0.05 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2

Algorithm SPFCM OFCM eFFCM randFCM SPFCM OFCM eFFCM randFCM SPFCM OFCM eFFCM randFCM

Speedup 4.947 1.542 3.519 10.858 4.073 1.466 3.307 7.413 3.011 1.385 3.193 4.248

Algorithm SPFCM OFCM eFFCM randFCM SPFCM OFCM eFFCM randFCM SPFCM OFCM eFFCM randFCM

Speedup 8.376 1.487 4.828 20.037 5.844 1.822 1.933 9.668 3.320 1.155 2.256 5.202

ON

DQ Rm % -0.011 0.069 -0.010 -0.005 -0.011 0.077 -0.010 -0.008 -0.011 0.104 -0.010 -0.010

TABLE V AVERAGE P ERFORMANCE VS . FCM fPDA 0.05 0.05 0.05 0.05 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2

vs. eFFCM 0.1928 0.9793 0.3027 1 2.0867

ON

vs. randFCM 0.0924 0.4693 0.1451 0.4792 1

MRI DATASETS VDQ% 0.193 0.633 0.187 0.253 0.192 0.649 0.187 0.243 0.166 0.911 0.170 0.179

CC% 0.250 0.458 0.253 0.240 0.261 0.528 0.268 0.259 0.256 1.036 0.259 0.250

P ENDIGITS DATASET

DQ Rm % 0.218 -0.028 0.076 0.609 0.073 -0.031 -0.063 0.226 0.036 -0.136 -0.064 0.082

VDQ% 6.153 3.434 4.048 7.376 3.668 3.214 3.477 3.784 3.133 1.988 3.436 3.726

CC% 7.651 8.270 6.850 8.952 7.751 7.906 5.295 3.357 5.413 4.867 5.268 5.841

TABLE VI AVERAGE P ERFORMANCE VS . FCM ON L ANDSAT DATASET fPDA 0.05 0.05 0.05 0.05 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2

Algorithm SPFCM OFCM eFFCM randFCM SPFCM OFCM eFFCM randFCM SPFCM OFCM eFFCM randFCM

Speedup 3.019 1.072 1.287 5.707 2.777 1.052 1.607 4.452 1.785 0.875 1.159 2.309

DQ Rm % 2.902 0.636 0.284 3.812 1.489 0.999 0.282 2.421 0.326 0.791 0.219 1.119

VDQ% 2.991 1.200 0.684 3.423 1.680 2.608 0.709 2.391 0.757 1.735 0.612 1.606

CC% 6.371 5.159 1.383 9.340 4.258 18.508 2.424 5.035 0.824 8.982 1.507 5.035

TABLE VII AVERAGE DATASET P ERFORMANCE VS . FCM Dataset MRI MRI MRI MRI Pendigits Pendigits Pendigits Pendigits Landsat Landsat Landsat Landsat

Algorithm SPFCM OFCM eFFCM randFCM SPFCM OFCM eFFCM randFCM SPFCM OFCM eFFCM randFCM

Speedup 4.010 1.464 3.340 7.506 5.847 1.488 3.006 11.636 2.527 1.000 1.351 4.156

DQ Rm % -0.011 0.083 -0.010 -0.008 0.109 -0.065 -0.017 0.306 1.572 0.809 0.262 2.451

VDQ% 0.183 0.731 0.181 0.225 4.318 2.879 3.654 4.962 1.809 1.848 0.668 2.473

TABLE VIII FCM WITH TIME PENALTY

RAND FCM SPEEDUP VS .

CC% 0.256 0.674 0.260 0.250 6.938 7.014 5.804 6.050 3.818 10.883 1.772 6.470

Dataset

fPDA

MN16 MN17 MN18 Pendigits Landsat MN16 MN17 MN18 Pendigits Landsat MN16 MN17 MN18 Pendigits Landsat

0.05 0.05 0.05 0.05 0.05 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2

randFCM mean time (msec) 8739 7968 8810 380 191 12694 11638 13008 787 252 23153 20387 22101 1502 459

Avg pct overhead 45.211% 49.874% 50.250% 22.895% 59.162% 31.282% 34.044% 33.710% 11.563% 47.222% 16.832% 19.444% 19.818% 5.992% 25.490%

FCM mean time (msec) 94586 83905 98842 7614 1090 94214 83868 98981 7609 1122 95786 84950 98128 7813 1060

Avg speedup

10.823 10.530 11.219 20.037 5.707 7.422 7.206 7.609 9.668 4.452 4.137 4.167 4.440 5.202 2.309

all fPDAs, show little deviation from the baseline algorithm (Tables IV, VII). Compared to the other datasets, the MRI datasets have a large number of data elements (> 3 × 107 ) and the dimensionality and number of clusters are low (3). For the Pendigits data (Tables V, VII), the quality measures begin to degrade. The DQ Rm % is still small and occasionally Remember that the runtime complexity of FCM an improvement over FCM. Regardless of whether the Rm around 20x. 2 is O(nisc ). Analysis of each algorithmic variant, ordered by value is an improvement or not, VDQ% and CC% are on its simplicity is given in the following subsections: average an order of magnitude higher than corresponding values for the MRI datasets, but still below 10%. A. randFCM’s Speedup For the Landsat dataset (Table VI, VII), the DQ Rm % This straightforward algorithmic variant reduces runtime by measures are all positive; i.e consistently worse than FCM. reducing the size of the dataset. Theoretically, randFCM should The VDQ% and CC% metrics are also much larger than what have a speedup inversely proportional to the size of the PDA. was observed with the MRI dataset. This theory does not take into account the time needed to The original idea is that the FCM algorithm serves as the randomly select the data from disk. The runtime reported in baseline for performance for the other algorithms. Reviewing this paper does include this time, plus other overhead, which the DQ Rm % column from Tables IV and V shows that decreases the speedup (Table VIII). This has a proportionally occasionally the value for DQ Rm % is negative. Equation (11) larger effect, the smaller the PDA. was calculated with Rm1 equaling the value for FCM. Negative Improvements could have been made in the random selection values for DQ Rm indicate that the competing algorithm had a process, which would have improved the speedup. Regardless, lower Rm value than FCM. This means that, from the objective a time penalty for disk I/O will still exist. function perspective, it is preferred. Occasionally, these seemingly small differences are statistically significant. A Welch’s t test was employed to test the B. eFFCM’s Speedup The eFFCM algorithm (Tables IV, V and VI) always provides difference between FCM and SPFCM for the MRI dataset faster results than FCM and the quality difference across all MN016. The DQ Rm is only 0.01%, but the t test yielded −38 measures never exceeds 10%. On the low dimensionality MRI t = 93.4 with 30 d.f., resulting in p = 1.53 × 10 , a strongly significant statistical difference. A question is whether this datasets, the quality difference never exceeds 1%. The eFFCM algorithm’s closest alternative is randFCM. difference represents an extrema for FCM, or if it matters. The original assumption was that the VDQ% and CC% Both algorithms randomly sample the dataset; the difference metrics would record the quality loss of the scalable method is eFFCM’s statistical test before a sample is accepted. Table compared to the optimal baseline, FCM. When a lower Rm IX lists the paired results for the two algorithms with identical results from scalable method, however, these metrics signify parameters. A select set of metrics (compared vs. FCM) is potentially positive differences. This is because the Rm ’s might shown. The last six rows of Table IX show averages across all be a more desirable extremum. So in cases where DQ Rm % fPDAs. is negative, these metrics show the quality improvement. The eFFCM results are of consistently higher quality than While the algorithms are designed to reduce Rm , the purpose randFCM, judging from DQ Rm %. With the assumption that is to partition the dataset. Changes to the partition are recorded a lower DQ Rm % indicates better quality, the CC% shows in the CC% metric. In Tables V and VI, very small values of the better quality of eFFCM’s results. This is especially clear DQ Rm % correlate to large changes in cluster assignments in the case of the high dimensionality Landsat and Pendigits (CC%). datasets. However, randFCM has on average, a significantly The gains in speed overall are modest; the greatest is higher speedup than eFFCM.

TABLE IX E FFCM AND RAND FCM VS .

Dataset

fPDA

Algorithm

Time (msec)

MN16 MN16 MN16 MN16 MN16 MN16 MN17 MN17 MN17 MN17 MN17 MN17 MN18 MN18 MN18 MN18 MN18 MN18 Pendigits Pendigits Pendigits Pendigits Pendigits Pendigits Landsat Landsat Landsat Landsat Landsat Landsat MRI MRI Pendigits Pendigits Landsat Landsat

0.05 0.05 0.1 0.1 0.2 0.2 0.05 0.05 0.1 0.1 0.2 0.2 0.05 0.05 0.1 0.1 0.2 0.2 0.05 0.05 0.1 0.1 0.2 0.2 0.05 0.05 0.1 0.1 0.2 0.2 AVG AVG AVG AVG AVG AVG

eFFCM randFCM eFFCM randFCM eFFCM randFCM eFFCM randFCM eFFCM randFCM eFFCM randFCM eFFCM randFCM eFFCM randFCM eFFCM randFCM eFFCM randFCM eFFCM randFCM eFFCM randFCM eFFCM randFCM eFFCM randFCM eFFCM randFCM eFFCM randFCM eFFCM randFCM eFFCM randFCM

18236 8739 20968 12694 25411 23153 37875 7968 36111 11638 34056 20387 31328 8810 31871 13008 29595 22101 1577 380 3937 787 3463 1502 847 191 698 252 915 459 29495 14278 2992 890 295 81

TABLE X SPFCM P ERFORMANCE

FCM

Overhead Data (%) used (%) 35.375 13 45.211 5 30.923 14 31.282 10 25.119 20 16.832 20 21.885 37 49.874 5 23.505 36 34.044 10 22.372 36 19.444 20 26.510 27 50.250 5 24.718 26 33.710 10 25.095 26 19.818 20 11.414 19 22.895 5 5.613 42 11.563 10 6.093 42 5.992 20 30.224 35 59.162 5 35.673 32 47.222 10 27.541 36 25.490 20 26.167 26.1 27.260 11.7 7.707 34.4 13.483 11.7 31.146 34.4 43.958 11.7

Speed up

DQ Rm %

CC%

5.187 10.823 4.493 7.422 3.770 4.137 2.215 10.530 2.323 7.206 2.494 4.167 3.155 11.219 3.106 7.609 3.316 4.440 4.828 20.037 1.933 9.668 2.256 5.202 1.287 5.707 1.607 4.452 1.159 2.309 3.340 7.506 3.006 11.636 1.351 4.156

-0.010 -0.003 -0.010 -0.007 -0.011 -0.011 -0.019 -0.016 -0.019 -0.015 -0.019 -0.018 -0.002 0.003 -0.002 -0.001 -0.002 -0.002 0.076 0.609 -0.063 0.226 -0.064 0.082 0.284 3.812 0.282 2.421 0.219 1.119 -0.010 -0.008 -0.017 0.306 0.261 2.451

0.322 0.301 0.326 0.363 0.306 0.306 0.296 0.264 0.309 0.276 0.309 0.287 0.140 0.157 0.169 0.139 0.164 0.156 6.85 8.952 5.295 3.357 5.268 5.841 1.383 9.340 2.424 5.035 1.507 5.035 0.260 0.250 5.804 6.050 1.772 6.470

eFFCM takes a ”double hit” in overhead from the sample selection process and the increased sample size compared with randFCM. Assuming all other factors being equal (parameters, randomization), the sample selection process for eFFCM will take longer than randFCM because of the need to test for significance and the possibility of resampling if the initial sample fails the χ2 test. When the total data used by eFFCM exceeds the size of the PDA used by randFCM, the runtime will increase more or less proportionally. So on average, the runtime of eFFCM will be longer than randFCM. The table rows showing average performance demonstrate this clearly. For the MRI datasets, the % data used for eFFCM is ×2.23 that of randFCM and the overhead % for both algorithms is about equal. The corresponding speedup for randFCM over eFFCM is a proportional ×2.24. Judging from the difference in quality, it appears that eFFCM is useful when the dimensionality of the dataset is high, justifying the statistical testing. When the dimensionality is low, in the case of the MRI datasets, randFCM looks like the better choice. C. SPFCM’s Speedup The SPFCM algorithm provides a reliable speedup with little or no loss in quality over all the datasets. A snapshot of the results is shown in Table X. The speedup ranges from roughly 1.8 to 5.8, but this

Dataset

fPDA

Time (msec)

MN016 MN016 MN016 MN017 MN017 MN017 MN018 MN018 MN018 Pendigits Pendigits Pendigits Landsat Landsat Landsat

0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2

18621 23834 31914 17686 21145 28923 19699 23017 31723 909 1302 2353 361 404 594

Pct overhead 26.653 21.704 15.993 27.983 23.519 17.467 29.276 24.195 17.429 16.282 10.906 06.162 52.355 46.040 31.987

Speed up 5.080 3.953 3.001 4.744 3.966 2.937 5.018 4.300 3.093 8.376 5.844 3.320 3.019 2.777 1.785

Adj. Speed up 6.424 4.758 3.431 6.076 4.858 3.402 6.535 5.304 3.577 9.263 6.263 3.452 4.486 3.909 2.227

DQ Rm %

CC%

-0.010 -0.010 -0.011 -0.019 -0.019 -0.019 -0.003 -0.003 -0.002 0.218 0.073 0.036 2.902 1.489 0.326

0.303 0.330 0.287 0.349 0.344 0.343 0.097 0.109 0.139 7.651 7.751 5.413 6.371 4.258 0.824

takes into account overhead from the randomization process, disk I/O and other overhead. The SPFCM algorithm though, assumes that the data is randomized. Subtracting the time for randomization from the calculation, the speedup ranges from 2.2 to 9.2 (column ”Adj. Speedup” from Table X). Based on DQ Rm % and CC%, the SPFCM algorithm provides a small improvement in quality over FCM on the MRI datasets. The Landsat and Pendigits datasets have degradations in Rm resulting in changes in cluster assignment ranging from 0.8% to 7.8%. There appears to be no correlation between the DQ Rm % and CC% in this small sample. D. OFCM’s Speedup The OFCM algorithm only produces a negligible speedup over FCM when doing the clustering sequentially. In one experiment, it was actually slower than FCM; Table XI has details. An attempt to cross reference results showed that no published reference to OFCM [26] [27] [10] actually has a speed comparison with FCM. The quality of results is inconsistent over the datasets tested; OFCM performs poorly on Landsat but exceptionally well on Pendigits. OFCM is very similar to the SPFCM algorithm. The primary differences are that the input data for OFCM is not assumed to be randomized and that weighted clusters from the previous PDA are not added to the subsequent PDA. The purpose of OFCM is to handle large amounts of streaming data [26]. The assumption is that data is processed as it comes and weighted cluster centers derived from processing each chunk are saved to be combined with additional cluster centers later. As a result, randomization of the datasets was not done before processing by OFCM. A visual inspection of the original Pendigits dataset, containing the class definitions, showed that the data is somewhat randomly distributed with respect to classes. This is not the case for the other datasets. In the case of MRI images, the image data was read in order. The three types of brain tissue that constitute clusters are not randomly distributed in a normal human brain, thus increasing the likelihood that each PDA processed by OFCM was a non-representative sample of the whole dataset.

TABLE XI OFCM P ERFORMANCE Dataset

fPDA

Speedup

MN016 MN016 MN016 MN017 MN017 MN017 MN018 MN018 MN018 Pendigits Pendigits Pendigits Landsat Landsat Landsat

0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2

1.570 1.456 1.334 1.356 1.403 1.298 1.701 1.540 1.522 1.487 1.822 1.155 1.072 1.052 0.875

DQ Rm % 0.005 0.017 0.082 0.066 0.059 0.056 0.136 0.154 0.172 -0.028 -0.031 -0.136 0.636 0.999 0.791

TABLE XII AVERAGE P ERFORMANCE OF OFCM VS . SPFCM VDQ%

CC%

0.356 0.364 1.047 0.699 0.711 0.777 0.844 0.874 0.909 3.434 3.214 1.988 1.200 2.608 1.735

0.267 0.242 1.139 0.247 0.259 0.609 0.861 1.083 1.361 8.270 7.906 4.867 5.159 18.508 8.982

Dataset

Random

Algorithm

Time (msec)

Mean iterations

MRI MRI MRI MRI Pendigits Pendigits Pendigits Pendigits Landsat Landsat Landsat Landsat

no no yes yes no no yes yes no no yes yes

OFCM SPFCM OFCM SPFCM OFCM SPFCM OFCM SPFCM OFCM SPFCM OFCM SPFCM

63391 30577 26029 19888 5437 1560 4420 1455 1099 317 936 345

137.16 51.14 46.06 28.18 574.70 107.91 487.17 110.22 420.74 76.02 371.97 81.02

Iter. per PDA (ˆi) 12.25 5.02 4.25 2.90 52.05 13.23 42.62 12.30 37.72 7.93 30.21 8.66

Speedup vs. SPFCM 0.479 1 0.757 1 0.274 1 0.336 1 0.286 1 0.384 1

OFCM saves them to disk, and (2) OFCM runs WFCM on weighted cluster centers after all data has been processed. Let us first consider difference (2). An examination of the These non-representative subsets are a factor in OFCM’s diagnostic logs for trials shows that the final run for OFCM performance. Recall that the runtime complexity of FCM and typically added 2-7 iterations to the algorithm. With the datasets 2 its variants is O(nisc ) where i is the number of iterations. The used, the size of the dataset for the final run is only 15-200 SPFCM and OFCM methods still process 100% of the data; elements. Hence, the amount of time added is negligible. the gains are made because the average number of iterations This leaves difference (1). OFCM uses the positions of are reduced. Hore (et. al) [11] identified that the gains in SPFCM are made because after the first PDA, the derived the previous c cluster centers as initial values for V , but cluster centers are used to initialize V in the subsequent PDA. SPFCM uses the positions for V and the c weighted points. Initial cluster centers closer to the optimal cluster centers allow Examination of equation (8) shows that the weights associated the algorithm to terminate with fewer iterations. That is where with the points influence the values in V . OFCM lacks points the speed gains are made. The more modest speedup observed with those weights, so it takes longer to terminate with the initialized values. Review of the diagnostic logs confirmed that from OFCM would also be due to fewer iterations. with identical initialization, OFCM required more iterations to When a PDA is not representative of the whole dataset, terminate per PDA. there will be two main effects: (1) the PDA might be slower The average iterations (ˆi) per PDA is the appropriate value to terminate due to homogeneous clusters being split, and (2) to use to compare performance with other FCM variants. the PDA will pass on a set of cluster center initializations to Reviewing complexity analysis in a similar manner as [11], the next PDA that are not representative of the ideal cluster the following notation is used: centers. p: represents the PDA value as a fraction (fPDA). In previous experiments, SPFCM used randomized data and d = p1 : the number of partial data accesses required. OFCM did not. In order to test this hypothesis, additional th experiments were done to compare SPFCM to OFCM on ij : represents the number of iterations in the j partial data identical datasets. The first set of 30 trials compared SPFCM access. Pd The total number of iterations can be expressed as j=1 ij and OFCM on non-randomized datasets with identical cluster Pd initialization. The second set compared the algorithms on pre- and ˆi = p ∗ j=1 ij . The runtime complexity for a single is O(pnsc2 ij ), thus the total runtime complexity equals randomized datasets. The only overhead to the algorithms was PDA Pd Pd a small amount of disk I/O. A routine was written to capture O( j=1 pnsc2 ij ) = O(n(p ∗ j=1 ij )sc2 ) = O(nˆisc2 ). By way of comparison with Table XII, FCM averaged about the number of iterations of the core WFCM algorithm for each 17 iterations on the MRI datasets to terminate to a solution. trial. Table (XII) shows the average results across datasets. The IX. C ONCLUSIONS AND FUTURE WORK total mean iterations and iterations per PDA are shown. The results show that randomized data consistently reduces the We developed an algorithmic test bed to compare FCM and number of iterations for OFCM, which is directly correlated to a four variants with respect to speed and three quality metrics. faster runtime. Surprisingly, SPFCM consistently has a smaller The core functions of the algorithm were carefully constructed number of iterations compared to an identically configured set to ensure comparisons on as equal a basis as possible. of OFCM trials. The algorithms listed in order of speed from slowest to fastest When identically configured and applied to the same data are: FCM, OFCM, eFFCM, SPFCM and randFCM. OFCM set, these algorithms have only two differences: (1) SPFCM had a surprisingly small speed improvement over FCM. Slower uses the weighted cluster centers from the previous PDA but termination of the partial data accesses was identified as the

root cause. Investigation of SPFCM and OFCM showed that the speedup is due to an overall decrease in average iterations. The time spent to randomize data was included in the calculations for eFFCM, SPFCM and randFCM. The runtime was broken into categories for analysis and if randomness can be assumed, the speedup of these three variants would further improve. The algorithms are difficult to rank in order of quality, especially with three metrics. Judging solely by DQ Rm %, the eFFCM algorithm appeared to have the best quality overall and randFCM appeared to have the worst. There was little consistency, however, in quality measures across datasets. Any ranking of ’best algorithm’ overall would be very subjective. This study demonstrates that there are tradeoffs between speed and quality that must be made, and the variation in performance across datasets. The randFCM algorithm was fastest and with quality measures often close to FCM. SPFCM was next fastest, with quality measures slightly lower than eFFCM. So for big data and maximal speed, randFCM is competitive. SPFCM has a good mix of speed and fidelity to FCM. The eFFCM algorithm is almost as fast as SPFCM and closer to FCM in quality. Scalable methods often had results with a smaller Rm value than the FCM. This shows that using a single quality metric such as CC% could be misleading because the difference in cluster assignments could be beneficial. A perceived degradation of quality could actually be an improvement. Relatively small differences in Rm from different algorithms on the same dataset occasionally resulted in disproportionate differences in CC%. In a similar vein, more significant differences in Rm from different algorithms on the same dataset occasionally had little effect on the other quality parameters. The results raise many questions that may be explored in future work. The primary one is raised by the observation that some scalable methods occasionally have improved quality as measured by Rm . Can one exploit this observation to get better partitions at termination in the general case? A related question is, ”How does CC% depend on differences in Rm ?” The eFFCM algorithm arguably had the best quality, despite the unclear implementation details. Some of the questions raised are discussed in [5] and future work may modify eFFCM to incorporate this or other research in an effort to maintain quality while improving speed. R EFERENCES [1] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, 1981. [2] A. K. Jain, “Data clustering: A review,” ACM Computing Surveys (CSUR), vol. 31, no. 3, pp. 264–323, September 1999. [3] S. P. Lloyd, “Least squares quantization of pcm,” IEEE Transactions on Information Theory, vol. 28, no. 2, pp. 129–137, March 1982. [4] A. K. Jain, “Data clustering: 50 years beyond k-means,” Pattern Recognition Letters, vol. 31, no. 8, pp. 651–666, June 2010. [5] R. J. Hathaway and J. C. Bezdek, “Extending fuzzy and probabilistic clustering to very large data sets,” Computational Statistics and Data Analysis, vol. 51, pp. 215–234, 2006. [6] T. Cheng, D. Goldgof, and L. Hall, “Fast fuzzy clustering,” Fuzzy Sets and Systems, vol. 93, no. 1, pp. 49–56, January 1998.

[7] M. C. Hung and D. L. Yang, “An efficient fuzzy c-means clustering algorithm,” in Proceeding of the 2001 IEEE International Conference on Data Mining (ICDM01), 2001, pp. 225–232. [8] N. R. Pal and J. C. Bezdek, “Complexity reduction for large image processing,” IEEE Trans. Syst. Man Cybern., vol. 32, no. 5, pp. 598–611, October 2002. [9] S. Eschrich, J. Ke, L. O. Hall, and D. B. Goldgof, “Fast accurate fuzzy clustering through data reduction,” IEE Trans. Fuzzy Systems, vol. 11, pp. 262–269, 2003. [10] P. Hore, L. O. Hall, D. B. Goldgof, Y. Gu, A. A. Maudsley, and A. Darkazanli, “A scalable framework for segmenting magnetic resonance images,” J. Sign. Process Syst., vol. 54, pp. 183–203, 2009. [11] P. Hore, L. O. Hall, and D. B. Goldgof, “Single pass fuzzy c means,” in IEEE International Conference on Fuzzy Systems. FUZZ-IEEE, July 2007, pp. 1–7. [12] L. A. Zadeh, “Fuzzy sets,” Information and Control, vol. 8, no. 3, pp. 338–353, 1965. [13] J. Kolen and T. Hutcheson, “Reducing the time complexity of the fuzzy c-means algorithm,” Fuzzy Systems, IEEE Transactions on, vol. 10, no. 2, pp. 263 –267, apr 2002. [14] L. Wang, C. Leckie, R. Kotagiri, and J. C. Bezdek, “Approximate pairwise clustering for large data sets via sampling plus extension,” Pattern Recognition, vol. 44, no. 2, pp. 222–235, February 2011. [15] T. C. Havens, J. C. Bezdek, C. Leckie, L. O. Hall, and M. Palaniswami, “Fuzzy c-means algorithms for very large data,” IEEE Trans. Fuzzy Systems, 2012, in review. [16] R. J. Hathaway and J. C. Bezdek, “Optimization of clustering criteria by reformulation,” IEEE Trans. Fuzzy Syst., vol. 3, no. 2, pp. 241–245, May 1995. [17] Y. Gu, L. O. Hall, and D. B. Goldgof, “Evaluating scalable fuzzy clustering,” in 2010 IEEE International Conference on Systems Man and Cybernetics (SMC), October 2010, pp. 873 – 880. [18] H. W. Kuhn, “The hungarian method for the assignment problem,” Naval Research Logistics Quarterly, vol. 2, no. 1-2, pp. 83–97, March 1955. [19] A. Frank and A. Asuncion, “UCI machine learning repository,” 2010. [Online]. Available: http://archive.ics.uci.edu/ml [20] F. Alimoglu and E. Alpaydin, “Methods of combining multiple classifiers based on different representations for pen-based handwritten digit recognition,” in Proceedings of the Fifth Turkish Artificial Intelligence and Artificial Neural Networks Symposium (TAINN 96, 1996. [21] A. Srinivasan. (1993) Statlog (landsat satellite) data set. Repository. [Online]. Available: http://archive.ics.uci.edu/ml/datasets/ Statlog+%28Landsat+Satellite%29 [22] B. Hoyt. (2011) inih: simple .ini parser in c. Open Source Project. [Online]. Available: http://code.google.com/p/inih/ [23] I. Witten and E. Frank, Data Mining:practical machine learning tools and techniques. Morgan Kaufmann, 2005. [24] R. Walpole and R. Myers, Probability and Statistics for Engineers and Scientists. MacMillan Publishing Company, 1985. [25] B. Reiter and J. Aquino. (2009) Statist 1.4.2. Open Source Project. [Online]. Available: http://wald.intevation.org/projects/statist/ [26] P. Hore, “Scalable frameworks and algorithms for cluster ensembles and clustering data streams,” Ph.D. dissertation, University of South Florida, June 2007. [27] P. Hore, L. Hall, D. Goldgof, and W. Cheng, “Online fuzzy c means,” in Annual Meeting of the North American Fuzzy Information Processing Society, May 2008, pp. 1–5.