A quantum evolutionary algorithm for data ... - Semantic Scholar

Report 4 Downloads 55 Views
Int. J. Data Mining, Modelling and Management, Vol. 2, No. 4, 2010

369

A quantum evolutionary algorithm for data clustering Chafika Ramdane* Computer Science Department, University of Skikda, El-Hadaiek Road PB 26, Skikda, 21000, Algeria Fax: 213-38-70-17-00 E-mail: [email protected] *Corresponding author

Souham Meshoul and Mohamed Batouche College of Computer and Information Sciences, Center of Excellence in Information Assurance, P.O. Box 51178, Riyadh 11543, Saudi Arabia E-mail: [email protected] E-mail: [email protected]

Mohamed-Khireddine Kholladi Computer Science Department, University of Constantine, MISC Laboratory, PB 325, Ain El Bey Road, Constantine 25017, Algeria E-mail: [email protected] Abstract: The emerging field of quantum computing has recently created much interest in the computer science community due to the new concepts it suggests to store and process data. In this paper, we explore some of these concepts to cope with the data clustering problem. Data clustering is a key task for most fields like data mining and pattern recognition. It aims to discover cohesive groups in large datasets. In our work, we cast this problem as an optimisation process and we describe a novel framework, which relies on a quantum representation to encode the search space and a quantum evolutionary search strategy to optimise a quality measure in quest of a good partitioning of the dataset. Results on both synthetic and real data are very promising and show the ability of the method to identify valid clusters and also its effectiveness comparing to other evolutionary algorithms. Keywords: data clustering; evolutionary algorithm; quantum computing; quantum representation; optimisation; data mining. Reference to this paper should be made as follows: Ramdane, C., Meshoul, S., Batouche, M. and Kholladi, M-K. (2010) ‘A quantum evolutionary algorithm for data clustering’, Int. J. Data Mining, Modelling and Management, Vol. 2, No. 4, pp.369–387.

Copyright © 2010 Inderscience Enterprises Ltd.

370

C. Ramdane et al. Biographical notes: Chafika Ramdane received her MSc in Computer Science from the University of Constantine, Algeria, in 2006. She is a Lecturer at the University of Skikda in Algeria. She is currently working towards her PhD degree. Her research interests include issues related to data clustering, evolutionary computing, quantum computing and optimisation. Souham Meshoul is an Associate Professor in the Department of Information Technology, at CCIS – King Saud University Kingdom of Saudi Arabia. She received her Engineer, MSc and PhD (State Doctorate) in Computer Science from the University of Constantine, Algeria. Her areas of interest include computational intelligence, quantum inspired computing, bioinformatics and image analysis. Mohamed Batouche is a Professor in the Department of Software Engineering, CCIS – King Saud University, Kingdom of Saudi Arabia. He received his Engineer degree in Computer Science from the University of Constantine, Algeria. He received his MSc and PhD degrees from the Institute National Polytechnique de Lorraine, France, in 1989 and 1993, respectively. His research areas include complex systems, metaheuristics, quantum computing, image processing, and computer vision. Mohamed-Khireddine Kholladi is an Associate Professor in the Department of Computer Science at College of Engineering, University Mentouri of Constantine. He received his PhD degree from INSA of Lyon – France. His interests include geographical information systems, computer graphics, knowledge databases, and new technology of information and communication.

1

Introduction

Data clustering is the process that is targeted at extracting groups or clusters in datasets. It is a key problem which is still the subject of active research fields namely exploratory pattern analysis, machine learning, pattern recognition, image segmentation and data mining to name just few (Berkhin, 2002; Jain et al., 1999). Although, a great deal of effort has been devoted to data clustering problem leading to an abundant literature (Xu and Wunsch, 2005), it still remains a challenging task because of the lack of prior information underlying data distributions. Data clustering algorithms can be hierarchical or partitioning. Hierarchical clustering algorithms construct a hierarchy of partitions, represented as a dendrogram in which each partition is nested within the partition at the next level in the hierarchy (Xu and Wunsch, 2005). The basic advantage of hierarchical algorithms is that they are not sensitive to initial conditions. However, their main drawback is that they are static, that is, data points assigned to a cluster can not move to another cluster (Das et al., 2008). Moreover, the time complexity of this approach is quadratic. Partitioning clustering algorithms generate a single partition, with a specified or estimated number of non-overlapping clusters. Recently, it has been recognised that partitioning algorithms are well suited for clustering a large dataset due to their relatively low computational requirements (Das et al., 2008). The time complexity of partitioning clustering is almost linear, which makes it widely used. An example of the simplest and commonly used partitioning algorithm is the K-means. This algorithm has the advantage to be computationally efficient and does not require the user to specify many parameters.

A quantum evolutionary algorithm for data clustering

371

However, it is sensitive to the selection of the initial cluster centroids and may converge to the local optima. To deal with the limitations of traditional partitioning clustering methods, numerous alternatives have been proposed like the evolutionary partitioning clustering in which clustering is viewed as an optimisation problem. Viewing clustering under this perspective requires the use of a quality measure as an objective function and a suitable optimisation strategy to find a good quality clustering. The principal advantage of this approach is that the objective of the clustering is explicit, which enables us to better understand the performance of the clustering algorithm on particular types of data and to use task-specific clustering objectives (Handl and Meyer, 2007). Numerous heuristic based optimisation algorithms have been proposed for data clustering, like genetic algorithm (Krishna and Murty, 1999; Lu et al., 2004; Maulik and Bandyopadhyay, 2000), artificial ants (Azzag et al., 2003), artificial immune system (De Castro and Von Zuben, 2001) and particle swarm optimisation (PSO) (Van der Merwe and Engelbrecht, 2003). Recently, quantum inspired evolutionary metaheuristics have been developed leading to hybrid algorithms like quantum evolutionary algorithm (QEA) (Han and Kim, 2002, 2004) and quantum particle swarm optimisation (QPSO) (Sun et al., 2004). The main advantage of this hybridisation is the quantum representation which is kind of probabilistic representation. It has a better characteristic of population diversity than any other representations (Han and Kim, 2004). QEA has been applied to gene expression data clustering (Zhou et al., 2005). The algorithm starts by applying K-means to the dataset to build a population of initial solutions. Each solution is a set of centroids. The key idea of the algorithm is to combine between the centroids of initial solutions in order to generate new solutions; this is carried out while evolving the population of the individuals by applying the operations of observation, selection, interference and migration. This algorithm has the same structure of the algorithm described by Han and Kim (2002). In Xiao et al. (2008), an improved K-means clustering algorithm based on quantum inspired genetic algorithm is proposed to cluster data. Its individual’s representation denotes clusters centroids of partition. The algorithm applies a number of operations on a population like selection, crossover, mutation and rotation. It optimises the Davies-Bouldin rule index. QPSO has been combined with K-means and applied to cluster data (Sun et al., 2006). In this algorithm, a single particle represents the cluster centroid vector, while a swarm represents a number of candidate clustering schemes for the dataset. The QPSO was also proposed to cluster gene expression data (Chen et al., 2008). In this paper, we tackle the data clustering problem using a novel quantum evolutionary framework with a novel quantum representation of individuals in which individual represents a partition and the probabilities have significant meaning, they reflect the distance between data points and clusters centroids. The main advantage of this representation is the minimising of the population’s size. Furthermore, we developed a novel initialisation procedure based on the used quantum representation as well as novel operations. Consequently, the remainder of the paper is organised as follows. In Section 2, we describe the basic concepts underling quantum computing. In Section 3, we present the proposed QEA for data clustering. Experiments and results are presented in Section 4. Finally, conclusions and future works are drawn.

372

2

C. Ramdane et al.

Quantum computing principles

The origin of quantum computing goes to the early ‘80s when Richard Feynman observed that quantum mechanical effects could not be simulated efficiently on a computer. This led to speculation that computation in general could be done more efficiently if it uses quantum effects. Few years later, many efforts on quantum computers have progressed actively because these computers were shown to be more powerful than classical computers on various specialised problems. But if there is no quantum algorithm that solves practical problems, quantum computers hardware may be useless. Later in ‘90s appeared the well known quantum algorithms such as Shor’s (1994) quantum factoring algorithm and Grover’s (1996) database search algorithm. Since there quantum computing began to attract serious attention especially because it provides parallelism which reduces obviously the algorithmic complexity. Such an ability of parallel processing can be used to solve combinatorial optimisation problems which require the exploration of large solutions spaces. In a quantum system, the smallest unit of information is the qubit. Unlike the classical bit, the qubit can be in the superposition of the two values at the same time. The state of a qubit can be represented as: Ψ = α 0 + β 1 , where: Ψ denotes a function wave in Hilbert space. 0 and 1 represent respectively the classical bit values 0 and 1. α and β are complex number that satisfy the probability amplitudes of the corresponding states. If a superposition is measured with respect to the basis { 0 , 1 } , the probability to measure 0 is α2 and the probability to measure 1 is β2. A quantum register which is a system of n qubits can represent 2n states at the same time. This means that one can store an exponential amount of information in a quantum register. Quantum parallelism stems from the ability to act on all states within a superposition at once. Since the number of possible states is 2n, we can perform in one operation on quantum computer what would take an exponential number of operations on classical computer. This is very attractive, but until now there is not yet a powerful quantum machine able to execute the developed quantum algorithms. Quantum algorithms consist in applying successively a series of quantum operations on quantum system. These quantum operations are performed using quantum gates and quantum circuits. Pending the construction of a powerful quantum machine, researches are conducted to get benefit from the quantum computing field. Based on the concepts of qubits and superposition of states of quantum machine, the merging between evolutionary computing and quantum computing has successfully resulted in a QEA (Han and Kim, 2002), which like any other evolutionary algorithm relies on the representation of the individual, the evolution function and the population dynamics. However, instead of binary, numeric, or symbolic representation, QEA uses quantum representation which allows representing the superposition of all potential solutions for a given problem. It also stems from the quantum operators it uses to evolve the entire population through generations. QEA has demonstrated its effectiveness and applicability on the knapsack problem and others (Han and Kim, 2002, 2004).

3

A quantum evolutionary algorithm for data clustering (QEAC)

In this section, we describe how quantum computing concepts have been used to perform data clustering. Two main features characterise the proposed approach: The quantum

A quantum evolutionary algorithm for data clustering

373

representation of the search space and the quantum evolutionary dynamics. Given a set of data to cluster, the main idea consists in optimising a measure of cluster quality to find a partition of the dataset. During the optimisation process, quantum operations are applied on the derived quantum representation. The quantum population evolves over generations until a termination criterion is reached.

3.1 Quantum representation The problem to be solved can be formally defined as follows. Given a dataset S, we have to search for a partition C = {c1, c2, …., cK} of S where each ci represents a cluster and K the number of clusters supposed to be given by the user. Each partition satisfies the following conditions: ci ≠ ϕ

1

∀i

2

∀ i, i ' ci ∩ ci ' = ϕ

3

∪c

i

= S.

i

The search space is thus the space of all potential partitions. A partition can be encoded in term of a binary matrix denoted by BM where each row represents a cluster and each column indicates a data point, that is, an element of S. The value of an element xij of this matrix is set to 1 to indicate that the corresponding data point pj in column j belongs to the cluster ci and 0 otherwise. From this binary encoding, a quantum representation can be easily formed. Indeed, it consists in a quantum matrix denoted by QM (see Figure 1) identical in structure to the binary matrix BM but different from it in two ways: Figure 1

Representation of quantum partition QM: m is the number of data points and K is the number of cluster

⎡ ⎛ α11 ⎞ ⎛ α12 ⎞ ⎢⎜ ⎟ ⎜ ⎟ ⎢ ⎝ β11 ⎠ ⎝ β12 ⎠ ⎢ ⎢ ⎢⎛ α K 1 ⎞ ⎛ α K 2 ⎞ ⎟ ⎜ ⎟ ⎢⎜ ⎣⎝ β K 1 ⎠ ⎝ β K 2 ⎠

⎛ α1m ⎞ ⎤ ⎜ ⎟⎥ ⎝ β1m ⎠ ⎥ ⎥ ⎥ ⎛ α Km ⎞ ⎥ ⎜ ⎟⎥ ⎝ β Km ⎠ ⎦

⎛ α ij ⎞ First, each element qij of QM is in fact a qubit ⎜⎜ β ⎟⎟ where αij and βij are probability ⎝ ij ⎠ amplitudes satisfying the property: | αij |2 + | βij |2 = 1. The | βij |2 is interpreted as the probability to assign a data point pj to cluster ci and the value | αij |2 is the probability of the non-assignment case. In this manner, the quantum representation of a partition is a quantum register containing a superposition of all possible combinations of data points within the cluster. As a consequence, the second main difference of this encoding is its ability to represent all potential partitions instead of only one. It is in fact a probabilistic representation of all the assignment configurations of data points to clusters.

374

C. Ramdane et al.

3.2 Outline of the proposed framework For sake of clarity, we first describe the general scheme of the proposed framework then we emphasise each of its main steps. As we do not operate on a quantum computer and in order to maintain diversity within partitions, we use a population of n quantum partitions where its lth partition is denoted by QMl. From quantum population derived a population of n binary partitions where its lth partition is denoted by BMl. Figure 2

Structure of QEAC

The best binary partition derived from the lth quantum partition is denoted by Bl and the global best binary partition found at each period of generations is denoted by Bglob. Binary population is divided into ng groups, each group contains nd partitions and the local best partition found in the gth group is denoted by Bgroupg. Starting with an initial quantum population, the process consists in evolving this population by applying some basic quantum operators: measurement, constructive and destructive interference, regeneration. Global and local migrations are applied on binary population. Taking inspiration from (Han and Kim, 2004), Figure 2 outlines the structure of the proposed framework which we can describe by the following algorithm where t denotes iteration, period is a fixed number of iterations after which global migration is performed, n is the population size and NbrGeneration denotes the maximum number of generations:

A quantum evolutionary algorithm for data clustering INPUT Dataset S Begin t ← 1; l ← 1; Repeat

1

Initialise (QMl); l ← l + 1; Until (l > n) Repeat l ← 1; Repeat

2

BMl ← Measurement(QMl);

3

Repair (BMl);

4

Evaluate (BMl);

5

If (t = 1) /* first best solution */ Bl ← BMl;

/* best solution */

Else

Bl ← best solution among BMl and Bl; End if p = random ([0,1]); If (p ≤ Pinterferdes) /* destructive interference */

6

QMl ← destructive interference (QMl, Bl); Else /* constructive interference */

7

QMl ← constructive interference (QMl, Bl); End if l ← l + 1; Until (l>n) If (t mod period = 0)

/* global migration */

Bglob← best solution among all Bl;

8

Migrate Bglob to all Bl; Else

/* local migration */

g ← 1; Repeat Bgroupg← best solution among nd Bl of the gth group;

Migrate Bgroupg to nd Bl of gth group;

9

g ← g + 1; Until (g> ng) End if

/* regeneration */

375

376

C. Ramdane et al. p = random ([0, 1]); If (p ≤ prob) l ← random (1, n);

10

QMl ← regeneration (BMl, prob); End if

t ← t + 1; Until (t > NbrGeneration) OUTPUT Bglob

3.3 Objective function To evaluate the quality of a given partition, we choose a measure of compactness called the objective function M which is optimised by the genetic algorithm of Maulik and Bandyopadhyay (2000). It consists to compute the sum of the distance between data points and their corresponding cluster centroids: M (C ) =

∑ ∑ d( p

j

− μi )

(1)

ci ∈C p j ∈ci

where pj denotes a data point, μi represents the centroid of cluster ci, d(.,.) is the Euclidian distance. As an objective, the function M should be minimised. Now we describe each procedure mentioned in the algorithm.

3.4 Initialisation step QEAC starts by selecting randomly K data points as initial centroids of clusters and assigning each data point to the closest centroid. These two steps generate one possible binary partition. If this binary partition contains empty clusters, it will be rejected and replaced by another. This is made to ensure the generation of a valid initial binary partition. Now, to generate the quantum partition, we need to define a function which calculates the αij and βij of each qubit qij. One possible function based on the distance between the centroids of clusters and data points is defined as follow:

α ij = cos(arccotg (d ( p j , μi )))

(2)

β ij = sin(arccotg (d ( p j , μi )))

(3)

The geometric interpretation of the chosen function is shown in Figure 3 where the distance between the data point pj and the centroid μi of the cluster ci is supposed equal to the cotangent of angle θij having αij its projection on the cosine axis and βij its projection on the sine axis.

3.5 Measurement of a binary population The measurement is the operation which allows the observation of the quantum states in order to extract one solution among all those present in superposition without destroying all other configurations as it is done in quantum system. The result of this operation is

A quantum evolutionary algorithm for data clustering

377

binary partition. It is generated by exploring quantum partition column by column and searches the maximum value of | βij |2 for each column j. Once this maximum is found for each column j, we set at 1 the corresponding element xij of the binary partition and at 0 the rest of elements of column j as shown by the following example: Figure 3

Geometric interpretation of the function calculating the initial αij βij

Figure 4

Example of observing quantum partition QM with four data points and two clusters

⎡⎛ 0.1432 ⎞ ⎢⎜ ⎟ ⎢⎝ 0.9897 ⎠ ⎢⎛ 0.0608 ⎞ ⎢⎜ ⎟ ⎣⎢⎝ 0.9982 ⎠

⎛ 0.3711 ⎞ ⎛ 0.6887 ⎞ ⎜ ⎟ ⎜ ⎟ ⎝ 0.9286 ⎠ ⎝ 0.7250 ⎠ ⎛ 0.8489 ⎞ ⎛ 0.3310 ⎞ ⎜ ⎟ ⎜ ⎟ ⎝ 0.5286 ⎠ ⎝ 0.9436 ⎠ QM

⎛ 0.9087 ⎞ ⎤ ⎜ ⎟⎥ ⎝ 0.4175 ⎠ ⎥ ⎡ 0 1 0 0 ⎤ →⎢ ⎥ ⎛ 0.9087 ⎞ ⎥ ⎣ 1 0 1 1 ⎦ ⎥ ⎜ ⎟ ⎝ 0.4175 ⎠ ⎦⎥ BM

For each column j, when we find multiple maximum values of | βij |2, we choose one maximum randomly. This observation ensures that each data point can be assigned to only one cluster, but it does not ensures the non-appearance of empty cluster, for this reason we have introduce the procedure repair.

3.6 Repair step When the measurement of quantum partition QM at the generation t leads to binary partition BM containing empty cluster, BM will be replaced by new binary partition generated as follow: 1

replace each empty cluster in the set {μi} by data point randomly chosen

2

calculate QM using (2) and (3) with d(pj, μi)

3

BM ← Measurement (QM).

These steps are repeated every time the measurement procedure generates binary partition including empty clusters.

378

C. Ramdane et al.

3.7 Constructive interference step In this step, quantum partition is updated by applying a unitary quantum operator which achieves rotation with angle Δθij designed as a function of αij, βij and the corresponding binary value bij in the best binary partition and xij in the current binary partition (see Figure 5). Each element qij of quantum partition is updated by following these steps: 1

determine Δθij with the lookup table (see Table1)

2

calculate the new values α'ij, β'ij using: ⎡ αij′ ⎤ ⎡cos( Δθij × s (αij , βij )) − sin( Δθij × s (αij , βij )) ⎤ ⎡αij ⎤ ⎢ ⎥=⎢ ⎥ ⎢ ⎥ ⎣⎢ βij′ ⎦⎥ ⎣⎢sin( Δθij × s (αij , βij )) cos( Δθij × s(αij , βij )) ⎦⎥ ⎣⎢ βij ⎥⎦

Table 1

(4)

Lookup table of Δθij

xij

bij

M(BM) ≥ M(B)

Δθij

αij × βij ≥ 0

αij × βij < 0

0

0

True

0.0045

s(αij, βij) = +1

s(αij, βij) = –1

0

1

True

0.025

s(αij, βij) = +1

s(αij, βij) = –1

1

0

True

–0.025

s(αij, βij) = +1

s(αij, βij) = –1

1

1

True

–0.0045

s(αij, βij) = +1

s(αij, βij) = –1

Notes: where s(αij, βij) is the sign of the product αij × βij, and bij, xij are the bits of the best partition B and the binary partition BM. M represents the objective function.

The value of the rotation angle Δθij is chosen by intuitive reasoning. When the condition M(BM) ≥ M(B) is satisfied, we try to invert xij but in a gradual way. In case where xij and bij contain the same values, the degree of inversion is slower than the case where xij and bij are different. When the condition M(BM) ≥ M(B) is not satisfied, we make no change. For example, if xij and bij are 0 and 0, respectively, and if the condition M(BM) ≥ M(B) is true: 1

If the qubit qij is located in the first or the third quadrant in Figure 5, the sign s(αij, βij) is positive and the value of the product Δθij × s(αij, βij) is set to a positive value to increase the probability of the state 1, so, to invert xij to 1.

2

If the qubit qij is located in the second or the fourth quadrant in Figure 5, the sign s(αij, βij) is negative and the value of the product Δθij × s(αij, βij) is set to a negative value to increase the probability of the state 1, so, to invert xij to 1.

If xij and bij are 1 and 0, respectively, and if the condition M(BM) ≥ M(B) is true : 1

If the qubit qij is located in the first or the third quadrant in Figure 5, the sign s(αij, βij) is positive and the value of the product Δθij × s(αij, βij) is set to a negative value to increase the probability of the state 0, so, to invert xij to 0.

2

If the qubit qij is located in the second or the fourth quadrant in Figure 5, the sign s(αij, βij) is negative and the value of the product Δθij × s(αij, βij) is set to a positive value to increase the probability of the state 0, so, to invert xij to 0.

A quantum evolutionary algorithm for data clustering Figure 5

379

Quantum interference

3.8 Destructive interference step The destructive interference contributes to the diversification of the research space, for this reason we added this operator who introduces disturbances on the binary partitions, that is done by the permutation of the two values αij and βij if the function M of the best binary partition is lower than the function M of the current binary partition M(BM) ≤ M(B). The destructive interference is made with a probability pinterferdes. Table 2 shows the cases where we apply the destructive interference. Table 2

Destructive interference

xij

bij

M(BM) ≤ M(B)

Permutation

0

0

True

Permutate αij and βij

0

1

True

Permutate αij and βij

1

0

True

Permutate αij and βij

1

1

True

Permutate αij and βij

3.9 Regeneration step The parameters chosen for the interference to explore solutions far from the best solution introduce perturbations on αij and βij of the quantum solution and since these probabilities reflect the distance between the data points and clusters centroids, the step of regeneration is introduced to keep this relation. At each generation, the regeneration consists to recalculate one quantum partition QM chosen randomly from the quantum population. Recalculating QM is done using (2) and (3) with d(pj, μi) where {μi} is the set of the centroids of clusters in binary partition BM. The regeneration is done with a probability prob.

380

C. Ramdane et al.

3.10 Global and local migration Migration in QEAC is defined as the process of coping global or local best solution. Global migration is implemented by selecting the best solution Bglob among all solutions Bl and replacing them by Bglob. Global migration is performed periodically according to period. The local migration is implemented by replacing the solutions Bl of each group by the best one in this group.

4

Experiments

In order to assess the performance of the proposed method, we used different types of evaluation as well as samples of datasets.

4.1 Evaluation Evaluation is done at different levels. The first evaluation is accomplished with the external measure: Fmeasure (Stein et al., 2003). It is a function often used in the clustering literature. It compares the quality of a clustering with respect to known correct classes for a given dataset. Let C = (C1, C2, ..., CK) be a given clustering and R = (R1, R2, ..., RK´) be the correct classes. The Fmeasure of a given cluster Ci with respect to a class Rj is then: F (Ci , R j ) =

2 Ci ∩ R j Ci + R j

(5)

Let m be the total number of data points in dataset. The Fmeasure for the whole clustering C with respect to R is defined by equation (6). Fmeasure takes values in the range [0, 1] and should be maximised for an optimal clustering. F (C , R) =

∑ j

Rj m

max F (Ci , R j ) i

(6)

The second evaluation is accomplished with the internal objective function M [see equation (1)] in order to evaluate the quality of the optimisation. The third evaluation is inspired from (Das et al., 2008; Abraham et al., 2006). We calculate the number of objective function evaluations (FE) that the algorithm takes to yield the best value of the objective function. This number of objective FE gives an idea about the speed and quality of convergence of the algorithm.

4.2 Datasets We have applied the proposed algorithm to both synthetic and real world datasets as shown in Table 3 and Table 4, where m refers to the total number of data points in the dataset, ni is the number of data points belonging to cluster i, Dim gives the dimensionality and K is the number of clusters. The synthetic datasets are generated with Gaussian clusters generator described in Handl and Knowles (2005). Real datasets are brought from the UCI machine learning databases repository (Blake and Merz, 1998). In

A quantum evolutionary algorithm for data clustering

381

this case, data points containing missing values are discarded and only in the case of dermatology dataset, we have discarded one feature with missing values. The source of Ruspini (1970) dataset is. Figure 6 gives an idea onto the distribution and the cluster’s shape of some datasets. Table 3

Summary of the used synthetic datasets

Synthetic datasets

K

Dim

m

ni

2d4c

4

2

1123

369, 471, 53, 230

2d10c

10

2

520

67, 15, 19, 53, 83, 64, 65, 68, 68, 18

10d10c

10

10

436

18, 83, 57, 26, 67, 50, 12, 72, 39, 12

10d20c

20

10

433

20, 30, 12, 9, 24, 31, 10, 23, 15, 41, 32, 9, 13, 26, 10, 36, 13, 40, 10, 29

Table 4

Summary of the used real datasets

Synthetic datasets

K

Dim

m

ni

Iris

3

4

150

50, 50, 50

Dermatology

6

33

366

112, 61, 72, 49, 52, 20

Wisconsin

2

9

699

458, 241

Ruspini

4

2

75

20, 23, 17, 15

Figure 6

Plot of synthetic datasets (see online version for colours)

382

C. Ramdane et al.

4.3 Results We have compared QEAC with other clustering algorithms: the genetic algorithm for data clustering developed by Maulik and Bandyopadhyay (2000) and the QEA for gene expression data clustering proposed by Zhou et al (2005). The genetic algorithm (Maulik and Bandyopadhyay, 2000) starts with random population, and develops it by applying genetic operators of mutation, crossover and selection. Each individual is encoded as real numbers representing the different clusters centroids. Both QEAC and the genetic algorithm optimise the same function M. The QEA (Zhou et al., 2005) has been applied on gene expression datasets. It minimises the intra cluster variance. QEA starts by applying K-means to the dataset ten times to build a population of initial solutions. Each solution is a set of clusters centroids. The key idea of the algorithm is to combine between the centroids of initial solutions in order to generate new solutions; this is carried out while evolving the population of the individuals by applying the operations of observation, selection, interference and migration. QEA has the same steps of the algorithm described in (Han and Kim, 2002). To make a reasonable comparison between the three algorithms, it was necessary to replace the objective function of QEA (Zhou et al., 2005) by the function M and to apply it to real and synthetic datasets. In this way, the three algorithms optimise the same function, they were applied to the same datasets, they were also developed in the same language Borland C++ 5 and executed on the same PC (Pentium 4, 2.3-GHz, Ram 2GB). In the experiments, the default parameters of the genetic algorithm and QEA described respectively in Maulik and Bandyopadhyay (2000) and Zhou et al. (2005) are used. The parameters used in QEAC have been set as shown in Table 5. The total number of FE is roughly equal to the product of the population size and the number of generations. Here, the three optimisation based algorithms were allowed to run for 1,000 FE. The population size of QEAC, QEA and the genetic algorithm is six, 20 and ten respectively. The three algorithms have been compared according to Fmeasure, function M and FE values. Table 5

Parameters settings

Parameters

N

NbrGeneration

ng

Nd

Period

prob

Pinterferdes

Values

6

165

2

3

20

0.8

0.4

Table 6 shows median and interquartile values of Fmeasure, the function M and the number FE obtained over 100 runs of each algorithm. The interquartile indicates the range of values to which the algorithms converge. Figure 7 visualises the distribution of the Fmeasure, the function M and the number FE obtained over 100 runs by the three algorithms. A scrutiny of Table 6 and Figure 7 reveals that for all datasets and in term of the function M, QEAC is significantly better than QEA and the genetic algorithm. QEAC succeed to minimise the function M better than the two others algorithms, consequently, this leads to improve the Fmeasure of the partition found by QEAC, as it is clear in datasets boxplots, so, QEAC is significantly better than QEA and the genetic algorithm in term of Fmeasure. For dermatology, the Fmeasure’s median of QEA which is 0.9428 is better than that of QEAC which is 0.9396 but the QEAC minimises the function M better than QEA, QEAC gives 1092.3917 against 1092.6051 of QEA.

A quantum evolutionary algorithm for data clustering Table 6

383

Obtained results

Dataset

2d4c

Genetic algorithm (Fmeasure)

QEAC (Fmeasure)

QEA (Fmeasure)

0.9730 (0.0000)

0.9748 (0.0027)

0.9739 (0.0000)

2d10c

0.8882 (0.0564)

0.9548 (0.0024)

0.9310 (0.0411)

10d10c

0.9346 (0.0293)

1.0000 (0.0000)

0.9414 (0.0688)

10d20c

0.9001 (0.0469)

0.9668 (0.0471)

0.9179 (0.0273)

Iris

0.8923 (0.0005)

0.8988 (0.0000)

0.8918 (0.0000)

Dermatology

0.8403 (0.1026)

0.9396 (0.1417)

0.9428 (0.0082)

Wisconsin

0.9662 (0.0014)

0.9663 (0.0015)

0.9618 (0.0000)

Ruspini

1.0000 (0.0000)

1.0000 (0.0000)

1.0000 (0.0000)

2d4c

Genetic algorithm (function M)

QEAC (function M)

QEA (function M)

1817.7937 (0.0000)

1816.2321 (0.2829)

1816.9536 (0.0000)

2d10c

548.1742 (23.2170)

523.8776 (0.3900)

530.0951 (6.7029)

10d10c

2619.1201 (74.9768)

2570.9338 (0.0000)

2605.4126 (36.1503)

10d20c

2892.8553 (133.2963)

2687.5314 (62.8131)

2847.7891 (74.6640)

Iris

97.2322 (0.0937)

97.2221 (0.0000)

97.3259 (0.0000)

Dermatology

1097.4252 (15.6586)

1092.3917 (2.5053)

1092.6051 (0.2295)

Wisconsin

2984.3712 (0.4736)

2984.2911 (0.3030)

2986.9613 (0.0000)

Ruspini

864.2239 (0.0000)

864.2239 (0.0000)

864.2239 (0.0000)

2d4c

Genetic algorithm (FE)

QEAC (FE)

QEA (FE)

41.0000 (19.0000)

534.0000 (396.0000)

17.0000 (34.0000)

2d10c

71.0000 (44.0000)

814.0000 (210.0000)

416.5000 (504.5000)

10d10c

61.0000 (48.0000)

130.5000 (160.5000)

492.5000 (496.5000)

10d20c

50.5000 (21.0000)

290.5000 (309.5000)

437.0000 (525.5000)

Iris

41.5000 (49.0000)

101.0000 (85.0000)

4.0000 (6.0000)

Dermatology

51.0000 (32.2500)

483.0000 (413.0000)

421.5000 (413.0000)

Wisconsin

21.0000 (7.0000)

182.0000 (292.0000)

3.0000 (3.5000)

Ruspini

10.0000 (11.0000)

10.0000 (8.5000)

5.0000 (6.0000)

As shown in the majority of datasets boxplots, the genetic algorithm’s performances are poorer than the two evolutionary quantum algorithms; this reflects the effect of merging quantum concepts with evolutionary algorithm to improve performance. Almost for all datasets, the genetic algorithm yields its best solution after few number of FE which represents less than 7.1% of the total number of FE, that means that the genetic algorithm is quickly trapped in local optima, therefore, its premature convergence can be one reason of the decrease of its quality. Results in term of FE show that QEA is also very quickly trapped in local optima of the search space in the case of 2d4c, Iris and Wisconsin. The reason of its premature convergence is that QEA starts by applying K-means many times to generate initial solutions and do not succeed to escape the local optima found by K-means. For 10d10c, 10d20c, QEAC converges to the best solution faster and efficiently than QEA. For 2d10c, QEAC converges to the best solution efficiently but slower than QEA. For dermatology, QEA is faster than QEAC but their

384

C. Ramdane et al.

convergence’s qualities are nearly similar. In fact, QEAC makes the best trade off between speed and quality of convergence. In Table 6, generally, we can see that the genetic algorithm is not accurate. For 2d4c, Iris and Wisconsin, the interquartile of QEA for both Fmeasure and function M is zero, whereas the number FE is very small. That proves that the accuracy of QEA is due to its trapping in local optima. For 2d10c, 10d10c, 10d20c, QEAC is more accurate than QEA. In the case of dermatology, QEA is more accurate than QEAC. Generally, QEAC gives the best trade off between accuracy and efficiency. Figure 7

Boxplots giving the distribution of Fmeasure, the function M and the number FE respectively (see online version for colours)

Note: The Fmeasure, the objective function M and the number FE respectively are achieved for 100 runs of the three algorithms on the eight datasets. Source: Weisstein (1999)

A quantum evolutionary algorithm for data clustering Figure 7

385

Boxplots giving the distribution of Fmeasure, the function M and the number FE respectively (continued) (see online version for colours)

Note: The Fmeasure, the objective function M and the number FE respectively are achieved for 100 runs of the three algorithms on the eight datasets. Source: Weisstein (1999)

5

Conclusions and future works

In this paper, we explored the applicability of the QEA to data clustering. The main features of the proposed approach consist basically in the quantum representation of the search space and the quantum based dynamics used to evolve the search trough the adopted representation. Experiments have shown the applicability and the effectiveness of quantum evolutionary clustering algorithm. The performance of the algorithm can be significantly improved by exploiting its intrinsic parallelism. Several directions for future work can be investigated. For example, other measures of cluster quality based on

386

C. Ramdane et al.

different aspects like spatial separation and connectedness can be optimised by the proposed algorithm without changing the representation of solutions and the search strategy based on quantum operators. We can extend our algorithm to multi-objective clustering by optimising several criterions. We can also adapt our algorithm to automatic clustering by determining the optimal number of clusters.

Acknowledgements The author would like to thank Dr. Smaine Mazouzi for help and general advice.

References Abraham, A., Das, S. and Konar, A. (2006) ‘Document clustering using differential evolution’, IEEE Congress in Evolutionary Computation, World Congress in Computational Intelligence, IEEE press, pp.6248–6255. Azzag, H., Monmarché, N., Slimane, M., Venturini, G. and Guinot, C. (2003) ‘AntTree: a new model for clustering with artificial ants’, IEEE Congress on Evolutionary Computation, Vol. 1, pp.2642–2647. Berkhin, P. (2002) ‘Survey of clustering data mining techniques’, Technical report, Accrue Software, San Jose, California. Blake, C.L. and Merz, C.J. (1998) ‘UCI repository of machine learning databases’, available at http://www.ics.uci.edu/~mlearn/Machine-Learning.html. Chen, W., Sun, J., Ding, Y., Fang, W. and Xu, W. (2008) ‘Clustering of gene expression data with quantum-behaved particle swarm optimization’, in the Proceeding of Industrial, Engineering and other Applications of Applied Intelligent Systems, 18–20 June, Vol. 5027, pp.388–396, LNCS, Wroclaw, Poland. Das, S., Abraham, A. and Konar, A. (2008) ‘Automatic clustering using an improved differential evolution algorithm’, IEEE Transactions on Systems, Man, and Cybernetics – Part A: Systems and Humans, Vol. 38, No. 1, pp.218–237. De Castro, L.N. and Von Zuben, F. (2001) ‘aiNET: an artificial immune network for data analysis’, in Abbas, H., Sarker, R. and Newton, C. (Eds.): Data Mining: A Heuristic Approach, Idea Group Publishing. Grover, L.K. (1996) ‘A fast quantum mechanical algorithm for database search’, Proceedings 28th Annual Symposium on the Theory of Computing, pp.212–219. Han, K.H. and Kim, J.H. (2002) ‘Quantum-inspired evolutionary algorithm for a class of combinatorial optimization’, IEEE Transactions on Evolutionary Computation, Vol. 6, No. 6, pp.580–593. Han, K.H. and Kim, J.H. (2004) ‘Quantum-inspired evolutionary algorithms with a new termination criterion, Hε gate, and two phase scheme’, IEEE Transactions on Evolutionary Computation, Vol. 8, No. 2, pp.156–169. Handl, J. and Knowles, J. (2005) ‘Improvements to the scalability of multiobjective clustering’, IEEE Congress on Evolutionary Computation, Vol. 3, pp.2372–2379. Handl, J. and Meyer, B. (2007) ‘Ant-based and swarm-based clustering’, Journal of Swarm Intelligence, Vol. 1, No. 2, pp.95–113. Jain, A.K., Murty, M.N. and Flynn, P.J. (1999) ‘Data clustering: a review’, ACM Computing Surveys, September, Vol. 31, No. 3, pp.264–323. Krishna, K. and Murty, M.N. (1999) ‘Genetic K-means algorithm’, IEEE Transactions on Systems, Man, and Cybernetics, Vol. 29, No. 3, pp.433–439.

A quantum evolutionary algorithm for data clustering

387

Lu, Y., Lu, S., Fotouhi, F., Deng, Y. and Brown, S.J. (2004) ‘FGKA: a fast genetic K-means clustering algorithm’, in the Proceedings of ACM Symposium on Applied Computing, 14–17 March, pp.162–163, Nicosia, Cyprus. Maulik, U. and Bandyopadhyay, S. (2000) ‘Genetic algorithm-based clustering technique’, Journal of the Pattern Recognition, Vol. 33, No. 9, pp.1455–1465. Ruspini, E.H. (1970) ‘Numerical methods for fuzzy clustering’, Information Sciences, Vol. 2, No. 3, pp.319–350. Shor, P.W. (1994) ‘Algorithms for quantum computation: discrete logarithms and factoring’, Proceedings, 35th Annual Symposium on Foundations of Computer Science, 20–22 November, pp.124–134, Sante Fe, New Mexico. Stein, B., Eissen, S.M.Z. and Wißbrock, F. (2003) ‘On cluster validity and the information need of users’, 3rd IASTED Int. Conference on Artificial Intelligence and Applications, pp.216–221. Sun, J., Xu, W. and Ye, B. (2006) ‘Quantum-behaved particle swarm optimization clustering algorithm’, in the Proceeding of the Advanced Data Mining and Applications, 14–16 August, Vol. 4093, pp.340–347, LNCS, Xi'an, China. Sun, J., Xu, W.B. and Feng, B. (2004) ‘A global search strategy of quantum-behaved particle swarm optimization’, IEEE Conference on Cybernetics and Intelligent Systems, Vol. 1, pp.111–116. Van der Merwe, D.W. and Engelbrecht, A. (2003) ‘Data clustering using particle swarm optimization’, IEEE Congress on Evolutionary Computation, Vol. 1, pp.215–220. Weisstein, E.W. (1999) ‘Box-and-whisker plot. From MathWorld – a wolfram web resource’, available at http://mathworld.wolfram.com/Box-and-WhiskerPlot.html. Xiao, J., Yan, Y.P., Lin, Y., Yuan, L. and Zhang, J. (2008) ‘A quantum-inspired genetic algorithm for data clustering’, IEEE Congress on Evolutionary Computation, pp.1513–1519. Xu, R. and Wunsch, D. (2005) ‘Survey of clustering algorithms’, IEEE Transactions on Neural Networks, Vol. 16, No. 3, pp.645–678. Zhou, W., Zhou, C., Huang, Y. and Wang, Y. (2005) ‘Analysis of gene expression data: application of quantum-inspired evolutionary algorithm to minimum sum-of-squares clustering’, in the Proceeding of Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing, 31 August–3 September, Vol. 3642, pp.383–391, LNCS, Regina, Canada.