A similarity-based approach for data stream classification Dayrelis Mena-Torres a, Jesús S. Aguilar-Ruiz b a
University of Pinar del Rio "Hermanos Saiz Montes de Oca", Road Marti, No. 272, Pinar del Rio, Cuba b University "Pablo de Olavide", Road Utrera, km 1, 41013, Sevilla, España
Abstract Incremental learning techniques have been used extensively to address the data stream classification problem. The most important issue is to maintain a balance between accuracy and efficiency, i.e., the algorithm should provide good classification performance with a reasonable time response. This work introduces a new technique, named Similarity-based Data Stream Classifier (SimC), which achieves good performance by introducing a novel insertion/removal policy that adapts quickly to the data tendency and maintains a representative, small set of examples and estimators that guarantees good classification rates. The methodology is also able to detect novel classes/labels, during the running phase, and to remove useless ones that do not add any value to the classification process. Statistical tests were used to evaluate the model performance, from two points of view: efficacy (classification rate) and efficiency (online response time). Five well-known techniques and sixteen data streams were compared, using the Friedman's test. Also, to find out which schemes were significantly different, the Nemenyi's, Holm's and Shaffer's tests were considered. The results show that SimC is very competitive in terms of (absolute and streaming) accuracy, and classification/updating time, in comparison to several of the most popular methods in the literature.
Keyword: Data streams, classification, similarity. 1 Introduction Data Stream mining refers to informational structure extraction as models and patterns from continuous data streams. Traditional methods of data analysis require the data to be first stored and then processed off-line using complex algorithms that make several passes over the data. However, data streams are, in principle, infinite, and data are generated with high rates, so these cannot be stored in main memory. Different challenges arise in this context, such as storage, querying and mining. This last aspect is mainly related to the computational resources to analyze such volume of data, so it has been widely studied in the literature, which introduces several approaches in order to provide accurate and efficient algorithms. One recommendation is that data streams should be processed in online manner so as to guarantee that results are up-to-date and that queries can be answered in real time with negligible delay. Incremental learning techniques are efficient approaches to address these issues. A learning task is incremental if the training set is not available at first, but generated over time, so the examples are processed sequentially by the system, which has to learn in successive episodes. Generally, there are two main approaches for classifying data streams: a) single classifier-based approach; b) ensemble-based approach. For the single classifier-based approach, the main issue is to build a model from a small portion of the data stream and incrementally update the model using newly arrived examples. The main techniques used are: Artificial Neural Networks (LEARN [52], Fuzzy-UCF [5]); Rule Learning (Facil [22], OGA [47], AC-DS [37]); Decision trees (VFDT [46], VFDTc [29], FlexDT [54], eFTP [9]); and Instance-based Learning (TWF and LWF [41], SlidingWindows [51], IBL-DS [28], IBLStreams [8]).
For ensemble-based approach, a number of base classifiers are built from different portions of the data stream, and then all base models are combined to form an ensemble of classifiers. Some of the proposed models are (AE [48], EM [19], AUE2 [16]). As opposed to model{based machine learning methods which induce a general model (theory) from the data and then use that model for further reasoning, IBL (InstanceBased Learning) algorithms simply store a representative subset of data itself. They work primarily through the maintenance of the typical examples of each class and have three key features [15]: the similarity function, the selection of the typical instances and the classification function. These models have been extensively used since they are conceptually simple and provide good results. However, it is a critical issue to correctly combine these three aspects. The main contributions of this work include: the introduction of a new approach for classification of data streams based on similarity, called Similarity-based Data Stream Classifier (SimC), where a representative and small set of information is kept in order to conserve the distribution of classes over the data stream. It guarantees the control of noise and outliers through an effective method for managing representative data and a novel insertion/removal policy, using appropriate estimators designed to improve the classification performance and the updating phase of the model. The remainder of this paper is organized as follows. Section 2 introduces the background knowledge of data stream mining and the Instance-Based Learning techniques. Section 3 describes the SimC algorithm. SimC has been tested on a suite of data streams, whose results are illustrated in Section 4. Finally, Section 5 summarizes the most interesting conclusions and future work. 2 Background 2.1 Instance-based Learning Instance-Based Learning (IBL) algorithms are included within the so-called lazy learning techniques, where the process of induction is deferred until qualifying. IBL algorithms are inherently incremental, since adaptation basically comes down to adding or removing observed cases. Thus, incremental learning and model adaptation is simple in the case of IBL. As opposed to this, incremental learning is much more difficult to carry out for most model-based approaches. Even though incremental versions do exist for a number of well-known learning methods, such as decision tree induction [61] and the learning of Takagi-Sugeno (TS) fuzzy models [21], the incremental update of a model is often quite complex and in many cases assumes the storage of a considerable amount of additional information. Despite its conceptual simplicity and its ease of implementation, the techniques based on the nearest neighbor have two severe drawbacks: high computational complexity and strong parametric dependence. Consequently, IBL might be preferable in a data stream application if the number of incoming data is large compared with the number of queries to be answered, i.e., if model updating is the dominant factor [8]. High computational complexity results from having to process the entire training set for each new prediction to find the k closest examples to the new test sample. To accelerate this process numerous techniques have been proposed to respond to three strategies: a) approximate searches; b) efficient data structures; and c) reducing the number of training examples. Search-based techniques do not calculate approximate distance to classify all such others but they are based on some heuristics, ignoring several of them, thus calculating a neighbor approaching his true neighbor. The second group includes proposals based on KD-Tree, which are able to reduce the complexity from n2 to n*log n. Strong parametrization means that the result depends largely on the value of k and the similarity metric used, respectively, making it a technique that requires user knowledge
of the problem. To reduce this dependence, it is usually applied cross-validation to find the appropriate value of k that provides greater accuracy. The success of these techniques and its application in the context of data streams depends largely on the correct combination of three key features: - Classification function: This function decides, when a new example is given, how it relates to learned cases. - Similarity function: This informs the algorithm how close two instances are. There is great complexity in the choice of the similarity function, especially in situations where some features are symbolic. - Selection function: This informs the algorithm which of the instances might be retained as representative examples. Among the best known classification functions is the nearest neighbor rule (1-NN), stated as follows: Given a training set and a sample test e on a metric m-dimensional metric space, the nearest neighbor of e is one example , to shorter distance of e, where e will be classified according to the label associated with . The nearest neighbor scheme is based on a simple assumption: defined a similarity measure d, for any two examples ep = (xp; yp) and eq = (xq; yq) of , then P(yp|xp) P(yq|xq) d(xp; xq) 0. There are many ways to measure the similarity and there is no a proper implementation of this concept for all application domains. In particular, the algorithm of the k nearest neighbors is very sensitive to the definition of proximity, and although it has some disadvantages, there are many variations to ensure its efficiency. An important step in the application of the IBL techniques is the appropriate selection of the distance measure to be used. Some proposals are based on the estimation of probabilities [56, 10, 18]. The best known distance of this type was proposed in [10], called Value Difference Metric (VDM). The comparison criterion to this function defining the distance between two values x; y for an attribute is specified in Eq. 1: (1) where q is a constant (typically 1 or 2), and Cl is a possible value for the attribute class. The term Pr(C |x) refers to the conditional probability that the output class is C, knowing that the attribute has value x, which can be estimated from a set of known instances according to the Eq. 2: (2) that refers to the number of instances in which the values x and C appear simultaneously (Nx,C) on the presence of x (relative frequency). An experimental study [23] using the rule of k-NN and the non-weighted variant (all the attributes equally influenced in the global similarity) shows percentages of VDM correct classifications for at least as good as with the C4.5 [32] (technique based on decision trees). VDM provides a normalized value in the interval [0; 1], and while most of distance functions on discrete values require binary values or simply take into account the number of unmatched cases, the VDM result depends on the individual value of the attribute, and missing values are treated as if it were another category. In [18] it is noticed that VDM does not know what to do when values that are not shown in the training set are present, and it proposes several extensions to numeric attributes. The simplest variant to achieve this is to use discretization [20]. In this case the domain of a numeric type is partitioned into intervals, transforming the attribute to ordinal type by associating each value to one of the defined intervals. This distance is named Discretized VDM (DVDM), but its fundamental disadvantage is that with the discretization important information available in continuous values is lost, which can
lead that the extreme values of the same interval are considered equal, having a negative impact on the accuracy of the generalization. Another extension is the Heterogeneous VDM (HVDM) distance, which combines Euclidean distance for numerical attributes and VDM for nominal features. This variant manages effectively data sets with both types of attributes, and is defined by Eq. 4: (3) Where
(4)
where x and y are two examples,
is the typical deviation of attribute values Aj
according to the training set examples. This measure requires special care with normalization, so that would achieve a scale in the same range from two different measurements. To obtain a classifier for combined continuous and discrete attributes, previous experiments were performed with heterogeneous distance functions, including, among others, the HVDM and DVDM measures. As a result, HVDM was selected with the best performance. A key feature for measures to be used in data stream applications is to be updateable; it should allow adding new cases and eliminating those that have been used to modify its parameters. The implementation of the distance HVDM for SimC presents this feature, which facilitates that it takes into account, at each time, only the cases that have been selected by the algorithm, rather than all those already considered. 2.2 Data Stream Mining In [1], there is presented the most significant requirements that should have the learning models in the data stream realm:(1) process an example at a time, and inspect it only once (at most); (2) be ready to predict at any point; (3) data may be evolving over time; and (4) expect an infinite stream, but process it under finite resources (time and memory). Numerous algorithms and models that satisfy these properties have been proposed for data stream mining. At present, data stream mining systems address the problems of classification [44], clustering [7], and frequent patterns recognition [20][42][25]. Special emphasis has been put on classification of distributed [35][13], multi-labels [33] and uncertain unlabeled data streams [49][59][45][43]. The proposed models for data streams classification, in any of its variants, can be categorized into: simple and ensembles models, as were explains before (see Section 1). However all classification tasks have a simple classifier as base. The simple classification models proposed until now, consider the data mining techniques. Decision tree and rule-based learning have been of the most frequent techniques in this area. The most representative algorithm in this line is the Very Fast Decision Tree (VFDT) [46]. VFDT is a decision-tree learning algorithm that dynamically adjusts its bias according to the availability of data. In decision tree induction, the main issue is the decision of when to expand the tree, installing a splitting-test and generating new leaves. The basic idea consists of using a small set of examples to select the splitting-test to be incorporated in a decision tree node. If after seeing a set of examples, the difference of the merit between the two best splitting-tests does not satisfy a statistical test (the Hoeffding bound), VFDT proceeds by examining more
examples. It only makes a decision (i.e., adds a splitting-test in that node), when there is enough statistical evidence in favor of a particular test. Statistical support decreases as the tree grows, and regularization is mandatory. In VFDT -like algorithms the number of examples needed to grow a node is only defined by the statistical significance of the difference between the two best alternatives. Deeper nodes of the tree might require more examples than those used in the root of the tree. Later some improvements of these model have been published [34][14][49]. While these approaches assume interesting theoretical properties, the time required for updating the decision tree can be significant, and a large amount of samples is needed to build a classifier with reasonable accuracy. When the size of the training set is small, the performance of this approach might be unsatisfactory. Also, the structure of it is very unstable. A slight change of the data distribution will trigger substantial changes of the tree. In rule-based learning, one of the first proposed models was FACIL (Fast and Adaptive Classifier by Incremental Learning) [22], an algorithm especially directed to the classification of data streams with numeric attributes. Later others algorithms have been published as AC-DS [37] a new associative classification algorithm for data streams, which is based on the estimation mechanism of the Lossy Counting (LC) [26] and landmark window model. In these rule-based stream classification methods a temporary window of samples is needed and the updating process of the rule set is very complex. The neural network-based models have been scarcely used, due to its limitation in the data streams context. In a neural network, for example, a new observation causes an update of the network weights, and this influence on the network cannot simply be canceled later on; at best, it can be reduced gradually over the course of time. Instance-based learning techniques are more suitable for data streams environment as these algorithms are simple and inherently incremental. They allow obtaining a flexible and easy updated model, without complex computations that optimize the update and classification times. Within this group, there were first proposed some adaptations to the standard kNN algorithm and other incremental learning models [15], to adjust them to the data streams problem. For instance, the Locally Weighted Forgetting (LWF) [41] appears under this perspective. The LWF (Locally Weighted Forgetting) is one of the best adaptive learning algorithms. It is a technique that reduces the weight of the k nearest neighbors e1…ek (in increasing order according to the distance function) of a new instance e0 by the factor
if
, where
is the squared
Euclidean distance to the ith nearest neighbor of e0, and is a previous defined parameter. An example is completely eliminated if its weight is below a given threshold. Since the radius dk is that of the sphere containing the k nearest neighbors of e0, this adapts the decay radius of influence to the local density of examples around the new one. As an alternative to LWF, it was considered TWF [41] (Time Weighted Forgetting), a technique for windowing examples that consists of a sliding window which keeps only the most recent experiences presented to the learner in the active learning set. It determines the weight of the cases according to their age, using a function that assigns a weight we, initialized to we = 1, to each observation e. In the Sliding Windows algorithm [51], the authors propose to adapt the window size in such a way as to minimize the estimated generalization error of the learner trained on that window. To this end, they divide the data into batches (i.e., for k = 1, 2, ...) and successively test windows consisting of batches t – k, t - k + 1 … t. On each of these windows, they train a classifier (in this case a support vector machine) and estimate its predictive accuracy (by approximating the leave-one-out error). The window/model
combination that yields the highest accuracy is finally selected. In [4], this approach is further generalized by allowing for the selection of arbitrary subsets of batches instead of only uninterrupted sequences. Despite the appealing idea of this approach to window (training set) adjustment, the successive testing of different window lengths is computationally expensive and, therefore, not immediately applicable in a data stream scenario with tight time constraints. One of the most extensive works in this technique is presented in [28] that proposes the IBL-DS algorithm for data streams classification. IBL-DS optimizes the composition and size of the case base autonomously. On arrival of a new example (x0; λ0), this example is first added to the case base and then the replacement policy is applied based on the next criteria: - Temporal relevance: recent observations are considered more useful than older ones. - Space relevance: a new example in a region of the space occupied by other examples is less relevant than an example located in an empty region. - Consistency: an example should be removed if it is inconsistent with the current concept. Moreover, it is checked whether other examples might be removed (redundant and outliers). To this end, a set C of examples within a neighborhood of e0 are considered as candidates. The most recent examples are excluded from removal due to the difficulty to distinguish potentially noisy data from the beginning of a concept change. Even though unexpected observations will be made in both cases, noise and concept change, these observations should be removed only in the former but not in the latter case. SVDM [6], which is a version of the VDM distance measure, is used for determining the neighbors. IBL-DS is relatively robust and produces good results when using the default parameters. IBLStreams [8] is an extension of IBL-DS, that includes regression problems. Recently published, the algorithm AES [38] is a hormone based nearest neighbor for data stream classification. The Artificial Endocrine System (AES) is a model that simulates the manner of information processing in biological endocrine systems [58]. It allows cells in a large system to communicate and interact with each other forming an entire system. The AES classifier consists of endocrine cells on the boundaries of different classes and the amount of the boundary samples Ktotal is decided a priori. Interior samples (points) in a class are seen as unimportant points which should be discarded in this process. With the evolution of the data stream, the endocrine cells keep on changing their positions to track the changing boundaries between different classes. Every time a new record arrives, the cell that resides in the most unfit location will move to the new arrived record. In this way, the changing boundaries between different classes are recorded by the locations where endocrine cells reside in. The model can be updated every time a new record arrives and the big data buffer of samples for training a real-time classifier is no longer needed. The Euclidean distance is used to calculate the distances between the new arrived record and all of the endocrine cells, with the 1NN nearest neighbor rule. The approach introduced in this paper embraces some characteristics of the previously analyzed algorithms and solves some of the presented problems by using novel policies of instance insertion/removal, that are able to maintain a case base updated, providing good performance in terms of accuracy and time. 3 Method This section describes the main characteristics of the proposed algorithm. Our essential goal is to efficiently build a classification model from data streams. The similarity-based approach for data stream classification, named SimC, uses the advantages of the Instance-based Learning techniques for classifying up- coming
examples. The main objective of our learning system is to maintain an implicit concept description in the form of a case base, introducing a novel policy of instance removal and insertion in order to guarantee the good performance of the classifier. The general process of the model is depicted in Fig. 1, where the flow of instances is analyzed, one by one, by SimC, through three fundamental procedures that handle the tasks related to the classification and updating process, and maintaining a buffer of relevant information from the selected instances. The three main procedures are: -
-
Build the classifier: it builds an initial model with the first 100 instances, which are stored in groups, one for each class. Then, this model is used to classify new instances or it is updated, as it corresponds, depending on the task. Update the classifier: once the initial model is built, this method is responsible for maintaining it correct, applying an insertion/removal policy designed for retaining the best representation of the knowledge distributed over both the search space and time. Classify new instance: it returns the class assigned to the new example; basically, this classification function is a standard k-NN (with k = 1, this parameter can also be defined by the user) using the heterogeneous distance measure HVDM [18], which is updated with each insertion and deletion in the case base.
Fig. 1: General scheme of the approach SIMC.
3.1 Estimators For the appropriate selection of instances, SimC uses some estimators that provide general information about each group and each instance stored in the case base, and these are the following: For a group of instances of each class: -
The mean instance of each group, obtained by calculating the mean or the mode of each attribute (depending on the type) of all instances in the group. It is used to calculate distances between groups.
-
The age of the group, calculated as the number of instances that belonged to the group. It is used to determine which groups should be eliminated first.
For an instance in the group:
-
The usefulness, defined as the number of instances correctly classified by this instance. It is used to know the instances which have been more useful for the classification, and determine which should not be removed from the group.
In our previous work [17], two different estimators were used (the nearest and the farthest neighbors of all new cases), which had spacial relation to the new case, but they did not control correctly the evolution of classes throughout several regions of the search space. The introduction of groups associated to each class (even new unseen classes) helps retaining the behavior of data over time, allowing the insertion of new relevant groups and the deletion of old and unuseful ones, improving the performance of the approach in terms of accuracy. 3.2 Description The structure and management of the case base is critical for achieving good performance in accuracy. Insertion and removal policies over this structure will impact on the proper adaptation of the model to the concept changes. Inherently, the case base represents the knowledge extracted from the data stream. Therefore, this structure and the common cases founded during the process will be described next. Let B be a model composed by n classes, such that , where each class C is, in turn, formed by a set of groups G. For simplicity, we will assume that the groups, by definition, belong to a specific class. Let C be a class defined by a set of groups G, such that , so the class will be represented by certain number of groups. Each group Gj represents a region of the search space, and it is composed by a number of estimators and a set of examples, i.e., , where is the mean instance of the group, is the age of the group, and is the set of instances that belong to the group. Initially, B is empty, and the first 100 examples will fulfill it (the choice of 100 is due to experimentation, although depending on the complexity of the information it might be decreased or increased slightly without much effect on the computational cost; in fact, this phase is also incremental, so there is no real need to build a small classifier at the beginning). In order to show how the method works, four cases have been identified, which illustrate all the situations that imply modifications in the case base, and describe the changes in the structure of the model due to the insertion/removal policy. Let B be the initial model, formed by two classes, CA and CB, where each class is composed by only one group, G1 and G2, respectively. Let us assume that only eight examples have been analyzed until now, four of each class, that belonging to the same group. This starting scenario is depicted in Fig. 2, where the estimators ( and ) and the usefulness ranking have been also included. Next, each possible situation, which implies a modification of the content or structure of the model, is described in detail.
Fig. 2: Initial Model B: structure of the case base.
Let be the new example to be analyzed, where represents the values for the set of attributes, and its associated class; let be the nearest neighbor of within B; let be the nearest group of ; and let be a polymorphic function that returns the class associated with the argument (example or group, respectively), i.e., . Case A If the class
of a new example is not already in the model, i.e., , then this new class is included in the model, and a new group is created in B, which will contain the new example (its estimator are also calculated, and ). This situation will transform B, from Fig. 2 to Fig. 3. In case the model is completely full, which might happen if a new unexpected concept suddenly appears, the least useful example of the oldest class is removed.
Fig. 3: Updating the model: Insertion/Removal Policy. Case A.
Case B If the class of the new example is already in the model, then the group which must be included in is searched for. The distance from to each is then calculated, and selected that that is minimum, i.e., the nearest group If then the new example is included in the group example in which the example
. If B has reached the maximum size, then the least useful is removed (according to the usefulness ranking; see Fig. 4, in is removed).
Fig. 4: Updating the model: Insertion/Removal Policy. Case B.
Case C This case occurs when the nearest group has associated class different from (and is already in the model). Therefore, it is necessary to look for the class of the nearest neighbor of to check how far from examples of the same class this example is. In case shares the same class of , then is included in the group of its nearest neighbor . If B has reached its maximum size, the least useful example from the oldest group of the same class as is removed. Fig. 5 shows that is included in group , and example is removed.
Fig. 5: Updating the model: Insertion/Removal Policy. Case C.
Case D This situation is present when the nearest group has class different from , which belongs to the model, and the class of the nearest neighbor of is different from that of , i.e., . Thus, as neither its nearest group nor its nearest neighbor have the same class as , then a new group (with only one example) is created. The least useful example of the oldest group with same class as is removed. Noise or concept drift might be the cause of this case, so the model should adapt itself to create new groups (which, if not used for long time, will be automatically discarded by the removal policy). When concept drift arises, this new group will grow in size as other older groups decrease, until disappear). Fig. 6 illustrates the creation of the new group belonging to class .
Fig. 6: Updating the model: Insertion/Removal Policy. Case D.
3.3 Algorithm The case base B contains a limited number of examples (parameter defined by the user) for which each insertion leads to the elimination of another example. When all the instances in a group are removed, then this group is also deleted. When a new group is created, we check along the case base for eliminating those older groups that contain a single instance, which helps the control of noise and outliers. In order to simplify the algorithm, two functions have been defined: lu() which returns the least useful example of the set used as argument; and old(), which returns the oldest group of the class used as argument. The least useful example is selected by examining the usefulness ranking of examples. Algorithm 1 shows the insertionremoval policies described before. Algorithm 1 SimC: Update INPUT : new example; d: distance function; OUTPUT B: model 1: if then 2: 3: 4: 5: else 6: if then 7: 8: 9: else 10: if then 11: 12: 13: else 14: 15: 16: end if 17: end if 18: end if
The case base is designed in such a way that it is balanced during the entire process, in terms of the concepts present until the moment, and it is able to rebalance the case base if a new concept appears. This feature differentiates our approach from any other,
because it is interesting to know about new concepts once the learning process has already started, instead of before starting. Also, in contrast to other techniques, SimC behaves consistently with respect to the examples to be removed, and maintains the spatial relevance within the case base. 4 Comparative Analysis Experiments are conducted to evaluate the SimC classification performance and are compared to well-known data stream classification algorithms, such as IBL- DS, LWF, TWF and Sliding Windows. All the algorithms have some parameters to be tuned. In this paper the default parameter settings suggested by the authors for each algorithm is chosen. LWF uses = 0.04, = 0.8 and TWF uses = 0.996. IBL-DS, LWF and the Sliding Windows models uses the SVDM distance measure proposed in [29], and the TWF and SimC use the HVDM distance measure, previously presented. For all the models, experiments were performed with k = 1 and a case base of maximum size of 400. To simulate a data stream were used the Agrawal, Led24, RDG1, RandomRBF, RandomTree and Stagger generators included in MOA [2] and were also incorporated a total of 9 data sets from UCI-MLR [3] (each of these data sets was generated with a maximum of 20,000 instances and 5% of noise, following the same criteria than other works). Airlines, Bank and Poker Hand are real data streams freely available. Table 1 shows information about the type and number of attributes and classes of each data streams. Table 1: Data Streams. Data Streams
No. Attribute
Class
No. Instances
MOA Agrawal
9
Mix
2
20000
LED24
24
Nom
10
20000
RDG1
10
Nom
2
20000
RandomRBF
10
Num
2
20000
RandomTree
10
Mix
2
100000
Stagger
3
Nom
2
100000
Airlines
7
Mix
2
539383
16
Mix
2
45210
10
Mix
10
25010
Real
Bank Poker Hand
Type
UCI
Annealing
39
Mix
5
20000
Balance
5
Num
3
20000
Breast
10
Num
6
20000
Car
7
Nom
4
20000
Pages-bloks
11
Num
5
20000
Pen Digits
17
Num
10
20000
Solar-Flare
13
Nom
6
20000
Vehicle
19
Num
4
20000
Vowel
14
Mix
11
20000
The proposed algorithms were implemented in Java for the experimental analysis and performance evaluation. There were used NetBeans IDE 6.0 in red hat enterprise Windows XP. The experiments were conducted in a machine with an Intel Core 2 Duo Processor 2.21 GHz processor and 1.49 GB of RAM. Also there were used as library, the Weka [40] software. Weka is a collection of machine learning algorithms for data mining tasks, which contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. These learning algorithms can be either applied directly to a dataset or called from our own implementation. To analyze the performance of the techniques for each data set the following measures are provided: Absolute accuracy of algorithm A on data stream D is recorded when A finishes classifying every instance in D. It is the number of correctly classified instances divided by the total number of instances in D. Accuracy of streaming of algorithm A on data stream D is recorded when A finishes classifying the last 100 examples in D. It equals to the number of correctly classified instances divided by 100. Mean of classification time is the averaged time required to classify a new unseen example. Mean of updating time is the averaged time the system requires updating the case base when a new example arrives. The performance of the classifier is also evaluated by classes, which the following measures [39]: - Positive precision, P: Percentage of examples the classifier has predicted as positive and they are really positive. - Recall or True Positive Rate, TPR: Percentage of correctly classified patterns of the positive class with regard to the total number of existent elements of that class. - F-Measure: It is a combined measure based on P and in TPR. It is calculated as follow: F -Measure = 2 * ((P * TPR)/(P + TPR)). To evaluate the measured performance of each competing scheme, some statistical tests are employed as recommended in [53]: The Friedman's test is a non-parametric test and it is effective for comparing multiple algorithms across multiple data streams. It ranks the algorithms for each data set separately, the best performing algorithm getting the rank of 1, the second best rank 2, etc. In case of ties, average ranks are assigned. Then it decides whether to reject the null-hypothesis, which states that all the schemes are equivalent and so their ranks should be equal. The post-hoc tests, if the Friedman's test rejects its null-hypothesis, in order to detect significant pairwise differences among all the classifiers. There were used the Nemenyi, Holm and Shaffer tests, respectively. They indicate which schemes have statistically significant differences (here it is used the 0.05 critical level). 4.1 Results In this section it is evaluated how well SimC performs when classifying data streams. The Absolute Accuracy of each algorithm on each data stream is detailed in Table 2, including standard deviation and mean values. The experiment shows that the average performance of the proposed model is the best. With this measure, the employed model obtains an average of 84% accuracy and the lowest standard deviation, showing a satisfactory behavior. To statistically analyze the results, the next step is to proceed to the Friedman's test, with 5 algorithms and 18 data streams, using a Chi-square ( ) distributed with (5 - 1) = 4 degrees of freedom.
Table 2: Absolute Accuracy. Data Streams
SimC
IBL-DS
LWF
TWF
Win400
Agrawal
.7534
.6214
.8601
.7927
.5600
Led24
.6514
.3953
.6259
.5526
.3690
Rdg
.9374
.8896
.8825
.9207
.8849
RamdomRBF
.9067
.8003
.8803
.9240
.9251
RandomTree
.6520
.6426
.6812
.6265
.6314
Stagger
1.000
1.000
.6675
.8523
.8520
Airlines
.5877
.6074
.6059
.1984
.5575
Bank
.8825
.8669
.8962
.8798
.8766
Poker Hand
.4854
.4469
.5200
.4652
.4719
Annealing
.9814
.9832
.9512
.9688
.8681
Balance
.8693
.8426
.7751
.6858
.6314
Breast
.9458
.8546
.7995
.9995
.9996
Car
.9499
.7880
.7753
.9004
.8094
Pages-block
.9530
.9330
.9377
.9474
.9467
Pendigits
.9551
.8575
.6093
.9596
.9578
Solar Flare
.9886
.9885
.9693
.9855
.9901
Vehicle
.7123
.6286
.5942
.6726
.6181
Vowel
.9176
.5828
.2224
.7980
.8546
Mean
.8405
.7627
.7363
.7838
.7669
.15
.18
.19
.21
.19
Deviation
Table 3: Absolute Accuracy: average ranking. Algorithm
Ranking
SimC
1.86
IBL-DS
3.41
LWF
3.49
TWF
3.00
WIN400
3.22
The p-value computed by the Friedman's test is 0.0125. So, the null hypothesis can be rejected and inferred that there exist significant differences among rival schemes. Table 3 shows the ranking obtained with the Friedman's test for the absolute accuracy. To find out exactly which schemes are significantly different, we proceed to the Nemenyi's (p-value 0.01), Holm's (p-value 0.01428) and Shaffer's (p- value 0.01) procedures, whose results reject the null hypotheses and demonstrate that with the measure of absolute accuracy SimC is significantly better than IBL-DS, LWF and WIN400. The results obtained with the measure Accuracy of Streaming, standard deviation and mean values are shown in Table 4. Once again the experiment shows that the average performance of the proposed model is the best. This time an average of 84% accuracy is obtained, giving our model the best performance. The same statistical test procedure is repeatedly applied to rival methods, with the measure accuracy of streaming. This time the p-value computed by the Friedman's test
is 0.00221. Therefore, the null hypothesis can be rejected and concluded that there exist significant differences among rival schemes. Table 5 shows the average ranking of the algorithms with this measure. Table 4: Accuracy of Streaming. Data Streams
SimC
IBL-DS
LWF
TWF
Win400
Agrawal
.7740
.5620
.8180
.7900
.5340
Led24
.6600
.3940
.6100
.5760
.3980
Rdg
.9420
.8680
.8800
.9280
.9060
RamdomRBF
.9020
.7920
.8060
.9420
.9400
RandomTree
.7101
.5110
.6302
.5212
.6210
Stagger
1.000
1.000
.7102
.8703
.9105
Airlines
.6410
.6010
.5612
.2121
.5513
Bank
.5301
.6401
.6507
.5702
.5801
Poker Hand
.5002
.4705
.5604
.4814
.5805
Annealing
.9860
.9880
.9480
.9800
.8760
Balance
.8860
.8620
.8220
.6700
.5500
Breast
.9300
.8520
.8080
1.000
1.000
Car
.9620
.8240
.6680
.9160
.8000
Pages-block
.9960
.9240
.9220
.9380
.9520
Pendigits
.9620
.7920
.4020
.9620
.9580
Solar Flare
.9960
.9780
.9640
.9860
.9940
Vehicle
.7080
.6420
.5320
.6780
.6420
Vowel
.9380
.5360
.0160
.8240
.8840
Mean
.8405
.7627
.7363
.7838
.7669
.15
.18
.19
.21
.19
Deviation
Table 5: Accuracy of Streaming: average ranking. Algorithm
Ranking
SimC
1.75
IBL-DS
3.55
LWF
3.66
TWF
2.97
WIN400
3.05
Then the Nemenyi's (p-value 0:01), Holm's (p-value 0:0125) and Shaffer's (p-value 0:01) procedures demonstrate that with the measure of accuracy of streaming, SimC is significantly better than IBL-DS, LWF, TWF and WIN400. An important task in data streams treatment is the model updating and the query answer in reasonable time. The behavior of the proposed model is considered with respect to the time consuming. Tables 6 and 7 show average updating and classification time measures for each algorithm with each data stream. Win400 is the fastest model. However, this model presents worse performance for the accuracy-related measures.
Table 6: Averaged Updating Time (in milliseconds). Data Streams
SimC
IBL-DS
LWF
TWF
Win400
Agrawal
.1435
3.132
2.641
.0336
.0241
Led24
1.456
1.784
3.273
.0848
.0747
Rdg
.1437
1.066
1.528
.0269
.0310
RamdomRBF
.1012
5.484
5.101
.0041
.0308
RandomTree
.1246
.8089
.0767
.0024
.0164
Stagger
.0208
.4552
.9718
.0190
.0418
Airlines
1.025
34.66
22.94
.7889
.1010
Bank
2.576
10.18
17.97
1.002
.2536
Poker Hand
1.765
2.888
1.859
.1258
.0158
Annealing
.7615
2.638
10.00
.1189
.0910
Balance
.3762
2.059
.6860
.0021
.0124
Breast
.0566
2.729
.7347
.0038
.0247
Car
.3036
.8825
3.511
.0364
.0271
Pages-block
.1813
1.897
2.744
.0043
.0282
Pendigits
.1953
2.887
1.830
.0043
.0517
Solar Flare
.1296
.9357
5.509
.0313
.0402
Vehicle
.1557
3.155
1.426
.0052
.0427
Vowel
.7879
1.787
1.361
.0457
.0508
Mean
.5725
4.413
4.844
.1932
.0667
.71
7.8
6.1
.34
.07
Deviation
SimC shows better results at updating the model than in classification tasks, because all the instances in the case base have to be analyzed as in a standard k-NN. However, the balance between updating time and classification accuracy turns the approach into a very efficient and effective classifier with high dimensionality for data streams. Performance evaluations by classes are decisive to obtain a measure of the classifier behavior. Usually in binary problems there is a significant unbalance among classes, allowing the classifier to obtain near to 100% of precision on the majority class, but a low performance on the minority class. A good classifier achieves good results for both classes in terms of recall, which ensures its effectiveness to identify the elements of each class. Table 8 shows the results of Precision/Recall/f-measure for each classifier with binary data streams, by classes. Minority class are marked with * and the best results of the F-measure are highlighted in bold. SimC provides good results for each class, especially for the most unbalanced class. In the RDG data streams, where the minority class represents only the 26% of the total set, SimC obtains a recall of 75%, while the rest of the classifiers do not exceed the 60%. A similar situation happens with RandomTree and Stagger data streams, as SimC and IBL-DS algorithms are 100% effective on both classes; whereas the rest of the models provide very poor performance on the minority class. With data streams Airlines all the models get a good precision by classes, overcoming in some cases to SimC. However, in terms of recall SimC obtained the best results (54%) for the minority class compared to the rest of the classifiers that do not exceed the 40%, which shows that SimC is quite effective to identify the elements of the minority class.
Table 7: Averaged Classification Time (in milliseconds). Data Streams
SimC
IBL-DS
LWF
TWF
Win400
Agrawal
1.115
.7173
1.873
1.556
.0877
Led24
9.094
.8781
3.022
9.917
.9096
Rdg
1.399
.5107
.8501
1.851
.3114
RamdomRBF
.5684
.8657
4.763
18.129
.3872
RandomTree
1.282
.0717
.0452
.0991
.0283
Stagger
.2197
.1359
.3565
.1785
.0656
Airlines
5.725
1.186
1.514
58.47
.1211
Bank
6.033
1.072
3.138
.5049
.2091
Poker Hand
3.456
.8739
1.108
.3923
.3742
Annealing
9.825
.9390
4.628
26.55
.2329
Balance
.6952
.2023
.3828
.3225
.0693
Breast
.5629
.1717
.4565
1.584
.0749
Car
1.714
.4755
1.624
.7062
.3955
Pages-block
.7650
.3629
1.558
18.283
.1063
Pendigits
1.119
.5064
1.405
4.506
.2697
Solar Flare
2.071
.3808
2.564
2.583
.2116
Vehicle
1.052
.4347
1.035
5.602
.2179
Vowel
2.611
.2126
.8670
3.072
.3305
Mean
2.739
.5555
1.733
3.578
.2446
Deviation
2.95
.34
1.38
13.70
.20
The data streams Bank displays the worst results, here all the models perform well on the majority class, but in the case of the minority that represents only the 13% of the complete set, the IBL-DS, LWF, TWF and WIN400 algorithms are only able to classify one example well. Unlike the rest of the comparison models, SimC is able to maintain a balance among all the classes that compose the problem, avoiding the unbalancing effect of the majority class. The flexible structure of SimC, formed by groups of each class and its policy of insertion/removal take advantage of the importance of the most significant examples in terms of space and time, which is very effective for unbalanced data streams. One of the most critical issues in the IBL techniques is the selection function. Each model must decide which cases are most important to retain and achieve better percentages of accuracy. The combination of various criteria such as spatial and temporal relevance and consistency may result in a more appropriate selection of cases to be stored. Furthermore, the influence of external parameters can affect the natural learning process in certain environments. Table 9 summarizes some qualitative features of case selection policies presented by each model. The instance selection policy of SimC relies on the combination of several estimators from instances and neighborhoods. Those criteria guaranty the adaptation to gradual concepts changes. Its flexibility allows to incorporate new concepts at any stage of the learning process, resetting the case base to the new conditions. Learning is done naturally without the influence of external parameters that can affect the results, without the need of expert knowledge of the problem to define external values.
Table 8. Class Performance. Class
SIMC P/TPR/fMeasure
IBL-DS P/TPR/fMeasure
LWF P/TPR/fMeasure
TWF P/TPR/fMeasure
WIN400 P/TPR/fMeasure
.85/.95/.90
.84/.84/.84
.67/.67/.67
.87/.65/.74
.68/.68/.68
.31/.31/.31
Agrawal GroupA
.81/.84/.82
.69/.79/.74
GroupB*
.64/.59/.61
.39/.26/.31
RDG C0
.91/.95/.93
.86/.99/.92
.82/.99/.90
.82/.99/.90
.81/.99/.89
C1*
.86/.75/.80
.96/.56/.71
.97/.40/.57
.98/.37/.54
.99/.34/.51
RandomRBF Class1
.92/.94/.93
.78/.84/.81
.90/.90/.90
.91/.92/.91
.91/.92/.92
Class2
.94/.92/.93
.82/.76/.79
.90/.90/.90
.92/.91/.91
.92/.90/.91
RandomTree Class1*
.57/.67/.62
.59/.52/.55
.63/.53/.58
.60/.59/.60
.63/.62/.62
Class2
.72/.63/.67
.68/.74/.70
.69/.77/.73
.70/.71/.71
.72/.73/.73
Stagger False*
01/01/2001
01/01/2001 1/.99/.99
1/.33/.50
1/.12/.21
True
01/01/2001
01/01/2001 .99/1/.99
.92/1/.96
.90/1/.94
Airlines 0
.63/.62/.62
.62/.79/.70
.60/.89/.72
.60/.89/.72
.58/.94/.72
1*
.53/.54/.54
.61/.40/.48
.67/.28/.39
.68/.27/.38
.69/.16/.26
.88/.99/.93
.88/1/.93
Bank No
.88/.99/.93
.88/.99/.93
.88/.99/.93
Yes*
.32/.0017/.0033 .5/.0001/.0003 .5/.0001/.0003 .5/.0001/.0003
1/.0001/.0003
Table 9. Qualitative comparison instance selection policies. Features
SimC
IBL-DS
LWF
TWF
Win400
Case selection depends on temporal relevance
X
X
X
X
X
Case selection depends on spatial relevance
X
X
X
Case selection depends on consistency
X
X
Independence of external parameters
X
Novelty detection
X
Good adaptation to gradual concept drift
X
X X
X
X
X
5 Conclusions To deal with the data stream classification problem several important aspects must be analyzed. The most important issue is to maintain a balance between accuracy and efficiency, i.e., the algorithm should provide good classification performance at a reasonable time response. Thus, the knowledge model (the case base in this case) must be representative of the data behavior (tendency), and should be manageable in size in order to not consume much time applying the insertion/removal policies.
The approach, a similarity-based data stream classifier named SimC, achieves the goal by introducing a novel insertion/removal policy, which adapts quickly to the data tendency and maintains a representative, small set of examples and estimators that guarantees good classification rates. Furthermore, the methodology is supported by IBL techniques, which are very intuitive, flexible and easy to replicate. The selection of the distance measure (critical aspect in this context), has been studied in depth, to involve satisfactorily both numeric and nominal attributes. The methodology is also able to handle new concepts appearing during the running phase and remove useless ones that do not add any value to the classification process. The proposed methodology uses estimators that were designed to guarantee the space and temporal relevance, and the consistency of the selected examples, that allows to improve in accuracy and updating times. SimC achieves an appropriate balance among the classes that compose the problem and guarantees its effectiveness in the minority classes for unbalance problems. Also, the performance of the classifiers was measured taking into account the precision, recall, and the f-measure of each algorithm with each binary data streams, by each class. Performance measures and statistical tests were used to evaluate the model behavior from two points of view: efficacy (classification rate) and efficiency (online time consuming). The results show that our approach is very competitive in terms of accuracy (absolute and streaming), and classification/updating time, against several of the most popular techniques in the literature. In short, five well-known techniques and eighteen data streams were compared, using the Friedman's test. To find out which schemes were significantly different, the Nemenyi's, Holm's and Shaffer's tests were considered. The future work directions of this work include: (1) refine the model to detect and deal with abrupt concept changes that occurs in the general distribution or in a particular class, using as indicators the measures of absolute accuracy and the precision by classes, (2) improve the performance of the classifier in terms of time, including efficient data structures to support the classification process.
6 Acknowledge Special thanks to Agencia Universitaria Iberoamericana de Postgrado (AUIP) for funding the research visits, and the Spanish Ministry of Science and Innovation for supporting the research under grant TIN2011-28956-C02-01. We are grateful to the anonymous referees for their invaluable suggestions to improve the paper. References 1. A. Bifet, R. Gavald, Adaptive learning from evolving data streams, Lecture Notes in Computer Science, th Vol. 5772. Advances in intelligent data analysis VIII, 8 international symposium on intelligent data analysis, IDA, 249-260, 2009. 2. A. Bifet, G. Holmes, R. Kirkby, B. Pfahringer, MOA: Massive Online Analysis, J. Mach. Learn. Res. , 11, 1601-1604, 2010. 3. A. Frank, A. Asuncion, UCI machine learning repository, 2010. 4. A. Klinkenberg, Learning drifting concepts: Example selection vs. example weighting, Intelligent Data Analysis, Special Issue on Incremental Learning Systems Capable of Dealing with Concept Drift, 8, 281300, 2004. 5. A. Orriols-Puig, J. Casillas, E. Bernado, Fuzzy-UCS: A Michigan-style Learning Fuzzy-Classifier System for Supervised Learning, Transactions on Evolutionary Computation, 1-23, 2008. 6. A. Skowron, A.Wojna, K-nearest neighbor classification with local induction of the simple value difference metric, Proceedings of the Fourth International Conference on Rough Sets, LNAI 3066, 229234, 2004. 7. A. Marascu, F. Masseglia, A typicity detection in data streams: A self adjusting approach, Intelligent Data Analysis, 15, 89-105, 2011. 8. A. Shaker, E. Hullermeier, IBLStreams: A System for Instance-Based Classification and Regression on Data Streams, Evolving Systems, 1-31, 2013.
9. A. Shaker, R. Senge, E. Hullermeier, Evolving fuzzy pattern trees for binary classification on data streams, Information Sciences, 220, 34-45, 2013. 10. C. Stanfill, D. Waltz, Toward memory-based reasoning, Communications of the ACM, 29(12), 1213-1228, 1986. 11. C. Liang, Y. Zhang, Q. Song, Decision tree for dynamic and uncertain data streams, Journal of Machine Learning Research Proceedings Track, 13, 209-224, 2010. 12. C. Li, Y. Zhang, X. Li, OcVFDT: One-class very fast decision tree for one-class classification of data streams. In Proc. of the 3rd international workshop on knowledge discovery from sensor data, 79-86, 2009 13. C. Ho-Leung, L. Tak-Wah, L. Lap-Kei, T. Hing-Fung, Continuous Monitoring of Distributed Data Streams over a Time-Based Sliding Window, Algorithmica, 62, 1088-1111, 2012. 14. C. Li, Y. Zhang, X. Li, Learning decision trees from dynamic data streams. Journal of Universal Computer Science, 11(8), 1353-1366, 2009. 15. D.W. Aha, D. Kibler, M.K. Albert, Instance-Based Learning Algorithms, Machine Learning, 66, 37-66, 1991. 16. D. Brzezinski, J. Stefanowski, Reacting to different types of concept drift: The accuracy updated ensemble algorithm, IEEE Transactions on Neural Networks and Learning Systems, 2013. 17. D. Mena, J.S. Aguilar-Ruiz, Y. Rodriguez, Classification Model for Data Streams Based on Similarity, IEA/AIE, LNAI 6703, 1, 1-9, 2011. 18. D.R. Wilson, T.R. Martinez, Improved Heterogeneous Distance Functions, Artificial Intelligence, 6, 134, 1997. 19. D. Md. Farid, L. Zhang, M. A. Hossain, C. M. Rahman, R. Strachan, G. Sexton, K. P. Dahal, An adaptive ensemble classifier for mining concept drifting data streams, Expert Syst. Appl., 40(15), 58955906, 2013. 20. D. Yang, E.A. Rundensteiner, M.O. Ward, Mining neighbor-based patterns in data streams, Information Systems, 38, 331-350, 2013. 21. E. Lughofer, FLEXFIS: A robust incremental learning approach for evolving Takagi-Sugeno fuzzy models, IEEE Transactions on Fuzzy Systems, 16(6), 1393-1410, 2008. 22. F.J. Ferrer-Troyano, J.S. Aguilar-Ruiz, J.C.R. Santos, Incremental rule learning and border examples selection from numerical data streams, J. UCS, 11(8), 1426-1439, 2005. 23. G. Gora, A.Wojna, RIONA: A classifier combining rule induction and k-nn method with automated selection, Europan Conference on Machine Learning, LNAI 2430, 111-123, 2002. 24. G. Widmer, Combining Robustness and Flexibility in Learning Drifting Concepts, Machine Learning, 111, 1994. 25. G. Lee, U. Yun, K.H. Ryu, Sliding window based weighted maximal frequent pat- tern mining over data streams, Expert Systems with Applications, 41 694-708, 2014. 26. G.S. Manku, R. Movtwani, Approximate Frequency Counts over Data Streams, Proceedings of the 28th VLDB conference, 2002. 27. H. Abdulsalam, D.B. Skillicorn, P. Martin, Classifying evolving data streams using dynamic streaming random forests, Proceedings of 19th International Conference on Database and Expert Systems Applications, 5181, 643-651, 2008. 28. J. Beringer, E. Hullermeier, Efficient Instance-Based Learning on Data Streams, Intelligent Data Analysis, 1-43, 2007. 29. J. Gama, R. Rocha, P. Medas, Accurate decision trees for mining high-speed data streams, In Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, 523528, 2003. 30. J. Gama, P. Kosina, Learning decision rules from data streams, In Proceedings of The 22nd International Joint Conference on Artificial Intelligence (IJCAI), 1255-1260, 2011. 31. J. Gama, A survey on learning from data streams: current and future trends, Prog Artif Intell, 1, 45-55, 2012. 32. J.R. Quinlan, C4.5: Programs for Machine Learning, 235-240, 1994. 33. J. Read, A. Bifet, G. Holmes, B. Pfahringer, Scalable and efficient multi-label classification for evolving data streams, Mach Learn, 88, 243-272, 2012. 34. J. Gama, P. Medas, Learning decision trees from dynamic data streams. Journal of Universal Computer Science, 11(8), 1353-1366, 2009. 35. K. Pripuic, I. Podnar, K. Aberer, Distributed processing of continuous sliding- window k-NN queries for data stream _ltering, World Wide Web, 14, 465-494, 2011. 36. L.A. Kurgan, K.J. Cios, CAIM discretization algorithm, IEEE Transactions on Knowledge and Data Engineering, 16(2), 145-153, 2004. 37. L. Su, H. Liu, Z. Song, A new classification algorithm for data stream, IJMECS, 3(4), 32-39, 2011. 38. L. Zhao, L. Wang, Q. Xu, Data stream classification with artificial endocrine system, Appl Intell, 37, 390-404, 2012. 39. M. Sokolova, G. Lapalme, A systematic analysis of performance measures for classification tasks, Information Processing and Management, 45, 427-43, 2009. 40. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, The weka data mining software: An update, SIGKDD Explorations, 11, 2009. 41. M. Salganicoff, Tolerating Concept and Sampling Shift in Lazy Learning Using Prediction Error Context Switching, Artificial Intelligence Review, 133-155, 1997.
42. M. Toyoda, Y. Sakurai, Y. Ishikawa, Pattern discovery in data streams under the time warping distance, The VLDB Journal, 22, 295-318, 2013. 43. M.M. Mohammad, W. Clay, G. Jing, K. Latifur, H. Jiawei, W.H. Kevin, C.O. Nikunj, Facing the reality of data stream classification: coping with scarcity of labeled data, Knowl Inf Syst, 33, 213-244, 2012. 44. N.G. Pavlidis, D.K. Tasoulis, N.M. Adams, D.J. Hand, ʎ-Perceptron: An adaptive classifier for data streams, Pattern Recognition, 44, 78-96, 2011. 45. N. Mozafari, S. Hashemi, A. Hamzeh, A Precise Statistical approach for concept change detection in unlabeled data streams, Computers and Mathematics with Applications, 62, 1655-1669, 2011. 46. P. Domingos, G. Hulten, Mining High-Speed Data Streams, Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 71-80, 2000. 47. P. Vivekanandan, R. Nedunchezhian, Mining data streams with concept drifts using genetic algorithm, Artif Intell Rev, 36, 163-178, 2011. 48. P. Zhang, X. Zhu, Y. Shi, L. Guo, X. Wu, Robust ensemble learning for mining noisy data streams, Decision Support Systems, 50, 469-479, 2011. 49. Q. Xiangju, Z. Yang, L. Chen, L. Xue, Learning from data streams with only positive and unlabeled data, Intell Inf Syst, 40, 405-430, 2013. 50. R. Jin, G. Agrawal, Efficient Decision Tree Construction on Streaming Data, Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 571 - 576, 2003. 51. R. Klinkenberg, T. Joachims, Detecting Concept Drift with Support Vector Machines, In Proceedings of the Seventeenth International Conference on Machine Learning (ICML), 487-494, 2000. 52. R. Polikar, L. Udpa, V. Honavar, LEARN ++: an Incremental Learning Algorithm For Multilayer Perceptron Networks, IEEE Transactions on System, Man and Cybernetics (C), Special Issue on Knowledge Management, 3414-3417, 2000. 53. S. Garcia, F. Herrera, An extension on "statistical comparisons of classifiers over multiple data sets" for all pairwise comparisons, Journal of Machine Learning Research, 9, 2677-2694, 2008. 54. S. Hashemi, Y. Yang, Flexible decision tree for data stream classification in the presence of concept change, noise and missing values,Data Mining and Knowledge Discovery, 19, 95-131, 2009. 55. S. Hashemi, Z. Ying, Y. Mirzamomen, M. Kangavari, Adapted one-versus-all decision trees for data stream classi_cation, IEEE Transactions on Knowledge and Data Engineering, 21, 624-637, 2009. 56. S. Cost, S. Salzberg, A weighted nearest neighbor algorithm for learning with symbolic features, Machine Learning, 10, 57-78, 1993. 57. S. Miyamoto, K. Mori, H. Ihara, Autonomous decentralized control and its application to the rapid transit system, Comput Ind, 5, 115-124, 1984. 58. T. Cover, P. Hart, Nearest neighbor pattern classification, IEEE Transactions on Information Theory, 13, 21-27, 1967. 59. W. Xindong, L. Peipei, H. Xuegang, Learning from concept driftig data streams with unlabeled data, Neurocomputing, 92, 145-155, 2012. 60. X. Zhu, X. Wu, Y. Yang, Effective classification of noisy data streams with attribute-oriented dynamic classi_er selection, Knowl. Inf. Syst., 9, 339-363, 2006. 61. Z. Qun, H. Xuegang, Z. Yuhong, L. Peipei, W. Xindong, A Double-Window-Based Classification Algorithm for Concept Drifting Data Streams,2010 IEEE International Conference on Granular Computing, 639-644, 2010.