AGGE: A Novel Method to Automatically Generate Rule ... - Springer Link

Report 1 Downloads 61 Views
AGGE: A Novel Method to Automatically Generate Rule Induction Classifiers Using Grammatical Evolution Romaissaa Mazouni and Abdellatif Rahmoun

Abstract. One of the main and fundamental tasks of data mining is the automatic induction of classification rules from a set of examples and observations. A variety of methods performing this task have been proposed in the recent literature. Many comparative studies have been carried out in this field. However, the main common feature between these methods is that they are designed manually. In the meanwhile, there have been some successful attempts to automatically design such methods using Grammar-based Genetic Programming (GGP). In this paper, we propose a different system called Automatic Grammar Genetic Programming (AGGP) that can evolve complete java program codes. These codes represent a rule induction algorithm that uses a grammar evolution technique that governs a Backus Naur Form grammar definition mapping to a program. To perform this task, we will use binary strings as inputs to the mapper along with the Backus Naur Form grammar. Such binary strings represent possible potential solutions resulting from the initialized component and Weka building blocks, this would ease the induction process and makes induced programs short. Experimental results prove the efficiency of the proposed method. It is also shown that, compared to some recent and similar manual techniques (Prism, Ripper, Ridor, OneRule) the proposed method outperforms such techniques.A benchmark of well-known data sets is used for the sake of comparison. Keywords: AGGE: Automatic Generation of classifiers using Grammatical Evolution, Grammatical Evolution, Context Free Grammar, Rule Induction Algorithms, Data mining, Rule Based Classification.

1

Introduction

The use of Evolutionary Algorithms (EA) in Artificial life that simulates the natural process of evolution started in the 1960’s with the works of Alex Fraser and Romaissaa Mazouni · Abdellatif Rahmoun Djillali Liabes University, Computer Science Department, Algeria e-mail: [email protected], [email protected] © Springer International Publishing Switzerland 2015 D. Camacho et al. (eds.), Intelligent Distributed Computing VIII, Studies in Computational Intelligence 570, DOI: 10.1007/978-3-319-10422-5_30

279

280

R. Mazouni and A. Rahmoun

Nill Bricelli, the writings of John Holland on Genetic Algorithms and Ingo Rechenberg on evolution strategies [1]. Koza [2] made this field very popular. The dramatic increases in the power of computers allowed the researchers to solve complete complicated real world problems inducing the automatic generation of computer programs using bio-inspired techniques such as evolutionary algorithms which have proven their ability to solve multidimensional problems more efficiently than software produced by human designers. Genetic Programming (GP) is the most popular EA technique, to automatic generation of computer programs using genetic operators. However in their standard form GP algorithms have some issues that need to be taken in consideration while designing it.One of the major problems faced when designing the GP is the selection of the function set and the terminal set.These sets should be chosen so as to satisfy the closure property [3]. The closure property requires that each function in the function set should be able to handle and process the values received as inputs to it, either these inputs are terminals or outputs of other functions. In order to respect this property the early designed GP algorithms dealt with only one data type which reduced of course the ability and power of GP. Recent GP systems use new approaches to give the GP the power to handle different data types,these approaches were used to satisfy the closure property, Grammar based genetic programming [4] is the technique the mostly used to tackle the closure problem. Another problem faced when using genetic programming systems in the huge search space that the GP will have to scan, using grammars along with GP was a solution to this issue because grammars offers the ability to introduce prior knowledge about the solutions basic structure to the GP so it will be easy to restrict the search space. Insuring the syntactic validity of the individuals is also a shutter that we should think about when using the GP , Grammars are also a way to assure the solutions syntactic validity and to also preserve this validity after applying genetic operators. One of the most popular research fields in computer science beside the evolutionary algorithms is datamining and more specifically the classification area, there exist several approaches to perform classification, and the most used technique among these latters is the rule based classification. Rule based classifiers are frequently used due to the humanly comprehensible nature of the models they generate. These formers have evolved through the time, from using simple sequential covering concepts in simple algorithms such as PRISM and CN2 to algorithms using new concepts such as minimum description length in the RIPPER algorithm ,these old and novel components can be modified and gathered automatically in different ways to produce new algorithms.

2

Grammar Based Genetic Programming

Writing a computer program is a purely manual task that is time consuming and requires a lot of reflection, it is a really tedious task. The idea of automating this process appeared in the 1950’s when Arthur Samuel thought of giving computers

A Novel Method to Automatically Generate Rule Induction Classifiers

281

the ability to learn without being explicitly programmed. This process was called back then Machine Learning, the in the 1980’s the Machine Learning definition was changed by Tom Mitchell into computers having the ability to learn via experience , at the same time another computer science field has appeared which was directly applied to automatic code generation this latter was called Genetic Programming. After several years of research in this new field a new branch of it called Grammar Based GP appeared and was conceived in order to satisfy the closure property and to restrict the search space as well.

2.1

Context Free Grammar Based Genetic Programming (CFG-GP)

The use of formal grammars was firstly introduced by Peter Whigham in order to control the search algorithm of genetic programming, it was also a solution provided to solve the typing problem recognized by [9] and a mean to introduce more bias into the genetic programming. Peter Whigham proposed a method called Context Free Grammar GP, [4,10] and noted that the use of CFG can be in a similar way to that of typing that restricts the structure of candidate solutions , this new method is based on the redefinition of the elements of tree based GP so it respects a certain grammar G.Individuals in CFG-GP have the tree structure and they are derived according to the CFG introduced to the GP, the genetic operators are modified in a way to preserve this representation. Context Free Grammars allow to easily use programming languages most appropriately for a given problem, [12].

2.2

Grammatical Evolution (GE)

Grammatical evolution is a special case of grammar based genetic programming that uses a mapping procedure between the genotypes and phenotypes of individuals [6, 13]. When using GE to evolve solutions of a certain problem we don’t need to give attention to the operators and how they will be implemented, the GE brings the benefit of the validity of the programs being evolved for free. The grammatical evolution described by [17] marries principles from molecular biology to the representational power of formal grammars [4]. The genotype-phenotype mapping in GE allows the operators to act on genotypes and not solution trees as in traditional GP and this is what makes this technique attractive.In analogy to the biological DNA the string of integers that represent the genotype is called Chromosome and the values it is consisted of are called Codons, A BNF grammar definition must initially be introduced when using GE to solve a certain problem, this BNF grammar describes the output language produce d by the system in [19] and it is used along with the chromosomes in the mapping process. The mapping process is the process of

282

R. Mazouni and A. Rahmoun

mapping non-terminals to terminals and that is done by converting the binary string data structure into a string of integers, which is brought from the Genetic Algorithm into the machinery of GE. Then this integers string is passed through a translation process, where the rules in the BNF definition are selected. The production rules, equivalent to amino acids in genetics, combine to produce terminals, which are the components making up the final program. One problem that could be faced when mapping binary strings is short genes ( when we run out of genes but we still have non-terminals to map), a solution to this issue is wrapping the individual and reuse the genes, a gene in GE could be used several times in a mapping process; We can also declare the individual invalid and punish it with a suitably harsh fitness value. The rules selection is performed using the modulo operator and each time we read a codon (an integer) we divide it by the number of the rule’s choices and the remainder of this division is the number of the rule to be selected.

3

Sequential Covering

Rule induction is an area of data mining where formal rules are extracted using a certain dataset, these rules will represent local patterns of a full scientific model in this dataset. One of the paradigms of the rule induction process is decision rules, which is used to induce decision rules using a certain set of observations and it has two different methods to do it, the first method is the indirect one where we extract the decision rules from other knowledge representations and the second method is the direct one, when using this method we extract decision rules directly from the training set. In this paper we will focus on the second method (direct one). Sequential covering is a technique following this paradigm. The idea of this technique is to learn one rule, remove the examples it covers and then, repeat this process [15] as described in the following pseudo code. The Sequential Covering Pseudo Code SequentialCovering(target_att,atts,examples,threshold) Learn_rule ={ }; Rule=Learn_One_Rule(target_att, atts ,examples); While ( performace(rule, examples) > threshold ) do Learned_rules=learned_rules +rule; examples= example - examples correctly classified by rule; rule= Learn_One_Rule( target_att,atts,examples); Done Learned_rules = sort(Learned_rules(performance)); Return Learned_rules;

In the sequential covering algorithms, Learn_One_Rule should have high accuracy but not necessarily high coverage, and it is not guaranteed to find the best or the smallest set of rules because this method performs a greedy search.There are plenty of proposed algorithms that follow the sequential covering paradigm. These algorithms differ in 4 main components of the sequential covering that can differ in the

A Novel Method to Automatically Generate Rule Induction Classifiers

283

way they represent the candidate rules. Some use the propositional logic such as: CN2 and Ripper, others use the first order logic such as: FOIL and REP. The algorithms following sequential covering can also use different search mechanisms [16], there are three different search strategies, the bottom-up one where we start with a random example of the dataset then we generalize it, the second strategy we have is top-down one where we start with an empty rule then we specialize it by adding preconditions to it and finally the bidirectional one. There exist also two different search methods the most used one is the greedy method (ex: PRISM) and the beam method ( ex: CN2, BEXA). Covering algorithms have different manners to evaluate rules some of the existing methods are : the confidence (ex: SWAP -1), the Laplace estimation ( ex: BEXA,CN2), the m-estimate, the ls-content (ex : HYDRA ), the minimum description length and the info gain. The final component that differentiate the covering algorithms is the pruning methods, pruning is used to handle over fitting and noisy data , there exist two kinds of pruning : the pre-pruning that deals with the over fitting and noisy data and the post-pruning that deals with rejection and conflict problems in order to find a complete consistent rule set. Pre-pruning gives the ability to find a fast model while the post-pruning helps finding simpler and more accurate models. Pappa and al [16], proposed a full Context Free Grammar using a Backus Naur Form terminology that presents all elements necessary for building a basic rule based classifier following the sequential covering method , this grammar contains 26 production rules, each one representing Non-Terminal symbols and 83 Terminal symbols describing these elements. This grammar produces either a rule list where rules are executed in a certain order or a rule set when there is no order needed when applying rules. This gives different initialization, refinement and evaluation techniques, the grammar is presented in Figure.1.

4

Automatic Generation of New Rule Based Classifiers

The Context Free Grammar based Genetic Programming has been used in order to induce rule based classifiers by [16] and it is to the best of our knowledge the first and only method used to automatically design a rule induction algorithm. It would be very interesting to try to design another system or method that will perform the same task but with less effort spent while designing the system, Grammatical evolution in an interesting method because we do not need to modify the crossover neither the mutation so they can respect the grammar. Grammatical Evolution is the proposed method that we will try to use to automatically generate rule based classifiers. there exist plenty of classification methods such: Support Vector Machine, Bayesian Neural Networks, Artificial Neural Networks, Decision Trees,...etc. The decision rules model is chosen because it has the tendency to be intuitively comprehensive by human beings. We will propose in the following section a system combining grammatical evolution with a context free grammar to evolve code fragments having the ability to generate accurate, noise tolerant and compact decision rule sets.

284

4.1

R. Mazouni and A. Rahmoun

Proposed System

The proposed system has five main components. Initially all we need a grammar that will represent the overall structure of all rule based classifiers, following the sequential covering paradigm that are manually designed. Secondly we need some building blocks taken from Weka to facilitate the task of reading arff files and testing the newly generated classifiers, this can be seen as " code reuse". we need also some of machine learning data sets to train and test these classifiers , we have downloaded these data sets from the UCI machine learning repository [18]. We need multiple datasets so that when we train the rule based classifiers (candidate solutions) they won’t be tailored to a certain specific domain. Finally, we have the mapper of the GE [12] that we modified in a way that when it reads terminals, it inserts java code representing this terminal’s actions, we used the GEVA [19] frame work to implement the system.

Fig. 1 The Grammar Describing Sequential Covering Method Elements

Fig. 2 Modules of The Proposed System

A Novel Method to Automatically Generate Rule Induction Classifiers

285

The most important module in the system is the mapper, which needs to be able to read the integer values from the chromosomes (candidate solutions which will be called in this paper AGGE-classifiers) then choose the appropriate corresponding rule of a certain non-terminal and import some of the already coded weka building blocks an insert it along with the terminals corresponding java code into the buildClassifier class of the new AGGE classifier, Figure 6 is a schema representing steps of the global system. Individuals in this system are represented as integer arrays that will be mapped into rule based classifiers, each array will be read integer by integer. Integers will be divided by number of choices of the current rule. The remainder on the division will be used by the mapper to choose the next rule to be applied. The first integer of each individual will be divided by 2 ( the number of choices of the first rule Start) and according to the remainder we will be constructing either a rule set (integer MOD 2 = 1) or a rule list (integer MOD 2 = 0). When using evolution to solve any problem we need a measure so to enable selecting best individuals among a population of solution, the metric or the fitness function used in this work to evaluate AGGE-classifiers generated during the evolution process is the accuracy method. After the initialization of the first population, individuals (AGGE-classifiers) will undergo the mapping process then they will be represented in java programs, these java programs are actual classifiers that will be compiled then executed ( trained and tested ) using different data sets and each classifier will have a set of different accuracies (an accuracy for each data set). these accuracies will be then averaged and used as the classifier’s overall accuracy so we will be able to compare these AGGE-classifiers in a population. The following equation defines the overall accuracy of an AGGE-classifier i in a given population where acci,j is the accuracy of this AGGE-classifier i using the data set j and h is the number of data sets. h

fi = ( acci,j )/h j=0

When using Grammatical Evolution we don’t need to think about the consistency with the grammar of new offsprings generated after a mutation or a crossover operation because these latters are applied on the genotypes.

5

Experimentation Results and Discussion

Prior to testing the system we downloaded data sets that we used to train and test the system, we downloaded 19 data sets from [18], these data sets are from different public domains so that the system will not be tailored to a specific domain as mentioned earlier. Some of these data sets have only nominal attributes, others have only numerical attributes, and some data sets have both attributes types. The data sets were divided into 2 groups, 70% of them (13) were used to train then to validate the models, and the rest were used for the purpose of testing. We should note that

286

R. Mazouni and A. Rahmoun

this division was done randomly. The meta-training set contains(Monks-2, Monks-3, Balance-scale, lymph, promoters, splice, vowel, vehicle, pima, glass,sonar, hepatitis, ionosphere) and the meta-testing set contains (Monks-1, segment, crx, sonar, heartc, mushrooms), before starting the training phase each set of the meta-training set is subsampled using the 5-fold cross-validation in order to avoid overfitting and to make predictions more generalizable. The system needs 3 components to start the evolution process. The grammar mentioned earlier in section 3, the meta data sets and finally the grammatical evolution parameters, the number of generation was set to 40, the population size to 200, the mutation probability was set to 0.01, the crossover probability to 0.7, the selection method chosen is the tournament selection and we used the generational replacement. We should mention here that these parameters are not optimized but rather empirically chosen after analyzing a certain number of trials. In order to evaluate the newly generated classifiers we computed the accuracies of 4 manually designed rule based classifiers using all 19 datasets. The first column of Table 1 reports accuracies of the new generated classifier (AGGE-classifier) while the remaining columns shows accuracy values of the 4 manually designed classifiers (Ripper, Ridor, OneRule and Prism), the first 13 rows reports the accuracies of the rules sets generated by the AGGE-classifier the 4 baseline. Using only the meta-training set( each row represent the test accuracy of a single set from the meta training set) these accuracy values are reported here to show the success of the training phase while the last six rows of the table shows the real predictive accuracy values because the AGGE-classifier has never met these sets during the training or validation phase. The aim of this work was to propose a system that uses the grammatical evolution method to automatically generate rule based classifiers having an accuracy that

Table 1 Accuracy rates (%) Using Both Meta-Sets AGGE Classifier monks-2 53.8462 monks-3 36.8852 promoters 90.5660 splice 99.1223 vowel 82.2222 vehicle 64.4526 pima 75.3404 glass 62.6923 zoo 100.000 lymph 68.9189 balence scale desc 78.7225 ionosphere 88.1474 hepatites 68.3871

Prism 37.8698 30.3279 66.0377 70.1823 83.0303 66.7849 70.0349 57.4939 62.3762 75.6757 52.3200 91.1681 78.0645

Ripper 53.2544 45.9016 78.3019 93.6991 69.6970 68.5579 75.1302 66.8224 86.1386 77.7072 80.0800 89.7436 78.0645

Ridor 50.8876 46.7213 74.5283 92.1003 77.6768 70.5674 75.0000 64.0187 940594 85.1351 79.5200 88.0342 78.7097

OneRule 42.0118 37.7049 69.8113 24.3574 31.8182 51.4184 70.1823 58.4112 42.5743 85.1351 56.3200 80.9117 83.2258

segment crx mushrooms monks-1 sonar heart-c

92.2944 77.5362 100.000 26.6129 74.0385 76.8977

95.4978 85.5072 100.000 49.1995 73.0769 81.5182

96.1472 83.3333 100.000 51.6129 73.5577 79.538

64.8052 85.7971 98.5229 50.0000 62.5000 71.6172

94.2857 91.4493 100.000 49.236 76.4423 80.5291

A Novel Method to Automatically Generate Rule Induction Classifiers

287

can be at least competitive with the existing manually designed classifiers. After the implementation and the testing phase, the system proved its ability to produce classifiers that are highly competitive with the human designed ones. We can easily notice the performance of these formers in Table 1, we should note that we used a 5-fold-cross-validation so we can be able to train the classifiers with the complete data set and then indirectly to test its with all data, this helps to take advantage of all knowledge available in the data sets and to obtain reliable performance results. We should also mention that accuracies in Table 1 were obtained by averaging the accuracy of the rule model generated by the AGGE-classifier for each test set over the 5 iterations of the 5-fold-cross-validation method used during experiments, this also applies on the rest of the benchmark classifiers used for the purpose of comparison. It is worth mentioning that in Table 1 the new generated classifier has practically the same results as the other methods and if we compare only the baseline methods with eah others we can clearly notice that the RIPPER records 10 wins over 5 for PRISM and RIDOR and 2 for the OneRule algorithm and this due to the sophisticated nature of the RIPPER classifier. It uses a growing, pruning and and optimization phase and the Minimum Description Length (MDL) method as a stopping criterion when constructing the rules. Now if we look at the AGGE-classifier accuracies we can notice how close are these accuracies to the baseline algorithms accuracies which is very interesting due to the fact that the AGGE-classifiers is automatically generated, and this removes a great deal of necessary time doing coding tasks. Humans designers can easily go wrong when parameterizing an algorithm during the design process, on the other hand the chance of having bad parameters when using automatic evolution of algorithms is very low. The last six rows show that the AGGE-classifier records 3 wins against the baseline classifiers (crx,Mushrooms,sona), for the heart-c and segment data sets the AGGE results where very close the best accuracy (80.5291 versus 81.5182 and 94.2857 versus 95.4978) these results proved that the proposed approach can be very interesting. However, the AGGE system is time consuming while evolving AGGE-classifiers and requires high computational power. Moreover, it can’t be run properly under an ordinary computer, where the evolution process can take up to one week of continuous calculation, we should also note that this version of the system does not handle missing values and thus requires eliminating instances with missing value before using the datasets. Concerning numeric values the system uses the discretization method.

6

Conclusion

This paper proved the possibility of generating automatically rule based classifiers using grammar evolution.The automatically and genetically evolved classifiers can produce results that are competitive with those produced using manually designed algorithm, Results obtained using the newly generated rule based classifier proved the efficiency and effectiveness of this approach, however there are still some gaps to be tackled.

288

R. Mazouni and A. Rahmoun

One of the interesting research directions that should be considered to improve the system is to focus on the fitness. This task can be fulfilled by using multi objective fitness function that considers either rules and rule set size or time consumption when performing classification (for real-time classification applications) along with the accuracy. This method might help in finding more efficient and powerful algorithms. Another way for improving the system may be the extension of the grammar so it can produce more complex algorithm structure,through the utilization of the fuzzy rule representation. Subsequently, the system will be able to produce fuzzy rules based classifiers, are more expressive and more powerful than classical ones.

References 1. Beyer, H.G., Schwefel, H.P.: Evolution strategies: A comprehensive introduction. Natural Computing 1, 3–52 (2002) 2. Koza, J.: Genetic programming: on the programming of computers by means of natural selection. MIT Press (1992) 3. Pappa, G.L., Freitas, A.: Automating the Design of Data Mining Algorithms. Springer, Heidelberg (2010) 4. McKay, R., Hoai, N.X., Whigham, P.A., Shan, Y., O’Neill, M.: Grammar-based Genetic Programming: A review. Genetic Programming and Evolvable Machines, 365–396 (2010) 5. Wong, M.L., Leung, K.S.: Applying logic grammars to induce sub-functions in genetic programming. In: IEEE International Conference on Evolutionary Computation, vol. 2, pp. 737–740 (1995) 6. O’Neill, M., Hemberg, E., Gilligan, C., Bartley, E., McDermott, J., Brabazon, A.: GEVA: Grammatical Evolution in Java. SIGEVOlution ACM 3, 17–23 (2008) 7. Norman, P.: Genetic programming with context-sensitive grammars. Phd thesis, Saint Andrew’s University (2002) 8. Wong, M.L., Leung, K.S.: Data Mining Using Grammar-Based Genetic Programming and Applications. Kluwer Academic Publishers (2000) 9. Montana, D.J.: Strongly typed genetic programming. Evolutionary Computation Journal 3, 199–230 (1995) 10. McKay, R., Hoai, N.X., Whigham, P.A., Shan, Y., O’Neill, M.: Grammar-based Genetic Programming: A review. Genetic Programming and Evolvable Machines 11, 365–395 (2010) 11. McKay, R.I., Nguyen, X.H., Whigham, P.A., Shan, Y.: Grammars in Genetic Programming: A Brief Review. In: Progress in Intelligence Computation and Intelligence: Proceedings of the International Symposium on Intelligence, Computation and Applications, pp. 3–18 (2005) 12. Nohejl, A.: Grammar Based Genetic Programming. MSc Thesis, Charles University of Prague (2011) 13. Nohejl, A.: Grammatical Evolution. BSc Thesis, Charles University of Prague (2009) 14. Freitas, A.A.: Data mining and Knowledge Discovery with evolutionary algorithms. Springer (2002) 15. Bing, L.: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Springer Edition (2011) 16. Pappa, L.G.: Automatically Evolving Rule Induction Algorithms using Grammar-based Genetic Programming. Phd Thesis, Kent University (2007) 17. Dempsey, I., O’Neill, M., Brabazon, A.: Foundations in Grammatical Evolution for Dynamic Environments. SCI, vol. 194. Springer, Heidelberg (2009) 18. UCI Machine Learning Repository, http://archive.ics.uci.edu/ml/ 19. Oltean, M., Grosan, C.: A Comparison of Several Linear Genetic Programming Techniques. Complex Systems Journal 14, 285–313 (2004)