EUSFLAT - LFA 2005
Fuzzy Box-plot based induction rules. Towards automatic generation of classes-interpretations Karina Gibert Dep. Statistics and Operations Research Technical University of Catalonia C. Pau Gargallo 5. 08028-Barcelona Spain
[email protected] Abstract Many Knowledge Discovery problems involve a clustering process to discover the underlying structure of the target domain. Later, interpretation of clusters is required in order to understand the discovered structure. And this interpretation becomes more difficult as the number of classes increases, as well as the available variables increase. In this paper a method for determining the class of a new object, given a previously known set of classes is presented. From a previous work based on the combination of Artificial Intelligence techniques with Statistical ones (boxplot based induction rules), a fuzzy membership function to every class is calculated and graphically represented. The presented formalism is also useful for automatic generation of conceptual interpretations of a given set of classes. A real application of the proposal to typical situations in a catalan wastewater treatment plant is also presented to illustrate the performance of the method. Key Words: Knowledge Discovery, Knowledge Base, Conceptual Interpretation of the results, Rules generation, Boxplot, Partition, Wastewater Treatment Plant.
1
Introduction
Clustering is one of the more used techniques in Knowledge Discovery (KD). Actually, according to [7], most of the KD processes either requires a clustering or can be reduced to it, as it is the case 524
Alejandra P´ erez-Bonilla Becaria CONICYT Dep. Statistics and Operations Research Technical University of Catalonia Spain
[email protected] of the target domain presented in this paper. As said before, once the clusters are identified, the next step is to interpret the results. In previous works the method boxplot based induction rules (BvIR)[?] for inducing characterizations of a given partition was presented. It is a hybrid method combining AI techniques and statistical ones, proposed for automatically generating descriptions of a given partition, in order to make easier their interpretation and prediction, as well as to contribute to KD in the target domain. The method included a graphical, and crisp, representation of the membership function of a certain variable to every class. Very simple transformation moves this function to a fuzzy paradigm. The aim of this work is to present a modification of BbIR as a tool for identifying the class of a given object, using numerical variables. By combining inductive learning and statistical tools, fuzzy membership functions of the day to every class is found, which is a better modelling from reality. Later, automatic production of a conceptual description of those classes will be faced. Once identified and understood by the user, this typical situations may be used afterwards for supporting the decision making process. This decision making may be either automatic or not. The main goal of wastewater treatment plants is, in general terms, to guarantee the outflow water quality (referred to certain legal requirements), in order to restore the natural environmental balance which is disturbed by industry wastes, domestic waste-waters, etc. When the plant is not on normal operation, which is extremely difficult to model by traditional mechanicistic models, de-
EUSFLAT - LFA 2005
cisions have to be taken to modify some parameters of the wastewater treatment process in order to reestablish the normality as soon as possible. This process is very complex, on the one hand, because of the intrinsic features of wastewater; on the other hand, because of the bad consequences of an incorrect management of the plant [?]. TRATAMIENTO PRIMARIO
TRATAMIENTO SECUNDARIO
Salida 1er. Decantador
3er. Decantador
Caudal Tanque de ventilación
Agua Emitida
Agua reciclada
Figure 1: WasteWater treatment diagram In this paper, our proposal is applied to real data from a catalan WasteWater treatment plant, to identify the class of a new day, from a set of previously identified situations [?]. A very brief description of the process in the plant is presented: the water flows sequentially through three or four processes which are commonly known as pretreatment, primary, secondary, and advanced treatment (see [?] for a detailed description of the process). Figure ?? depicts its general structure: (i): In the pretreatment, an initial separation of solids from wastewater is performed. (ii) Primary treatment consists of leaving the wastewater in a settler for some hours. Solids will deposit down the settler and could be sent out. (iii) Secondary treatment occurs inside a biological reactor. A population of microorganisms (biomass) degrades the organic matter solved in the wastewater. (iv) In the advanced treatment another settler is used to separate the water from the biomass. The settler output (solids or biomass) produces a kind of mud which is the input of another process called sludge line. The paper is organized as follows: In §?? previous works are presented. In section §?? a general view of the proposed method is presented. In §?? application of this method to the wastewater treatment plant data is detailed. Finally, conclusions and future work come in §??.
2
Previous work
Given the results of a clustering process, next step towards final knowledge induction for supporting
decision-making is to interpret the real meaning of each class. In the context of statistics, a manual interpretation is usually done, in close connection with the domain experts, on the basis of the statistics provided by statistical softwares. Trying to paralel the analysis done by experts to support this phase, a procedure for qualitative variables was first proposed in [?, ?]). Main idea was to automatically analyze conditional distributions to identify characteristic variables [?, ?], variables with exclusive values in a class. Next, extension to numerical variables, very relevants in many domains, as is the case of WWTP, was faced [?]. Since in this case experts required graphical representations for interpreting the classes, the starting point of the presented method was to study multiple boxplots. In [?] BbIR is presented as a method which produce a probabilistic knowledge base to describe classes. In [?] a comparison between a very primary version of the method and other inductive methods shown that BbIR appeared as a good imitation of the real process used by experts to manually interpret the classes. It also confirmed that modifying the method to provide more flexibility would sensibly improve its performance [?]. Here a fuzzy version of the method is presented [?], to avoid big changes on probabilities which do not model correctly the real situation. In this version some rules have a variable degree of certainty depending on xik .
3
The Methodology
The standard input of a clustering algorithm is a data matrix with the values of K variables X1 . . . XK (numerical or not) observed over a set I = {1, . . . n} of individuals. Variables are in columns, while individuals in rows. Cells contain the value (xik ), taken by individual i ∈ I for variable Xk , (k = 1 : K). A partition in ξ classes of I is denoted by Pξ = {C1 , ..., Cξ }. Thus, P2 = {C1 , C2 }, is a binary partition of I. 3.1
Multiple boxplot
Figure ?? displays the multiple boxplot of a given variable through a set of 4 classes. For short, a multiple box plot [?] can show the relationship between a common numerical variable and some 525
EUSFLAT - LFA 2005
23.66 32.53 48.7 22.75 30.0635.28 46.9 51.82
C1
I1
◦
I3
C1
¾I6? ◦
C2 C2
?
5. Built the table of frequencies of classes conditioned to intervals:
I8 I9
?
k P|I k I1k I2k . . . I2ξ−1 C1 f11 f12 C2 .. fcs . Cξ fξ(2ξ−1) 1 1 1
?
C3 C3 C4
C4 Q-E 20,5 28,9 37,3 45,7 54,9
?
?
? ? Q-E 20,5 28,9 37,3 45,7 54,9
Figure 2: left: Multiple box-plot of variable (inflow wastewater) Q-E through a 4-classes partition. right: Variable windows for inducing a coding of the numerical variable.
where fcs =
6. Identify the conditional empirical frequencies with the degrees of certainty of the rules. This may be graphically represented and used as an interpretation supporting tool. Fig. ?? shows the degrees of membership of variable QE-E in a partition of two classes.
groups. For each class the interval of values taken by the variable is visualized and rare observations (outliers) are marked as ”◦” and “*”. A box is displayed from Q1 (first quartile) to Q3 (third quartile) and the Median, usually inside the box, is marked with a vertical sign. Boxes include, then, the 50% of the elements of the class.
7. For every non-null cell of table P|I k built a probabilistic rule in the following way: fsc
R = {r : xik ∈ Isk −→ C : ∀fsc > 0}
With this representation it is easy to see if there is a class in which some variable has special values. In the figure, this happens for class 4 (see figure ??). This fits in the definition of characterizing variable for class C originally given in [?], and reproduced bellow. 3.2
8. If crisp decisions required, decide an uncertainty level α, cut from R all the rules with a degree of uncertainty less than α and get an automatic interpretation of P, at level α. 3.3
Boxplot-based induction rules
1. For every C ∈ P calculate the maximum and minimum of Xk : xmC = min(Xk |C) = min∀i∈C {xik } xM C = max(Xk |C) = max∀i∈C {xik } 2. Built a new set with all the extreme values calculated in 1: E = {xmC , xM C : C ∈ P} 3. Sort E into E ∗
1. Reduce the horizontal edges of the membership function by redefining Isk as:
4. Built a coding of Xk using as cutpoints the elements in E ∗ : I k = {Isk }, where
=
[e∗s , e∗s+1 ), ∀s
I0ks = [e∗s + δs , e∗s+1 − δs ), ∀s = {1 . . . 2ξ − 1}; being δs = 0.05(e∗s+1 − e∗s ).
= {1 . . . 2ξ −1}; and 1
k ∪2ξ−1 s=1 Is
=
[e∗1 , e∗2ξ−1 )
= [mini xik , maxi xik ) =
Fuzzy Boxplot-based induction rules
Of course characteristic values are found for fsc = 1 and quickly visualized in the figure. It is also clear that smoothing the bounds1 of fig. ?? interpretation turns into a fuzzy paradigm and induction of linguistic labels is also suitable. Our proposal consists of modifying the previous method by using trapezoidal bounds for the induced intervals. So, the key is to modify the cutting of the variable, by introducing intermediate points near the boundaries where smoothing of membership function can be applied.
Short description of the method [?] is presented:
Isk
card{i:i∈C&xik ∈Isk } card{i:xik ∈Isk }
For example, by eliminating of a 5% of horizontal segments per side followed by linear interpolation of the funcrk tion gap.
526
EUSFLAT - LFA 2005
2. Enlarge I k with the following intervals: Jsk = [e∗s − δs−1 , e∗s + δs ), ∀s = {2 . . . 2ξ − 2} 3. Modify the antecedents of the set of rules by using the new intervals fsc
R = {r : xik ∈ I0ks −→ C : ∀fsc > 0} 4. Enlarge the set of rules by adding a new rule for each element in Jsk :
meeting the river): PH-S: pH ; SS-S: Suspended Solids (mg/l); SSV-S: Volatile suspended solids (mg/l); COD-S: Chemical organic matter (mg/l); BOD-S: Biodegradable organic matter (m/l). (v) Other variables: QR-G: Recirculated Flow ; QPG: Purged flow; QA-G: Air inflow Figure ?? shows where all the measures are taken along the plant. It is easy to see that this set is heterogeneous and usually there are missing values and the sensors may provide noisy data.
fsc
R = {r : xik ∈ I0ks −→ C : ∀fsc > 0}∪ {r : xik ∈ Jsk
f (xik ,fsc ,fs+1c )
−→
C : ∀fsc > 0}
s+1 +δs ) where f (x, fsc , fs+1c ) = fsc + (x−e δs −δs+1 (fs+1 c −fsc )
4 4.1
The Application The target domain
Data analyzed in this paper comes from the Waste Water Treatment Plant of Girona (in Spain). It is a sample of 396 observations taken from September the first of 1995 to September the 30th of 1996. Each observation refers to a daily mean, and it is identified by the date itself. The state of the Plant is described through a set of 25 variables, considered the more relevant upon expert’s opinions. They can be grouped as: (i) Input (measures taken at the entrance of the plant): Q-E: Inflow wastewater (daily m3 of water);FE-E Iron pre-treatment (g/l); pH-E; SS-E: Suspended Solids (mg/l); SSV-E: Volatile suspended solids (mg/l); COD-E: Chemical organic matter (mg/l); BOD-E: Biodegradable organic matter (m/l). (ii) After Settler (measures taken when the wastewater comes out of the first settler): PH-D: pH; SS-D: Suspended Solids (mg/l); SSV-D: Volatile suspended solids (mg/l); COD-D: Chemical organic matter (mg/l); BOD-D: Biodegradable organic matter (m/l). (iii) Biological treatment (measures taken in the biological reactor): Q-B: Biological reactor-flow; V30: Index at the biological reactor (measures the sedimentation quality of the mixed liquor, in ml/l); MLSS-B: Mixed liquor suspended solids at the biological reactor; MLVSS-B: Mixed liquor volatile suspended solids: MCRT-B: Mean cell residence time at the biological reactor. (iv) Output (when the water is
4.2
Crisp Analysis
For the presented data, P2 = {C392, C393} contains 2 classes. The multiple boxplot of Q-E versus P2 is in figure ??. Descriptive statistics by classes can be done for everyone of the variables. The one regarding Q-E can be seen in table ??. C392
C393 ◦
20,500 28,897.150437,294.300845,691.453154,088.6016
Figure 3: Multiple boxplot of Q-E versus P2 . Table 1: Summary Class Var. N = 396 ¯ Q-E X S min max N*
statistics for Q-E versus P2 . C393 C392 nc = 390 nc = 6 42,112.9453 4,559.2437 29,920 54,088.6016 1
22,563.7988 1,168.8481 20,500 23,662.9004 0
Using the BbIR as presented in ?? the following set of intervals is found: I Q−E = {I1Q−E = [20500.0, 23662.9], I2Q−E = (23662.9, 29920.0], I3Q−E = (29920.0, 54088.6]}. Table ?? shows the distribution of the observations across the classes and the intervals simultaneously. And the following probabilistic reduced rules system can be induced for the variable Q-E: 527
EUSFLAT - LFA 2005
I1 I2 I3
C393 0 1 388
C392 6 0 0
Regarding this variable, concept xQA−G,i ∈ [124120.0, 136371.0] can be induced for C393 and xSS−E,i ∈ (136371.0, 367840.0] for C392, but this is an uncertain interpretation, since it is based on probabilistic rules with p < 1.
Table 2: Crossing of classes versus intervals 100 C393 75
1.0
r1,2 : xQ−E,i ∈ [20500.0, 23662.9] −→ C392 1.0 r2,1 : xQ−E,i ∈ (23662.9, 29920.0] −→ C392 1.0 r3,1 : xQ−E,i ∈ (29920.0, 54088.6] −→ C393
50 25 0
124120154585185050215515245980276445306910337375367840 100 C392 75
In fact, this information can be also graphically represented in a degrees of membership diagram (see figure ??). Since Q-E is totally characterizing variable (as it produces a set of secure rules), it is enough to associate concepts xQ−E,i ∈ [20500.0, 29920.0] for C393 and xQ−E,i ∈ (29920.0, 54088.6] for C392 as an interpretation of the classes. 100 C393 75 50 25 0 20500 24698 28897 33095 37294 41492 45691 49890 54088 100 C392 75
50 25 0
124120154585185050215515245980276445306910337375367840
Figure 5: Membership degrees for QA-G vs P2 . Fig.?? shows the membership function, which can be easily transformed to a fuzzy paradigm by following the proposal presented in ??. First of all limits of the intervals are modified and the Jsk are included. Then slopped lines for this new intervals are calculated and the final set of rules is enlarged with those corresponding to the Jsk . In this way, the following probabilistic rules system can be induced for the variable QA-G: 100 C393
50
75
25
50
0
25
20500 24698 28897 33095 37294 41492 45691 49890 54088
0
Figure 4: Membership degrees for Q-E vs P2 . However, in a number of cases totally characterizing variables may not be found and then it is much more suitable to move to a fuzzy paradigm. Fuzzy membership functions
@ @
@
124120154585185050215515245980276445306910337375367840
100 C392
C C C
75 50 25 0
4.3
@
C C
124120154585185050215515245980276445306910337375367840
Figure 6: Fuzzy membership degrees for QA-G Following the proposal from [?] §??, a complete set of rules induced from QA-G is: 0.67 r1 : xQA−G,i ∈ [124120.0, 136371.0] −→ C393 0.18 r2 : xQA−G,i ∈ (136371.0, 324470.0] −→ C393 0 r3 : xQA−G,i ∈ (324470.0, 367840.0] −→ C393 0.33 r4 : xQA−G,i ∈ [124120.0, 136371.0] −→ C392 0.82 r5 : xQA−G,i ∈ (136371.0, 324470.0] −→ C392 1 r6 : xQA−G,i ∈ (324470.0, 367840.0] −→ C392
0.67
r1 : xQA−G,i ∈ [124120.0, 129552.0] − − − − − −→ C393 0,33xik −42752.16
r2 : xQA−G,i ∈ (129552, 143189.55] − − − − − −→ C392 0.18
r3 : xQA−G,i ∈ (143189.55, 308246.5] − − −− −→ C393 0,82xik −252763.84
r4 : xQA−G,i ∈ (308246.5, 340693.5] − − − − − −→ C392 0.33
r5 : xQA−G,i ∈ [124120.0, 129552.0] − − − − − −→ C392 0,67xik −86795.18
r6 : xQA−G,i ∈ (129552, 143189.55] − − − − − −→ C393 0.82
r7 : xQA−G,i ∈ (143189.55, 308246.5] − − −− −→ C392 0,82xik −252760.42
r8 : xQA−G,i ∈ (308246.5, 340693.5] − − − − − −→ C393 528
EUSFLAT - LFA 2005
1.0
r9 : xQA−G,i ∈ (340693.5, 365840.0] − − − − − −→ C392 0,99xik −340680.94
r10 : xQA−G,i ∈ (365840.0, 367840] − − − − − −→ C392
From this new interpretation the class of a given new day is very easy to evaluate. Degrees of certainty of belonging to one class or the other are calculated by following the generated set of rules, or directly on the graph.
5
Conclusions and future work
In this paper a method for automatically generate a fuzzy membership function of a certain variable regarding a given set of classes is presented. It consists of an algebraic modification of BbIR, previously presented. Extension to a fuzzy paradigm is immediate, by fuzzifying the limits of the generated intervals in such a way that the membership function is no crisp anymore (see fig ??). This tools are very promising for automatic generation of conceptual characterizations taking into account uncertainty. The proposal has been successfully applied to real data coming from a wastewater treatment plant. Next step is to take advantage of the hierarchical structure of the reference clustering in order to built the concepts associated to every class. Acknowledgments This research was partially financed by TIC2003.
References [1] Abrams and Eddy. Wastewater engineering treatament, disposal,reuse. 4th Ed. revised by George Tchobanoglous, Franklin L. Burton NY.US. McGraw-Hill., 2003. [2] J. Comas, S. Dzeroski, K. Gibert, I. Roda, and M. S`anchez-Marr`e.Knowledge discovery by means of inductive methods in wastewater treatment plant data. AI Communications. The european journal on artificial intelligence, 14(1):45– 62, march 2001. [3] K. Gibert. The use of symbolic information in automation of statistical treatment for ill-structured domains. AI Communications, 9(1):36–37, march 1996.
conocimiento y la minera de datos. Red Nacional de Minera de Datos y Aprendizaje, 2004. [5] Gibert, K. and Cort´es, U. Generaci´o autom`atica de regles a partir de la caracteritzaci´o de classes. Bolet´ın de la Asociaci´on Catalana de Inteligencia Artificial. v. 14-15. pp 193-203. 1998 [6] K. Gibert and I. Roda. Identifying characteristic situations in wastewater treatment plants. In Workshop in Binding Environmental Sciences and Artificial Intelligence, volume 1, pages 1–9, 2000. ECAI. [7] K. Gibert and A. Salvador. Aproximaci´on difusa a la identificaci´on de situaciones caracter´ısticas en el tratamiento de aguas residuales.X Congreso Espa˜ nol sobre tecnolog´ıas y l´ ogica fuzzy, pages 497–502, Espa˜ na, sep 2000. ESTYLF 2000. [8] A. P´erez-Bonilla and K Gibert. Clasificacin de Algunas Plantas Depuradoras de Aguas Residuales de Catalua. Research DR 2004/13, Universidad Politcnica de Catalua, Barcelona, Espaa, Noviembre 2004. [9] K. Gibert, T. Aluja, and U. Cort´es. Knowledge Discovery with Clustering Based on Rules. Interpreting Results. In Principles of DM and KD, volume 1510 of LNAI. Springer-Verlag. Mining and Knowledge Discovery. pages 83–92, Nantes, 1998. [10] Gibert, K. and Cort´es, U. and Sonickis S. Using rules as a bias mechanism in Knowledge Discovery with clustering. Workshop on causal networks: From inference to data mining (IBERAMIA), pp47-58, Portugal, 1998. [11] J. M., Gimeno, J. B´ejar, M. S`anchez-Marr`e, U. Cort´es. Discovering and modelling process change: An application to industrial processes. In Practical Applications of Data Mining and Knowledge Discovery, 1998. [12] J.W. Tukey. Exploratory Data Analysis. AddisonWesley, 1977. [13] I. R-Roda, M. Poch, M. S`anchez-Marr`e, U. Corts, and J. Lafuente. Consider a case-based system for control of complex processes. Chemical Engineering Progress, 1999. [14] M. S`anchez-Marr`e et al. Concept formation in wwtp by means of classification techniques: A compared study. Applied Intelligence, 1997.
[4] K. Gibert. Tcnicas hbridas de Inteligencia Artificial y Estadstica para el descubrimiento de 529