Adaptability, interpretability and rule weights in fuzzy rule-based systems Andri Riida , Ennu R¨ usternb a
Laboratory of Proactive Technologies, Tallinn University of Technology, Ehitajate tee 5, 19086, Tallinn, Estonia, e-mail:
[email protected] b Department of Computer Control, Tallinn University of Technology, Ehitajate tee 5, 19086, Tallinn, Estonia, e-mail:
[email protected] Abstract This paper discusses interpretability in two main categories of fuzzy systems - fuzzy rule-based classifiers and interpolative fuzzy systems. Our goal is to show that the aspect of high level interpretability is more relevant to fuzzy classifiers, whereas fuzzy systems employed in modeling and control benefit more from low-level interpretability. We also discuss the interpretabilityaccuracy tradeoff and observe why various rule weighting schemes that have been brought into play to increase adaptability of fuzzy systems rather just increase computational overhead and seriously compromise interpretability of fuzzy systems. Keywords: Fuzzy modeling, fuzzy control, classification, interpretability of fuzzy systems 1. Introduction Fuzzy rule-based systems can be divided into two main categories. In the fields of modeling and control usually interpolative fuzzy systems are employed because what we need is a continuous output variable. The inference algorithm of interpolative fuzzy systems is based on rule interpolation, hence the name1 . In classification, on the other hand, the principal task of a fuzzy rule-based classifier is just to assign a class label (the number of which is limited) to the sample presented to it. The inference algorithm relies on rule competition, rather than on cooperation. 1
Not to be confused with an alternative concept of fuzzy interpolative systems [34].
Preprint submitted to Information Sciences
December 11, 2012
Interpretability of fuzzy systems - an ability to explain the behavior of the system in an understandable way - has attracted many researchers in recent years (see e.g. [2, 3, 4, 5, 9, 10, 11, 14, 17, 21, 30, 31, 33, 37, 39, 40, 41, 44, 50, 45, 65]). While the ultimate definition of interpretability with all its implications (interpretability measures and requirements) is still underway [17], a taxonomy of fuzzy system interpretability into low-level and high-level interpretability has been firmly established [17, 65]. Low-level interpretability issues can be tracked down to fuzzy partition properties such as normality, coverage, convexity, distinguishability and complementarity. High-level interpretability, on the other hand, is associated with rule base properties and in many studies, high-level interpretability improvement essentially boils down to complexity reduction (i.e. reducing the number of variables, rules and conditions per rule). Some recent studies have included further semantic issues to deal with [6, 18, 43]. Our previous research [51, 55] has mostly focused on low-level interpretability aspects in interpolative fuzzy systems under the label of fuzzy system transparency. In the present paper we take a step further from [57] and show that these two levels of interpretability are not equally relevant to the two categories of fuzzy rule-based systems. In fuzzy interpolative systems, the high level of rule interaction dictates that partition properties are a primary concern from interpretability viewpoint and complexity reduction is a background issue (Sect. 2). This is because fuzzy models and controllers cannot usually have a large number of variables without falling prey to the curse of dimensionality. On the other hand, it can be shown that due to specific characteristics of the inference algorithm in classification, low-level interpretability requirements appear in a milder formulation. Interpretability improvement is therefore largely a matter of finding a small number of concise fuzzy rules with a limited number of conditions per rules while preserving a satisfying balance between interpretability and accuracy (interpretability-accuracy tradeoff). The roots of this approach date back to the 1990s [25, 26] and we revisit its main aspects in Sect. 4. We also discuss adaptability of fuzzy rule-based systems. Some interpolative fuzzy systems, such as Mamdani systems greatly benefit from productsum inference as it provides an analytical expression of the inference function and permits us to apply computationally efficient methods for the identification of consequent parameters (Sect. 3). In classification, rule weights are often introduced to improve the classification rate [15, 24, 29, 32, 38, 66] and 2
they are sometimes considered as an improvement to the way in which the rules in fuzzy interpolative systems interact [12, 48, 63]. Our purpose, however, is to show that in both fuzzy system categories (see Sect. 5 and Sect. 6, respectively), the rate at which rule weights contribute to adaptability is often overestimated and their true identity is usually overlooked. 2. Interpolative fuzzy systems It is generally acknowledged that of the two prevailing types of interpolative fuzzy systems, Mamdani systems are inherently more interpretable than Takagi-Sugeno (TS) systems2 . This is because Mamdani systems provide a better (more intuitive) mechanism for the integration of expert knowledge into the system as fuzzy rules in Mamdani systems closely follow the format of natural languages and deal with fuzzy sets exclusively. These rules are based on the disjunctive rule format IF x1 is A1r AND x2 is A2r AND ... ... AND xN is AN r THEN y is Br OR ...,
(1)
where Air denote the linguistic labels of the i-th input variable associated with the r-th rule (i = 1, ..., N ; r = 1, ..., R), and Br is the linguistic label of the output variable, associated with the same rule. Each Air has its representation in the numerical domain - the membership function µir (the same applies to Br represented by γr ) and in a general case the inference function that computes the fuzzy output F (y) of the system (1) has the following form ÃÃ N ! ! R [ \ µir (xi ) ∩ γr , (2) F (y) = r=1
i=1
where ∪R r denotes the aggregation operator (corresponds to OR in (1)), ∩ is the implication operator (THEN) and ∩N i is the conjunction operator (AND). In order to obtain a numerical output, (2) is generally defuzzified with the 2
1st and higher order TS systems are not considered in this paper but reader may refer to a related study [53].
3
center-of-gravity method
R yF (y)dy y = Ycog (F (y)) = RY . F (y)dy Y
(3)
In the following, the activation degree of the r-th rule - the result of the conjunction operation in (2) - is denoted as τr =
N \
µir (xi ).
(4)
i=1
In a normal Mamdani system the number of membership functions (MFs) per i-th variable (Si ) is relatively small - this number is rarely equal to R as the notation style in (1) implies, moreover, for the sake of coverage it is often desiredQthat all possible unique combinations of input MFs are represented (R = N i=1 Si ). MFs of the system are thus shared between the rules and a separate R × N dimensional matrix that accommodates the identifiers mri ∈ {1, 2, ..., Si } maps the existing input MFs µsi to the rule slots. Unless unique output MFs are assigned to all rules, they also need some external allocation mechanism. According to [55], Mamdani systems are subject to transparency constraints to ensure low level interpretability. The most convenient way to satisfy the transparency constraints for the input MFs µsi (s = 1, ..., Si ; i = 1, ..., N ) is to use the following definition: xi −as−1 i , as−1 < xi ≤ asi i asi −as−1 i as+1 −xi i µsi (xi ) = , (5) asi < xi < as+1 s+1 s, i a −a i i 0, otherwise that defines the strong fuzzy partition [59]. One can see that (5) is almost a perfect embodiment of normality, coverage, distinguishability and complementarity. For output MFs γr (y), however, the respective constraint is totally different R ymax yγr (y)dy y = core(γr (y)), (6) Ycog (γr (y)) = R min ymax γr (y)dy ymin and simply requires that γr must be symmetrical. The latter property is satisfied by default by symmetrical triangular MFs given by 2y − 2br + sr 2br + sr − 2y , ), 0), (7) γr (y) = max(min( 2 2 4
where sr is the width of γr and br = core(γr (y)) is its center. Utilization of (5) establishes a gridlike input partition, where system behavior at rule centroids (τr = 1) can be fully predicted by the interpretation of the rulebase and MFs (assuming that γr are symmetrical). Consider a fuzzy controller in Figure 1, which is the Trajectory Management Unit from [52]. Between the rule centroids, system output is interpolated (at most 2N rules can contribute to the output at any given input vector xk ). However, because of the grid, it is difficult to model phenomena that are oblique to the axes. Moreover, for best results in data-driven modeling, it is assumed that we are able to provide enough data so as to cover the whole input space. High space coverage requirement combined with the curse of dimensionality means that we are usually restricted to the applications where the number of inputs is relatively small.
250 200 150
ĭ
100 50 0 -50 -10 0
-5 5 0
10 15
x2
5
20 25
x1
10
Figure 1: Grid-like partition of the input space and interpolation driven inference. Rule centroids are denoted by small cubes.
There are many inference operators developed for Mamdani systems (a thorough study [13] counts over 40 different operators for fuzzy implication alone). Their role is to determine the character of interpolation between rule centroids. Some researchers [62] find that we should define a specific tnorm/t-conorm for each specific application, properly catching the semantics 5
of the problem. In practice, however, only a handful of them (and perhaps rightfully so) have found wider use. The most common choices for implication (and conjunction) operators are minimum and product and for the aggregation operator usually maximum or sum is chosen. 3. Product-sum fuzzy systems It is easy to show that if we fix product implication and sum aggregation then (3) will be significantly reduced R ymax PR R ymax PR PR τ γ (y)ydy τ γr (y)ydy r r r r=1 r=1 ymin ymin r=1 τr Cr Sr y = R ymax PR = PR = P , (8) R ymax R τ γ (y)dy τ γ (y)dy τ S r r r r r r r=1 r=1 r=1 ymin ymin where
R ymax
γr (y)ydy y Cr = R min , ymax γr (y)dy ymin
is the center of gravity of given γr and Z ymax Sr = γr (y)dy,
(9)
(10)
ymin
is its area. (8) implies that with given inference operators it is sufficient to consider only the area and center of gravity of γr . For example, with (7), Cr = br and Sr = sr /2, thus (8) can be rewritten as PR y = Pr=1 R
τr br sr
r=1 τr sr
.
(11)
Note that if γr are of equal width (∀r, sr = ξ, where ξ is an arbitrary positive constant) (11) reduces to PR τr br , (12) y = Pr=1 R r=1 τr which is the inference function for the well-known 0-th order Takagi-Sugeno systems, meaning, of course that if the output MFs of the fuzzy system (11)
6
are of equal width, these can be with no loss of generalization reduced to scalars3 . P i s Furthermore, as Ss=1 µi (xi ) = 1 (input transparency constraint), (12) would further reduce to R X y= τr br , (13) r=1
which is as basic as it can get. Until recently, optimization of Mamdani systems such as (11) was limited to using derivative-free but computationally greedy population based guided search methods such as evolutionary algorithms [16]. Thanks to a recently developed method [54] it is possible to identify sr and br for (11) in a more efficient manner. To include the situations where the number of unique output MFs (T ) is or is to be expected smaller than R (meaning that they are shared among the rules) let us introduce a R × T allocation matrix M that maps the t-th output MF (t ∈ {1, 2, ..., T }) to the r-th rule if the element in the r-th row and the t-th column of M is equal to one (only one element of this value is permitted per row). If there is no sharing of output MFs (T = R), M is an identity matrix and can be neglected. Using the notations τ1 (1) τ2 (1) ... τR (1) τ1 (2) τ2 (2) ... τR (2) , (14) Γ= ... ... ... ... τ1 (K) τ2 (K) ... τR (K) s0 = [s1 , s2 , ..., sT ]T , b0 = [b1 , b2 , ..., bT ]T ,
(15)
y = [y(1), y(2), ..., y(K)]T ,
(16)
and where K, of course, is the number of training samples, we can show that if M , Γ, s0 and y0 are known, a least squares solution to (11) that lacks an 3
An early attempt to improve tractability and computational properties of Mamdani systems [64] reduces (3) to (12) via the application of “weighted average” defuzzification - an approximate substitute of the center-of-gravity method with accounted information loss.
7
exact solution in terms of b0 is given by b0 = pinv(Γ · M · diag(s0 )) · diag(Γ · M · s0 ) · y,
(17)
where diag() denotes the operation which transforms a column vector (its argument) into a diagonal matrix and where pinv() is the Moore-Penrose pseudoinverse [49] applied for matrix inversion. If M , Γ, b0 and y are known and s0 is unknown, we face the homogeneous matrix equation (diag(y) · Γ · M − Γ · M · diag(b0 )) · s0 = 0.
(18)
To find a non-trivial solution to (18), we can apply singular value decomposition [20], i.e. find matrices U , Σ and V so that U · Σ · V T = diag(y) · Γ · M − Γ · M · diag(b0 ),
(19)
where Σ is a diagonal matrix and U and V are orthogonal matrices. The least-squares solution is given by the column of V , which corresponds to the smallest diagonal entry of Σ. For the parameter identification of (11) we can start with (17)4 and apply (19) and (17) then repeatedly until the solution converges (less than 10 iterations are needed) or until all elements of s0 maintain the same sign (to preserve the physical meaning of sr ). To illustrate the basic characteristics of the algorithm we use a very simple regression example. We extract a fuzzy model of y = sin(2x − 0.7), with x = [0, 1],
(20)
assuming that a three-rule model would be sufficient to approximate the function. Two input MFs are fixed to the extremes of the input domain (to maintain the coverage property) and we let a21 take values from [0,1]. For each position of a21 , the set of output parameters (b1 , b2 , b3 , s1 , s2 and s3 ) is identifed using (17) and (19). We also identify the corresponding 0-th order TS models. The curves of modeling root mean square error (²) in respect 4
For the first iteration, b0 is identified with s0 that is a T × 1 vector of ones. If our goal is to obtain consequent parameters for a 0-th order TS system (12) then one-time application of (17) concludes the identification procedure.
8
to the position of a21 for both types of models are depicted in Figure 2, left hand side. We can see that as expected, the error is dependent on how a21 is positioned, however, a Mamdani model appears to be both more accurate and more robust than the corresponding 0-th order TS model. It is quite obvious that one additional parameter per rule offers increased adaptation potential. To exploit this potential fully, however, we need support from the method used to determine an input partition. 0.05
sr + 1 > sr
0.04
br + 1
sr > sr - 1
0.03
H
y
br
s r > sr + 1
0.02
br - 1
0.01
s r - 1 > sr
0 0
0.5
ܽଵିଵ
1
ܽଵଶ
ܽଵ
x
ܽଵାଵ
Figure 2: Left hand side: dependence between the position of a21 and modeling error for (20). Mamdani models (solid line), 0-th order TS models (dashed line). Right hand side: interpolation between neighboring rules in Mamdani systems (dashed line depicts the linear interpolation when sr−1 = sr = sr+1 )
Interpretation of br in a fuzzy system follows naturally from the definition of transparency - each br is the value of y when τr = 1. Interpretation of sr , however, becomes ambiguous. As can be seen from Figure 2, right hand side, a larger sr flattens the output by acting like a magnet that pulls the data samples between ar−1 and ar+1 toward br - the value of sr could be therefore 1 1 interpreted as the “magnetic” value of a rule. There are two reasons, though, why this (or any other) interpretation of sr cannot be exploited very well. Firstly, unlike br , the absolute value of a given sr is meaningless, what counts is its relative magnitude in respect to the widths of output MFs of its neighboring rules (and remember that all sr can be scaled up and down proportionally without any effect to the inferred y). Secondly, each rule interacts with up to 3N − 1 neighboring rules (where N is the number of inputs) so we have to deal with an exponentially growing 9
set of parameters in this comparison as the number of inputs increases and the whole effort quickly becomes impractical. It can be also argued that if M is not the identity matrix, the increased adaptation potential of (11) will be cancelled out by possible contradictory requirements that derive from extensive sharing of consequent parameters (this, no doubt, influences br as well but to a lower degree). 4. Fuzzy rule-based classifiers A classifier is an algorithm that assigns a class label to an object, based on the object description. The object description comes in the form of a vector containing values of the features (attributes) that are considered to be relevant for the classification task. Typically, the classifier learns to predict class labels using a training algorithm and a training data set (alternatively, when a training data set is not available, a classifier can be designed from prior knowledge and expertise). Once trained, the classifier is ready for operation on unseen objects. A classifier is expected to group the examples presented to it into a small number of distinct classes, that are labelled with discrete values (1, 2, ..., T ) where T is the number of classes. Note that the actual numerical value assigned to a class is irrelevant, it just functions as a label. A fuzzy classifier is a fuzzy rule-based system that utilizes fuzziness only in the reasoning mechanism [35] (as opposed to the algorithms that assign fuzzy class membership to the objects e.g. fuzzy clustering algorithms such as fuzzy c-means [8], Gustafson-Kessel [23] or Gath-Geva clustering [19]) and that consists of rules in the following format IF x1 is A1r AND x2 is A2r AND ... ... AND xN is AN r THEN y belongs to class cr
(21)
where cr is a class assigned to the r-th rule (cr ∈ {1, ..., T }) and Air denote the linguistic labels of the i-th feature associated with the r-th rule (i = 1, ..., N ; r = 1, ..., R). Other than the phrasing of the consequent part, these rules look identical to (1), however, the underlying reasoning mechanism of a fuzzy rule-based classifier is usually implemented by the single winner approach [27, 42, 58] that selects the class label cr , associated with the rule that provides the
10
highest rule activation degree (4) for the given feature vector y = cr , arg max(τr ).
(22)
1≤r≤R
(22) implies that proper classification rules do not cooperate in producing the output of the fuzzy system as in (2), instead they compete with each other. The grid partition (5), useful for interpolative fuzzy systems, serves altogether three purposes - firstly it ensures sufficient amount of interpolation between individual rules so as to provide continuous and smooth output. This purpose is void in classifier systems because of no need for a continuous output. Secondly, it isolates the rule centroids so as to ensure interpretability. Classifier rules, however, isolate themselves from each other naturally, due to rule competition in (22). Thirdly, the grid partition also provides what is known as “global semantics” [7], meaning that MFs can be shared among the rules and therefore it is possible to assign meaningful linguistic labels to them (“big”, “small” etc.) However, more often than not, classification rules are created directly in the product space, thus the number of MFs µir is generally equal to the number of rules and the MFs are rarely shared. This practice has been driven by the character of classification problems - we typically deal with a high number of features/inputs and as a result the data samples are distributed unevenly over the input (hyper)space with low space coverage. Creating the rules directly in the product space helps us to concentrate upon relevant regions of product space and keep the number of rules low. Obviously, this comes at a price - loss of global semantics. Utilization of grid partitioning (5), however, would mean that we would have to deal with all those problems related to the curse of dimensionality. Consider the following example (Figure 3) where a two-feature five-rule classifier classifies the feature vectors into five classes (for the sake of simplicity we assume that each rule specifies a distinct class). The problem is simplistic because the classes in this classification problem do not overlap. Nevertheless, even though the classes and corresponding rules do not overlap, associated MFs on individual features x1 and x2 are bound to do so. Does the loss of global semantics imply the loss of interpretability? It is not necessarily so. As useful as global semantics is from the viewpoint of interpretability, its absence does not prevent us from interpreting the rules and understanding the behavior of the system, at least, not in classification. 11
A12
A11
A13 A15
A14 A21
A11
R1
R1
R4
A12 A21
A23
A22
R2
class 2
A13
A24
R3
x2
class 1
class 3
R5
A24
A14
A22
R2
A23
R3 R4
class 4
A25
A15
A25
R5 x1
class 5
x1
x2
Figure 3: A five-rule fuzzy classifier (left hand side) and the rule view of the same classifier (right hand side).
In the rule view mode (right side of Figure 3) the rules of the system can be easily analyzed and interpreted. For example, looking at rule R1 , one can see that A21 only overlaps with A24 (which implies that R1 does not overlap with rules R2 , R3 and R5 ). As A11 , on the other hand, does not overlap with A14 , we can conclude that R1 is not overlapping with any other rules and resides in the left upper corner of the product space. Such analysis can be carried on with the rest of rules. This, of course, fires back a little in terms of the number of parameters because triangular MFs not belonging to the grid need to be defined with three parameters (air < bir < cir ) xi −air bir −air , air < xi < bir cir −xi , bir ≤ xi < cir , µir (xi ) = (23) c ir −bir 0, otherwise and one might fear that classifier systems with unshared input MFs may be subject to overtraining as they give the system additional degrees of freedom it might misuse. Overtraining, however, owes as much to the system configuration as to the algorithm that is used to optimize its parameters. Moreover, the outcome is heavily influenced by the way we separate data into training and testing sets. The generalization ability is also tightly connected with the coverage 12
property - since the early studies (e.g. [1]), it has been recognized that MFs with non-compact supports perform better in this respect and [55] provides a formula for replacing (23) with numerically near-equivalent Gaussian doublesided MFs that have the ability to classify samples even outside original rule borders. In numerical terms, the main advantage of fuzzy rule-based classifiers over conventional logic based classification systems or fuzzy min-max neural networks [60] lies in its ability to provide oblique decision boundaries that allows us to achieve high classification accuracy even with a small number of rules [36]. For example, the class distribution in Figure 4 can be modeled with different levels of granularity with no loss of accuracy. Non axis-parallel decision boundaries are the result of rule overlap and the shape of the boundary depends on the placement of overlapping rules in respect to each other as well as the inference (conjunction) operator. Minimum conjunction, for example, yields angular decision borders that can be witnessed in Figure 4 as opposed to smoother decision borders with product conjunction. In consequence, many studies in fuzzy classification focus on improving the decision borders so as to obtain as high accuracy as possible. However, rules that cover a lot of the feature space are less meaningful than concise ones and high overlap rate presents a further interpretability problem. Therefore it often makes sense to introduce a higher level of granulation even if it means that there will be more rules to interpret. A11
A11 A12
A14
A12
A15
A13 A16
R3
R1
R2 A21
A23
R1
A22 A26 A21 A25
x2 A22
R6 R2 x1
A24
R5
R4 x1
Figure 4: Reduced complexity and bloated rules (left hand side) vs. higher granularity (right hand side) . Rule centroids are depicted with “+”.
13
Emphasis is even stronger in a further and perhaps the most effective way to improve interpretability of classifiers with many features - rule compression5 . Rule compression (rule-level feature selection) [3, 17, 22, 56] is a procedure that discards less relevant antecedent conditions from individual rules, resulting in shorter and more comprehensible rules that contain only the conditions that matter. Typically, rules that are more concise from the beginning, are also more likely to be compressed at a higher rate with negligible accuracy loss. It would unfair, however, not to mention that rule compression can have side effects because those shorter rules are also more general and compression of a rule in one given dimension is in fact equivalent to the expansion of the rule in the very same dimension, which can create a conflict situation in the unexplored part of the input space and lead to incorrect classification of unseen samples (Figure 5). A11
R2 A22
x2
R1 The conflict area
x1
Figure 5: Compressed rules “IF x1 is A11 THEN y belongs to class 1” and “IF x2 is A22 THEN y belongs to class 2” and their conflict.
5
Some authors use the term “incompleteness” to indicate that some conditions (and variables) are missing from rules.
14
5. Rule weights in fuzzy interpolative systems In interpolative systems, rule weights can be applied to complete rules or only to the consequent part of the rules [46]. In the first case, the corresponding weight wr is used to modulate the activation degree of a rule and in the second case it is used to modulate the rule conclusion. Consider a weighted fuzzy system (r = 1, ..., R) IF x1 is A1r AND ... ... AND xN is AN r THEN y is Pr with wr ,
(24)
where Pr labels the output fuzzy set associated with r-th rule (having the parameters pr and zr ) and wr is the rule weight. Interpretation of rule weights has long been a controversial issue [46] with many possible interpretations being tossed around (credibility, importance, influence, reliability etc.). With product-sum inference, however, the inference functions corresponding to (24) appear as PR τr wr pr zr y = Pr=1 , (25) R r=1 τr wr zr (if weights are applied to complete rules) or PR r=1 τr wr pr zr , y= P R τ z r r r=1
(26)
(if applied only to rule consequents). It is easy to see that both (25) and (26) are equivalents of (11), as sr = wr zr , br = pr in the first and br = wr pr , zr = sr in the second case. The only principal difference between (24) and (1) is that even if the original MFs in (24) may have been shared between the rules, the modulated MFs are now unique to each rule (as were the weights wr , for that matter). In fact, we can consider sr in (11) as rule weights applied to complete rules. Similar reduction is obtained if the weights are used in multiple-consequent rules in which a rule appears as IF x1 is A1r AND ... AND xN is AN r THEN y is P1 with w1r AND P2 with w2r , AND ... AND PT with wT r .
(27)
The inference functions corresponding to (27) and transformation formulas that reduce (27) to (11) are given in Table 1, only further emphasizing redundancy of rule weights under given circumstances. 15
Table 1: Transformation of multiple-consequent weighted fuzzy systems into (11)
weights applied to
inference function
rule conclusion
y=
complete rules
y=
PR
PT
r=1 τr j=1 wjr pj zj PR P τr T zj PR r=1 PT j=1 r=1 τr j=1 wjr pj zj PR PT r=1 τr j=1 wjr zj
br PT
j=1 wjr pj zj PT zj PT j=1 j=1 wjr pj zj PT j=1 wjr zj
sr PT PT
j=1 zj
j=1
wjr zj
6. Rule weights in fuzzy rule-based classifiers Consider a weighted fuzzy classifier (r = 1, ..., R) IF x1 is A1r AND ... ... AND xN is AN r THEN y belongs to class cr with wr ,
(28)
In the inference algorithm the rule weight is simply multiplied with the corresponding τr y = cr , arg max(τr · wr ),
(29)
1≤r≤R
The effect of rule weighting (see [28] for an earlier take on the subject) is equivalent to multiplying each MF with wr (with minimum conjunction) or √ with wr (with product conjunction), which is a direct violation of normality property. Note that the weights are as scalable as sr in (11), this directly derives from (29). It is important to understand that with compact MFs such as (23), rule weights only influence the shape of the decision boundary between two or more rules in the overlap area. Furthermore, the weights have a noticeable effect on decision boundaries only if the weights are substantially different (Figure 6). In general, the decision boundary between P overlapping rules (their intersection is not an empty set) is a composite line consisting of up to P !/(2 · (P − 2)!) segments. Each of such segments is in turn a part of the decision boundary between two overlapping rules, say Rp and Rq . τp · wp = τq · wq , p 6= q
(30)
What one should not forget, however, is that rule weighting is a symmetrical procedure (the corresponding MFs are affected in all dimensions), 16
A11
A12
R1 w2 = 3w1 A21
x2
A22 w1 = w2 w1 = 3w2
R2 x1
Figure 6: Decision boundaries with different weight assignments. smoother decision boundaries characteristic of product conjunction.
Here one can see
there can be several isolated overlap areas, consequently, fine-tuning of the decision boundary with the help of rule weights in one such area may improve classification accuracy but may easily degrade performance in another overlap area. For multiple-consequent rules IF x1 is A1r AND ... ... AND xN is AN r THEN y belongs to c1 with w1r , to c2 with w2r , ... ..., to cT with wT r
(31)
usually the voting inference [29] is applied y = cj , arg max(Vj ),
(32)
1≤j≤T
where Vj =
R X
τr wjr .
(33)
r=1
Comparing (32) to (22), it is easy to see that instead of individual rules the classes themselves compete with each other in (32). The decision boundary segments are thus defined by 17
R X
τr wrj =
r=1
R X
τr wrk
(34)
r=1 k6=j
Let us detract a positive constant w ¯r from both wrj and wrk R X
τr (wrj − w¯r ) =
r=1
R X
τr (wrk − w¯r )
(35)
r=1 k6=j
It is obvious that w ¯r from both sides of the equation cancel each other out and (35) reduces back to (34). In summary, this means that one of the rule weights per rule in (31) can always be reduced to zero by choosing w¯r = min (wjr ) and therefore the effect of voting becomes apparent only 1<j