A Rough Set Framework for Data Mining of Propositional Default Rules Torulf Mollestad
Dept. of Computer Systems and Telematics Institute of Computer Science The Norwegian Institute of Technology 7034 Trondheim, Norway
[email protected] Andrzej Skowron
Institute of Mathematics University of Warsaw 02-097 Warsaw Banacha 2, Poland
[email protected] Abstract As the amount of information in the world is steadily increasing, there is a growing demand for tools for analysing the information. In this paper we investigate the problem of data mining, that is, constructing decision rules from a set of primitive input data. The main contention of the present work is that there is a need to be able to reason also in presence of inconsistencies, and that more general, possibly unsafe rules should be made available through the data mining process. Such rules are typically simpler in structure, and allow the user to reason in absence of information. A framework is suggested for the generation of propositional default rules that re ect normal intradependencies in the data.
1 Introduction As the amount of information in the world is steadily increasing, there is a growing demand for tools for analysing the information, nding patterns in terms of implicit dependencies in data. Realising that much of the collected data will not be handled or even seen by human beings, systems that are able to generate pragmatic summaries from large quantities of information will be of increasing importance in the future. Although simple statistical techniques for data analysis were developed long ago, advanced techniques for intelligent data analysis are not yet mature. As a result, there is a growing gap between data generation and data understanding. At the same time, there is a growing realisation and expectation that data, intelligently analysed and presented, will be a valuable resource to be used for a competitive advantage. Recently, the concept of knowledge discovery [PSF91] has been brought to the attention of the business community. One main reason for this is that there is a general recognition that there is untapped value in larger data bases. Knowledge discovery (KD) is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. Knowledge discovery is thus a form of machine learning which discovers interesting knowledge and represents the information in a high-level language. If the underlying source of information is a database, the term data mining is used to denominate the process of automatic extraction of information. The information is then represented in terms of rules re ecting the intradependencies in the data. Applications of such rules include customer behavior analysis in a supermarket or banking environment, and telecommunications alarm diagnosis and prediction. Typically, such rules would emulate an expert's reasoning process when classifying objects. Accepted for presentation at the 9th International Symposium on Methodologies for Intelligent Systems, ISMIS'96, Zakopane, Poland, June 9-13, 1996.
1
A great deal of work done in data mining has concentrated on the generation of rules that cover the situation where the training data is entirely consistent, i.e. all objects that are indiscernable are classi ed equally. In these cases, de nite rules may be generated, that map all objects into the same decision class. There is however in many cases a clear need to be able to reason also in presence of inconsistencies. Dierent experts may disagree on the classi cation of one particular object, in which case it is desirable to assign dierent trust to the respective conclusions. Also, if objects are classi ed inconsistently, we want still to be able to generate rules that re ect the normal situation. Such normalcy rules typically sanction a particular conclusion given some information, whereas additional knowledge may invalidate previous conclusions. In this work we look at the problems related to incompleteness and uncertainty/inconsistency in information systems, we wish to be able to generate rules that are able to handle these common phenomena. More speci cally, we investigate how Rough Sets [Paw82, Paw91] can be applied to the problem of generating default rules [Rei80, Poo88, PGA86] from a set of primitive sample data, these rules enable us to express common relationships. The conclusions that are drawn from a default theory rely on the soundness of a set of assumptions, that themselves may be disproven when new knowledge is made available. This generalises the notion of decision rules, and provides a framework under which more, potentially interesting, statistical information may be extracted. Much of this information would be lost if the knowledge extraction process is restricted to generating de nite rules only. The contention of this paper is that generation of default rules from databases provides a framework that is suited to handle many of the problems described above [Mol95, Mol96]. By using the rough set approach, we are able to emulate an expert's decision process by generating a set of decision rules. In combination with default reasoning, we generate rules that cover the most general patterns in the data, noise in the form of abnormal objects will not prevent the generation of such normalcy rules. The input to the knowledge extraction process is a set of example objects and information about their properties. To learn rules from a set of examples, we assume the existence of an oracle that has complete knowledge of the domain { the world or universe. The oracle is able to classify the elements of the universe in the sense that he can make decisions with respect to some restricted set of decision properties. In doing this, he identi es a set of concepts { the classes of the classi cation. The task of the learner is to learn the oracle's knowledge, by trying to nd the characteristic features of each concept, nding description of the oracle's concepts in terms of attributes that are available to the learner. The task of learning is thus the problem of expressing the oracle's basic concept in terms of the learner's basic concepts. In the next sections 2 and 3, we give an introduction to Default Reasoning and Rough Sets theory, respectively. Then, in the following section, we present the principle behind de nite rule generation, and give a small example that will be referred to throughout the rest of the paper, and in section 5 we investigate the problem of default rule generation using the Rough Set approach. In section 6 we give references to related work. In the last section we give a summary of work done and point to some questions that give directions for present and future research.
2
2 Nonmonotonic and Default Reasoning The notion of default reasoning is omnipresent in common-sense, and is manifested in all reasoning from inconsistent and incomplete information. The Default Logic of Reiter [Rei80] formalises the notion of default rules. The framework uses a non-monotonic reasoning strategy to draw conclusions from defaults that are consistent with the intended model. Default Logic seems to be a non-monotonic formalism that can be quite easily adapted to the problem of explanation nding. It is possible to interpret defaults as prede ned hypotheses and reasoning with them as a special way of logical theory formation [Rei80, Poo88, PGA86]. Poole [Poo88] proposes an abductive approach to non-monotonic reasoning which corresponds to Reiter's Default Logic. Poole's Theorist is a simple framework for default reasoning. The input to the system is two sets of rst order formulae, facts and defaults, that may be used as premises in a logical argument. The facts are closed formulae that are taken to be true in the domain, whereas the defaults may be seen as possible hypotheses that may be used in a theory to explain the answers, provided that consistency is maintained. An explanation in the Theorist framework hF ; i of a formula g is a consistent set of facts and instances of defaults F [ D that entail g. (D is a subset of ground instances of elements from ). Theorist [Poo88, Poo89, PGA86] relies on a backward-chaining theorem prover to collect the assumptions that will be needed to explain a given set of observations and prove their consistency. If new facts are added to the theory, it may be that the goal can no longer be explained because the defaults used are inconsistent with the new facts. Tweety the non- ying bird is the most common example of the inadequacy of monotonic reasoning from consistent facts, illustrating the need for non-monotonic reasoning capabilities. Consider the example; the framework hF ; i where
F =f bird(X) : y(X)
penguin(X) penguin(X) penguin(tweety) bird(john) g =f y(X) bird(X) g From the knowledge represented we may conclude by default that John ies, not possessing any more knowledge with respect to his ying abilities. That is, we are able to explain y(john) by making the explicit assumption fly (john) bird(john) (a ground instance of a default). The use of the default is however blocked in Tweety's case, since penguins do not y (by a fact in the theory). Obviously, if it came into knowledge that John is a penguin too, the proposition that he ies should no longer be explainable. From having observed, say, one hundred birds, all of which y except for ten which are penguins, a default may be generated to cover the normal situation, i.e. birds do y, whereas penguins do not. If more speci c knowledge becomes available, we may construct further normalcy rules to account for this special situation. Assume that one of the 10 penguins is indeed observed to y. In this situation, the theory should be modi ed such that it no longer is a fact, rather a default that expresses that penguins do not y. The conclusion that a given penguin does not y would be sanctioned unless more speci c knowledge is available with respect to the ying abilities of the particular penguin. To model this situation, we need a framework in which a default rule may be used to block other defaults. Brewka [Bre89] de nes a useful generalisation of Theorist such that defaults of many dierent priority levels may exist. When defaults are in con ict, the defaults of higher priority take precedence, and the others are deemed inconsistent. 3
3 Overview of the Rough Set theory The Rough set approach was designed as a tool to deal with uncertain or vague knowledge in AI applications, and has shown to provide a theoretical basis for the solution of many problems within knowledge discovery. The notion of classi cation is central to the approach; the ability to distinguish between objects, and consequently reason about partitions of the universe. In Rough Sets, objects are perceived only through the information that is available about them { that is, through their values for a predetermined set of attributes. In the case of inexact information, one has to be able to give and reason about rough classi cations of objects. In this paper we investigate how Rough Sets [Paw82, Paw91] can be applied to the problem of generating default rules from a set of primitive sample data, given in terms of a set of objects and information about their properties. As our starting point we have a situation in which data about some domain has been collected and represented in an Information System, containing knowledge about a set of objects in terms of a prede ned set of attributes/properties. The attribute values are given for each object in the domain.
De nition (Information System, Decision System). An Information System (IS) is an ordered pair A = (U; A) where U is a nonempty nite set of objects { the Universe, and
A is a nonempty, nite set of elements called Attributes. The elements of the Universe will in the following be referred to as Objects. Every attribute a 2 A is a total function a: U ! V , where V is the set of allowable values for the attribute (its range). A Decision System is an IS A = (U; A) for which the attributes in A are further classi ed into disjoint sets of condition attributes C and decision attributes D. (A = C [ D, C \ D = ;). a
a
De nition (Indiscernibility Relation). With every subset of attributes B A in the IS A = (U; A), we associate an equivalence relation IND(B), called an Indiscernibility Relation, de ned as follows:
IND(B ) = f(x; y ) 2 U 2 j a(x) = a(y ) for every a 2 B g By U=IND(B ) we mean the set of all equivalence classes in the relation IND(B ). The intuition behind the notion of an indiscernibility relation is that selecting a set of attributes B A eectively de nes a partitioning of the universe into sets of objects that cannot be discerned/distinguished using the attributes in B only. The equivalence classes E 2 U=IND(B ), induced by a set of attributes B A, will in the following be referred to as object classes or simply classes. Through the remainder of the paper, the de nitions will be given in terms of the classes E induced by a set of attributes B , and not the objects themselves. In other words, each object x is represented by its class E , containing all objects that are indistinguishable from x. Throughout the paper, the notation j : j denotes the function that returns the cardinality of the argument set. i
i
i
De nition (Discernibility Matrix). For a set of attributes B A in A = (U ; A), the Discernibility Matrix M (B ) = fm (i; j )g , 1 i; j n = j U=IND(B ) j, where m (i; j ) = fa 2 B j a(E ) = 6 a(E )g for i,j = 1,2,...,n D
D
D
n
n
i
j
The entry m (i; j ) in the discernibility matrix is the set of attributes from B that discern object classes E ; E 2 U=IND(B ). D
i
j
4
To each attribute a 2 B A we now associate a unique boolean variable a. Hence, to each element m (i; j ) of the discernibility matrix corresponds a set m (i; j ) = fa j a 2 m (i; j )g. D
D
D
De nition (Discernibility Function). The Discernibility Functionf (B) of a set of attributes B A is ^ _
f (B) = i;j
W
2 f1 g
m (E ; E ) D
i
j
::n
where n = j U=IND(B ) j, and m (E ; E ) is the disjunction taken over the set of boolean variables m (i; j ) that corresponds to the discernibility matrix element m (i; j ). The Relative Discernibility function f (E; B ) of an object class E, attributes B A, is D
i
j
D
D
^ _
f (E; B) = j
2 f1 g
m(E; E ) j
::n
where n = j U=IND(B ) j. The relative discernibility function f (E; B ) computes the minimal sets of attributes (from B ) that are necessary to distinguish E from other objects classes de ned by B .
De nition (Dispensability). An attribute a is said to be dispensable or super uous in B A if IND(B) = IND(B ? fag), otherwise the attribute is indispensable in B. If all attributes a 2 B are indispensable in B , then B is called orthogonal. De nition (Reduct, Relative Reduct). A Reduct of B is a0 set of attributes B 0 B 0 such that all attributes a 2 B ? B are dispensable, and IND(B ) = IND(B ). We use the
term RED(B ) to denote the family of reducts of B . The set of prime implicants of the discernibility function f (B ) determines the reducts of B . The set of prime implicants of the relative discernibility function f (E; B ) determines the relative reducts of B . A reduct (there may be several) is thus a subset of the attributes A by which all classes discernible by the original IS may still be kept separate.
De nition (Lower and Upper Approximation). The Lower Approximation BX and the Upper Approximation BX of a set of objects X U with reference to a set of attributes B A (de ning an equivalence relation on U ) may be de ned in terms of the classes in the equivalence relation, as follows:
BX = SfE 2 U=IND(B) j E X g BX = SfE 2 U=IND(B) j E \ X 6= ;g
called the B{lower and the B{upper approximation, respectively. The region BN (X ) = BX -BX is called the B -boundary (region) of X . The lower approximation BX is the set of elements from U that can be with certainty classi ed as elements of X , according to the attribute set B . The set BX contains the set of objects that may possibly be classi ed as elements of X . The boundary region contains elements that B
5
can neither be classi ed as being de nitely within nor de nitely outside X , again using the attributes B .
De nition (Rough Membership function). For A = (U; A), x 2 U , X U , attributes B A, the Rough Membership function for a class E 2 U=IND(B) is \ X j ; 0 (E; X ) 1 (E; X ) = j Ej E j B
B
4 Extraction of Knowledge from Information Systems In this section we present a simple example that will help clarify the concepts that have been introduced and also serve as an illustration throughout the rest of the paper. The information system displayed in Table 1 resulted from having observed a total of one hundred objects that were classi ed according to condition attributes C = fa; b; cg. Furthermore, the oracle's classi cation is represented as a set of decision attributes D = fdg.
E1 E2 E3 E4 E5 1 E5 2 ;
;
a 1 1 2 2 3 3
b 2 2 2 3 5 5
c 3 1 3 3 1 1
d 1 (50) 2 (5) 2 (30) 2 (10) 3 (4) 4 (1)
Table 1: The example information system The partition of the universe induced by the condition attributes contains ve classes, namely E1 through E5. The objects that have the same values for the attributes C = fa; b; cg (indiscernible objects) are represented by their equivalence class (one of E1 ,...,E5). For instance, the class E1 contains 50 objects, that are all characterised by their attribute-value vector a1 ^ b2 ^ c3 . The class E5 is shown split into two disjoint sets of objects; E5 = E5 1 [ E5 2, re ecting the two dierent decisions; d = 3 (for E5 1) and d = 4 (E5 2). Hence, the system is indeterministic with respect to the objects in class E5 . More generally, Skowron et al [SGB91] argue that a knowledge system may always be split into two distinct parts, one of which is totally deterministic, the other indeterministic. The notion of determinism is related to the upper and lower approximations of sets; the objects contained in the lower approximation of some decision class are exactly those for which deterministic/consistent decision rules may be generated. For objects contained in the upper{ but not in the lower approximation (i.e. that are in the boundary region) of several classes, no such consistent decision can be made. Before computing the discernibility matrix for the example information system, we make a note of the following: A simpli cation may be made in the case of decision tables; if we are interested in generating decision rules that cover the input data, we do not have to distinguish between classes that are mapped into the same decision X . Hence, the items in the discernibility matrix that record the dierences between these classes need not be considered. In the example, this simpli cation may be applied to the classes E2 , E3 and E4. ;
;
;
j
6
;
The (modi ed) discernibility matrix (over the condition attributes C = fa; b; cg) for the decision system is given in table 2. In the right side of the table we have shown the relative discernibility functions, i.e. the sets of attributes that are needed to discern one particular class E from all other classes. For instance, in order to be able to separate the objects in class E1 from any object class that maps into another decison, we need to make use of both attributes a and c. i
E1 E2 E3 E4 E5
E1 E2 E3 E4 E5
c a ab abc
c
a
ab
abc ab abc abc
ab abc abc
ac c(a _ b) a
a_b a_b
Table 2: Discernibility Matrix for the Decision System The discernibility function for the entire system is ca(a _ b)(a _ b _ c) = ca, hence, only two of the three attributes are needed to distinguish the classes when the decision mapping is taken into account.
De nite Rule Generation Several authors have worked on applying the Rough Set approach to the problem of decision rules generation, designing algorithms to extract information from a set of primitive data by building propositional rules that cover the available input knowledge [Yas91, GB88, GB92, Sko93, SGB91, PS93, HCH93]. In the example given above, a typical learning task would be to de ne the oracle's knowledge (the classi cation of objects wrt. the decision attribute set D) in terms of the properties available to the learner; i.e. the attributes contained in the set C A. As a rst step, we will however introduce some concepts and de nitions that will be referred to later. In the following we will use the following shorthand notation; the expression E 1v vi vn denotes the class over U that is de ned by the corresponding values for attributes a1:::an. For example, referring to the example given in table 1, the class E is equal to E3 [ E4. a
1:::ai
:::an
a2 c3
De nition (Class Description). A Class Description Des(E; B) of an object class E , attributes B A is a value trace over the set of attributes B , i.e. a set of attribute{value
pairs over B that characterise E . By using the lower approximations of the sets X 2 U=IND(fdg), decision rules may be generated for the cases where objects can be classi ed with full certainty into some decision class. As mentioned above, the rules are de ned as a mapping between an object/class description over the set of condition attributes into a description over the decision attribute(s). De nite rules are generated according to the below schema if all objects of E with full certainty have the decision j , in other words, if E is fully contained in X . j
i
i
j
Des(E ; C ) ! Des(X ; D), where E X i
j
i
j
In de ning the rules, the minimal class description of each class is used, which is exactly the values for the attributes in a relative reduct for the class. See the discernibility matrix given 7
in table 2, the minimal set of attributes needed to distinguish E1 from all other classes is ac. Hence, one rule is generated for this class, mapping the value trace over the reduct (a1 c3) into the decision class d1. There are two minimal class descriptions of the E2 class; namely a1 c1 and b2c1, corresponding to each of the two disjuncts of the discernibility function f (E2; C ). From the table we nd that the objects contained in classes E1 through E4 all may be mapped deterministically into one of the decision classes. Each of these classes is contained in the lower approximation CX for some decision class X ; j 2 f1; 2g. The following ve (de nite) rules are generated, applying the minimal class descriptions: j
j
a1 c3 ! d1 a1 c1 ! d2 b2c1 ! d2 E3 ,E4: a2 ! d2 E4 : b3 ! d 2
E1 : E2 :
However, for the class E5, the ambiguity with respect to the decision d means that no de nite rule can be generated. All objects in the class are contained in the boundary region of both classes X3 and X4. The fraction of instances for which (deterministic) decision rules can be learned may be used as a metric on the quality of the learning process. By this simple de nition, the quality of the learning (quality of approximation) described above would be:
S CX j j CX [ CX j j E [ E [ E [ E j 95 j
= jU j = = = 100 = 0:95 jU j jU j C
4 i=1
i
1
1
2
2
3
4
5 Default Rule Generation The procedure described above was able to generate rules that cover the deterministic situation, i.e. where all objects of a particular class E 2 U=IND(C ) are mapped into the same decision class. The example given in the last section does however demonstrate the fact that it is not always possible to obtain a set of deterministic rules that completely covers the entire training set. More importantly, in general we do not even want to generate such a set of overly speci c rules. Even if they do cover all the objects in the training information system, such rules may prove completely insucient when used on other input data. It is very often important to be able to handle inconsistencies in the data. From such data we may still be able to extract a lot of interesting information, speci cally knowledge that re ects the most common or normal situation. A strict requirement on the absolute correctness of the rules has proven insucient in many real world applications. Computer systems { as well as people { are often under pressure to make a decision under strict time constraints, and the ability to reason in absence of knowledge is a great advantage. To reason in this way, we may choose to believe in some rule provided that the evidence supporting it is strong enough. Furthermore, results have been shown [Qui86] that suggest that simpli ed, \uncertain" rules in general prove better than the original ones when applied to new cases. Finding such simple and natural expressions for regularities in databases is useful in various analysis or prediction tasks. Hence, we do want to obtain a set of rules that model the general characteristics of the data, and that are less susceptible to noise. In the following section we de ne a rst generalisation, de ning a framework for the generation of default rules. We will also show how default reasoning may be used as a framework to represent and reason about indeterministic information systems. i
8
Handling the Indeterministic Situation Uncertainty in information systems may stem from uncertainty in the represented data (due to noise), or even uncertainty of the knowledge itself. Inconsistencies are present because of dierent actions of an expert for dierent objects that are described by the same values of conditions, ie. the objects that are indiscernible by these attributes. In the case where the decisions of several experts are represented, diering decisions will be another source of inconsistency. Ultimately, the reason that indeterminacy arises in information systems is the lack of ability to represent { in the particular information system { some attributes that dierentiate the objects in the "real" world. If an expert classi es two objects with the same information vector in the system dierently, this is because he is able to distinguish them according to some attribute(s) that are not represented in the system. These attributes may however be signi cant only in very special situations, and the data collected may suggest that one particular decision eectively covers a great majority of the cases. Default rules that account for some normal dependency between objects and their properties may be accepted if there exists a decision that dominates to a certain extent the set of objects characterised by the rule. Consider the general rule schema Des(E ; C ) ! Des(X ; D). To nd whether the rule can be accepted as a default relation between classes E 2 U=IND(C ) and X 2 U=IND(D), we test the value for the membership function versus the threshold by testing that (E ; X ) : In other words, we accept the default rule for the condition class E provided that the value (E ; X ) exceeds a threshold value for some dominating decision class X . Note that, in the deterministic situation, (E ; X ) = 1, since the class E is in its entirety contained in X . Again referring to the example given on page 6, we aim at characterising the data represented in the class E5 (non{deterministic). The computation yields (E5; X3) = 4=5, and provided that this value is greater than the preset threshold, two default rules are generated for the class (for each of the two minimal class descriptions): i
j
i
j
tr
C
i
i
j
tr
C
i
j
tr
j
i
C
i
j
j
C
: a3 ! d3
b5 ! d3
Which rules will be generated in the process depends upon the setting of . The threshold value may be a preset constant, or be parametrised by the classes in U=IND(C ) [Mol96]. tr
Finding More General Patterns In the above, we have suggested generating de nite rules from the deterministic subpart of the information system, as well as default- or \common pattern" rules for the nondeterministic part. The default rules are generated in the cases where 1 > . From any (deterministic or nondeterministic) information system, we are however often interested in nding more general patterns in the data. We propose to do this by a framework that is able to generate and reason about classes for which no unique decision can be made. In fact, sets may be constructed that cover more objects (and hence have a simpler description), by forming unions of the classes induced by the condition attributes C . Rules are generated that map these composed classes into the decision that dominates the set. The rules that result generally have at least two great advantages as compared to deterministic rules: they are always simpler in structure, and though not entirely correct wrt. the training tr
9
data, they will in many cases prove to be better when handling yet unseen cases, being less susceptible to noise. The ability of the system to classify future events will in general greatly depend on the system's ability to generalise over its knowledge. In the core of the approach is the idea of creating indeterminacy in information systems, generating rules that cover the majority of the cases. There are in principle three ways of creating an indeterminacy in an information system [Mol96], here we consider generation of indeterminacy through selecting projections over the condition attributes, allowing certain attributes to be excluded from consideration. In doing this, we eectively join equivalence classes over the condition attributes, classes which may be mapped into dierent decisions. Classes in the equivalence relation de ned by the condition attributes C are \glued" together by selecting suitable projections C of C, while ensuring that, for the resulting object class, one particular decision remains the dominating one. This simple idea is fundamental to the approach. In this paper, we suggest a framework for extracting such \normal" dependencies from the information system. The basic steps are the following: Pr
1. Selection of projections C = C ? C over the attributes. The projections are selected such that new indeterminacies result. 2. By each such projection, classes are glued together into more compound object classes E( P r ) 2 U=IND(C ), k = 1:: j U=IND(C ) j. 3. Rules are generated, ensuring that one particular decision X remains the dominating one by checking that Pr
Cut
Pr
k;C
Pr
j
Pr
C
(E(
k;C
P r)
;X ) = j
j E Pr \ X j j E Pr j (k;C
)
j
(k;C
)
tr
4. Blocks to the rules are generated for those classes E which are now contained in the compound class, but which deterministically map into another decision than the dominating one, in other words that E \X =; i
i
j
Each new class E P r 2 U=IND(C ), where k ranges over the new equivalence classes generated, may be constructed as a union of those E ; E 2 U=IND(C ) that originally were discernible only by attributes that were removed in the projection, i.e. m (i; j ) C . In other words: k;C
Pr
i
j
D
E
k;C
Pr
S
Cut
= fE [ E j E ; E 2 U=IND(C ); m (i; j ) C g i
j
i
j
D
Cut
The selection of projections that will potentially lead to interesting rules is in other words dictated by the content of the discernibility matrix. Any two object classes E and E may be glued together by making a projection which excludes all the attributes represented by m (i; k) = fa 2 C j a(E ) 6= a(E )g. Furthermore, the consequence of removing a set of attributes may be studied simply by syntactically removing all the corresponding boolean variables from the discernibility matrix. The cells in the matrix that become empty signify exactly the classes that are glued together by performing the projection. Consider now an information system having discernibility function f (B ) (de ned on page 5). The only immediate ways of creating indeterminacies in the system is to remove either of the factors in the conjunction, which amounts to cutting out each of the attributes that are i
D
i
k
10
k
W
represented by corresponding boolean variables in (x ; x ). For instance, for a system having discernibility function, say, a(b _ c), there are two immediate ways of creating an indeterminacy; either removing a or removing both attributes b and c. There are exactly as many dierent projections as there are conjuncts in the discernibility function. Each projection results in a \new" indeterministic information system, de ned over the condition attributes C C . From each of these systems, a recursive call is made to the rule generation procedure, which will eventually generate still more simpli ed rules. The projection operation is de ned over subsets of the condition attributes C . Hence, the recursive process is de ned on the lattice over the powerset 2 . Using the properties of the discernibility function, described above, we may reduce the search in the lattice signi cantly. For instance, in the example IS, having discernibility function ac, we nd that two immediately interesting projections are C ? fag = fb; cg and C ? fcg = fa; bg. Each of these projections are made, respective rules are generated, and recursive calls are made initiating further search from each of the subsystems. The search for the example system is illustrated in gure 1. As mentioned above, rules are generated provided that the objects contained in the class E P r in a given, high fraction of the cases still are classi ed correctly into the decision j . The number of objects from the training set that are wrongly classi ed (into the decision class X , l 6= j ) by rules generated in this way is equal to (1 ? (E P r ; X )) j E P r j. The new object classes may be characterised by their minimal description/information vectors P r = Des(E P r ; MinDes), where MinDes is a prime implicant of the relative discernibility function f (E P r ; C ). Recall that Des(E; C ) computes the value trace of the class E over the attributes C . Since there may exist several such minimal descriptions of a given class, the schema gives rise to potentially several rules. Default rules () and facts (F ) are constructed according to the following schema: i
j
Pr
C
k;C
l
k;C
C
j
k;C
C
Pr
C
: Des(E P r ; C ) ! Des(X ; D) F : Des(E ; C ) ! :Des(X ; D) , where E 2 U=IND(C ), E E Pr
k;C
i
Cut
j
j
i
i
k;C
Pr
and E \ X = ;. i
j
Note that in addition to the default rules, some facts are generated that may potentially block the defaults. These rules are generated for the \minority" classes, the antecedents of the counterexample rules are expressions over the set of attributes that were removed in the projection. The counterexamples, then, are exactly those sets E that are distinguishable from the other E 0 E P r by the attributes in C . i
i
k;C
Cut
5.1 Example Let us now consider the method at work with an example. As shown in section 4, we are, from the given information system, able to generate ve de nite rules that cover the available \certain" knowledge. The algorithm is initially called with argument C ? ;, i.e. not removing any attributes. At this (top) level in the lattice, the de nite rules are generated ( = 1), and default rules that cover the indeterministic part of the IS ( < 1). The total set of de nite rules generated for the example are found on page 8. The only class which is indeterministic wrt. the decision is E5. The degree of membership (E5; X3) = 4=5, whereas (E5; X4) = 1=5. Class E5 has relative discernibility function a _ b. Hence, the following rules may be extracted (each rule is shown with its corresponding value): C
C
11
a3 (C ? ;): ab 3 5 b5
!dj !dj !dj !dj 3
4/5
4
1/5
3
4/5
4
1/5
As argued above, the value for the discernibility function for the system will at each point suggest which paths may allow new and interesting rules to be obtained. In this example, additions may only be obtained through the deletion of either one of attributes a or c. Deleting the attribute b from the condition attributes would not create any new indeterminacy, and hence, no new rules could be generated. Making the two interesting projections yields two new systems over C ? fag and C ? fcg, that have discernibility matrices M (C ? fag) and M (C ? fcg), respectively. D
D
abc
abc a
bc
bc
ab
ac a
bc
ab
ac
a
c
b
b c
c
c
a
c
b
ac
b
c
a
b b
c
a
c
a
c
b
b
c
a
0
0
a
b
−
Figure 1: The search in the lattice The search in the lattice is shown in gure 1, where rules are generated at nodes abc, ab, bc, c, b and ;. Note again that the paths traversed are directed by the value for the discernibility function of the nodes at dierent levels, shown on the right hand side of each node in the lattice. For the projections C ? fag and C ? fcg, respectively, the following new sets of (default) rules are generated:
d1 j (C ? fag): bb2cc3 ! ! d2 j 2 3
d1 j (C ? fcg): aa1 ! ! d2 j 1
50/80
50/55
30/80
5/55
Continuing to remove attributes from the two systems obtained will imply that new defaults may be generated. The system over C ? fag has discernibility function bc. Further removal of attributes b or c yields two new sets:
c3 c3 (C ? fabg): c1 c1 c1
!dj !dj !dj !dj !dj 1
50/90
2
40/90
2
5/10
3
4/10
4
1/10
d1 j (C ? facg): bb2 ! ! d2 j 2
50/85 35/85
12
If, starting from the system over C ? fabg, the attribute c is removed, there is obviously nothing that enables us to distinguish the dierent classes any more. The rules generated at the bottom level of the lattice re ect simply the distribution of values for the decision attribute d, i.e the fraction of the objects that are classi ed into the respective decisions. In practice one would not want to generate rules that do not have a particular coverage above the threshold. The threshold value may be set at any level ( 1), thereby eectively discarding the rules with insucient con dence. Sorting the generated rules by their respective values of the degree of membership , and setting the threshold value = 0.55, we obtain the table given below: tr
tr
F: a c ! d ac !d bc !d a !d b !d 1 3
1
1 1
2
2 1
2
2
2
3
2
: a1
a3 b5 b2c3 b2 c3
!dj !dj !dj !dj !dj !dj 1
50/55=0.91
3
4/5=0.80
3
4/5=0.80
1
50/80=0.62
1
50/85=0.59
1
50/90=0.56
Assume now that a new object is observed, for which the value of the a attribute is found to be 1, whereas the values for all the other attributes for the object are unknown. The de nite rules do not sanction any conclusion in this case, we may however apply the rst default rule to conclude (by assumption) that the decision in this case should be 1. If, later, further knowledge is made available, the assumption (and therefore the conclusion) may have to be retracted. The framework described here allows the counterexamples (blocks) to defaults to be recorded explicitly. For instance, consider again the rst (highest priority) default rule (having given it a unique name); def
a1 d 1
: a1 ! d 1
The rule was output at level C ? fcg in the lattice, and the counterexample is produced in terms of the attributes that were cut out, that is c in this case. Hence, we record the value of this attribute for all those object classes that are contained in E = E1 [ E2 and that map into any decision other than d1 . In the example, this concerns the objects in the class E2, for which the value for c is 1. A fact is generated that blocks the default in this case. Notice that other objects may well exist in the information system that have c = 1 and that still map into decision d = 1. Hence, the knowledge that c = 1 should be used to block the application of the default, instead of the conclusion itself. A rule is generated that may block def . a1
a1 d 1
c1 ! :def
a1 d1
6 Related Work Yasdi [Yas91] presents an approach which allows generation of rules from non{deterministic situations. In his framework, the rules are built up according to the schema
fDes(E ) ! Des(X ) j E \ X 6= ;g i
j
i
j
where i and j range over the learner's and the oracle's concept partitions, respectively. The approach eectively uses the upper approximations of sets, creating indeterministic/inconsistent 13
rules. Yasdi's paper provides no way of dealing with such inconsistencies in the generated rules, there is no reference to information which could imply that one decision be preferred over others in situations of con ict. The framework described in this paper does also, in principle, allow mutually inconsistent (default) rules to coexist. However, the setting of the threshold value may be used to eectively control which rules are included in the theory. Also, the relative trust given to the rules in terms of the membership function re ects a preference relation over the rules. Grzymala{Busse [GB88] describes the process of knowledge acquisition under uncertainty, using the Rough set approach. From a set of examples, the framework produces rules that are classi ed into certain and possible, depending on their classi cation determinacy. The inconsistencies present in the original information system are thus not corrected; rather, they are used to generate unsafe rules that cover a greater set of the objects represented in the information system. The author suggests propagating the possible rules in parallel with the propagation of the certain rules, eectively using two dierent production systems that are run concurrently. Hu, Cercone and Han [HCH93] describe an algorithm that treats tuples that occur rarely in the training data as noise, not allowing for the generation of rules from such data. This is accomplished by calculating a simple frequency ratio for each tuple and ltering out the exceptional tuples using a noise lter threshold. In our example, the ratio for the object class E5 2 would be 1/100 = 0.01, and one might consider cutting out the tuple from the information system. (Incidentally, this would have solved our problem with the indeterminacy of the decision attribute for the E5 class). The concept of Boundary region thinning [Sko93] treats objects corresponding to small values of the probability distribution wrt the decision as abnormal or noisy objects. ;
The generalised decision for a given information vector can then be modi ed by removing from it the decision values corresponding to these small values of the probability distribution. The decision rules generated for the modi ed generalised decision can give better quality of classi cation of new, yet unseen objects. [Sko93]
7 Summary and Future Research In this paper we contend that default rules provide a powerful tool for representing common characteristics of a set of data. We suggest that Rough Set theory may be applied to solve the problem of generation of default decision rules from such primitive (sample) data. The framework and its properties is extensively described in [Mol96]. An algorithm has been developed which is in the process of being implemented and tested. Apart from the speci c issue tackled, we foresee several interesting topics and questions for future investigation, some of which are indicated below. This study shows that Rough Sets will prove bene cial to the research in these areas.
The study of the relationship between the propositional data tables and the possible
generation of rst-order clauses or default rules, generalising the approach into such a framework. The problem of assigning relative priorities to default rules. Development of strategies that enable an ordering of the rules into dierent priority strata (Prioritized Theorist 14
[Bre89]). A further study into the relationship between (prioritized) default rule systems and probabilistic systems. The problem of ensuring consistency between the rules added, and with the original examples. The problem of updating the rule base when new data becomes available, and the reassignment of priorities to default rules should be studied.
References [Bre89] G. Brewka. Proc. of IJCAI-89. In Preferred subtheories: An Extended Logical Framework for Default Reasoning, pages 1043{1048, 1989. [GB88] J.W. Grzymala-Busse. Knowledge Acquisition under Uncertainty { A Rough Set Approach. Journal of Intelligent and Robotic Systems, 1:3{16, 1988. [GB92] J.W. Grzymala-Busse. LERS { A System for Learning from Examples based on Rough Sets. In R. Slowinski, editor, Handbook of Applications and Advances of the Rough Sets Theory, pages 3{18. 1992. [HCH93] X. Hu, N. Cercone, and J. Han. An Attribute{Oriented Rough Set approach for Knowledge Discovery in Databases. In W. P. Ziarko, editor, Rough Sets, Fuzzy Sets and Knowledge Discovery, pages 90{99. Springer Verlag, 1993. [Mol95] T. Mollestad. Learning Propositional Default Rules using the Rough Set Approach. In A. Aamodt and J. Komorowski, editors, Scandinavian Conference on Arti cial Intelligence, pages 208{219, Trondheim, Norway, May 1995. IOS Press. [Mol96] T. Mollestad. A Rough Set Approach to Default Rules Data Mining. PhD thesis, The Norwegian Institute of Technology, Trondheim, Norway, September 1996. To appear. [Paw82] Z. Pawlak. Rough Sets. International Journal of Information and Computer Science, 11(5):341{356, 1982. [Paw91] Z. Pawlak. Rough Sets { Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, 1991. [PGA86] D. Poole, R. Goebel, and R. Aleliunas. Theorist: A Logical Reasoning System for Defaults and Diagnosis. In N.J. Cercone and G. McCalla, editors, The Knowledge Frontier: Essays in the Representation of Knowledge, pages 331{352. Springer{Verlag, New York, 1986. [Poo88] D. Poole. A Logical Framework for Default Reasoning. Arti cial Intelligence, 36:27{47, 1988. [Poo89] D. Poole. Explanation and Prediction: An Architecture for Default and Abductive Reasoning. Computational Intelligence, 5(2):97{110, 1989. [PS93] Z. Pawlak and A. Skowron. A Rough Set Approach to Decision Rules Generation. Technical report, University of Warsaw, 1993. [PSF91] G. Piatetsky-Shapiro and W.J. Frawley, editors. Knowledge Discovery in Databases. AAAI/MIT, 1991. [Qui86] J.R. Quinlan. Induction of Decision Trees. Machine learning, 1:81{106, 1986. [Rei80] R. Reiter. A Logic for Default Reasoning. Computational Intelligence, 13:81{132, 1980. [SGB91] A. Skowron and J. Grzymala-Busse. From Rough Set Theory to Evidence Theory. Technical report, Warsaw University of Technology, 1991. [Sko93] A. Skowron. Boolean Reasoning for Decision Rules generation. In J. Komorowski and Z.W. Ras, editors, 7th International Symposium for Methodologies for Intelligent Systems, ISMIS `93, pages 295{305, Trondheim, Norway, June 1993. Springer Verlag. [Yas91] R. Yasdi. Learning Classi cation Rules from Databases in the Context of KnowledgeAcquisition and -Representation. IEEE Transactions on Knowledge and Data Engineering, 3:293{306, September 1991.
15