Decision Support Systems 42 (2006) 1673 – 1683 www.elsevier.com/locate/dss
A logical framework for identifying quality knowledge from different data sources☆ Kaile Su a , Huijing Huang b , Xindong Wu c , Shichao Zhang d,e,⁎ a Faculty of Computer Science, Zhongshan University, China Bureau of Personnel and Education, Chinese Academy of Sciences, China c Department of Computer Science, University of Vermont, Burlington, Vermont 05405, USA d Guangxi Normal University, Guilin, China Faculty of Information Technology, University of Technology Sydney, PO Box 123, Broadway NSW 2007, Australia b
e
Received 9 February 2005; received in revised form 16 February 2006; accepted 20 February 2006 Available online 19 April 2006
Abstract As the Web has emerged as a large distributed data repository, individuals and organizations have been able to utilize the lowcost information and knowledge on the Internet when making business decisions. Because data in different data sources may be conflictive or untrue, researchers and practitioners must intensify efforts to develop appropriate techniques for its efficient use and management. In this paper, a logical framework is designed for identifying quality knowledge from different data sources, thus working towards the development of an agreed ontology. Our experimental results have demonstrated that the approach is promising, and that a minor data enhancement adjustment could bring higher effectiveness. © 2006 Elsevier B.V. All rights reserved. Keywords: Knowledge; Data source; Logic
1. Introduction The vast amount of information available on the Web provides great potential for people to improve the quality of decision-making by enhancing results mined from
☆ This work is partially supported by Australian large ARC grants (DP0449535, DP0559536 and DP0667060), a China NSF major research Program (60496327), and two China NSF grants (60463003, 60473004). ⁎ Corresponding author. Faculty of Information Technology, University of Technology Sydney, PO Box 123, Broadway NSW 2007, Australia. E-mail address:
[email protected] (S. Zhang).
0167-9236/$ - see front matter © 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.dss.2006.02.012
databases [9]. If a company has an internal dataset D to be mined, high-profit pressures generate an urgent need for the collection of extra information from external data sources, referred to here as D1, D2, …, Dn, when mining D. As such, knowledge discovery from different data sources (K3D) has been recognized recently as an important research topic in the data mining community. Here, the knowledge from D is referred to as ‘internal knowledge’, whereas the knowledge from D1, D2, …, Dn is ‘external knowledge’. There are essential differences between mono- and multi-database mining. Both data and patterns in multidatabases present more challenges than those in monodatabases. For example, unlike in mono-databases, data
1674
K. Su et al. / Decision Support Systems 42 (2006) 1673–1683
items in multi-databases may have different names, formats and structures. They may also conflict with one another, or be untrue [23]. Therefore, an agreed ontology must be developed for mining quality knowledge from multiple data sources, where quality knowledge is reliable and not contradictory. Liu et al. have proposed a means of searching for interesting knowledge in multiple databases according to a user query [11]. Zhong et al. have proposed a method of mining peculiarity rules from multiple databases [26]. Aronis et al. have introduced a system called WoRLD that uses spreading activation to enable inductive learning from multiple tables in multiple databases spread across the network [1]. These research efforts provide a good insight into knowledge discovery from multiple data sources. The authors of this paper have also contributed research towards mining multiple data sources. For example, Wu and Zhang [20] advocated an approach for identifying patterns in multiple databases by weighting; Zhang et al. [23] designed a local pattern analysis for mining multiple databases; Wu et al. [22] proposed a database classification for mining multiple databases; and Zhang et al. [24] systematically studied various strategies for mining multiple databases. However, all of the above techniques are based only on quality data. That is, researchers have assumed that the input to mining algorithms conforms to well-defined data distribution, containing no missing, inconsistent, or incorrect values [25]. This leaves a large gap between the use of available data and the machinery available to process that data. Because real-world data might be incomplete, noisy, or inconsistent, thus disguising useful patterns, researchers and practitioners must intensify efforts to develop appropriate techniques for efficiently using and managing data. Although data enhancement, which straddles data preprocessing and K3D, often presents itself as less glamorous, it is, in fact, a more critical step than other steps in K3D applications; as minor data enhancement adjustments have the potential to bring about higher effectiveness. Therefore, we see data enhancement as a crucial research topic in K3D applications. In this paper, breaking fresh ground from traditional data mining strategies, we take a data source as a knowledge base and design a logic framework for identifying trustworthy knowledge from various data sources, thus working toward the development of an agreed ontology. The rest of this paper is organized as follows. Some basic concepts are recalled in Section 2. In Section 3 we present a logic and its semantics for K3D. Section 4
constructs the proof theory. Section 5 discusses how the proposed methodology can be used to enhance nonmonotonic reasoning. Section 6 illustrates the use of our framework. Our research contributions are summarized in the last section. 2. Needed concepts The aim of our research in this paper can be formulated as follows: Given a mining task for a company that has data source DS1, and assuming DS2, DS3, …, DSn are n − 1 external data sources that have been collected for the mining task, we will construct a logic framework for identifying trustworthy knowledge from external data sources. Because privacy is a very sensitive issue, and safeguarding its protection in a data source is of extreme importance, sharing knowledge (rather than simply using the original raw data) presents a feasible way to deal with different data source problems [20]. Accordingly, we assume here that a data source is taken as a knowledge base1; a company is viewed as a data source; and a rule has two values in a data source: true (the data source supports the rule) and false (otherwise). The collected knowledge from external data sources may be subject to noise. Thus, if a data source wants to create its own knowledge for data mining applications, the data source needs the ability to refine the knowledge it has collected. That is, the data source has to determine which set of knowledge to believe according to its own knowledge. To do this, we advocate a logic framework that pursues the following principle, based on work in [16], as follows. If a data source i believes that another data source j is veridical, in terms of a standard multi-modal language, Ki(Kj α ⇒ α) for all formulas α, then data source i inherits and accepts the knowledge in data source j. Otherwise, the knowledge in data-resource j must be preprocessed before it is applied. In the above stipulation, the knowledge in data source i is referred to as ‘internal knowledge’ and the knowledge in data source j is referred to as ‘external knowledge’. For example, let D1 be a data source with a rule set {a1 → b1}, D2 be an external data source with a rule set
1
If a data source contains only data, we can transform it into knowledge by existing mining techniques.
K. Su et al. / Decision Support Systems 42 (2006) 1673–1683
{a2 → b2, a3 → b3}, and data source D1 believes that the external data source D2 is veridical. Then the data source D1 inherits and accepts the knowledge in D2. This means that D1 has a rule set {a1 → b1, a2 → b2, a3 → b3} after collecting knowledge from D2. Our logic framework focuses on the following epistemic properties: • Veridicality. Knowledge is true. • Introspection. A data source is aware of what it supports and of what it does not support. • Consistency. A data source's knowledge is noncontradictory. The well-known modal logics S5, K, T, K45 and KD45 have already been constructed by Turner [18] and Wooldridge [19] using different combinations among the above properties. In this paper we will build a new model logic for K3D, in which the veridicality and introspection of knowledge are taken into account. We allow the explicit mention of introspection and veridicality of a data-resource according to its own knowledge so that the formulas in our language indicate which one of K, T, S5, or K45 can be used by a dataresource. The language of those logics is a propositional logic augmented by the modal operators K1, K2, …, Kn, where Kiϕ reads “data source i supports ϕ”. We denote this language by Ln. For convenience, we define true as an abbreviation for a fixed valid propositional formula, say p ∨ ¬p, where p is a primitive proposition. We abbreviate ¬true by false. According to Halpren and Mosses [5], the semantics of logic formulas can be given by means of Kripke structures [7], which formalize the intuition behind possible worlds. A Kripke structure is a tuple (W, π, K1, …, Kn), where W is a set of worlds, π associates with each world a truth assignment to the primitive propositions so that π (w)(p) ∈ {true, false} for each world w, and primitive proposition p, and K1, …, Kn are binary accessibility relations. By convention, KiM and πM are used to refer to the Ki relation and the π function in the Kripke structure M, respectively. We omit the superscript M if it is obvious from the context. Finally, we define Ki ðwÞ ¼ fw Vj8/aw VðKi /Þg
can inductively give semantics to formulas as follows. For primitive propositions p, ðM ; wÞ p iff pM ðwÞðpÞ ¼ true Conjunctions and negations are dealt with in the standard way. Finally, ðM ; wÞ Ki a iff for all w VaKiM ðwÞ; ðM ; w VÞ a Thus, a data source i supports α if α is true in all situations that the data source considers possible. Note that the Kripke structure M is fixed in the above inductive interpretation. However, as we will see in the next section, our interpretation differs from the above case. 3. Formal semantics The language we define in this section is Ln, augmented by two classes of special proposition constants: Ii and Vi (1 ≤ i≤n). These are denoted by Ln(VI), where Ii and Vi (1 ≤ i≤n) correspond to the epistemic properties, introspection and veridicality respectively. In formula Ii data source i has the ability to respect its knowledge, and formula Vi indicates that data source i supports true knowledge only. The language Ln(VI) is tailored to represent and tackle the relationship between internal and external knowledge faced by the data sources. When identifying trustworthy knowledge from external data sources, the most important merit of this logic is that it distinguishes the internal knowledge from the external knowledge of a data source. We now present the interpretation of Ln(VI). By way of description, in this section we use standard Kripke structures and situations in a non-standard way. The key point is that the accessibility relation Ki in the Kripke structures is no longer related to data source i's knowledge. In each situation (M, w), the syntactic counterpart is (M,w′), where, w′ ∈ Ki(w) is what the information that data source i has collected rather than its own knowledge. Nonetheless, the relation Ki, together with the actual world w, uniquely determines data source i's knowledge in some implicit way. Let ⊨N be the satisfaction relation we are going to define. The most subtle case is that of dealing with formulas of the form Kiα. One might tend to let ðM ; wÞ N Ki a
That is, Ki(w) is the set of worlds that data source i considers possible in each w. A situation is a pair (M, w) consisting of a Kripke structure M and a world w in M. By using situations, we
1675
iff for all w VaKi ðwÞ; ðM ; w VÞ N a
1676
K. Su et al. / Decision Support Systems 42 (2006) 1673–1683
This would be completely true if the (M, w′)'s (w′ ∈ Ki (w)) were exactly those situations that are thought possible from the standpoint of data source i's knowledge. Unfortunately, these (M, w′)s are thought possible only from the standpoint of what the information that data source i has collected. Indeed, if we did it in this way, we would get nothing but the well-known modal logic Kn. Thus, it is crucial to ask in what situations should data source i think it possible from the standpoint of what it actually supports, after checking and cogitating on the external knowledge. Let Si (M, w) be the set of all such possible situations for the given situation (M, w), then we interpret formulas of the form Kiα as,
has collected. Thus, in each of those situations (M′, w′), Ii must hold. Hence,
ðM ; wÞ N Ki a
M
iff
p M Vðw VÞðIi Þ ¼ true Based on the above discussions, we define Si (M, w) as a set of those situations (M′, w′), where w′ ∈ KiM(w) and M′ coincides with M, and, KiM Vðw VÞ ¼ KiM ðwÞ and p M Vðw VÞðIi Þ ¼ true For convenience, we denote the above M′ Kripke structures by Ki ðw VÞ pðw VÞðIi Þ ; Ki ðwÞ true
We can now figure out what Si (M, w) should be. There are four cases according to the logic style of data source i in the situation (M, w). Firstly, if the logic style is K, i.e. the value of π (w) is false at both Ii and Vi, then data source i is unable to distinguish its knowledge from what has been collected, and hence the situations in Si(M, w) are exactly the same as (M, w′)'s, where w′ ∈ Ki(w). Secondly, assume the logic style is T. That is,
Finally, let the logic style be S5. Then, by considering the veridicality, we should put the actual world w into the set KiM(w), and conclude that KiM(w) ∪ {w} was the set of worlds possible from the standpoint of data source i's external knowledge. By considering the introspection, we define Si (M, w) in the same way as in the case of K45. Now, KiM (w), the set of worlds possible from the standpoint of data source i's external knowledge, is replaced by KiM (w) ∪ {w}. In other words, we define Si (M, w) as the set of those situations (M′, w′), where w′ ∈ KiM (w) ∪ {w}, and M′ coincide with M, and
pðwÞðIi Þ ¼ false and pðwÞðVi Þ ¼ true
KiM Vðw VÞ ¼ KiM ðwÞ [ fwg and p M Vðw VÞðIi Þ ¼ true
Then, we get Si (M, w) by adding the actual world w to each (M, w′), where w′ ∈ Ki(w). This enables data source i to delete the false part from the collected knowledge. Thirdly, suppose the logic style is K45. That is:
For convenience, we denote the M′ Kripke structures by Ki ðw VÞ pðw VÞðIi Þ M ; Ki ðwÞ [ fwg true
8ðM V; w VÞaSi ðM ; wÞ; ðM V; w VÞ N a
pðwÞðIi Þ ¼ true and pðwÞðVi Þ ¼ false Assuming, from its internal knowledge, that data source i thinks the situations (M′, w′) are possible, we have, by the introspection of data source i, that data source i's external knowledge in each of those situations (M′, w′) is exactly the same as that in the actual situation (M, w). This is semantically represented as KiM Vðw VÞ ¼ KiM ðwÞ On the other hand, the knowledge of the introspection property Ii is such that if it holds, then data source i must support it, no matter what information data source i
As we have seen, Si (M, w) is equal to (1) {(M, w′) | w′ ∈ KiM(w)}, if π(w)(Ii) = false and π(w)(Vi) = false; (2) {(M, w′) | w′ ∈ KiM (w)} ∪ {(M, w)}, if π(w)(Ii) = false and π(w)(V i) = true;
h
(3)
h
(4)
i
j
K ðw ÞV pðw ÞV ðI Þ
M Ki ðwÞ ; true i ; w V w VaKiM ðwÞ ; i if π(w)(I i) = true and π(w)(V i) = false; M
Ki w V pðw ÞV ðIi Þ Ki w[fwg; true
i
; wV
j
waK V iM ðwÞ[fwg
;
if pðwÞðIi Þ ¼ true and pðwÞðVi Þ ¼ true From the above, we can get the following properties of Si.
K. Su et al. / Decision Support Systems 42 (2006) 1673–1683
Proposition 1. For all situations (M, w): (1) if π M(w)(Vi ) = true and π M(w)(Ii ) = false, then (M, w) ∈ Si(M, w); M (2) if π M (w)(V h i ) = true and i π (w)(Ii ) = true, Ki ðwÞ then M K ðwÞ[fwg ; w aSi ðM ; wÞ; i
(3) if π M(w)(Ii) = true, then, for all situations (M′, w′) in Si (M, w), pM Vðw VÞðIi Þ ¼ true and Si ðM V; w VÞ ¼ Si ðM ; wÞ The properties can be proven by the above definitions and deliberations. We now formally present our semantic framework by defining inductively the satisfaction relation ⊨N between a situation and a formula as follows. (1) (M, w) ⊨N p iff π(w)(p) = true, for a primitive proposition p. (2) (M, w) ⊨N ¬ α iff not (M, w) ⊨N α, and (M, w) ⊨N α ∧ β iff (M, w) ⊨N α and (M, w) ⊨N β. (3) (M, w) ⊨N Kiα iff (M′, w′) ⊨N α for all (M′, w′) ∈ Si (M, w). We remark that, according to this semantic framework, a veridical data source takes the knowledge that has been collected and adds the current world to the set of possibilities in order to obtain its own knowledge. However, it is difficult for data sources to construct such a knowledge-sharing environment. This is because a data source typically does not support what is current world and it is impossible for a data source to perform the operation of adding current world. Nevertheless, this does not imply that this semantic framework is counterintuitive and suspect. There might be no veridical data source in the real world, but there are data sources that do have veridicality properties from the viewpoints of other data sources. For example, if data source j thinks that data source i is veridical, then, from data source j's viewpoint, data source i does not yet support the current world, but data source j supports (or believes) that the current world is one of data source j's possible worlds. Thus, if data source j supposes that the current world is w and data source i's set of possible worlds (corresponding to its external knowledge) is W, then data source j thinks that data source i's knowledge is determined by the set W, plus the supposed current world w. The following propositions helps to prove the validity of axioms in the next section. Proposition 2. For all formulas ϕ, if p M ðwÞðVi Þ ¼ true and p M ðwÞðIi Þ ¼ true
1677
then, for each world w′, Ki ðwÞ ; w V N / iff ðM ; w VÞ N / M Ki ðwÞ [ fwg Proof. The proof of this proposition is accomplished by somewhat tedious induction on the structure of ϕ. More precisely, assuming, for all sub-formulae of ϕ, that this claim holds for all M, i, w and w′, we will show that it holds also for ϕ. Suppose p M ðwÞðVi Þ ¼ true and p M ðwÞðIi Þ ¼ true If ϕ is a primitive proposition p, it is immediately derived from the fact that assignment functionh π is the i Ki ðwÞ . same in both the Kripke structures M and M Ki ðwÞ[fwg The cases where ϕ is a conjunction or a negation follow from the definition of ⊨N above. If ϕ is of the form Kiψ, then, in the case of w = w′, the claim holds, since it is easy to check that Ki ðw VÞ Si ðM ; wÞ ¼ Si M ;w Ki ðwÞ [ fwg Thus, we can assume w ≠ w′. There are four subcases as follows: (1) (2) (3) (4)
π M(w′)(Vi) = false and π M(w′)(Ii) = false, π M(w′)(Vi) = true and π M(w′)(Ii) = false, π M(w′)(Vi) = false and π M(w′)(Ii) = true, π M(w′)(Vi) = true and π M(w′)(Ii) = true.
Subcases 1 and 2 can be immediately obtained by the definition of ⊨N, and the inductive assumption. For subcase 3, we firstly note that, by w ≠ w′, h i M
KiM ðw VÞ ¼ Ki
Ki ðwÞ Ki ðwÞ[fwg
ðw VÞ
Accordingly, by the definition of ⊨N, it suffices to show, for each w″ ∈ KiM(w′),that Ki ðwÞ Ki ðwWÞ pðw WÞðIi Þ ; M ; w W N w Ki ðwÞ [ fwg Ki ðw VÞ true iff Ki ðw WÞ pðw WÞðIi Þ ; M ; w W N w Ki ðw VÞ true Note that the assertion above holds if w″ = w, because Ki ðwÞ Ki ðwÞ pðwÞðIi Þ M ; Ki ðwÞ [ fwg Ki ðw VÞ true
1678
K. Su et al. / Decision Support Systems 42 (2006) 1673–1683
assumption, we have, for each (M′, w′) ∈ Si(M, w), (M′, w′) ⊨N α and (M′, w′) ⊨N α ⇒ β. It follows that (M′, w′) ⊨N β for all such (M′, w′), and therefore, (M, w) ⊨N β. Parts (2) and (3) can immediately be obtained from Proposition 1 (1) and Proposition 1 (2), respectively.
equal
Ki ðwÞ pðwÞðIi Þ ; M Ki ðw VÞ true And if w″ ≠ w, then this assertion holds also by the inductive assumption. Subcase 4 can be proved in the same way. Finally, suppose that ϕ is of the form Kjψ for j ≠ i. This case is also divided into four subcases depending on the values of π M(w) at Vj and Ij. We prove the claim only in the case where data source i's logic style is S5. That is: p M ðw VÞðVi Þ ¼ true and p M ðw VÞðIi Þ ¼ true since other cases are simpler, or can be handled in the same way. In this subcase, by the definition of ⊨N, it suffices to show, for every w″ ∈ Kj (w′) ∪ {w′}, that Ki ðwÞ Ki ðw WÞ pðw WÞðIi Þ ; w W N w ; M Ki ðwÞ [ fwg Ki ðw VÞ [ wV true iff
M
Ki ðw WÞ pðw WÞðIi Þ ; ; w W N w Ki ðwVÞ [ fw Vg true
But this can be obtained in a straightforward way by the inductive assumption. Because even if w ″ = w, in the Kripke structure Ki ðw WÞ pðw WÞðIi Þ M ; Ki ðw VÞ [ fw Vg true the data source i's logic style in world w remains S5. That is, p ðw VÞðVi Þ ¼ true and p ðw VÞðIi Þ ¼ true M
M
Hence, the inductive assumption can be applied. The following proposition reflects some of the formal properties of ⊨N. Proposition 3. For all formulas α, β ∈ Ln(VI), and situations (M, w): (1) (M, w) ⊨N (Kiα ∧ Ki(α ⇒ β)) ⇒ Kiβ; (2) (M, w) ⊨N Vi ⇒ (Kiα ⇒ α); (3) ðM ; wÞ⊨N Ii ⇒ðKi Ii ∧ðK i α⇒K i K i αÞ ∧ (¬ Kiα ⇒ Ki¬ Kiα)). Proof. To prove part (1), assuming (M, w) ⊨N Kiα ∧ Ki (α ⇒ β), we must show (M, w) ⊨N β. But, by the
4. Proof theory With respect to the semantic framework in the previous section, we present a sound and complete proof theory below. For any data source i: P. K. T. K45.
All instances of axioms of propositional logic (Kiα ∧ Ki(α ⇒ β)) ⇒ Kiβ Vi ⇒ (Kiα ⇒ α) Ii ⇒ (Ki Ii ∧ (Kiα ⇒ KiKiα) ∧ (¬ Kiα ⇒ Ki¬Kiα))
and the rules of inference are: R1. R2.
From α and α ⇒ β infer β From α infer Kiα.
Axioms P and K, and the inference rules R1 and R2, consist of the well-known modal logic system Kn. Axioms T and K45 capture the aforementioned meanings of the constants: Ii and Vi, respectively. For convenience, we denote this system by KnVI . Two main results of our logic are the soundness and completeness of the proof system KnVI . The soundness is easy to check. Nevertheless, we need Proposition 2 to prove the validity of Axiom T, as shown in the proof of Proposition 3. Completeness is proved by the standard techniques originally due to Kaplan [6] that show close correspondence between the axioms and a particular Kripke structure known as the canonical structure. Theorem 1. For the language Ln(VI), the system KnVI is a sound and complete axiomatization with respect to the semantics presented in the previous section. Proof. Soundness: the validity of axiom P and rule R1 follows immediately from the fact that the interpretation of ∧ and ¬ in the definition of ⊨N is the same in the propositional calculus. The validity of axioms K, T, and 45 are simply Proposition 3. For rule R2, if (M, w) ⊨N α for all situations (M, w), then, for any fixed situation (M′, w′), it follows that (M, w) ⊨N α for all situations (M, w) ∈ Si(M′, w′). Thus, (M′, w′) ⊨N Kiα for all situations (M′, w′). Completeness: it suffices to prove that every consistent formula is satisfied by some situation. We construct a special Kripke structure M c, known as canonical Kripke structure as follows. Given a set w of formulas,
K. Su et al. / Decision Support Systems 42 (2006) 1673–1683
define w / Ki = {ϕ: Kiϕ ∈ V}. Let Mc(W,π,K1, …, Kn), where W ¼ fw : wis a maximal consistent setg true if paw pðwÞðpÞ ¼ false if pgw Ki ¼ fðw; w VÞ : w=Ki pw Vg:
On the other hand, assuming w′ / Ki ⊆ w″ and ϕ ∈ w / Ki, we have Kiϕ ∈ w, and hence by Axiom 45, KiKiϕ ∈ w. Thus Ki /aw V ðby w=Ki pw VÞ and /awW ðwV=Ki pwWÞ
We first show that for each w ∈ W, Si(M c, w) = {(Mc, w′): w′ ∈ Ki(w)}, we have ðM c ; wÞ N Ki / iff ðM c ; w VÞ N / for all w VaKi ðwÞ
1679
ð⁎Þ
which is referred to a fact (⁎). Firstly, if the logic style of data source i is K (i.e., π(w)(Ii) = false and π(w)(Vi) = false) then, by the definition of Si, this claim is trivially true. Secondly, if the logic style of data source i is T (i.e., π (w)(Ii) = false and π (w)(Vi) = true) then, by the definition, Si ðM c ; wÞ ¼ fðM c ; w VÞ : w VaKi ðwÞg [ fðM c ; wÞg But, by Axiom T and the maximality of w, we have w / Ki ⊆ w. It follows then that w ∈ Ki (w), and hence Si ðM c ; wÞ ¼ fðM c ; w VÞ : w Va K i ðwÞg Thirdly, we suppose the logic style of data source i is K45, or pðwÞðIi Þ ¼ true and pðwÞðVi Þ ¼ false By the definition, it suffices to show that c Ki ðw VÞ pðw VÞðIi Þ ; 8w VaKi ðwÞ; M ; w V ¼ ðM c ; w VÞ Ki ðwÞ true That is, Ki(w′) = Ki(w) and π(w′)(Ii) = true. By Axiom 45 and the fact that π(w)(Ii) = true (i.e. Ii ∈ w), we have Ki Ii ∈ w. Hence Ii ∈ w′ and therefore, π(w′)(Ii) = true. To show Ki(w′) = Ki(w), given an arbitrary w″ ∈ W, we must prove that w / Ki ⊆ w″ iff w′ / Ki ⊆ w′. Assuming w / Ki ⊆ w″ and ϕ ∈ w′ / Ki, we want to show ϕ ∈ w″. If not, then Kiϕ ∉ w, and hence ¬ Kiϕ ∈ w. Thus, by Axiom 45, we would have Ki¬Kiϕ ∈ w, whence ¬Kiϕ ∈ w′, contradicting the assumption that ϕ ∈ w′ / Ki.
Finally, for the case where the logic style of data source i is S5 (i.e., π(w)(Ii) = true and π(w)(Vi) = true) by the same argument as above, it suffices to show that w ∈ Ki (w), Ki (w′) = Ki(w) and π(w′)(Ii) = true. But, as we have shown, the first follows from π(w)(Vi) = true and Axiom T, and the last two from π(w)(Ii) = true and Axiom 45. We now show, by induction on the structure of ϕ, that for every w we have ðM c ; wÞ N / iff /aw
ð⁎⁎Þ
More precisely, assuming that the claim holds for all sub-formulas of ϕ, we will also show that it holds for ϕ. If ϕ is a primitive proposition p, this comes immediately from the definition of π(w) above. The case where ϕ is a conjunction of a negation follows easily from the definition of ⊨N, and some basic properties of a maximal consistent set of formulas. Finally, suppose ϕ is of the form Kiψ and ψ ∈ w, then ψ ∈ w/Ki. Thus, by the definition of Ki, for each w′ ∈ Ki (w), ψ ∈ w′, and hence, by the inductive hypothesis, (Mc, w′) ⊨N ψ. Thus, by fact (*), it follows that (Mc, w) ⊨ N Kiψ. For the other direction, assuming (Mc, w) ⊨N Kiψ, we must prove Kiψ ∈ w. We first show that w / Ki ∪ {ψ} is inconsistent. If this is not the case, then it must have a maximal consistent extension w′, and by construction we have w′ ∈ Ki(w). By the inductive hypothesis, we would have (Mc, w′) ⊨N ¬ψ, and so (M c, w) ⊨N Kiψ by fact (*), contradicting our previous assumption. Since w / Ki ∪ {ψ} is inconsistent, some finite subset, say, {ϕ1, …, ϕk, ¬ψ}, must also be inconsistent. Thus, by propositional reasoning, we have p/1 Z ð/2 Z ð: : : ð/k Z wÞ: : : ÞÞ And, by R2, we have Ki p/1 Z ð/2 Z ð: : : ð/k Z wÞ: : : ÞÞ Therefore, by iterated applications of Axiom K and propositional reasoning, we get Ki /1 ; Ki /2 ; : : : Ki /k pKi w
1680
K. Su et al. / Decision Support Systems 42 (2006) 1673–1683
Since ϕ1, … ϕk ∈ w / Ki, we must have Kiϕ1, … Kiϕk ∈ w. Since w is a maximally consistent set of formulas, it must be closed under ⊢. Therefore, Kiψ ∈ w. □ 5. Circumscription Viewing each modal operator Ti (see Section 2) as a predicate, we may circumscribe Ti to a theory T in order to enhance the mechanism of nonmonotonic reasoning, in the same way as in the circumscription of [12]. The intuition behind this circumscription is that, if T holds, data source i can be collected as little as possible. In other words, what has been collected from data source i is, in some sense, derivable from T. Similar to Reiter's default logic [14], which has been widely investigated in the AI community [15], we capture our intuition also by the default theory (T, Di), : ITi / where Di consists of those free normal defaults IT for i/ arbitrary formulas ϕ. Interestingly, we can get the notion of circumscribing external knowledge about some subject straightforward by restricting formulas ϕ in de: ITi / faults IT to those that concern the subject. This notion i/ appears useful, because we are often interested only in what has collected from a data source about a specific subject, instead of all that has been collected from it. The semantic counterpart of the above is as follows. Just as for the HM notion of only knowing [4], we need some appropriate notion of possibility. Here we adopt the ω-tree ([2,3,17]). Let TM,w be the ω-tree corresponding to the situation (M, w), and Possi (M, w) the set {TM,w′|w′ ∈ Ki(w)}. We have the following: Theorem 2. For each situation (M, w) and each theory T of LnT (VI), there is an extension E of the default theory (T, Di ) such that (M, w) ⊨N E, iff (M, w) ⊨N T, and for all (M′, w′), Possi (M, w) is not a proper subset of Possi (M′, w′) whenever (M′, w′) ⊨N T. We note that it is unreasonable for us to exactly the same for knowledge modalities Kis. In the presence of introspection, ¬ Ki¬ Ki p is equivalent to Ki p. Thus the : IKi IKi p default IK essentially says that data source i i IKi p knows p whenever possible, contradicting the intuition of the circumscribing knowledge of data source i. We thus should restrict our attention to only those formulas that are, in some sense, objective for data source i, and the resulting notion of only knowing is essentially the only knowing about objective formulas. Nevertheless, in the presence of veridicality, it is rather subtle for standard modal logics to figure out such formulas, and some additional complicated operators, such as the Qiξ ,s in [3], may be needed.
Interestingly, our methodology can reasonably lead to the notion of only knowing about some subject by : IKi / to those limiting the formulas ϕ in the defaults IK i/ that concern the subject. This is beneficial, since we are usually more interested in what a data source knows about a particular subject. This approach to onlyknowing-about differs from that in [8], where two special kinds of subject are discussed for the logic KD45. The first kind of subjects, using default theories, can uniformly address arbitrary and interesting subjects, such as those regarding some data sources as knowledge about the actual world, provided that it is clear what formulas concern those subjects. Our notion of only-knowing-about, however, is given as a special case of circumscribing external knowledge. Assuming O⁎i α denotes what data source i knows is α, we define it semantically as follows. Given a situation (M, w), (M, w) ⊨N O⁎i α iff (M, w) ⊨N Kiα and for each (M′, w′), Possi(M, w) is not a proper subset of Possi(M′, w′) whenever (M′, w′) ⊨ N Kiα. Thus, by Theorem 2, (M, w) ⊨N O⁎i α iff (M, w) ⊨N E, for some extension E of the default theory (Kiα, Di). It is worth pointing out that O⁎i α is satisfiable for any arbitrary α. Our notion thus differs from that in [10], which is closely related to that of a stable expansion in [13]. As shown by Example 5.4 of [10], it is impossible to know only the fact of knowing some falsifiable objective sentence, though possible to only know any objective sentence. Despite the reasons demonstrated in [10], this seems to contradict our intuition, that if knowing some thing is logically equivalent to knowing some other thing, then only knowing one of them should be equivalent to only knowing the other. We say T is i-determinate if the default theory (T, Di) has a unique extension, and a formula α is honest if Kiα is determinate. In comparison with the HM notion of only knowing [4], we also characterize the notion of honest for special logics as follows. To present the notion of S5n-i-honest, we limit our attention to so-called S5-situations, where for all worlds w′ and all data source js, π(w′)(Ij) = true and π(w′)(Vj) = true. We say α is S5n-i-honest, if there is an S5-situation (M, w), called S5-i-maximum situation for α, such that (M, w) ⊨N Kiα, and for each S5-situation (M′, w′), we have Possi(M′, w′) ⊆ Possi(M, w) whenever (M′, w′) ⊨N Kiα. Similar notions can be obtained for logics Kn, Tn and K45n. The above notions are reasonable. In fact, we can prove that, for Kn, Tn, and K45n, and also for S5n, they coincide with those in [3]. The essential point of our approach to only-knowing-about is that the
K. Su et al. / Decision Support Systems 42 (2006) 1673–1683
circumscription of knowledge usually results from that of external knowledge. The less that is collected, the less is known. This only-knowing-about approach has many potential applications in non-monotonic reasoning environments, and can be used as a qualifier in answering systems. 6. Evaluation As we have seen, the above framework provides a formal description for considering external knowledge under two epistemic properties: introspection and veridicality. It can be taken as a basis for K3D. This section evaluates the logic framework. 6.1. Examples Our logic framework is taken as the first step in mining multiple data sources. The use of the logic framework is illustrated as follows: Let DS1 = {a1 → b1} be an internal data source, DS2 = {a2 → b2, a3 → b3, a4 → b4} and DS3 = {a2 → b2, a5 → b4} be two external data sources. Then DS1 can form its own knowledge set, Ruleset1, by collecting quality knowledge from DS2 and DS3. (1) When I1, DS1 has the ability to select true rules in DS2 ∪ DS3 and add them to Ruleset1. When not (I1), Ruleset1 is formed dependent on K1(I2), K1 (I3), K1(V2) and K1(V3). (2) When not (V1), Ruleset1 = /. When V1, Ruleset1 is formed dependent on K1(I2), K1(I3), K1(V2) and K1(V3).
1681
(3) When K1(I2), Ruleset1 = DS1 ∪ DS2. When K1(I3), Ruleset1 = DS1 ∪ DS3. When K1(I2) and K1(I3), Ruleset1 = DS1 ∪ DS2 ∪ DS3. (4) When K1(V2), true rules in DS2 are added to Ruleset1. When K1(V3), true rules in DS3 are added to Ruleset1. (5) When K1(V2) and K1(V3), true rules in DS2 ∪ DS3 are added to Ruleset1. When K1(V2) or K1(V3), if the rule a2 → b2 in DS2 ∩ DS3 is true, a2 → b2 are added to Ruleset1. In the above examples, K1(Ii) means that data source DS1 believes that data source DSi has introspection ability. K1(Vi) means that data source DS1 believes that data source DSi is veridical. The values of I1 and V1 are determined by the domain knowledge in DS1. Whereas the values of K1(I2), K1(I3), K1(V2) and K1(V3) are determined by both domain knowledge and experienced knowledge in DS1. Domain knowledge can be a set of constraints. For example, ‘the salary of a regular employee is not over $10,000.00 per week’. Experience knowledge can be a set of rules extracted from historical data in data sources. For example, customers who ‘think that the supermarket Safeway is credible’. 6.2. Experiments To evaluate the effectiveness of the proposed framework, we have carried out some experiments using Java on a DELL machine. Our experiments are designed to test the proposed approach from the veridicality of data sources.
120 TD-success-ratio NOTD-success-ratio
succecc ratio (%)
100
80
60
40
20
0 1
2
3
4
number of data-sources
Fig. 1. Success ratios of TD and NOTD.
5
6
1682
K. Su et al. / Decision Support Systems 42 (2006) 1673–1683
To obtain a group of data sources relevant and uncontradictable to a given dataset, we vertically partition a transaction database into a number of sub-sets, each containing a certain number of attributes. Also, we modify some of the data in some subsets so that the modified subsets are contradictory to the given dataset. We do this to test inconsistencies in data sources. In these experiments, multiple subsets were generated using databases from the Synthetic Classification Data Sets on the Internet (http://www.kdnuggets.com/). We now carried out two sets of experiments on four classes of applications from a data source DS. One set assumes that DS uses rules only from trustworthy data sources. This set we call TD. The other we call NOTD, in which case DS randomly borrows external rules from other data sources. We select 10 data sources and each has a set of rules. Eight are trustworthy, and two contain rules contradictory to DS, where those rules always cause failed applications. Note that association rules (including negative association rules) from a dataset are taken as patterns of the dataset for uncontradictability. A negative association rule (see Ref. [21]) A ⇒ ¬B is defined as (1) A ∩ B = /; (2) supp(A) ≥ minsupp, supp(B) ≥ minsupp, and supp (A ∪ ¬B) ≥ minsupp; (3) supp(A ∪ ¬B) / supp(A) ≥ minconf. Where minsupp and minconf are the minimum support and minimum confidence given by the user. For the four classes of applications, each consists of ten reasoning tasks. The first class of application requires rules from 2 data sources. The second class of application requires rules from 3 data sources. The third class of application requires rules from 5 data sources. The fourth class of application requires rules from 6 data sources. The success ratios of TD and NOTD are depicted in Fig. 1. In Fig. 1, the TD model received a 100% successratio because (1) the proposed technique has been utilized and (2) the given reasoning tasks can be finished by calling in trustworthy data sources. The NOTD model obtained a low success-ratio, decreasing according to the amount of fraudulent rules in untrustworthy data sources that were used. 7. Conclusions Existing mining techniques may not always be helpful for identifying patterns from different data sources. This is because knowledge from external data
sources may be untrustworthy, or even conflictive, and can disguise realistic patterns that might be useful to real-world applications. For this reason, we have proposed a framework for K3D, which aims to identify trustworthy knowledge in veridical data sources. The proposed approach is different from traditional data mining techniques because (1) we distinguish internal knowledge from external knowledge; (2) we have designed a logic framework for identifying trustworthy knowledge from various data sources; and (3) untrustworthy and fraudulent knowledge have been eliminated by veridicality and introspection analysis. Future work will involve the design of logic frameworks for resolving conflict, as we move towards the development of an agreed ontology for mining multiple data sources. References [1] J. Aronis, et al., The WoRLD: knowledge discovery from multiple distributed databases, Proceedings of 10th International Florida AI Research Symposium, 1997, pp. 337–341. [2] J.Y. Halpern, Reasoning with only knowing with many agents, Proc. of AAAI '93, 1993, pp. 655–661. [3] J.Y. Halpern, A theory of knowledge and ignorance for many agents, Journal of Logic and Computation 7 (1997) 79–108. [4] J.Y. Halpern, Y. Moses, Towards a theory of knowledge and ignorance, Proc. of AAAI Workshop on Non-Monotonic Logic, 1984, pp. 125–143. [5] J. Halpern, Y. Moses, A guide to completeness and complexity for modal logics of knowledge and belief, Artificial Intelligence 54 (1992) 319–379. [6] D. Kaplan, Review of “A semantical analysis of modal logic. I: normal modal propositional calculi”, Journal of Symbolic Logic 31 (1966) 120–122. [7] S. Kripke, A semantical analysis of modal logic. I: Normal modal propositional calculi, Zeitschrift für mathematische Logik und Grundlagen der Mathematik 9 (1963) 67–96. [8] G. Lakemeyer, All they know about, Proc. of AAA'93, 1993, pp. 662–667. [9] V. Lesser, B. Horling, F. Klassner, A. Raja, T. Wagner, S. Zhang, BIG: an agent for resource-bounded information gathering and decision making, Artificial Intelligence Journal, vol. 118, 1–2, 2000, pp. 197–244. [10] J. Levesque, All I know: a study in autoepistemic logic, Artificial Intelligence 42 (1990) 263–309. [11] H. Liu, H. Lu, J. Yao, Identifying relevant databases for multidatabase mining, Proceedings of PAKDD '98, 1998, pp. 210–221. [12] J. McCarthy, Circumscription: a form of nonmonotonic reasoning, Artificial Intelligence 13 (1980) 27–39. [13] R.C. Moore, Semantical considerations on nonmonotonic logic, Artificial Intelligence 25 (1985) 75–94.
K. Su et al. / Decision Support Systems 42 (2006) 1673–1683 [14] R. Reiter, A logic for default reasoning, Artificial Intelligence 13 (1980) 81–132. [15] K. Su, W. Li, Computation of seminormal default theories, Fundamental Informaticae 40 (1999) 79–102. [16] K. Su, X. Luo, H. Wang, Chengqi Zhang, S. Zhang, Q. Chen, A logical framework for knowledge sharing in multi-agent systems, Proceedings of COCOON'01, August 2001. [17] K. Su, H. Chen, D. Ding, Two alternative notions of possibility, Journal of Logic and Computation, 10 (2) (2000). [18] R. Turner, Truth and Modality for Knowledge Representation, Pitman Publishing, London, 1990. [19] M. Wooldridge, The logical modelling of computational multi-agent systems, Ph.D. thesis, University of Manchester, 1992. [20] X. Wu, S. Zhang, Synthesizing high-frequency rules from different data sources, IEEE Transactions on Knowledge and Data Engineering 15 (2) (March/April 2003) 353–367. [21] Xindong Wu, Chengqi Zhang, Shichao Zhang, Mining both positive and negative association rules, Proceedings of 19th International Conference on Machine Learning, Sydney, Australia, July 2002, pp. 658–665. [22] Xindong Wu, Chengqi Zhang, Shichao Zhang, Database classification for multi-database mining, Information Systems 30 (2005) 71–88. [23] Shichao Zhang, Xindong Wu, Chengqi Zhang, Multi-Database Mining, IEEE Computational Intelligence Bulletin 2 (1) (June 2003) 5–13. [24] Shichao Zhang, Chengqi Zhang, Xindong Wu, Knowledge Discovery in Multiple Databases, Springer, ISBN: 1-85233-703-6, 2004, p. 233. [25] Shichao Zhang, Chengqi Zhang, Qiang Yang, Information enhancement for data mining, IEEE Intelligent Systems (Mar./ Apr. 2004). [26] N. Zhong, Y. Yao, S. Ohsuga, Peculiarity oriented multi-database mining, Principles of Data Mining and Knowledge Discovery, 1999, pp. 136–146. Kaili Su is a professor at the Sun Yatsen University and a Research Fellow at the Griffith University. He received his PhD degree in computer science from Nanjing University in 1995. He was a Visiting Research Fellow at the University of NSW from 2001 to 2002. His research interests include logic-based knowledge representation, verification of security protocols, model checking, multi-agent systems, and modal logic. He has published 68 papers, including full papers published in top international conferences AAAI-04, AAAI-05, KR-04, and AAMAS-05; and papers in journals such as Information and Computation, Journal of Logic and Computation and Fundamenta Informaticae. He (with Meyden and Engelhardt) won the best paper award in AiML 2002.
1683
Hui-jing Huang received his BS degree from Peking University in 1996 and has been awarded an MS degree in 2005 by the Institute of Computing Technology, Chinese Academy of Sciences. He is currently an engineer and a project manager at the Bureau of Personnel and Education, Chinese Academy of Sciences, Beijing, China. His research interests include network administration, data mining, ERP and software engineering.
Xindong Wu is a Professor and the Chair of the Department of Computer Science at the University of Vermont. He holds a PhD in Artificial Intelligence from the University of Edinburgh, Britain. His research interests include data mining, knowledge-based systems, and Web information exploration. He has published extensively in these areas in various journals and conferences, including IEEE TKDE, TPAMI, ACM TOIS, IJCAI, AAAI, ICML, KDD, ICDM, and WWW, as well as 12 books and conference proceedings. Dr. Wu is the Editor-inChief of the IEEE Transactions on Knowledge and Data Engineering (by the IEEE Computer Society), the founder and current Steering Committee Chair of the IEEE International Conference on Data Mining (ICDM), an Honorary Editor-in-Chief of Knowledge and Information Systems (by Springer), and a Series Editor of the Springer Book Series on Advanced Information and Knowledge Processing (AI&KP). He is the 2004 ACM SIGKDD Service Award winner.
Shichao Zhang is a senior research fellow in the Faculty of Information Technology at the University of Technology, Sydney, and a professor at the Guangxi Normal University. He received his PhD degree in computer science from Deakin University, Australia. His research interests include data analysis and smart pattern discovery. He has published over 30 international journal papers (including 6 in IEEE/ACM Transactions, 2 in Information Systems, 6 in IEEE magazines) and over 30 international conference papers (including 2 ICML papers and 3 FUZZ-IEEE/AAMAS papers). He has won 4 China NSF/863 grants, 3 Australian large ARC grants and 2 Australian small ARC grants. He is a senior member of the IEEE, a member of the ACM, and serving as an associate editor for Knowledge and Information Systems and The IEEE Intelligent Informatics Bulletin.