Database Repair by Signed Formulae - Semantic Scholar

Report 4 Downloads 53 Views
Database Repair by Signed Formulae Ofer Arieli1 , Marc Denecker2 , Bert Van Nuffelen2 , and Maurice Bruynooghe2 1

Department of Computer Science, The Academic College of Tel-Aviv, Antokolski 4, Tel-Aviv 61161, Israel [email protected] 2 Department of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, B-3001 Heverlee, Belgium {marcd,bertv,maurice}@cs.kuleuven.ac.be

Abstract. We introduce a simple and practically efficient method for repairing inconsistent databases. The idea is to properly represent the underlying problem, and then use off-the-shelf applications for efficiently computing the corresponding solutions. Given a possibly inconsistent database, we represent the possible ways to restore its consistency in terms of signed formulae. Then we show how the ‘signed theory’ that is obtained can be used by a variety of computational models for processing quantified Boolean formulae, or by constraint logic program solvers, in order to rapidly and efficiently compute desired solutions, i.e., consistent repairs of the database.

1

Introduction

In this paper we consider a uniform representation of repairs of inconsistent relational databases, that is, a general description of how to restore the consistency of databases instances that do not satisfy a given set of integrity constraints. We then show how this description can be used by a variety of computational methodologies for efficiently computing database repairs, i.e., new consistent database instances that differ from the original database instance by a minimal set of changes (with respect to set inclusion or set cardinality). Reasoning with inconsistent databases has been extensively studied in the last few years, especially in the context of integrating (possibly contradicting) independent data-sources.1 In this paper we introduce a novel representation of the repair problem as a theory that consists of what we call signed formulae. Then we illustrate how off-the-shelf computational systems can use the theory to solve the problem, i.e., to compute repairs of the database. Here we apply two types of tools for repairing a database: – We show that the problem of finding repairs with minimal cardinality for a given database can be converted to the problem of finding minimal Herbrand 1

See., e.g., [1, 4, 9, 10, 13, 14, 19, 20, 23] for more details on reasoning with inconsistent databases and further references to related works.

2

O.Arieli, M.Denecker, B.Van Nuffelen, and M.Bruynooghe

models for the corresponding ‘signed theory’. Thus, once the process for consistency restoration of the database has been represented by a signed theory (using a polynomial transformation), tools for minimal model computations (such as the Sicstus Prolog constraint solver [12], or the answer set programming solver dlv [15]) can be used to efficiently find the required repairs. – For finding repairs that are minimal with respect to set inclusion, satisfiability solvers on appropriate quantified Boolean formulae (QBF) can be utilized. Again, we provide a polynomial-time transformation to (signed) QBF theories, and show how QBF solvers [5, 11, 16–18, 21, 26] can be used to restore the database consistency. The rest of the paper is organized as follows: In the next section we formally define the underlying problem and in Section 3 we show how to represent it by signed formulae. In Sections 4 and 5 we show how constraint solvers for logic programs and quantified Boolean formulae can be utilized for computing database repairs based on the signed theories. In Section 6 we present some experimental results, and in Section 7 we conclude with some further remarks and observations.

2

Database Repairs

Let L be a first-order language, based on a fixed database schema S and a fixed domain D. Every element of D has a unique name. A database instance D consists of atoms in the language L that are instances of the schema S. As such, every database instance D has a finite active domain, A(D), which is a subset of D. A database is a pair (D, IC), where D is a database instance, and IC, the set of integrity constraints, is a finite and classically consistent set of formulae in L. Given a database DB = (D, IC), we apply to it the closed word assumption, so only the facts that are explicitly mentioned in D are considered true. The underlying semantics of a database (D, IC) corresponds, therefore, to the least Herbrand model of D (notation: HD ), i.e., the model of D that assigns true to all the ground instances of atomic formulae in D, and assigns false to all the other atoms. Given a database DB = (D, IC), let DB A = D ∪ IC A = D ∪ {ρ(ψ) | ψ ∈ IC, ρ : var(ψ) → A(D)}, where ρ is a ground substitution of variables to the individuals of A(D), the active domain of D.2 DB A is called the Herbrand expansion of DB. As D, IC, and A(D) are all finite sets, DB A is also finite, and so Σ DB = {p1 , p2 , . . . , pn }, 2

Thus, e.g., ρ(∀x ψ(x)) = ψ(p1 ) ∧ ... ∧ ψ(pn ) and ρ(∃x ψ(x)) = ψ(p1 ) ∨ ... ∨ ψ(pn ), where p1 , . . . , pn are the elements of A(D); the transformation for other formulae is standard.

Database Repair by Signed Formulae

3

the set of the (ground) atomic formulae that appear in DB A , is finite as well. In what follows we shall assume that the databases are grounded w.r.t. their active domains, therefore we shall omit the superscripts of IC A and DB A . We say that a formula ψ follows from a database instance D (notation: D |= ψ) if the minimal Herbrand model of D is also a model of ψ. A database DB = (D, IC) is consistent if every formula in IC follows from D (notation: D |= IC).3 Given a possibly inconsistent database, our goal is to restore its consistency, i.e., to ‘repair’ the database: Definition 2.1. An update of a database DB = (D, IC) is a pair (Insert, Retract), s.t. Insert ∩ D = ∅ and Retract ⊆ D.4 A repair of DB is an update of DB, for which (D ∪ Insert \ Retract, IC) is a consistent database. Intuitively, a database is updated by inserting the elements of Insert and removing the elements of Retract. An update is a repair when the resulting database is consistent. Note that if DB is consistent, then (∅, ∅) is a repair of DB.  Example 2.1. Let DB = {P (a)} , {∀x(P (x) → Q(x))} . Clearly, this database is not consistent. The Herbrand expansion of DB is ({P (a)}, {P (a) → Q(a)}), and it has three repairs, namely R1 = ({}, {P (a)}), R2 = ({Q(a)}, {}), and R3 = ({Q(a)}, {P (a)}) that correspond, respectively, to removing P (a) from the database, inserting Q(a) to the database, and performing both actions simultaneously. Note that as the underlying semantics is determined by Herbrand interpretations, the Domain Closure Assumption5 is implicit here, and should be regarded as another constraint that should be satisfied by every repair. Therefore, e.g., ({Q(b)}, {P (a)}) is not a repair of DB in this case, for any b 6= a. Another implicit assumption, induced by the use of Herbrand semantics, is that Clark’s equality axioms are satisfied, and so the elements of Σ DB are all different. As the example above shows, there are many ways to repair a given database, some of them may not be very natural or sensible. It is usual, therefore, to specify some preference criterion on the possible repairs, and to apply only those that are (most) preferred with respect to the underlying criterion. The most common criteria for preferring a repair (Insert, Retract) over a repair (Insert0 , Retract0 ) are set inclusion [1, 4, 9, 10, 14, 19, 20], i.e., (Insert, Retract) ≤i (Insert0 , Retract0 ), if Insert ∪ Retract ⊆ Insert0 ∪ Retract0 , or minimal cardinality [4, 13, 23], i.e., 3 4 5

That is, there is no integrity constraint that is violated in D. Note that by conditions (1) and (2) it follows that Insert ∩ Retract = ∅. Namely, that the domain of every variable is in the set Σ DB of the ground atoms that appear in DB.

4

O.Arieli, M.Denecker, B.Van Nuffelen, and M.Bruynooghe

(Insert, Retract) ≤c (Insert0 , Retract0 ), if |Insert| + |Retract| ≤ |Insert0 | + |Retract0 |. Both criteria above reflect the intuitive feeling that a ‘natural’ way to repair an inconsistent database should require some minimal amount of changes, therefore the recovered data is kept ‘as close as possible’ to the original one. According to this view, for instance, each one of the repairs R1 and R2 of Example 2.1 is strictly better than R3 . Note also, that (∅, ∅) is the only ≤i -preferred and ≤c -preferred repair of consistent databases, as expected.

3

Representation of Repairs by Signed Formulae

In what follows we represent (preferred) repairs in terms of what we call ‘signed formulae’. Then we incorporate corresponding solvers in order to compute the repairs. For every (ground) atom p ∈ Σ DB we introduce a new atom, sp , intuitively understood as ‘switch p’, or ‘change the status of p’, that is, sp holds iff p ∈ Insert ∪ Retract. For every integrity constraint ψ ∈ IC we define a new formulae, ψ, obtained from ψ by simultaneously substituting every appearance of an atom p by a corresponding expression τp that is defined as follows: ( ¬sp if p ∈ D, τp = sp otherwise. The formula ψ = ψ [ τp1 /p1 , . . . , τpm /pm ] (i.e., the simultaneous substitution in ψ of all the atomic formulae pi , 1 ≤ i ≤ m, by their ‘signed expressions’ τpi ) is called the signed formula that is obtained from ψ. Given a repair R = (Insert, Retract) of a database DB, define a valuation ν R on {sp | p ∈ Σ DB } as follows: ν R (sp ) = t iff p ∈ Insert ∪ Retract. ν R is called the valuation that is associated with R. Conversely, a valuation ν on {sp | p ∈ Σ DB } induces a database update Rν = (Insert, Retract), where Insert = {p 6∈ D | ν(sp ) = t} and Retract = {p ∈ D | ν(sp ) = t}.6 Obviously, these mappings are the inverse of each other. Example 3.1. Let DB = ({p}, {p → q}) be a ground representation of the database considered in Example 2.1. In this case, the sign formula of ψ = p → q is ψ = ¬sp → sq , or, equivalently, sp ∨ sq . Intuitively, this formula indicates that in order to restore the consistency of DB, at least one of p or q should be ‘switched’, i.e., either p should be removed from the database or q should be inserted to it. Indeed, the three classical models of ψ are exactly the three valuations on {sp , sq } that are associated with the three repairs of DB (see Example 2.1). The next theorem shows that this is not a coincidence. 6

Clearly, Rν is an update of DB, but it is not necessarily a repair of DB (see Definition 2.1).

Database Repair by Signed Formulae

5

Theorem 3.1. Let DB = (D, IC) be a database. Denote: IC = {ψ | ψ ∈ IC}. a) if R is a repair of DB then ν R is a model of IC, b) if ν is a model of IC then Rν is a repair of DB. Proof. For (a), suppose that R is a repair of DB = (D, IC). Then, in particuR lar, DR |= IC, where DR = D ∪ Insert \ Retract. Let HD be the least Herbrand R model of DR , and let ψ ∈ IC. Then HD (ψ) = t, and so it remains to show R that ν R (ψ) = HD (ψ). The proof of this is by induction on the structure of ψ, and we show only the base step (the rest is trivial), i.e., for every p ∈ Σ DB , R ν R (p) = HD (p). Indeed, – – – –

R

p ∈ D \ Retract ⇒ p ∈ DR ⇒ ν R (p) = ν R (¬sp ) = ¬ν R (sp ) = ¬f = t = HD (p). R p ∈ Retract ⇒ p ∈ D \ DR ⇒ ν R (p) = ν R (¬sp ) = ¬ν R (sp ) = ¬t = f = HD (p). R p ∈ Insert ⇒ p ∈ DR \ D ⇒ ν R (p) = ν R (sp ) = t = HD (p). R p 6∈ D ∪ Insert ⇒ p 6∈ DR ⇒ ν R (p) = ν R (sp ) = f = HD (p). For part (b), suppose that ν is a model of IC. Let Rν = (Insert, Retract) = ({p 6∈ D | ν(sp ) = t}, {p ∈ D | ν(sp ) = t}).

We shall show that Rν is a repair of DB. According to Definition 2.1, it is obviously an update. It remains to show that every ψ ∈ IC follows from DR = R R D ∪ Insert \ Retract, i.e., that HD (ψ) = t, where HD is the least Herbrand model of DR . Since ν is a model of IC, ν(ψ) = t, and so it remains to show that R HD (ψ) = ν(ψ). Again, the proof is by induction on the structure of ψ, and we R show only the base step, that is: for every p ∈ Σ DB , HD (p) = ν(p): – – – –

R

p ∈ D \ Retract ⇒ p ∈ DR , ν(sp ) = f ⇒ HD (p) = t = ¬ν(sp ) = ν(¬sp ) = ν(p). R p ∈ Retract ⇒ p ∈ D \ DR , ν(sp ) = t, ⇒ HD (p) = f = ¬ν(sp ) = ν(¬sp ) = ν(p). R p ∈ Insert ⇒ p ∈ DR \ D, ν(sp ) = t, ⇒ HD (p) = t = ν(sp ) = ν(p). R p 6∈ D ∪ Insert ⇒ p 6∈ DR , ν(sp ) = f , ⇒ HD (p) = f = ν(sp ) = ν(p). 2

The last theorem implies, in particular, that in order to compute repairs for a given database DB, it is sufficient to find the models of the signed formulae that are induced by the integrity constraints of DB; the pairs that are induced by these models are the repairs of DB. Example 3.2. Consider again the (grounded) database of Examples 2.1 and 3.1. The corresponding signed formula ψ = sp ∨ sq has three models {sp : t, sq : f }, {sp : f, sq : t}, and {sp : t, sq : t}.7 These models induce, respectively, three pairs, ({}, {p}), ({q}, {}), and ({q}, {p}), which are the repairs of DB (cf. Example 2.1). 7

We are denoting here by p : x the fact that the atom p is assigned the value x by the corresponding valuation.

6

O.Arieli, M.Denecker, B.Van Nuffelen, and M.Bruynooghe

4

Computing Preferred Repairs by Model Generation

In this section we show how solvers for constraint logic programs (CLPs), answerset programming (ASP) and SAT solvers can be used for computing ≤c -preferred repairs and ≤i -preferred repairs. The experimental results are presented in Section 6. 4.1

Computing ≤c -Preferred Repairs

By Theorem 3.1, the repairs of a database correspond exactly to the models of the signed theory. It is straightforward to see that ≤c -preferred repairs of DB (i.e., those with minimal cardinality) correspond to models of IC that minimize the number of t-assignments of the atoms sp . Hence, the problem is to find Herbrand models for IC with minimal cardinality (called ≤c -minimal Herbrand models). Theorem 4.1. Let DB = (D, IC) be a database and IC = {ψ | ψ ∈ IC}. Then: a) if R is a ≤c -preferred repair of DB, then ν R is a ≤c -minimal Herbrand model of IC. b) if ν is a ≤c -minimal Herbrand model of IC, then Rν is a ≤c -preferred repair of DB. We discuss two techniques to compute ≤c -minimal Herbrand models. The first approach is to use a finite domain CLP solver. Encoding the computation of ≤c -preferred repair using a finite domain constraint solver is a straightforward process. The ‘switch atoms’ sp are encoded as finite domain variables with domain {0, 1}. A typical encoding specifies the relevant constraints (i.e., the encoding of IC), assigns a special variable, Sum, for summing-up all the signed variables that are assigned the value ‘1’, and asks for a solution with a minimal value for Sum. Example 4.1. Below is a code for repairing the database of Example 3.2 with Sicstus Prolog finite domain constraint solver CLP(FD) [12]8 . domain([Sp,Sq],0,1), Sp #\/ Sq, sum([Sp,Sq],#=,Sum), minimize(labeling([],[Sp,Sq]),Sum).

% % % %

domain of the signed atoms the signed theory Sum = num of vars with val 1 find a solution with min sum

The solutions computed here are [1, 0] and [0, 1], and the value of Sum is 1. This means that the cardinality of the ≤c -preferred repairs of DB should be 1, and that these repairs are induced by the valuations ν1 = {sp : t, sq : f } and ν2 = {sp : f, sq : t}. Thus, the two ≤c -minimal repairs here are ({}, {p}) and ({q}, {}), which indeed insert or retract exactly one atomic formula. 8

A Boolean constraint solver would also be appropriate here. As Sicstus Prolog Boolean constraint solver has no minimization capabilities, we prefer to use here the finite domain constraint solver.

Database Repair by Signed Formulae

7

A second approach is to use the disjunctive logic programming system DLV [15]. To compute ≤c -minimal repairs using DLV, the signed theory IC is transformed into a propositional clausal form. A clausal theory is a special case of a disjunctive logic program without negation in the body of the clauses. The stable models of a disjunctive logic program without negation as failure in the body of rules coincide exactly with the ≤i -minimal models of such a program. Hence, by transforming the signed theory IC to clausal form, DLV can be used to compute ≤i -minimal Herbrand models. To eliminate models with non-minimal cardinality, weak constraints are used. A weak constraint is a formula for which a cost value is defined. With each model computed by DLV, a cost is defined as the sum of the cost values of all weak constraints satisfied in the model. The DLV system can be asked to generate models with minimal total cost. The set of weak constraints used to compute ≤c -minimal repairs is exactly the set of all atoms sp ; each atom has cost 1. Clearly, ≤i -minimal models of a theory with minimal total cost are exactly the models with least cardinality. Example 4.2. Below is a code for repairing the database of Example 3.2 with DLV. Sp v Sq. :~ Sp. :~ Sq.

% the clause % the weak constraints (their cost is 1 by default)

Clearly, the solutions here are {sp : t, sq : f } and {sp : f, sq : t}. These valuations induce the two ≤c -minimal repairs of DB, R1 = ({}, {p}) and R2 = ({q}, {}). 4.2

Computing ≤i -Preferred Repairs

The ≤i -preferred repairs of a database correspond to minimal Herbrand models with respect to set inclusion of the signed theory IC. We focus on the computation of one minimal model. The reason is simply that in most sizable applications, the computation of all minimal models is not feasible (there are too many of them). We consider here three simple techniques to compute a ≤i -preferred repair. In the next section we consider another more complex method. I. One technique, mentioned already in the previous section, is to transform IC to clausal form and use the DLV system. In this case the weak constraints are not needed. II. Another possibility is to adapt CLP-techniques to compute ≤i -minimal models of Boolean constraints. The idea is simply to make sure that whenever a Boolean variable (or a finite domain variable with domain {0, 1}) is selected for being assigned a value, one first assigns the value 0 before trying to assign the value 1. Proposition 4.1. If the above strategy for value selection is used, then the first computed model is provably a ≤i -minimal model.

8

O.Arieli, M.Denecker, B.Van Nuffelen, and M.Bruynooghe

Proof. Consider the search tree of the CLP-problem. Each path in this tree represents a value assignment to a subset of the constraint variables. Internal nodes, correspond to partial solutions, are labeled with the variable selected by the labeling function of the solver and have two children: the left child assigns value 0 to the selected variable and the right child assigns value 1. We say that node n2 is on the right of a node n1 in this tree if n2 appears in the right subtree, and n1 appears in the left subtree of the deepest common ancestor node of n1 and n2 . It is then easy to see that in such a tree, each node n2 to the right of a node n1 assigns the value 1 to the variable selected in this ancestor node, whereas n1 assigns value 0 to this variable. Consequently, the left-most node in the search tree which is a model of the Boolean constraints, is ≤i -minimal. 2 In CLP-systems such as Sicstus Prolog, one can control the order in which values are assigned to variables. We have implemented the above strategy and discuss the results in Section 6. III. A third technique considered here uses SAT-solvers. SAT-solvers, such as zChaff [25], do not compute directly minimal models, but can be easily extended to do so. The algorithm uses the SAT-solver to generate models of the theory T , until it finds a minimal model. Minimality of a model M of T can beWverified by checking the unsatisfiability of T , augmented with the axV ioms p∈M ¬p and p6∈M ¬p. The model M is minimal exactly when these axioms are inconsistent with T . This approach has been tested using the SAT solver zChaff [25]; the results are discussed in Section 6.

5

Computing ≤i -Preferred Repairs by QBF Solvers

In this section we show how solvers for quantified Boolean formulae (QBFs) can be used for computing the ≤i -preferred repairs of a given database. In this case it is necessary to add to the signed formulae of IC an axiom (represented by a quantified Boolean formula) that expresses ≤i -minimality, i.e., that an ≤i preferred repair is not included in any other database repair. Then, QBF solvers such as QUBOS [5], EVALUATE [11], QUIP [16], QSOLVE [17], QuBE [18], QKN [21], SEMPROP [22], and DECIDE [26], can be applied to the signed quantified Boolean theory that is obtained, in order to compute the ≤i -preferred repairs of the database. Below we give a formal description of this process. 5.1

Quantified Boolean Formulae

Quantified Boolean formulae (QBFs) are propositional formulae extended with quantifiers ∀, ∃ over propositional variables. In what follows we shall denote propositional formulae by Greek lower-case letters (usually ψ, φ) and QBFs by Greek upper-case letters (e.g., Ψ, Φ). Intuitively, the meaning of a QBF of the form ∃p ∀q ψ is that there exists a truth assignment of p such that ψ is true for every truth assignment of q. Next we formalize this intuition.

Database Repair by Signed Formulae

9

As usual, we say that an occurrence of an atomic formula p is free if it is not in the scope of a quantifier Qp, for Q ∈ {∀, ∃}, and we denote by Ψ [φ1 /p1 , . . . , φm /pm ] the uniform substitution of each free occurrence of a variable pi in Ψ by a formula φi , for i = 1, . . . , m. The notion of a valuation is extended to QBFs as follows: Given a function νat : Σ DB ∪ {t, f} → {t, f } s.t. ν(t) = t and ν(f) = f , a valuation ν on QBFs is recursively defined as follows: ν(p) = νat (p) for every p ∈ Σ DB ∪ {t, f}, ν(¬ψ) = ¬ν(ψ), ν(ψ ◦ φ) = ν(ψ) ◦ ν(φ), where ◦ ∈ {∧, ∨, →, ↔}, ν(∀p ψ) = ν(ψ[t/p]) ∧ ν(ψ[f/p]), ν(∃p ψ) = ν(ψ[t/p]) ∨ ν(ψ[f/p]). A valuation ν satisfies a QBF Ψ if ν(Ψ ) = t; ν is a model of a set Γ of QBFs if it satisfies every element of Γ . A QBF Ψ is entailed by a set Γ of QBFs (notation: Γ ` Ψ ) if every model of Γ is also a model of Ψ . In what follows we shall use the following notations: for two valuations ν1 and ν2 we denote by ν1 ≤ ν2 that for every atomic formula p, ν1 (p) → ν2 (p) is true. We shall also write ν1 < ν2 to denote that ν1 ≤ ν2 and ν2 6≤ ν1 . 5.2

Representing ≤i -Preferred Repairs by Signed QBFs

It is well-known that quantified Boolean formulae can be used for representing circumscription [24], thus they properly express logical minimization [7, 8]. In our case we use this property for expressing minimization of repairs w.r.t. set inclusion. Given a database DB = (D, IC), denote by IC ∧ the conjunction of all the elements in IC (i.e., the conjunction of all the signed formulae that are obtained from the integrity constraints of DB). Consider the following QBF, denoted ΨDB : n n  ^ ^    ∀s0p1 , . . . , s0pn IC ∧ s0p1 /sp1 , . . . , s0pn /spn → (s0pi → spi ) → (spi → s0pi ) . i=1

i=1

Consider a model ν of IC ∧ , i.e., a valuation for sp1 , . . . , spn that makes IC ∧ true. The QBF ΨDB expresses that every interpretation µ (valuation for s0p1 , . . . , s0pn ) that is a model of IC ∧ , has the property that µ ≤ ν implies ν ≤ µ, i.e., there is no model µ of IC ∧ , s.t. the set {sp | ν(sp ) = t} properly contains the set {sp | µ(sp ) = t}. In terms of database repairs, this means that if Rν = (Insert, Retract) and Rµ = (Insert0 , Retract0 ) are the database repairs that are associated, respectively, with ν and µ, then Insert0 ∪Retract0 6⊂ Insert∪Retract. It follows, therefore, that in this case Rν is a ≤i -preferred repair of DB, and in general ΨDB represents ≤i -minimality.

10

O.Arieli, M.Denecker, B.Van Nuffelen, and M.Bruynooghe

Example 5.1. With the database DB of Examples 2.1, 3.1, and 3.2, IC ∪ ΨDB is the following theory, Γ :      0 0 0 0 0 0 0 0 sp ∨sq , ∀sp ∀sq sp ∨sq → (sp → sp )∧(sq → sq ) → (sp → sp )∧(sq → sq ) . The models of Γ are those that assign t either to sp or to sq , but not to both of them, i.e., ν1 = (sp : t, sq : f ) and ν2 = (sp : f, sq : t). The database updates that are induced by these valuations are, respectively, Rν1 = ({}, {p}) and Rν2 = ({q}, {}). By Theorem 5.1 below, these are the only ≤i -preferred repairs of DB. Theorem 5.1. Let DB = (D, IC) be a database and IC = {ψ | ψ ∈ IC}. Then: a) if R is an ≤i -preferred repair of DB then ν R is a model of IC ∪ ΨDB , b) if ν is a model of IC ∪ ΨDB then Rν is an ≤i -preferred repair of DB. Proof. Suppose that R = (Insert, Retract) is an ≤i -preferred repair of DB. In particular, it is a repair of DB and so, by Theorem 3.1, ν R is a model of IC. Since Theorem 3.1 also assures that a database update that is induced by a model of IC is a repair of DB, in order to prove both parts of the theorem, it remains to show that the fact that ν R satisfies ΨDB is a necessary and sufficient condition for assuring that R is ≤i -minimal among the repairs of DB. Indeed, ν R satisfies ΨDB iff for every valuation µ that satisfies IC ∧ and for which µ ≤ ν R , it is also true that ν R ≤ µ. Thus, ν R satisfies ΨDB iff there is no model µ of IC s.t. 0 µ < ν R , iff (by Theorem 3.1 again) there is no repair R0 of DB s.t. ν R < ν R , iff there is no repair R0 = (Insert0 , Retract0 ) s.t. Insert0 ∪ Retract0 ⊂ Insert ∪ Retract, iff R is an ≤i -minimal repairs of DB. 2 Note 5.1. (Complexity results) A skeptical (conservative) approach to query answering is considered, e.g., in [1, 19], where an answer to a query Q and a database DB is evaluated with respect to (the databases that are obtained from) all the ≤i -preferred repairs of DB. A credulous approach to the same problem evaluates queries with respect to some ≤i -preferred repair of DB. Theorem 5.1 implies the following upper complexity bounds for these approaches: Corollary 5.1. Credulous query answering lies in Σ2P , and skeptical query answering is in Π2P . Proof. By Theorem 5.1, credulous query answering is equivalent to satisfiability checking for IC ∪ ΨDB , and conservative query answering is equivalent to entailment checking for the same theory (see also Corollary 5.2 below). Thus, these decision problems can be encoded by QBFs in prenex normal form with exactly one quantifier alternation. The corollary is obtained, now, by the following well-known result: Proposition 5.1. [27] Given a propositional formula ψ, whose atoms are partitioned into i ≥ 1 sets {p11 , . . . , p1m1 }, . . . , {pi1 , . . . , pimi }, deciding whether ∃p11 , . . . , ∃p1m1 , ∀p21 , . . . , ∀p2m2 , . . . , Qpi1 , . . . , Qpimi ψ

Database Repair by Signed Formulae

11

is true, is ΣiP -complete (where Q = ∃ if i is odd and Q = ∀ if i is even). Also, deciding if ∀p11 , . . . , ∀p1m1 , ∃p21 , . . . , ∃p2m2 , . . . , Qpi1 , . . . , Qpimi ψ is true, is ΠiP -complete (where Q = ∀ if i is odd and Q = ∃ if i is even).

2

As shown, e.g., in [19], the complexity bounds specified in the last corollary are strict, i.e., these decision problems are hard for the respective complexity classes. Note 5.2. (Consistent query answering) Another consequence of Theorem 5.1 is that the conservative approach to query answering [1, 19] may be represented in our context in terms of a consequence relation as follows: Corollary 5.2. Q is a consistent query answer of a database DB = (D, IC) in the sense of [1, 19] iff IC ∪ ΨDB ` Q. The last corollary and Section 4.2 provide, therefore, some additional methods for consistent query answering, all of them are based on signed theories.

6

Experiments and Comparative Study

The idea of using formulae that introduce new (‘signed’) variables aimed at designating the truth assignments of other related variables is used, for different purposes, e.g. in [2, 3, 6, 7]. In the area of database integration, signed variables are used in [19], and have a similar intended meaning as in our case. In [19], however, only ≤i -preferred repairs are considered, and a rewriting process for converting relational queries over a database with constraints to extended disjunctive queries (with two kinds of negations) over database without constraints, must be employed. As a result, only solvers that are able to process disjunctive Datalog programs and compute their stable models (e.g., DLV), can be applied. In contrast, as we have already noted above, motivated by the need to find practical and effective methods for repairing inconsistent databases, signed formulae serve here as a representative platform that can be directly used by a variety of off-the-shelf applications for computing (either ≤i -preferred or ≤c -preferred) repairs. In what follows we examine some of these applications and compare their appropriateness to the kind of problems that we are dealing with. We have randomly generated instances of a database, consisting of three relations: teacher of the schema (teacher name), course of the schema (course name), and teaches of the schema (teacher name, course name). Also, the following two integrity constraints were specified: ic1 A course is given by one teacher:  ∀X ∀Y ∀Z teacher(X) ∧ teacher(Y ) ∧ course(Z) ∧ teaches(X, Z) ∧   teaches(Y, Z) → X = Y

12

O.Arieli, M.Denecker, B.Van Nuffelen, and M.Bruynooghe

ic2 Each teacher gives at least one course:   ∀X teacher(X) → ∃Y course(Y ) ∧ teaches(X, Y ) The next four test cases (identified by the enumeration below) were considered: 1. Small database instances with ic1 as the only constraint. 2. Larger database instances with ic1 as the only constraint. 3. Databases with IC = {ic1, ic2}, where the number of courses equals the number of teachers. 4. Databases with IC = {ic1, ic2} and with fewer courses than teachers. Note that in the first two test cases, only retractions of database facts are needed in order to restore consistency, in the third test case both insertion and retractions may be needed, and the last test case is unsolvable, as the theory is not satisfiable. For each benchmark we generated a sequence of instances with an increasing number of database facts, and tested them w.r.t. the following applications: – ASP/CLP-solvers: DLV [15] (release 2003-05-16), CLP(FD) [12] (version 3.10.1). – QBF-solvers: SEMPROP [22] (release 24.02.02), QuBE-BJ [18] (release 1.3), DECIDE [26]. – SAT-solvers: A minimal-model generator based on zChaff [25]. The goal was to construct ≤i -preferred repairs within a time limit of five minutes. The systems DLV and CLP(FD) were tested also for constructing ≤c preferred repairs. All the experiments were done on a Linux machine, 800MHz, with 512MB memory. Tables 1–4 show the results for providing the first answer.9 The results of the first benchmark (Table 1) already indicate that DLV, CLP, and zChaff perform much better than the QBF-solvers. In fact, among the QBFsolvers that were tested, only SEMPROP could repair within the time limit most of the database instances of benchmark 1, and none of them could successfully repair (within the time restriction) the larger database instances, tested in benchmark 2. Also, we encountered some space limitation problems and a bug10 in DECIDE, and this discouraged us from using it in our experiments. Another observation from Tables 1–4 is that DLV, CLP, and the zChaff-based system, perform very good for minimal inclusion greedy algorithms. However, 9

10

Times are in given in seconds, empty cells mean that timeout is reached without an answer, vars is the number of variables, IC is the number of grounded integrity constraints, and size is the size of the repairs. For the unsatisfiable QBF ∃xy∀uv((x ∨ y) ∧ (u ∨ v)), the answer x = 1 and y = 0 is returned. The system developers were notified about this and the bug is being fixed.

Database Repair by Signed Formulae

13

Table 1. Results for test case 1 Test info. No. vars IC 1 20 12 2 25 16 3 30 28 4 35 40 5 40 48 6 45 42 7 50 38 8 55 50 9 60 58 10 65 64 11 70 50 12 75 76 13 80 86 14 85 76 15 90 78 16 95 98 17 100 102 18 105 102 19 110 124 20 115 116

size 8 7 12 15 16 17 15 20 21 22 22 27 29 30 32 35 40 37 43 44

DLV 0.005 0.013 0.009 0.023 0.016 0.021 0.013 0.008 0.014 0.023 0.014 0.021 0.021 0.022 0.024 0.027 0.017 0.018 0.030 0.027

CLP 0.010 0.010 0.020 0.020 0.020 0.030 0.020 0.030 0.030 0.030 0.030 0.030 0.030 0.030 0.040 0.040 0.040 0.040 0.040 0.040

≤i -repairs zChaff SEMPROP 0.024 0.088 0.018 0.015 0.039 0.100 0.008 0.510 0.012 0.208 0.008 0.673 0.009 0.216 0.018 1.521 0.036 3.412 0.009 10.460 0.019 69.925 0.010 75.671 0.009 270.180 0.010 0.020 0.047 0.016 0.033 0.022 0.041

QuBE 14.857

≤c -repairs DLV CLP 0.011 0.020 0.038 0.020 0.611 0.300 2.490 1.270 3.588 3.220 12.460 10.350 23.146 20.760 29.573 65.530 92.187 136.590 122.399 171.390

Table 2. Results for test case 2

No. 1 2 3 4 5 6 7 8 9 10 11

Test info. vars 480 580 690 810 940 1080 1230 1390 1560 1740 1930

IC 171 214 265 300 349 410 428 509 575 675 719

size 470 544 750 796 946 1108 1112 1362 1562 1782 2042

≤i -repairs DLV CLP 0.232 0.330 0.366 0.440 0.422 0.610 0.639 0.860 0.815 1.190 1.107 1.560 1.334 2.220 1.742 2.580 2.254 3.400 2.901 4.140 3.592 5.260

zChaff 0.155 0.051 0.062 0.079 0.094 0.123 0.107 0.135 0.194 0.182 0.253

14

O.Arieli, M.Denecker, B.Van Nuffelen, and M.Bruynooghe

Table 3. Results for test case 3 Test info. No. vars 1 25 2 36 3 49 4 64 5 81 6 100 7 121 8 144 9 169 10 196 11 225

size 4 9 15 23 30 34 38 47 51 68 70

≤i -repairs DLV CLP 0.008 0.030 0.008 0.030 0.027 0.250 0.019 0.770 0.012 4.660 0.021 0.626 0.907 0.161 1.877 8.496

zChaff 0.066 0.087 0.050 0.013 0.102 0.058 1.561 2.192 0.349 4.204 16.941

≤c -repairs DLV CLP 0.010 0.05 0.070 0.42 0.347 9.48 2.942 58.09 26.884 244.910

Table 4. Results for test case 4

No. 1 2 3 4 5 6 7

Test info. teachers courses 5 4 7 5 9 6 11 7 13 8 15 9 17 10

DLV 0.001 0.005 0.040 0.396 3.789 44.573

≤i -repairs CLP 0.01 0.13 1.41 17.18

zChaff 0.001 0.010 0.020 0.120 1.050 13.370

≤c -repairs DLV CLP 0.001 0.001 0.005 0.120 0.042 1.400 3.785 17.170 44.605

Database Repair by Signed Formulae

15

when using DLV and CLP for cardinality minimization, their performance is much worse. This is due to an exhaustive search for a ≤c -minimal solution. While in benchmark 1 the time differences among DLV, CLP, and zChaff, for computing ≤i -repairs are marginal, in the other benchmarks the differences become more evident. Thus, for instance, zChaff performs better than the other solvers w.r.t. bigger database instances with many simple constraints (see benchmark 2), while DLV performs better when the problem has bigger and more complicated sets of constraints (see benchmark 3). The SAT approach with zChaff was the fastest in detecting unsatisfiable situations (see benchmark 4). As shown in Table 4, detecting unsatisfiability requires a considerable amount of time, even for small instances. Some of the conclusions from the experiments may be summarized as follows: 1. In principle, QBF-solvers, CLP-solvers, ASP-solvers, and SAT-solvers are all adequate tools for computing database repairs. 2. All the QBF-solvers, as well as DLV and zChaff, are ‘black-boxes’ that accept the problem specification in a certain format. In contrast, CLP(FD) provides a more ‘open’ environment, in which it is possible to incorporate problem-specific search algorithms, such as the greedy algorithm for finding ≤i -minimal repairs (see Section 4.2). 3. Currently, the performance of the QBF-solvers is considerably below that of the other solvers. Moreover, most of the QBF-solvers require that the formulae are represented in prenex CNF, and specified in Dimacs or Rintanen format. These requirements are usually space-demanding. In our context, the fact that many QBF-solvers (e.g., SEMPROP and QuBE-BJ) return only yes/no answers (according to the satisfiability of the input theory), is another problem, since it is impossible to construct repairs only by these answers. One needs to be able to extract the assignments to the outmost existentially quantified variables (as done, e.g., by DECIDE). Despite these drawbacks of QBF-solvers, reasoning with QBFs seems to be particularly suitable for our needs, since this framework provides a natural way to express minimization (in our case, representations of optimal repairs). It is most likely, therefore, that future versions of QBF-solvers will be the basis of powerful mechanisms for handling consistency in databases.

7

Concluding Remarks

This work provides further evidence for the well-known fact that in many cases a proper representation of a given problem is a major step in finding robust solutions to it. In our case, a uniform method for encoding the restoration of database consistency by signed formulae allows us to use off-the-shelf solvers for efficiently computing the desired repairs.

16

O.Arieli, M.Denecker, B.Van Nuffelen, and M.Bruynooghe

As shown in Corollary 5.1, the task of repairing a database is on the second level of the polynomial hierarchy, hence it is not tractable. However, despite the high computational complexity of the problem, the experimental results of Section 6 show that our method of repairing databases by signed theories is practically appealing, as it allows a rapid construction of repairs for large problem instances.

References 1. M.Arenas, L.Bertossi, and J.Chomicki. Consistent query answers in inconsistent databases. Proc. 18th ACM Symp. on Principles of Database Systems (PODS’99), pp.68–79, 1999. 2. O.Arieli and M.Denecker. Modeling paraconsistent reasoning by classical logic. Proc. 2nd Symp. on Foundations of Information and Knowledge Systems (FoIKS’02), T.Eiter and K.D.Schewe, editors, LNCS 2284, Springer, pp.1–14, 2002. 3. O.Arieli and M.Denecker. Reducing preferential paraconsistent reasoning to classical entailment. Journal of Logic and Computation 13(4), pp.557–580, 2003. 4. O.Arieli, B.Van Nuffelen, M.Denecker, and M.Bruynooghe. Coherent composition of distributed knowledge-bases through abduction. Proc. 8th Int. Conf. on Logic Programming, Artificial Intelligence and Reasoning (LPAR’01), A.Nieuwenhuis and A.Voronkov, editors, LNCS 2250, Springer, pp.620–635, 2001. 5. A.Ayari and D.Basin. QUBOS: Deciding quantified Boolean logic using propositional satisfiability solvers. Proc. 4th Int. Conf. on Formal Methods in ComputerAided Design (FMCAD’02), M.D.Aagaard and J.W.O’Leary, editors, LNCS 2517, Springer, pp.187–201, 2002. 6. P.Besnard, T.Schaub. Signed systems for paraconsistent reasoning. Journal of Automated Reasoning 20(1), pp.191–213, 1998. 7. P.Besnard, T.Schaub, H.Tompits, and S.Woltran. Paraconsistent reasoning via quantified Boolean formulas, part I: Axiomatizing signed systems. Proc. 8th European Conf. on Logics in Artificial Intelligence (JELIA’02), S.Flesca et al., editors, LNAI 2424, Springer, pp.320–331, 2002. 8. P.Besnard, T.Schaub, H.Tompits, and S.Woltran. Paraconsistent reasoning via quantified Boolean formulas, part II: Circumscribing inconsistent theories. Proc. 7th European Conf. on Symbolic and Quantitative Approaches to Reasoning with Uncertainty (ECSQARU’03), T.D.Nielsen and N.L.Zhang, editors, LNAI 2711, Springer, pp.528–539, 2003. 9. L.Bertossi, J.Chomicki, A.Cortes, and C.Gutierrez. Consistent answers from integrated data sources. Proc. Flexible Query Answering Systems (FQAS’2002), A.Andreasen et al., editors, LNCS 2522, Springer, pp.71–85, 2002. 10. L.Bertossi and C.Schwind. Analytic tableau and database repairs: Foundations. Proc. 2nd Int. Symp. on Foundations of Information and Knowledge Systems (FoIKS’02), T.Eiter and K.D.Schewe, editors, LNCS 2284, Springer, pp.32–48, 2002. 11. M.Cadoli, M.Schaerf, A.Giovanardi, and M.Giovanardi. An Algorithm to evaluate quantified Boolean formulae and its experimental evaluation. Automated Reasoning 28(2), pp.101–142, 2002. 12. M.Carlsson, G.Ottosson and B.Carlson. An open-ended finite domain constraint solver, Proc. 9th Int. Symp. on Programming Languages, Implementations, Logics, and Programs (PLILP’97), LNCS 1292, Springer, 1997.

Database Repair by Signed Formulae

17

13. M.Dalal. Investigations into a theory of knowledge base revision. Proc. National Conference on Artificial Intelligence (AAAI’98), AAAI Press, pp.475–479, 1988. 14. S.de Amo, W.Carnielli, and J.Marcos. A logical framework for integrating inconsistent information in multiple databases. Proc. 2nd Int. Symp. on Foundations of Information and Knowledge Systems (FoIKS’02), T.Eiter and K.D.Schewe, editors, LNCS 2284, Springer, pp.67–84, 2002. 15. T.Eiter, N.Leone, C.Mateis, G.Pfeifer, and F.Scarcello. The KR system dlv: Progress report, comparisons and benchmarks. Proc. 6th Int. Conf. on Principles of Knowledge Representation and Reasoning (KR’98), Morgan Kaufmann Publishers, pp.406–417, 1998. 16. U.Egly, T.Eiter, H.Tompits, and S.Woltran. Solving advanced reasoning tasks using quantified Boolean formulas. Proc. National Conf. on Artificial Intteligence (AAAI’00), AAAI Press, pp.417–422, 2000. 17. R.Feldmann, B.Monien, and S.Schamberger. A distributed algorithm to evaluate quantified Boolean formulae. Proc. National Conf. on Artificial Intteligence (AAAI’00), AAAI Press, pp. 285–290, 2000. 18. E.Giunchiglia, M.Narizzano, and A.Tacchella. QuBE: A system for deciding quantified Boolean formulas satisfiability. Proc. 1st Int. Conf. on Automated Reasoning (IJCAR’01), R.Gor, A.Leitsch, and T.Nipkow, editors, LNCS 2083, Springer, pp.364–369, 2001. 19. S.Greco and E.Zumpano. Querying inconsistent databases. Proc. Int. Conf. on Logic Programming and Automated Reasoning (LPAR’2000), M.Parigot and A.Voronkov, editors, LNAI 1955, Springer, pp.308–325, 2000. 20. G.Greco, S.Greco, and E.Zumpano. A logic programming approach to the integration, repairing and querying of inconsistent databases. Proc. 17th Int. Conf. on Logic Programming (ICLP’01), LNCS 2237, Springer, pp.348–363, 2001. 21. H.Kleine-B¨ uning, M.Karpinski, and A.F¨ ogel. Resolution for quantified Boolean formulas. Journal of Information and Computation 177(1), pp.12–18, 1995. 22. R. Letz. Lemma and model caching in decision procedures for quantified Boolean formulas. Proc. TABLEAUX’2002 , U.Egly and G.C.Ferm¨ uler, editors, LNAI 2381, pp.160–175, 2002. 23. P.Liberatore and M.Schaerf. BReLS: A system for the integration of knowledge bases. Proc Int. Conf. on Principles of Knowledge Representation and Reasoning (KR’2000), Morgan Kaufmann Publishers, pp.145–152, 2000. 24. J.McCarthy. Applications of circumscription to formalizing common-Sense knowledge. Artificial Intelligence 28, pp.89–116, 1986. 25. M.Moskewicz, C.Madigan, Y.Zhao, L.Zhang, and S.Malik. Chaff: Engineering an efficient SAT solver. Proc. 39th Design Automation Conference, 2001. 26. J.T.Rintanen. Improvements of the evaluation of quantified Boolean formulae. Proc. 16th Int. Joint Conf. on Artificial Intelligence (IJCAI’99), Morgan Kaufmann Publishers, pp.1192–1197. 1999. 27. C.Wrathall. Complete sets and the polynomial-time hierarchy. Theoretical Computer Science 3(1), pp.23–33, 1976.