Tree patterns with Full Text Search - CiteSeerX

Report 2 Downloads 93 Views
Tree patterns with Full Text Search Maria-Hendrike Peetz

Maarten Marx

ISLA, University of Amsterdam Science Park 107, 1098 XG Amsterdam, The Netherlands

[email protected]

[email protected]

ABSTRACT Tree patterns with full text search form the core of both XQuery Full Text and the NEXI query language. On such queries, users expect a relevance-ranked list of XML elements as an answer. But this requirement may lead to undesirable behavior of XML retrieval systems: two queries which are intuitively (e.g., without ranking) equivalent return differently ordered lists of elements. We show that the best performing XML retrieval semantics has this behavior. We also show how minimization of tree patterns can efficiently solve this problem.

1.

INTRODUCTION

The extension of the strict boolean XML query languages XPath and XQuery with ranked full text search functionality is the topic of the INEX evaluation forum [11] and the W3C XQuery Full Text standard [19]. Information needs which combine constraints on the content and the structure of documents are natural and often easy to express in XPath-like languages. For example, from a corpus of scientific articles we want to have all sections about XML from articles which are about information retrieval. In XPath, extended with an about function, this is expressed as collection(’DBLP’)//article [about(.,’Information Retrieval’)]// section[about(.,’XML’)]. To give semantics to this query is not easy at all, but the following two constraints are basic:1 1. results should be ranked by relevance to the underlying information need [13]; 2. the strict constraints within the query should be obeyed (e.g., only section elements from articles in the DBLP 1 They combine the condition in [9] which says that “Ranking should reflect the actual, combined content and structure constraints”.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WebDB ’10 Indianapolis, IN USA Copyright 2010 ACM 978-1-4503-0186-2/10/06 ...$10.00.

collection may be returned)2 The situation with this type of queries is very different from standard database queries. The latter have, in principle, one well-defined semantics to which implementations can adhere, for instance, the W3C recommendations for XPath and XQuery [19]. For query languages with full text functionality which give a relevance-ranked list of answers it is inherently impossible to give a mathematically precise reference semantics. In fact, the INEX evaluation forum can be seen as a quest for finding better and more efficient implementations of a most desired semantics. Also, the W3C XQuery Full Text Recommendation [19] does not specify a semantics. It specifies a Full Text function with a certain interpretation but all key decisions about ranking are left implementation dependent. As the number of possible implementations is in principle unbounded, it is desirable to have a number of theoretically well-founded minimal requirements on the semantics of content and structure query languages. In this paper we discuss a number of these requirements and apply them to the NEXI query language developed within the INEX community [18]. The most important requirement is a conservativity principle, first discussed in [9]. It states that in the setting where keyword search is implemented using XPath’s contains() function, queries which are XPath-equivalent should return the same ranked list of answers. Thus, for example, //article[about(.,’XML’)] should return the same ranked list of articles as //article[about(.,’XML’)][//article], because //article[contains(.,’XML’)] and //article[contains(.,’XML’)][//article] are equivalent under the XPath semantics. A system that does not return the same results for those two queries exhibits unsound behaviour and would irritate the user. [9] used an experimental technique to show that out of 200 ranking approaches with good experimental results only 13 were conservative. The input language was vertical Core XPath with union and a Full Text keyword search function as described in [19]. 2

Actually, research within INEX has shown that users prefer systems that (partially) violate these constraints. Users prefer that the results be grouped naturally (in the example, grouped by article). Also, users do consider other XMLelements like paragraphs or chapters as relevant answers to the example query, so some systems return them as well [17, 11].

Main results. In this paper we study the conservativity principle theoretically for descendant only tree patterns extended with full text search. We give a counterexample to conservativity in this small language. The counterexample works for most semantics studied in [9]. But more importantly, it works for the semantics with the best retrieval performance at INEX which also adheres to the XQuery Full Text standards — a semantics based on language modeling with smoothing and a length prior. Next to this negative result we show how implementations can be made conservative by first performing tree pattern minimization [4, 16], a polynomial time step. The paper is organized as follows. The next section contains related work. Section 3 describes syntax and different semantics for tree patterns with full text search. Section 4 defines a number of conservativity properties, and Section 5 contains our results.

2.

RELATED WORK

Our work is most closely related to the W3C Full Text recommendation [19] and the design goals in [2]. The restrictions in the recommendation and the list of design goals, together with the use cases in [3], bound the possible algorithms and semantics. [9] added the natural restriction that we called conservativity above. INEX is the evaluation initiative which studies information retrieval on XML documents using query languages which combine content and structure. From those experiments the NEXI query language emerged [18]. Its most distinctive feature is the about(Path,String) function which has the same interpretation as the ftcontains expression from the XQuery Full Text spec: it returns true if some element reachable from the context node by Path is about String. [10] contains an extensive analysis of a large set of content and structure queries from INEX and shows that most can be expressed in descendant-only tree patterns with full text search. The connections between tree patterns and XPath fragments are studied in [5]. Minimization of tree patterns is first introduced in [4] and later improved in [16]. Descendant-only XPath has close connections to modal languages studied in provability logic and has better algorithmic and definability properties than fragments with child and descendant axes [6].

Queries are always interpreted at the root of an XML document tree. [10] provides a theoretical foundation for the restriction to only the descendant axis in an IR setting.

3.1.1

XPath like syntax

Definition 1. Let s ∈ K be a string, ∗ be the wildcard and p ∈ N ∪ {∗}. The language NEXI is defined as ϕ ::= . | ϕ//p | ϕ union ϕ | ϕ[ϕ] | ϕ[about(ϕ, s)]. In order to smoothly connect NEXI to tree patterns we need to remove the union operator and incorporate the about() function. The intended meaning of the filter expression [about(P,S)] is there exists an element about ’S’ reachable from the context node by path P. Thus, [about(P,S)] is equivalent to [P[about(.,’S’)]]. Hence, without loss of expressive power we can restrict the path in about() to be always the dot. Definition 2. Twig NEXI ⊆ NEXI. It has no union and restricts full-text search to about(.,s). Under any reasonable semantics, union distributes over filters and //, and so we have Proposition 1. Every NEXI query is equivalent to a union of Twig NEXI queries. Union only plays a marginal role in NEXI (see below), so in this paper we just consider Twig NEXI. We now adapt the tree pattern or twig formalism to incorporate the about() function.

3.1.2

Twig syntax

The definition of a twig is based on [12, 15]. We do not need edge labeling anymore, as we only consider the descendant relation between nodes. Full text search is modelled with a function C mapping nodes to a set of keywords. Definition 3. A full text twig T is a tuple (V, E, C, L, d, r) where • V is a finite set, E ⊆ V × V , and (V, E) is a tree with root r, • d ∈ V is the answer (distinguished) node, • L : V −→ N ∪ {∗} maps nodes to labels, and

3.

NEXI: SYNTAX AND SEMANTICS

In this section, we provide syntax and semantics for the NEXI query language. We first give an XPath-like syntax used within INEX, and then an alternative formulation using tree patterns. After that we discuss the various possible semantics and formalize two of them.

3.1

Different syntaxes

In this paper, N denotes a set of labels, corresponding to names of XML elements, and K denotes a set of strings, corresponding to Full Text queries. For technical reasons, we assume that the two sets are disjoint. The intuition behind the N(arrowed)E(extended)X(Path)I query language is that it restricts Core XPath [8] to only the descendant axis and extends it with a function about(P,S) which takes a path P and a string S as arguments [18].3 3

In [18] more restrictions are made then we do here. These

• C : V → P(K) maps nodes to finite sets of keywords. The distinguished path of a twig T consists of all nodes between and including the root and the distinguished node. By design every Twig NEXI query can be written as a full text twig and the other way around. This is based on the usual mapping between XPath and Twig syntax with extension that the argument of about(.,t) forms the keywords in a twig. There are different twigs which are nonetheless equivalent. For instance, consider .//p and .[.//p]//p. The minimal twigs from [4, 16] provide a unique a normal form for twigs. A twig is minimal if every homomorphism to itself is the identity mapping. A homomorphism h for full text twigs also preserves the keywords, that is, for all nodes n we have that C(n) ⊆ C(h(n)). further restrictions come from the experimental setup of INEX and have no semantical impact.

3.2

Possible semantics

The semantics of any NEXI variant is determined by the interpretation of the about(Path,String) function. All other operators receive their standard XPath interpretation. The reason to introduce about() was to implement a less strict variant of contains() with functionality familiar to search engines. For instance, about(.,’Xml’) and about(.,’XML’) could be equivalent queries. Our about() function serves the same purpose as the ftcontains() function in [19, 2]. We refer to the XQuery Full Text document for further semantic considerations on the textual part of about(). Within INEX there have also been experiments with less strict versions of the path condition. For example, a query //article[about(p,’XML’)] could return articles which only contain XML elements called ip,p1 or p2 which are about ’XML’. What happened here was that the query was asked to a heterogeneous collection of articles (all IEEE journals from a certain period) which do not have a uniform XML schema and as a result paragraphs could be called p,ip,p1,p2 in this collection. With a non-strict interpretation of the structural part of about(), systems are in essence performing XML datamediation[1]. In fact, a simple mapping rule mechanism was used in INEX.4 To summarize, with two arguments of about() and two possibilities — strict and loose – for semantics there are four different styles of semantics. We want to separate the data mediation issues from the full text issues, so we only consider a strict interpretation of the structural part of about(). This leaves us with two semantics which we discuss now.

3.2.1

Strict semantics

With the strict semantics, about() is interpreted as contains() [9]. Let ϕ be a Twig NEXI query and T an XML document tree. Let ϕ0 be ϕ with each about(.,s) replaced by contains(.,s). Then ansS (ϕ, T ) denotes the set of answer nodes obtained from interpreting ϕ0 at the root of T according to the standard XPath semantics. We say that ϕ1 ≡S ϕ2 iff ∀T , ansS (ϕ1 , T ) = ansS (ϕ2 , T ). Under the strict semantics the results about minimization from [4] can be extended to tree patterns with full text, and each NEXI query has a unique normal form. Theorem 1 ([4]). (i) Under the strict semantics, every full text twig is equivalent to a unique minimal full text twig. (ii) This minimal full text twig can be found in polynomial time.

3.2.2

Loose semantics

One approach to implementing an XPath query with a loose interpretation of about() is as a score region algebra expression [14] interpreted on a region algebra corresponding to an XML tree. A score region algebra is an extended version of the region algebra for structured text search [7]. The universe R of a region algebra is a set of regions {(s, e, n) ∈ N × N × N } | s ≤ e}. 4 This is basically a GAV style mechanism in which elements in the global schema were replaced by a union of local schema elements. E.g., p was replaced by p | ip | p1 | p2.

root

s

n

John

vp

v

n

loves

Mary

Figure 1: Example tree T

There are several ways in which an XML tree can be transformed into a region set. Here we discuss the most simple one in which there is a bijection between tree-nodes and regions: the region bounds s and e for XML trees are determined from the pre-order document tree traversal. For example, the tree T in Figure 1 corresponds to the XML document on the left in Table 1 and to the set of the regions R(T ) on the right in Table 1. The name n of a region is the label of its corresponding node. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

<s> John loves Mary <s>

(0, 14, root) (1, 13, s) (2, 4, n) (3, 3, John) (5, 12, vp) (6, 8, v) (7, 7, loves) (9, 11, n) (10, 10, Mary)

Table 1: Pre-order example document of T , with the set of corresponding regions R(T ). The domain of a region algebra is a set of sets of regions. Many operators can be defined on such a domain. We discuss just these needed for interpreting Twig NEXI: σn=name (·), = and e} R1 = R2 = {(s, e, n) ∈ R1 | ∃(s0 , e0 , n0 ) ∈ R2 : s0 > s & e0 < e}. Then a region algebra is a tuple (P(R), =,