Fuzzy Sets and Systems 140 (2003) 75 – 91 www.elsevier.com/locate/fss
An approach to knowledge-based query evaluation Troels Andreasen Computer Science, Roskilde University, Roskilde, Denmark
Abstract We describe here a query evaluation principle based on hierarchical aggregation. Queries are assumed to be free form expressions of lists of words unconnected or forming fragments of natural language (NL). A query is transformed into a nested structure of subsets of attributes. During query evaluation, each subset is treated for individual aggregation and aggregates are thus in turn aggregated, leading to an aggregate for measuring the full query against an object from the database. The key idea with the described principle is to obtain a knowledge-based query evaluation. This is obtained by initiating the query evaluation process through manipulation based on knowledge in a domain-speci3c knowledge base. An experimental prototype system implementing this line of query evaluation is described. c 2003 Elsevier B.V. All rights reserved.
1. Introduction The evaluation principle described in this paper is considered in the context of an approach to ontology-based querying, where the employed knowledge base contains domain-speci3c knowledge comprising a dictionary and an ontology for a given domain. The general idea with this line of evaluation is to assimilate applicable knowledge during the evaluation to guide and improve this process. The transformation of the query into a compound expression is based on knowledge from a knowledge base. In addition, the base is the source for inferring appropriate parameters for the aggregation of query terms and 3nally the knowledge base is used to introduce similarity. Query terms are expanded into sets covering also similar terms. While the user may pose a query in a simple form as a list of words or in natural language (NL), queries are evaluated as compound expressions. Through a knowledge-based manipulation of the initial query, answers to queries are improved and the goal is to obtain a retrieval from text databases that is content-based rather than word-based and that further exploit the knowledge represented in an ontology. E-mail address:
[email protected] (T. Andreasen). c 2003 Elsevier B.V. All rights reserved. 0165-0114/03/$ - see front matter doi:10.1016/S0165-0114(03)00028-9
76
T. Andreasen / Fuzzy Sets and Systems 140 (2003) 75 – 91
In the approach the content-based retrieval is achieved through descriptions derived for database objects as well as for queries. A description for a query is an intermediate representation of the “content” derived as described above. Descriptions are assumed to be similarly derived for text objects in the database. Hence two main components of a system supporting this kind of querying would be one for generating and another for comparing descriptions. The approach described in this paper is developed as part of the project OntoQuery. The remainder of this paper is organized as follows. In the following section the project OntoQuery is brieAy described. In Sections 3 and 4 descriptions, as ontology based internal representations of queries and text objects, are discussed. In Section 5 the problem of querying based on ontology is introduced and in Section 6 a speci3c approach is suggested. Section 7 describes a prototype system implementing the approach and Section 8 concludes. 2. OntoQuery project The OntoQuery or ‘Ontology-based Querying’ project [5,6] aims to contribute to the development of general solutions to the querying of text databases and to the extraction of descriptions of database objects through limited computational natural language understanding. Stressing the use of ontologies, the project provides a content-based query and retrieval functionality going beyond the super3cial keyword recognition typical of current search engines, whilst not attempting a full semantic analysis of source texts. The overall methodological goal of the project is to develop a coherent theory for ontological representation of domain knowledge, for ontological semantics for natural language phrases and for ontology-based search in text databases. The project addresses: • Development of theories, methods and tools for establishing formal ontologies integrated with language-speci3c terminology and lexical networks. A key idea is to introduce a formal language for ontologies, Ontolog, which combines taxonomies with object and relational expression forms. • Development of methods for ontology-based linguistic analysis of source texts and queries. This primarily concerns the identi3cation and analysis of noun phrases (NPs), comprising morphological, syntactic and semantic analysis. In our approach, NPs are central to the speci3cation of particular concepts in an ontology. • Development of methodologies for ontology-based query processing that eGciently compare an internal formal description of a query with the ontological descriptions of text database objects. Query processing then becomes matching of the query description with the descriptions of text database objects in the framework of the given ontology. Ontolog is used simultaneously for the representation of domain knowledge in the ontology, for the representation of natural language semantics and for descriptions of the texts in the database, and it will allow for reasoning with the ontology. It is intended primarily as a theoretical, logical framework, in which the diHerent traditional representations at the various levels of analysis are conceptualized, analyzed and integrated. Thus a strategic purpose is to facilitate coherence in the
T. Andreasen / Fuzzy Sets and Systems 140 (2003) 75 – 91
77
resulting system architecture. The language oHers descriptions, which serve multiple purposes ranging from feature structures in the linguistic analysis, via lexical semantic bases and terminology bases to ontologies and query descriptions. The use of natural language for database querying traditionally relies on a logical semantics, which determines the translation from syntax trees into a logical query language. This technique must restrict the query language to a small fragment of natural language, because a full computational semantic treatment of comprehensive natural language fragments is beyond the scope of current language technology. As an alternative approach this project introduces an ontology-based semantic analysis for natural language texts and query phrases, which refrains from a full logical analysis of the meaning of natural language texts. As a starting point, we focus on the analysis and disambiguation of NPs, particularly those including adjectives, prepositional phrases (both complements and adjuncts), and genitives. The project is funded by the Danish Research Agency under the Information Technology Program and further details can be found in http://www.ontoquery.dk/.
3. The role of descriptions As already mentioned a key idea is to initiate descriptions. Rather than matching words and properties from the query with words and properties from database objects, the approach is to add a level of abstraction by introducing descriptions and to perform the matching at the level of these instead. Objects in the database and queries are preprocessed in a similar manner leading to descriptions, which may simply be considered as sets of attributes. Simple forms of attributes are words, but in general an attribute is an expression capturing a proportion of meaning as a part of the description of the object. Expressions are concepts formed from words and other concepts using a set of relations available in the ontology language. This principle of processing assumes a knowledge base that includes a dictionary of words and an ontology of concepts. Words are transformed based on the dictionary such that morphological variants and synonyms are united. Concepts (and words) are related through the ontology, forming new concepts. To generate descriptions, objects 1 are prepared by a special lightweight natural language processing (NLP). One special case of this form of processing is simple heuristic NP-recognition leading to a grouping of the words in the sentence. 2 The result may simply be in the form of a markup in the sentence identifying beginnings and endings of NP-fragments of the sentence. For instance, the sentence, ‘The strange and horrifying man drove his red car into the dark wood’ may be marked, ‘(The strange and horrifying man) drove (his red car) into (the dark wood)’, giving 3ve groups—three NPs and two non-NPs (gaps between NPs). Based on the markup (and grouping) in the sentence, further concept formation may be performed by applying semantic relations. This is an important aspect of the OntoQuery project as explained in [5,6,9,13]. However, a very simple conceptual formation 1
Objects are either queries or text fragments from the database, for instance, a sentence from a document. Instead of a sentence, there may be a list of words. In this approach, queries and text objects are not required to be well-formed sentences. 2
78
T. Andreasen / Fuzzy Sets and Systems 140 (2003) 75 – 91
appears when applying the groups directly. The groups may be restricted to certain word classes and the resulting groups considered as concepts. For instance, when restricted to nouns, adjectives and verbs, the description of the above sentence becomes {{strange; horrifying; man}; {drive}, {red; car}; {dark; wood}}. We may therefore consider this set of four elements, each of which is a set of words, as a description of the sentence having eight attributes joined in four diHerent groups.
4. Two-directional concept-formation In a simple word-based document retrieval principle a text object is recognized through the words appearing in it. Thus in this context we may de3ne a text object description as the corresponding single set of words. On the other end of the scale of complexity we may imagine a thorough, ontology-based NL analysis leading to a single concept capturing the meaning of the text object as a whole. There is no question that the word-based approach is too simplistic and that considerable improvements leading to overall better responses should be feasible. However, with the current results within NLP and application of ontologies the single concept descriptions are simply not in general achievable. Furthermore, even in cases where they may be derived, it is a question whether single concept descriptions is the best way to represent for instance a user’s intention. If the user brings several aspects together into a single query then this query would probably be better represented by a set of items corresponding to these aspects than by a single concept description. Thus we should face a means of representing text objects, through descriptions that are sets of “describers”, while describers are not reduced to being single words, but may be sets and/or concepts formed using semantic relations (expressions in the description language). As already mentioned descriptions should be formed and interpreted through NLP and through the domain speci3c ontology. The extent to which NL and concept formation through the ontology is integrated is an open question. Of course, some concept formation through the ontology should be performed during the NLP of the text object. For instance we may in some cases resolve ambiguities by partial concept formation as in {{many; disease}; {cause}; {lack[WRT vitaminD]; winter}} in place of {{many; disease}; {cause}; {de3ciency; vitaminD; winter}} as description of the sentence “Many diseases are caused by de3ciency of vitamin D during winter” using a semantic relation WRT (with respect to). Similar further formation may apply a relation like CBY (caused-by). However, if descriptions are to be sets of items and be open for further interpretation through the ontology, then we are concerned with two-directional concept-formation. Consider the simple NLP-approach presented above, where the result is a grouping of the words from the sentence—each group being a representative for a concept. In this simple approach, we obviously have two directions in which we can determine concept-inclusion relationships. From the
T. Andreasen / Fuzzy Sets and Systems 140 (2003) 75 – 91
79
ISA relation of the ontology we may derive that “man ISA human” and thus also for instance that: {strange; man} ISA {strange; human}: Having sets as items representing concepts in descriptions also lead to concept formation from set inclusion. Basically we have that A ISA B if B ⊂ A and thereby for instance the relation: {strange; man} ISA {man}: This phenomena obviously increase the complexity in concept formation and comparison of descriptions since also the relation {strange; man} ISA {human} along both directions must be taken into consideration. 5. The query-answering problem In building systems for querying information bases, we should take into account that the typical user does not know the content of the base and that users in general do not have a clear picture of what prototypical instances within their intention might be. Therefore, queries posed will in general not have strictly matching objects and even when they do the objects might not be exactly what the user really was looking for. Query-answering thus becomes a very diHuse task where explicit handling of imprecision becomes vital, and even seemingly well-founded measures for evaluating the quality of answers, such as recall and precision, become inaccurate and subjective. Because of the lack of appropriate imprecision handling in query-answering systems, users have developed habits of being minimalist in query expressions. The average number of words used in queries on the Internet is well below two. While minimalist queries may be a good means of querying when the user has a clear view of need and intention, this approach may sometimes be a very poor method of working for less determined or experienced users, who might require a greater number of words to indicate their intention. Only a few words per query leads to a large number of combinations and thereby a large number of diHerent queries to try to see whether they do or do not lead in the right direction, while—with a Boolean interpretation—a query with a large number of words would typically lead either to virtually everything or to nothing. At least, this is the case when a Boolean interpretation is taken and only a single logical connective indirectly applies 3 to form the set of words into an expression. A range of improvements over these simple choices has emerged. Models suggested within the Information Retrieval (IR) area, such as the vector space and the extended Boolean model, take in more advanced “best match”-like interpretations leading to an ordering of the answers that puts forward those objects that match the largest number of attributes. This is a matter of aggregating the correspondences of the individual attributes to each object leading to a single grading of the object in the answer to the query. Further, when the restriction to Boolean evaluation of single query attributes (words, phrases or other properties of the query) is relaxed, further improvements of the matching mechanism may appear. While IR 3
The single logical connective could be indirectly speci3ed through user interface choices like “at least one” or “all” leading to either “or” or “and”.
80
T. Andreasen / Fuzzy Sets and Systems 140 (2003) 75 – 91 M1 Q1
A1
M2 Q2
A2
A3
M3 Q3
A4
A5
A6
Fig. 1. A hierarchical query.
models in general also capture this aspect, e.g. from grading related attributes to documents, we see a more concise treatment of the subject within fuzzy sets, where the general query-answering problem reduces to a problem of computing with attributes (as words), in the sense that both the correspondence between individual attributes and the aggregation of these correspondences are treated as a matter of fuzzy set membership evaluation. The query-answering problem is about 3nding objects satisfying the query attributes. For each attribute, some degree of satisfaction is better than none, and the more attributes satis3ed the better. The problem may, in other words, be seen as a question of 3nding an appropriate fuzzy aggregation. In this direction the class of ordered weighted averaging (OWA) aggregators [8] have been shown to be very useful [2,11]. The OWA aggregation principle is very Aexible and may include importance weighting. Further it is very useful, in connection with querying, to apply linguistic quanti3ers or, probably even simpler, to relate an adjustable ‘orness’-value to the query. Because of this Aexibility, OWA 3lls part of the gap that arises when eliminating logic connectors from query expressions for simpli3cation purposes. Users may simply pose queries as lists of words; the interpretation may be based on a preferred orness-value or a chosen linguistic quanti3er and importances may be added to query attributes. However, especially for the (very few) skilled users who easily understand logical expressions, there are still drawbacks from not having connectors to combine more complex query expressions. Importances may, of course, alter the inAuence of individual attributes but, apart from this, it is still only a single aggregation operator that is applied for the query as a whole. In [9], Yager introduces a hierarchical approach to aggregation as a language intended for document retrieval based on OWA. The intention is to enable users to better represent their requirements using the language, which is called Hierarchical Document Retrieval Language (with the acronym HI-RET). The key idea in this language is to extend expressiveness beyond that of single-operator aggregation as used in OWA. Query attributes may be grouped for individual aggregation and the language is orthogonal in the sense that aggregated values may appear as arguments to aggregations. Thus, queries may be viewed as hierarchies, as shown in Fig. 1. If the general form of an importance-weighted, quanti3ed aggregation expression is c1 ; : : : ; cn : M : Q with components ci , importance M and quanti3er Q, the evaluation of the query in Fig. 1 on a
T. Andreasen / Fuzzy Sets and Systems 140 (2003) 75 – 91
81
database object (or document) d is
where Ai (d) ∈ [0; 1] measures the degree to which attribute Ai conforms to document d, while Mj and Qj are the importance and quanti3er applied in the jth aggregate. This is very similar to the formation of compound expressions in logic using logical connectors (as aggregation operators), parentheses and terms (as attributes). Without doubt, Yager’s generalization of OWA into hierarchical aggregation expressions will have a great potential within IR. As Yager observed, because of the generality obtained, a range of retrieval principles can be accommodated, among which is the vector space type. The question is, however, whether a means of querying through hierarchies, as suggested by Yager, will be a realistic approach to a document retrieval language. Almost every user knows the meaning of words like ‘and’ and ‘or’. The problems that separate the skilled from other users are in combining attributes, operators and parentheses and still keeping track of what the query really means. Almost every user also knows the meanings of words like ‘most’ and ‘few’. Similarly, however, there may potentially be problems with combining attributes, linguistic quanti3ers and parenthesis, and this emphasizes that there may in turn be a problem with the usability of the suggested HI-RET language as a general purpose document retrieval language. We will not take this discussion any further here, but turn to another approach where Yager’s generalized aggregation principle, which is the key idea in HI-RET, may be shown to play an important role. On the one hand, it may be necessary to accept that dividing the query attributes into subsets and joining them back again using logical connectors or aggregation operators may be too complex a task for the typical user. On the other hand, it may turn out that even though Aexible and soft operators like OWA are particularly well suited for query evaluation, the limitations of the inherent single-operator principle may in some cases be too restraining. If, therefore, we could utilize the advantages of an expressive language that allows compound expressions and still not add complexity to the query formulations posed by users, we may signi3cantly improve query-answering. What is needed to obtain this is a knowledge-based approach to query evaluation, where the simple list-of-words (possibly NL) query is transformed into a compound expression, heuristically deducing groupings of the query attributes and appropriate aggregate operators, and inferring what parts of the query are more important. One approach in this direction is the following. 6. A knowledge-based query matching approach In the OntoQuery project described above an essential key idea is to perform a knowledge-based query evaluation. The knowledge base pertained includes as an essential part a domain speci3c ontology.
82
T. Andreasen / Fuzzy Sets and Systems 140 (2003) 75 – 91
The evaluation principle involves a transformation leading to uni3ed expression of query attributes structured as a two-level hierarchical expression. This transformation involves a simplistic NLP analysis of the initial query expression leading to the query description and a heuristically based determination of the aggregation parameters connecting the description. The hierarchical expression resulting from the transformed query is further expanded using concept relations from the knowledge base leading to an expression where single words/concepts are replaced by fuzzy sets and thereby introducing similarity. Finally, the answer to the query is derived by measuring the compatibilities between the transformed, expanded query and each database text object. 6.1. Transforming the query Initially NL processing is performed that heuristically identi3es NPs. As exempli3ed above, groups are formed from NPs and NP-gaps and the result is a prepared hierarchy of two levels. For the query evaluation a hierarchical aggregation is applied over the groups and each group is aggregated through importance-weighted, quanti3ed, order-weighted averaging. The aggregation parameters are derived during the transformation process. The general principle is that aggregation is restrictive for individual groups and relaxed for the overall query aggregate, corresponding to linguistic quanti3ers like ‘most’ for individual groups and ‘some’ for the query. This is however modi3ed through importance weighting based on domain knowledge from the knowledge base, primarily from giving more importance to nouns in general and domain-speci3c concepts from the ontology in particular. The sentence, ‘The strange and horrifying man drove his red car into the dark wood’ will lead, as explained, to the description: {{strange; horrifying; man}; {drive}; {red; car}; {dark; wood}}. During query evaluation, the description of the query is further transformed into a hierarchical expression reAecting the aggregation principles mentioned. This expression may, for the example query sentence, be:
corresponding to the hierarchy in Fig. 2. Here, importance weighting is only exploited at the level of individual groups, where nouns are given more importance. The restrictive quanti3cation for groups is by MOST and the relaxed quanti3cation for the overall query is by SOME. 6.2. Expanding the transformed query/Applying ontology knowledge A query is connected to the ontology through the query description, thus attributes and compound attributes of the query correspond to and are treated as words and concepts in the knowledge base. As indicated above the ontology in the knowledge base is assumed to explicate concept inclusion. However, the ontology may well include various other relations between concepts such as synonymy, partonomy and association relations, that directly contributes to similarity between concepts.
T. Andreasen / Fuzzy Sets and Systems 140 (2003) 75 – 91
83
(1,1,1,1)
some
(½,½,1)
(½,1)
most
Strange
horryfying
(½,1)
most
man
drive
red
most
car
dark
wood
Fig. 2. A representation of ‘The strange and horrifying man drove his red car into the dark wood’.
animal
dog
cat
poodle
alsatian
Fig. 3. Inclusion relation (ISA) with upwards reading, e.g. dog ISA animal.
Moreover, semantic relations de3ned in the ontology indirectly contribute to similarity trough subsumption. For instance ‘disease CBY lack WRT vitamin is subsumed by—and thus included in—the more general concepts ‘disease CBY lack’ and ‘disease’. For concept inclusion intuitively we should have strong similarity in the opposite direction of the inclusion (specialization), but also the direction of the inclusion (generalization) must contribute with some degree of similarity. Synonymy obviously imply strong similarity, while partonomy in general is diGcult to measure in terms of degrees of similarity. Association, in the variant that is statistically based on a corpus, has an inherent grading from the statistics and is thus very accurately reAected in graded similarity. Until now the main concern in the OntoQuery project has been the concept inclusion relation. The mapping into similarity is based on ‘distance’ in the relation, where greater distance—longer path in the relation graph—corresponds to smaller similarity and where specialization corresponds to strong and generalization to weak similarity. With the example fraction of an ontology in Fig. 3 the term dog could be expanded to for instance dog+ = 1=dog + 0:9=poodle + 0:9=Alsatian + 0:2=animal reAecting that generalization is expensive in connection with query expansion. Various principles can be applied for mapping concept relations into similarity functions. This issue is discussed in [4].
84
T. Andreasen / Fuzzy Sets and Systems 140 (2003) 75 – 91
Expansion of the query is a simpli3cation replacing direct reasoning over the ontology during query evaluation and graded similarity is the obvious means to make expansion a useful. In the present approach the expansion replaces a term from the query with a fuzzy set where the original term is a full member. 6.3. Query matching The query is to be evaluated against text objects in the database at an appropriate level—probably at the sentence level, such that the query in principle is compared to each sentence of each text in the database. As described above the original query expression is transformed into a description (grouping of terms). This description is modi3ed into a hierarchical expression (grouping of terms with speci3ed quanti3ers and importances). From applying the ontology the hierarchical query expression is further modi3ed into a hierarchical expression over expanded terms (grouping of fuzzy sets with speci3ed quanti3ers and importances). Descriptions generation, as mentioned, not only applies for queries. It is assumed that descriptions are also generated for text in the database and stored as an indexing of the database objects. If at the sentence level; each sentence in the database will have attached a description. Thus matching is a matter of comparing the manipulated query expression with sentence descriptions in the database. Conformity between the query and a sentence in the database is calculated as the degree to which the query expression evaluates, given the sentence description. Based on this, the answer may be given as the most similar sentences and/or documents.
7. An experimental prototype A preliminary prototype on the approach described has been developed for the purpose of querying a set of articles from the new Danish Encyclopedia and an ontology on nutrition developed in collaboration with Danmarks Nationalleksikon (The organization responsible for The Danish Encyclopedia). In the prototype we distinguish, according to the approach, a database and a knowledge base component. The database includes documents/text objects (articles from the Encyclopedia) and descriptions of these. The knowledge base contains knowledge about the domain, mainly in the form of a concept-ontology and dictionaries. The descriptions tie text objects (documents or fragments of texts) to the ontology. Thus, it is through the descriptions that the database and the knowledge base are connected as depicted in Fig. 4. The prototype may be seen as comprising two main components—a description generator and a description comparator. The description generator is applied when loading new documents/texts to the system database and when interpreting queries posed to the system. The most important part of the description generator is a simple and limited ontology-based NL parser. The description comparator is applied during query-evaluation and may also be applied in establishing measures of text object distance. Comparison of descriptions involves reasoning within the adopted description language ONTOLOG [9].
T. Andreasen / Fuzzy Sets and Systems 140 (2003) 75 – 91
Knowledge base (domain knowledge)
Database
Ontology Text-objects … …… ……… ……… ……… ……… ……… ……… …… …
85
Descriptions
Wordlists/concepts … …… ……… …… …
Additional concept-relations … …
Fig. 4. System resources—a database containing text objects (documents) and a knowledge base comprising knowledge about the domain of the texts, connected through descriptions.
Any text object in the database is provided with descriptions that serve as a sort of index— essentially, keyword indexes can be regarded special cases of concept descriptions. For additional details on the motivation behind the architecture of the prototype we refer to [6]. Below we will introduce details behind the motivation of the current main components of the prototype system. These are, apart from the knowledge base (the ontology and lexicon), an auxiliary ontology navigator, a description generator and a query tool (the description comparator). The prototype ontology and ontology-based lexicon is build on the Danish SIMPLE ontology [12,15] and extended with a domain-speci3c part developed empirically on the basis of texts on nutrition from the Danish Encyclopedia. These texts also make up the coverage of the domain speci3c lexicon, currently consisting of approximately 1000 lexical entries. Pustejovsky’s theory of lexical meaning [16], relying on the Qualia Structure and on a highly structured lexicon in general, constitutes the backbone of the SIMPLE ontology. 7.1. The ontology navigation tool The present prototype ontology only captures concept inclusion—the ISA-relation—and it is navigation in this relation that is supported by the ontology navigator. As can be seen from Fig. 5 a concept is displayed with its connection to other concepts through the inclusion relation. What can be revealed about a concept by this tool is • all paths to the top, • immediate sibling concepts (the set of sub-concepts to any super-concept of the concept in focus) and • immediate sub-concepts to the concept in focus. The tool also includes a ”3nd”-function, which is a simple string-based search for concepts. The tool is available at www.ontoquery.dk/prototype/ontonav.
86
T. Andreasen / Fuzzy Sets and Systems 140 (2003) 75 – 91
Fig. 5. The ontology navigator showing that mangelsygdom (de3ciency disease) has, for instance, sygdom (disease) as super- and pellagra as sub-concept and further, that e.g. allergi (allergy) is a sibling and that the only two paths to the top of the ontology are mangelsygdom → sygdom → disease → agentive → top and mangelsygdom → sygdom → disease → phenomenon → event → entity → top.
At the current state the ontology navigator is only a simple auxiliary tool. The navigator is to be further developed in two directions. Firstly, it is the plan to develop the tool towards a more general interpretation engine for the language for ontology representation ONTOLOG [9] and secondly it is also a goal to shift to a more user-friendly graphical user interface for the navigator. 7.2. The description generator tool Descriptions are compiled by the description generator when text documents are introduced to the system and further the description generator is applied during interpretation of queries. Description generation involves partial NL analysis and the generator includes a tagger, a parser, that recognizes NPs, and a subcomponent that builds descriptions in the description language. The generation process is performed in that order as illustrated in Fig. 6.
T. Andreasen / Fuzzy Sets and Systems 140 (2003) 75 – 91
87
Sentence Tagger Tagged sentence NP-recognizer Tagged sentence with marked NP's Lemmatizer / Generator Description
Fig. 6. Prototype architecture—the process of generating descriptions.
Tagging is performed by an implementation of Eric Brill’s tagger [8] customized for interaction with a text database, and trained for Danish. NP recognition is performed by Steven Abney’s chunk parser “Cass” [1]. The grammar has been developed manually as described in [14] and since the parser is based on 3nite state transducers, processing time is very fast. In the 3rst version, the NP recognizer only 3nds NP chunks extending from the beginning of the constituent to its head, according to the de3nition of a chunk given by Abney. This is a minimalist approach that keeps noise to a very low level. We have also developed a second version with an extension of the grammar aimed at recognizing at least some types of post-head dependents, for the search algorithm to be able to work with richer descriptions, but further experiments with this version are still needed. It still remains to develop a parser to produce a semantic analysis of the identi3ed NPs in terms of concepts and semantic relations resulting in more general concept expressions in the ontology language Ontolog (as described in [9]). Descriptions are generated from the result of the NP recognizer through morphological processing producing uni3ed lemmas (word identi3ers) from words, through a selection of lemmas based on word classes, and through a grouping of lemmas based on NPs, as illustrated in Fig. 6. Presently the descriptions have the form of sets of sets of words, where a single set of words corresponds to an NP. Words may be concepts from the ontology or word forms from the Danish Language Council’s dictionary and a set of words in principle represents an abstraction of concepts applied in the prototype. The morphological processing at the current state only involves identifying words from word forms. Based on the Danish Language Council’s dictionary a mapping from a Danish word form to a unique identi3er (a lemma) for the word is performed. A numeric identi3er or lemma is used, but when needed for display it is transformed into a standard form of the word. The selection based on word classes currently only slips adjectives and nouns through for the description. The grouping into sets of sets of words is simply reAecting the NPs recognized. Experiments have been carried out with an extension of this coarse grained approach so that also gaps between NPs may produce groups and so that verbs are also selected. Yet these experiments have not been successful and it seems that a means of selecting only verbs carrying meaning is needed. (A possible next step in this direction is to involve a positive list of words selected from general and domain speci3c words.)
88
T. Andreasen / Fuzzy Sets and Systems 140 (2003) 75 – 91
Fig. 7. The description generator tool screen dump showing descriptions generated and intermediate results for the text “syngdomme der fHlger af ensidig kost og vedrHrer tryptofan” (“diseases following from homogeneous food an related to tryptofan”).
The query evaluation involves a generation of a description for the query and further, generation of descriptions to operate as indexes to text objects is a main task when building the database. Thus generation of descriptions is a subtask in other parts of the system. However, for isolated experiments with description generation, a tool, that takes a string and produces a description, has been developed. This tool is available at www.ontoquery.dk/prototype/generator and it can be used to produce a description from any fragment of Danish text. An example is shown in Fig. 7. It should also be noted that descriptions can be generated from simple lists of words rather than well-formed sentences. Thus queries can also be posed as lists of words. 7.3. The query tool Query evaluation is, as already mentioned, reduced to comparison of a description of the query with descriptions of text objects in the database. As explained above, this mainly involves an aggregation of measures of correspondence between elements—or attributes—of the descriptions. The query evaluation part of the current prototype, which is available at www.ontoquery.dk/ prototype/query, is based on a simpli3ed aggregation approach as compared with the principle introduced in Section 2. The simpli3ed aggregation is a two-level order-weighted aggregation, with the two levels arising from the set of sets structure of descriptions. The method is a special, simpli3ed case of the hierarchical aggregation principle introduced in [3]. As for the similarities to be aggregated, we use a naive approach based on distance (shortest path length) in the ontology. As an example take the fraction of the ontology shown in Fig. 5 and the query shown in Fig. 8. A concept X of the query matches a sub-concept Y by (10 − d(X; Y ))=10, where d(X; Y ) is the distance (path length in the current simple prototype). Thus X matches X to
T. Andreasen / Fuzzy Sets and Systems 140 (2003) 75 – 91
89
Fig. 8. The querying tool of the OntoQuery prototype showing the answer to a query. Notice that the 3rst object in the answer matches the query, but not fully since sygdom is a second level super-concept to pellagra (see Fig. 5) and (ensidig, mangelfuld, kost) ((homogeneous, insuGcient, food)) is considered a sub-concept of (ensidig, kost) ((insuGcient, food)).
1.0 and matches an immediate sub-concept to 0.9, a second-level sub-concept to 0.8, etc. Therefore sygdom (disease) matches pellagra to 0.8. Further the degree to which (ensidig, mangelfuld, kost) ((homogeneous, insuGcient, food)) matches (ensidig, kost) is aggregated over the query description part by simple average, thus the grade becomes 0.67, since (mangelfuld) is missing. The grading of the 3rst object in the answer in Fig. 8 is derived from an aggregation of similarities (0.8, 0.67,1.0), where the 1.0 corresponds to tryptofan}, which describes both the query and the text object. The (outer) aggregation is in this case also a simple average on the similarities—thus the grading becomes 0.82. Text objects in the prototype are sentences from the articles in the database. Thus a sentence level description is assumed. In the tool, sentences from the answer appear a hyperlinks leading from a sentence to the article it appears in. The tool includes a separate article viewing part, not shown above. 8. Concluding remarks We have introduced a knowledge-based approach to query evaluation, where simple word lists or queries posed in natural language are transformed into hierarchical expressions over quanti3ed, importance-weighted groups of attributes for order-weighted evaluation. To further draw
90
T. Andreasen / Fuzzy Sets and Systems 140 (2003) 75 – 91
on ontological knowledge from the knowledge base queries are expanded through similarity reAecting concept relations from the ontology. The approach described here combines two directions for utilizing knowledge in query evaluation. On the one hand—based on common knowledge—aggregation parameters (quanti3ers and importances) are derived, and on the other—based on domain knowledge—similarities at the level of terms are derived. References [1] S. Abney, Partial parsing via 3nite-state cascades, Proc. ESSLLI’96 Robust Parsing Workshop, 1996, pp. 8–15. [2] T. Andreasen, Flexible database querying based on associations of domain values, in: ISMIS’97, Eighth Internat. Symp. on Methodologies for INTELLIGENT SYSTEMS. Charlotte, North Carolina. Springer Verlag, Lecture Notes in Arti3cial Intelligence, 1997, pp. 570 –578. [3] T. Andreasen, Query evaluation based on domain-speci3c ontologies NAFIPS’2001, Proc. 20th IFSA/NAFIPS Internat. Conf. Fuzziness and Soft Computing, 2001, pp. 1844 –1849. [4] T. Andreasen, On knowledge-guided fuzzy aggregation, Proc. IPMU’2002 Nineth Internat. Conf. on Information Processing and Management of Uncertainty in Knowledge-Based Systems, 2002, pp. 1561–1568. [5] T. Andreasen, H. Erdman Thomsen, J. Fischer Nilsson, The OntoQuery project, in: P. Anker Jensen, P. Skadhauge (Eds.), Ontology-Based Interpretation of Noun Phrases, Proc. First Internat. OntoQuery Workshop, Department of Business Communication and Information Science, University of Southern Denmark, Kolding, 2001, 2000, pp. 1–10. [6] T. Andreasen, J. Fischer Nilsson, H. Erdman Thomsen, Ontology-based querying, in: H.L. Larsen et al. (Eds.), Flexible Query Answering Systems, Recent Advances, Physica-Verlag, Springer, Berlin, 2000, pp. 15 –26. [7] P. Anker Jensen, P. Skadhauge (Eds.), Ontology-Based Interpretation of Noun Phrases, Proc. First Internat. OntoQuery Workshop, Department of Business Communication and Information Science, University of Southern Denmark, Kolding, 2001. [8] E. Brill, Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging, Comput. Linguistics 21 (4) (1995) 543–565. [9] J. Fischer Nilsson, A logico-algebraic framework for ontologies, in: P. Anker Jensen, P. Skadhauge (Eds.), Ontology-Based Interpretation of Noun Phrases, Proc. First Internat. OntoQuery Workshop, Department of Business Communication and Information Science, University of Southern Denmark, Kolding, 2001, 2000, pp. 11.42. [10] D.H. Kraft, G. Bordogna, G. Pasi, Fuzzy set techniques in information retrieval, in: J.C. Bezdek, D. Dubois, H. Prade (Eds.), Fuzzy Sets in Approximate Reasoning and Information Systems, Kluwer Academic Publishers, Norwell, MA, 1999, pp. 469–510. [11] H.L. Larsen, T. Andreasen, H. Christiansen, Knowledge discovery for Aexible querying, in: T. Andreasen et al. (Eds.), Proc. Internat. Conf. on Flexible Query Answering Systems, 11–15 May 1998, Roskilde, Denmark. Lecture Notes in Arti3cial Intelligence, Springer, Berlin, 1998. [12] A. Lenci, N. Bel, F. Busa, N. Calzolari, E. Gola, M. Monachini, A. Ogonowski, I. Peters, W. Peters, N. Ruimy, M. Villegas, A. Zampolli, SIMPLE: a general framework for the development of multilingual lexicons”, A.P. Cowie, T. Fontenelle (Eds.), Internat. J. of Linguistics, vol. 13, Number 4, Oxford University Press, Oxford, 2000. [13] B. Nistrup Madsen, B. Sandford Pedersen, H. Erdman Thomsen, De3ning semantic relations for ontoQuery, in: P. Anker Jensen, P. Skadhauge (Eds.), Ontology-Based Interpretation of Noun Phrases, Proc. First Internat. OntoQuery Workshop, Department of Business Communication and Information Science, University of Southern Denmark, Kolding, 2001, 2000, pp. 57–88. [14] P. Paggio, Parsing in OntoQuery—Experiments with LKB”, in P. Anker Jensen, P. Skadhauge (Eds.), Ontology-Based Interpretation of Noun Phrases, Proc. First Internat. OntoQuery Workshop, Department of Business Communication and Information Science, University of Southern Denmark, Kolding, 2001, 2000, pp. 89 –102. [15] B.S. Pedersen: The OntoQuery prototype ontology, in: H. Thomsen (Ed.), Proc. from OntoQuery Workshop on Ontologies and Search, January, 2001, LAMBDA no. 28, Department of Computational Linguistics, Copenhagen Business School, 2001, pp. 103–114.
T. Andreasen / Fuzzy Sets and Systems 140 (2003) 75 – 91
91
[16] J. Pustejovsky, The Generative Lexicon, The MIT Press, Cambridge, MA, 1995. [17] H. Thomsen (Ed.), Proc. from OntoQuery Workshop on Ontologies and Search, January, 2001, LAMBDA no. 28, Department of Computational Linguistics, Copenhagen Business School, 2001. [18] R.R. Yager, On ordered weighted averaging aggregation operators in multicriteria decision making, IEEE Trans. Systems, Man Cybernet. 18 (1) (1988) 183–190. [19] R.R. Yager, A hierarchical document retrieval language, in: Information Retrieval, vol 3, Issue 4, Kluwer Academic Publishers, Dordrecht, MA, 2000, pp. 357–377.