Query Terms Abstraction Layers - CiteSeerX

Report 2 Downloads 94 Views
Query Terms Abstraction Layers Stein L. Tomassen and Darijus Strasunskas Department of Computer and Information Science, Norwegian University of Technology and Science, Sem Saelandsvei 7-9, NO-7491 Trondheim, Norway {stein.l.tomassen, darijus.strasunskas}@idi.ntnu.no

Abstract. A problem with traditional information retrieval systems is that they typically retrieve information without an explicitly defined domain of interest to the user. Consequently, the system presents a lot of information that is of little relevance to the user. Ideally, the queries’ real intentions should be exposed and reflected in the way the underlying retrieval machinery can deal with them. In this paper we propose using abstraction layers to differentiate on the query terms. We explain why we believe this differentiation of query terms is necessary and the potentials of this approach. The whole retrieval system is under development as part of a Semantic Web standardization project for the Norwegian oil and gas industry.

1

Introduction

Query interpretation is the first phase of an information retrieval (IR) session and the only part of the session that receives clear inputs from the user. Traditional vector space retrieval systems view the queries from a syntactic perspective and calculate document similarities from counting frequencies of meaningless strings. They typically retrieve information without explicitly defined domain of interest to the user. As a result, the system presents a lot of information that are of little relevance to the user. Consequently, the retrieval and ranking process is very important, though the result crucially hinges on the user’s ability to specify unambiguously his/her information needs. Ideally, the queries’ real intentions should be exposed and reflected in the way the underlying retrieval machinery can deal with them. In this paper we propose using abstraction layers to differentiate on the terms specified in a query to better grasp the real intention of the query. Users tend to use very few terms (3 or less) in their search queries [1, 2]. As a result, the system cannot understand the context of the user’s query, which results in lower precision. By adding more relevant terms to the query the domain of interest can to some extend be identified. However, adding the correct terms is not always trivial, since the user needs knowledge about the terminology used in that particular domain to find those correct terms. For an explorative search the process of finding those correct terms can be satisfactory. However, that is not the case when precise information is needed.

According to Gulla [1], there are only about 10% of the users, which are using the advanced features of a search engine found on the Web. In addition, an internal study at FAST1 show that 48% of the retrieved documents viewed are the three top ranked documents before they do a new search [1]. This makes it similarly difficult to state what kind of documents might be of interest to the user. Another problem with current search engines is that they do not understand the content of the documents and consequently cannot filter out those documents not being relevant. Most search engines found on the Web do use many traditional information retrieval techniques when indexing the documents, like stemming, removal of stopwords, etc. In this process, they do not try to understand the content of the documents but primarily make them as easily available for retrieval as possible. This often means that the documents are normalized and it becomes even harder to differentiate on the documents. A reason for why IR systems have major problems of understanding the intention of either queries or documents is how we humans use concepts. Typically, humans think of words as concepts, e.g. a word like ‘sports’ covers all kinds of sports like soccer, skiing, golf, etc. However, we also interpret the concepts differently depending on each persons background, for example some does not regard golf as a real sport. IR systems work in word-space while we humans deal with information in concept-space [3, 4]. Sine typical IR systems work in word-space to retrieve documents written by humans thinking in concept-space the result is often not satisfying. A novel and promising approach is concept-based search [5, 6, 7]. With this approach, the burden of knowing how the documents are written is taken off the user and hence the user can focus on searching on a conceptual level instead. One problem with this approach is to find good concepts. Some promising approaches described in [5, 7] find concepts based on the result set of each search, which next is used to refine the search. However, the relationships between the concepts are neglected. Ontologies can define concepts and the relationships among them from any domain of interest [8] and are therefore suitable to define domains. In our approach [9], we use ontologies to define concepts in a particular domain. We use a query enrichment approach that uses contextually enriched ontologies to bring the queries closer to the user’s preferences and the characteristics of the document collection. The idea is to associate every concept (classes and instances) of the ontology with a feature vector (fv) to tailor these concepts to the specific terminology used in the document collection. Synonyms and conjugations would naturally go into such a vector, but we would also like to include related terms that tend to be used in connection with the concept and to provide a contextual definition of it. Afterward, the fvs are used to enrich the query provided by the user. In addition, we exploit the relationships between the concepts defined in the ontology when post processing the result set of the search for filtering and presentation. Preliminary results of our approach looks promising compared to traditional IR approaches. Though, we have acknowledged the need to differentiate on the query terms to increase precision of search. For an explorative search, using only concepts when defining the query gives satisfactory results. However, for more precise search, 1

Fast Search & Transfer ASA, http://www.fastsearch.com/

using only concepts does not give the same acceptable results. Consequently, the user should be able to use a mixed approach being able to use both terms and concepts. In our approach all the concepts are related to the domain specified by the ontology, this should also be the case with the other query terms. This proposed differentiation of the query terms is further described in section 3. This research is part of the Integrated Information Platform for reservoir and subsea production systems (IIP) project supported by the Norwegian Research Council (NFR)2 that funds this work. The IIP project is creating an ontology for all subsea equipment used by oil and gas industry. The project will make this ontology publicly available and standardized by the International Organization for Standardization (ISO) 3. This paper is organized as follows. In section 2, related work is discussed. In section 3, the proposed layers of abstraction of query terms are presented. In section 4, we describe an approach of how feature vectors can be constructed. Finally, in section 5 we discuss the potentials of this approach and conclude the paper.

2

Related Work

Traditional information retrieval techniques (i.e., vector-space model) have an advantage of being fast and give a fair result. However, it is difficult to represent the content of the documents meaningfully using these techniques. That is, after the documents are indexed, they become a “bag of terms” and hence the semantics is partly lost in this process. The related work to our approach comes from two main areas. Ontology based IR, in general, and approaches to query expansion, in particular. General approaches to ontology based IR can further be sub-divided into Knowledge Base (KB) and vector space model driven approaches. KB approaches use reasoning mechanism and ontological query languages to retrieve instances. Documents are treated either as instances or are annotated using ontology instances [10, 11, 12, 13]. These approaches focus on retrieving instances rather than documents. Some approaches are often combined with ontological filtering [14, 15, 16]. There are approaches combining both ontology based IR and vector space model. For instance, some start with semantic querying using ontology query languages and use resulting instances to retrieve relevant documents [13, 17]. [17] use weighted annotation when associating documents with ontology instances. The weights are based on the frequency of occurrence of the instances in each document. [18] combines ontology usage with vector-space model by extending a non-ontological query. There, ontology is used to disambiguate queries. Simple text search is run on the concepts’ labels and users are asked to choose the proper term interpretation. A similar approach is described in [19] where documents are associated with concepts in the ontology. The concepts in the query are matched to the concepts of the ontology in order to retrieve terms and then used for calculation of document similarity. 2 3

The Research Council of Norway, http://www.forskningsradet.no ISO, http://www.iso.org/

[14] is using ontologies for retrieval and filtering of domain information across multiple domains. There each ontology concept is defined as a domain feature with detailed information relevant to the domain including relationships with other features. The relationships used are hypernyms (super class), hyponyms (sub class), and synonyms. Unfortunately, there are no details in [14] provided on how a domain feature is created. Most query enrichment approaches are not using ontologies like [4, 5, 6, 7, 20]. Query expansion is typically done by extending provided query terms with synonyms or hyponyms (cf. [21]). Some approaches are focusing on using ontologies in the process of enriching queries [12, 14, 19]. However, ontology in such case typically serves as thesaurus containing synonyms, hypernyms/hyponyms, and do not consider the context of each term, i.e. every term is equally weighted. [6] is using query expansion based on similarity thesaurus. Weighting of terms is used to reflect the domain knowledge. The query expansion is done by similarity measures. Similarly, [5] describes a conceptual query expansion. There, the query concepts are created from a result set. Both approaches show an improvement compared to simple term based queries, especially for short queries. [20] is a commercial search engine which provide three basic search strategies, word, concept and superconcept search respectively. A concept is represented as a set of words, while a superconcept is a combination of several closely related concepts. The user may mix strategies when searching. Unfortunately, there are not enough details available in [20] to state how this work. In [4] each document and query is represented by concept lattices and are not using ontologies. The concept lattice for a document can learn and be improved by relevance feedback. Testing done shows significant increase in efficiency as the system learns from experience. They have also recognized the need for a hybrid approach where both concepts and keyword matching is done. The approaches presented in [3, 7] are most similar to ours. However, [7] is not using ontologies but is reliant on query concepts. Two techniques are used to create the feature vectors of the query concepts, i.e. based on document set and result set of a user query. While the approach presented in [3] is using ontologies for the representation of concepts. The concepts are extended with similar words using a combination pf Latent Semantic Analysis (LSA) and WordNet4. Both approaches get promising results for short or poorly formulated queries. The approach presented in [20] does differentiate on query terms by providing different search strategies. However, how equal this approach is to ours or how this is done is hard to tell since little details are provided.

3

Layers of Abstraction

As mentioned, one of the major problems for traditional IR systems is to understand the intended meaning of the queries as well as the content of the documents in order to provide high retrieval effectiveness. A reason for this is that IR systems work in 4

WordNet, http://wordnet.princeton.edu/

word-space while we humans deal with information in concept-space [3, 4]. In addition, humans are in general good in figuring out from relatively few words what is the correct context. The reason for this is the enormous amount of common knowledge that we possess. Researchers of artificial intelligence (AI) have acknowledged this and are therefore trying to grasp this common knowledge into enormous knowledge bases (e.g. CYC5, Open Mind 6). This common knowledge that we possess is by practical means not available for typically IR systems to use. For the IR systems to get better understanding of the user query the user needs to provide more information than is traditionally given. We believe that by specifying in the query what is a concept and what is a term and to what domain they relate the IR systems can better understand the intended meaning of the user. If the IR system can get a better understanding of the real intention of the user query then the retrieval effectiveness can be considerably improved.

Fig. 1. Query terms abstraction layers. The user can specify any of the levels in the pyramid, but domain must always be specified to take full advantage of this approach. Further, term, relaxed term, and concept must also have some kind of a relation to the specified domain.

Fig. 1 depicts our proposal of how the different terms of a query can be grouped into different abstraction layers depending on their intended role. The idea is that every term, relaxed term, and concept has a relation to a domain. Where a domain is specified in an ontology consisting of classes and instances and the relationships among them. A concept is defined in a domain and is consequently either a class or an instance defined in an ontology. Further, each concept has a feature vector consisting of e.g. 5-15 terms from the document collection that has been associated with each concept. Synonyms and conjugations would naturally go into such a vector, but also related terms that tend to be used in connection with the concept are included. A concept can be defined for several domains but the corresponding fvs will most likely be different. A relaxed term is also defined in a domain and is proposed to have a shorter fv than a typical concept. An fv for a relaxed term contains normally 5 6

Cycorp, Inc., http://www.cyc.com/ Open Mind Initiative, http://www.openmind.org/

synonyms that are closer to the original term than typically for a concept (e.g. Volkswagen=). Note, that there is no restriction in this proposal that e.g. “Volkswagen” can be both a concept and a relaxed term where both can be defined in the same domain. In section 5, the process of how an fvs for both relaxed terms and concepts can be build is described, an example of fvs are depicted in Fig. 3. Finally, a term is equal to a query term in a traditional IR system. A term does not necessarily have to be defined in a domain, that is being a class or concept defined in an ontology, but is always indirectly related to a domain. A simple example of a term is “rabbit” which is used in many different contexts. However, if it were said to have a relation to a domain describing “cars” we would probably find that there is a car called, e.g. “Volkswagen Rabbit”.

4

Feature Vector Construction

Fig. 2 gives an overall view of the steps involved in the feature vector construction process. Before the process can start, the user needs to identify both the ontology and the document collection to use, where the latter is indicated in Fig. 2 as index and document collection. Next, we will explain in more detail the different steps of this process.

Fig. 2. The feature vector construction process. The dotted line from the document collection to the index indicates that the index is based on the document collection.

Step 1. Since we endeavor to create feature vectors for every concept in an ontology, the algorithm starts with traversing the ontology and creates a fv for each relaxed term. The fvs are based on a thesaurus like e.g. WordNet.

Step 2. The fvs for the relaxed terms are used when retrieving documents for each of the concepts in the ontology. This retrieval session is keyword-based (for instance, see concept “Christmas tree” in the ontology fragment illustrated in upper part of Fig. 3). Step 3. The result set for each concept is further processed by clustering techniques in order to identify (discriminate) different domains within the document collection. Because, at this stage of the process the ontology concepts are treated as ordinary terms and can therefore be used in many different domains. Clustering allows finding different domains. However, we endeavor enriching concepts only by the terms from the domain relevant to the ontology at hand. That is done in next steps as follows. Step 4. A problem at this stage is to identify the relevant domain. Therefore, we compute similarity between the clusters of the neighborhood concepts. Commonality (i.e. high similarity) here identifies the document sets (clusters) being relevant to the domain of our interest. The hypothesis is that individual clusters having high similarity across ontology concepts are with high probability of the same domain. This hypothesis is backed up with observed patterns of collocated terms within the same domain, and consequently different domains will have different collocation pattern of terms. Step 5. In this step, we identify the cluster being relevant to the domain for each concept. However, the similarity of clusters depends a lot on the quality of the ontology, especially how much the different concepts overlap. If the degree of overlap is high, then it can be difficult to discriminate the clusters while choosing the most representative one for a particular concept. Therefore, to resolve this ambiguity we might need manual intervention by an expert to identify these candidates before proceeding. Step 6. The final cluster for a concept serves as a basis to construct a feature vector, i.e. relate concept from the ontology to the terminology used in a document collection within a particular domain. Therefore, all the documents from each cluster of a concept are analyzed to find those terms being most relevant to that concept (for instance, see middle part of Fig. 3, where related terms are emphasized in bold). This is done by using a combination of Natural Language Processing (NLP), text mining, and statistical methods. The relationship between the terms will be indicated by a value from range [0, 1], where 1 is the highest degree of relationship. Only those terms being key phrases or terms and having a relationship to the concept above some threshold is being considered, all other terms are rejected. The relationship values will be used as the weight for each term of the feature vector (see bottom part of Fig. 3, there a feature vector is composed from the terms related to “Christmas tree” and found in the document collection). Fig. 3 shows an explanatory example of this process for the concept “christmas tree”.

Fig. 3. An explanatory example to illustrate parts of the process of constructing a feature vector for the concept “christmas tree”.

5

Discussion and Conclusion

In this paper we have proposed the usage of abstraction layers to differentiate on the query terms to better grasp the users real intention of a query. Results from conceptbased approaches presented in section 2 show an improvement where short or badly formulated queries are used compared to traditional IR methods. However, for longer more specific queries results show a slight decrease in effectiveness. This indicates that only keyword- or concept-based approaches are not sufficient. Therefore, by differentiating on the query terms (a mixed approach) and also relating all the query terms to a domain, we believe that IR systems can better understand the real intention and hence improve retrieval effectiveness considerably. However, this approach has to be accepted by the end users or else it will fail. As mentioned, only about 10% of the users are using the advanced features of a search engine found on the Web. A reason for this is probably that most users does not see the big increases in retrieval effectiveness in using the advanced feature dialog or know these features by heart and hence uses them directly instead. Nevertheless, for the approach presented here the users have to add some more information in comparison to what they might be used to. E.g. for a simple search the user must identify the correct domain for a concept or else the approach will be equal to typically IR systems existing today. But we believe that the user might be willing to do that if (s)he sees an improvement in search result quality, which we believe there will be. However, this will also very much depend on how easy it is for the user to

specify this extra information. Another important issue is the availability of ontologies. According to Swoogle7, a search engine for retrieval of semantic web documents, there are more than 10.000 ontologies available. Many of these can be suitable and adapted for search by using e.g. our approach described in [9]. As the research reported here is still in progress [23], we have not been able to fully implement and formally evaluate the approach. Therefore, in future work we are planning to inspect and tackle a set of issues as follows. We will investigate alternative methods for assigning relevant terms to the ontology concepts, i.e. using association rules, and evaluate the influence on the search results. We will need to investigate alternative user interfaces for this system. We will also look into alternative methods for post-processing of the retrieved documents utilizing the semantic relations in the ontology for better ranking and navigation. Acknowledgements. This research work is funded by the Integrated Information Platform for reservoir and subsea production systems (IIP) project, which is supported by the Norwegian Research Council (NFR). NFR project number 163457/S30. In addition, we would like to thank Jon Atle Gulla for his support and help.

References 1. Gulla, J.A., Auran, P.G., Risvik, K.M.: Linguistic Techniques in Large-Scale Search Engines. Fast Search & Transfer (2002) 15 p. 2. Spink, A., Wolfram, D., Jansen, M.B.J., Saracevic, T.: Searching the Web: the public and their queries. J. Am. Soc. Inf. Sci. Technol. 52 (2001) 226-234 3. Ozcan, R., Aslangdogan, Y.A.: Concept Based Information Access Using Ontologies and Latent Semantic Analysis. Technical Report CSE-2004-8. University of Texas at Arlington (2004) 16 4. Rajapakse, R.K., Denham, M.: Text retrieval with more realistic concept matching and reinforcement learning. Information Processing & Management 42 (2006) 1260-1275 5. Grootjen, F.A., van der Weide, T.P.: Conceptual query expansion. Data & Knowledge Engineering 56 (2006) 174-193 6. Qiu, Y., Frei, H.-P.: Concept based query expansion. Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval. ACM Press, Pittsburgh, Pennsylvania, USA (1993) 160-169 7. Chang, Y., Ounis, I., Kim, M.: Query reformulation using automatically generated query concepts from a document space. Information Processing and Management 42 (2006) 453468 8. Gruber, T.R.: A translation approach to portable ontology specifications. Knowledge Acquisition 5 (1993) 199-220 9. Tomassen, S.L., Gulla, J.A., Strasunskas, D.: Document Space Adapted Ontology: Application in Query Enrichment. In: Kop, C., Fliedl, G., Mayer, H.C., Métais, E. (eds.): 11th International Conference on Applications of Natural Language to Information Systems (NLDB 2006), Vol. 3999. Springer-Verlag, Klagenfurt, Austria (2006) 46-57 10.Song, J-F., Zhang, W-M., Xiao, W., Li, G-H., Xu, Z-N.: Ontology-Based Information Retrieval Model for the Semantic Web. Proceedings of EEE 2005. IEEE Computer Society (2005) 152-155 7

Swoogle, http://swoogle.umbc.edu

11.Rocha, C., Schwabe, D., de Aragao, M.P.: A hybrid approach for searching in the semantic web. Proceeding of WWW 2004, ACM (2004) 374-383 12.Ciorscu, C., Ciorscu, I., Stoffel, K.: knOWLer - Ontological Support for Information Retrieval Systems. In Proceedings of Sigir 2003 Conference, Workshop on Semantic Web, Toronto, Canada (2003) 13.Kiryakov, A., Popov, B, Terziev, I., Manov, D., and Ognyanoff, D.: Semantic Annotation, Indexing, and Retrieval. Journal of Web Semantics 2(1), Elsevier, (2005) 14.Braga, R.M.M., Werner, C.M.L., Mattoso, M.: Using Ontologies for Domain Information Retrieval. Proceedings of the 11th International Workshop on Database and Expert Systems Applications. IEEE Computer Society (2000) 836-840 15.Borghoff, U.M., Pareschi, R.: Information Technology for Knowledge Management. Journal of Universal Computer Science 3 (1997) 835-842 16.Shah, U., Finin, T., Joshi, A., Cost, R.S., Mayfield, J.: Information Retrieval On The Semantic Web. Proceedings of Conference on Information and Knowledge Management. ACM Press, McLean, Virginia, USA (2002) 461-468 17.Vallet, D, Fernández, M., Castells, P.: An Ontology-Based Information Retrieval Model. Gómez-Pérez, A., Euzenat, J. (Eds.): Proceedings of ESWC 2005, LNCS 3532, SpringerVerlag. (2005) 455-470. 18.Nagypal, G.: Improving Information Retrieval Effectiveness by Using Domain Knowledge Stored in Ontologies. OTM Workshops 2005, LNCS 3762, Springer-Verlag, (2005) 780-789 19.Paralic, J., Kostial, I.: Ontology-based Information Retrieval. Information and Intelligent Systems, Croatia (2003) 23-28 20.Adi, T., Ewell, O.K., Adi, P.: High Selectivity and Accuracy with READWARE’s Automated System of Knowledge Organization. Management Information Technologies, Inc. (MITi) (1999) 21.Chenggang, W., Wenpin, J., Qijia, T. et al.: An information retrieval server based on ontology and multiagent. Journal of computer research & development 38(6) (2001) 641647. 22.Det Norske Veritas: Tyrihans Terminology for Subsea Equipment and Subsea Production Data. Det Norske Veritas (DNV) (2005) 60 p. 23.Tomassen, S.L.: Research on Ontology-Driven Information Retrieval. In: Meersman, R., Tari, Z., Herrero, P., al., e. (eds.): OTM 2006 Workshops. Springer-Verlag, Montpellier, France (2006)