A data abstraction approach for query relaxation

Report 2 Downloads 80 Views
Information and Software Technology 42 (2000) 407–418 www.elsevier.nl/locate/infsof

A data abstraction approach for query relaxation S.-Y. Huh*, K.-H. Moon, H. Lee Graduate School of Management, Korea Advanced Institute of Science and Technology, 207-32 Cheongryangri-dong, Dongdaemun-gu, Seoul, South Korea Received 26 February 1999; received in revised form 20 July 1999; accepted 8 November 1999

Abstract Since a query language is used as a handy tool to obtain information from a database, users want more user-friendly and fault-tolerant query interfaces. When a query search condition does not match with the underlying database, users would rather receive approximate answers than null information by relaxing the condition. They also prefer a less rigid querying structure, one which allows for vagueness in composing queries, and want the system to understand the intent behind a query. This paper presents a data abstraction approach to facilitate the development of such a fault-tolerant and intelligent query processing system. It specifically proposes a knowledge abstraction database that adopts a multilevel knowledge representation scheme called the knowledge abstraction hierarchy. Furthermore, the knowledge abstraction database extracts semantic data relationships from the underlying database and supports query relaxation using query generalization and specialization steps. Thus, it can broaden the search scope of original queries to retrieve neighborhood information and help users to pose conceptually abstract queries. Specifically, four types of vague queries are discussed, including approximate selection, approximate join, conceptual selection and conceptual join. q 2000 Elsevier Science B.V. All rights reserved. Keywords: Abstraction hierarchy; Approximate query; Cooperative query answering; Abstraction database; Decision support system

1. Introduction Providing a fault-tolerant and intelligent interface for database users is needed for improving the effectiveness of information retrieval and decision-making. Traditional query language, as a standard database user interface, has often frustrated database users including expert users by enforcing rigidity and preciseness in query writing and providing unsatisfactory query results. To compose correct queries, users are required to fully understand the database schema and underlying data, which is not easy for ordinary users. Though the query is correct, when there exists no exact answer, users only get the null information. In such a case, it may be better for the database system to produce approximate answers. Even when exact answers are found, neighboring information is still useful to users if the query is intended to explore some hypothetical information or abstract general fact. Moreover, if the database system is intelligent enough to allow users to express vague queries, fault-tolerance of information retrieval and effectiveness of decision analysis can be greatly enhanced. The purpose of this paper is to provide such an intelligent interface to a relational database system by facilitating cooperative * Corresponding author. Fax: 182-2-958-3604. E-mail addresses: [email protected] (S.-Y. Huh), khmoon @green.kaist.ac.kr (K.-H. Moon).

query handling with query relaxation mechanisms and producing approximate answers [5–7,12,14,15,19]. At the core of cooperative query answering, knowledge representation plays a crucial role for capturing the semantic relationships between raw data values of the database and high-level abstract concepts, and for accepting vague queries and carrying out approximate query processing. A variety of studies on knowledge representation approaches have been made, including approaches based on rule, fuzzy sets, semantic distance, and conceptual classification. The rule-based approach [4,7,8] represents the semantic relationship and integrity constraints among data values using first-order logic predicates. Thus, the entire database consists of a set of base predicates, and a database query is also written by a predicate rule whereby searching information is specified with free variables. The query is answered through conflict resolution and inference mechanisms, and query relaxation is carried out by coordinating the integrity constraints. However, this approach has limitations—a lack of systematic organization in guiding the query relaxation process and the less intuitive query answering process. The fuzzy database approach [1,16,20] supports various kinds of fuzziness derived from data themselves and linguistic queries. It assumes that data objects in each domain can be assigned a degree of similarity between 0 and 1. In addition, the knowledge base can store various kinds of imprecise information such as mixed hair color (i.e. 0.6 brown and

0950-5849/00/$ - see front matter q 2000 Elsevier Science B.V. All rights reserved. PII: S0950-584 9(99)00100-7

408

S.-Y. Huh et al. / Information and Software Technology 42 (2000) 407–418

Fig. 1. Example of knowledge abstraction hierarchy instances.

0.4 black) and a certain person’s residence (i.e. Boston or New York). Users compose approximate queries by using fuzzy comparators such as much-greater-than. Relations are extended to accept fuzzy values or to allow values that are sets of domain objects. The semantic distance approach [10,11,13] defines numerical distances to represent the degree of similarity between a pair of data objects. Multiple distance metrics are also available for each attribute domain. This approach is easy and efficient for developing query relaxation algorithms, since quantitative distances among data objects are easy to compute. However, this approach is limited due to the difficulty in transforming quantitative and qualitative data into a uniform quantitative measure. In addition, there is a lack of objective criteria for assessing semantic similarities among various data objects. The conceptual clustering approach [2,3,5,6,17,19] applies the data abstraction method to object instance and replaces the instance attribute values by the abstract concepts to which they belong. For example in the PET attribute, dogs and cats are mammals, snakes and turtles are reptiles, and so on. The abstract representation forms a clustering hierarchy where similar attribute values are clustered by the abstract concepts and the individual abstract concepts are related to one another by a certain abstraction function or the abstraction similarities. This approach aims at accepting a user-defined casual query that is vague in terms of the schema of the database. It transforms the query into a schema-compatible form using the clustering hierarchy by mapping appropriate query conditions or by relaxing the query conditions interactively to be more generalized with a broader search scope. The conceptual clustering approach is specifically advantageous when the characteristics of data objects are qualitative and categorical. However, existing conceptual clustering approaches have limitations in the diversity of

vague queries and the flexibility of clustering hierarchy maintenance. This is largely due to ambiguous distinction between value abstraction and domain abstraction information. Most studies focus on the construction of the clustering hierarchy itself from data object instances in the underlying database, but need further elaboration on the semantics of the hierarchy. This paper extends the conceptual clustering framework on the basis of a knowledge abstraction hierarchy (KAH) that captures not only value abstraction information but also domain abstraction information from the underlying database. Based on the KAH, a knowledge abstraction database is constructed on a relational data model, and various cooperative query-answering mechanisms are explored using the generalization and specialization processes in the KAH. In terms of approximate queries, four kinds of mechanisms are presented, including approximate selection, conceptual selection, approximate join and conceptual join. A prototype system has been implemented at KAIST to demonstrate the usefulness of KAH in ordinary database application systems. In Section 2, the KAH is discussed in terms of two abstraction perspectives: value abstraction and domain abstraction. In Section 3, we construct a knowledge abstraction database that incorporates the KAH on a relational data model. Section 4 describes the cooperative query answering processes on the basis of the knowledge abstraction database. Finally, Section 5 provides the conclusion of the paper.

2. A knowledge abstraction hierarchy Data abstraction has been considered to be an effective association method for accommodating semantic relationships between data values [18]. In data abstraction, a

S.-Y. Huh et al. / Information and Software Technology 42 (2000) 407–418

409

Table 1 Formal representation of value and domain abstraction relationships Relationship

Formal notation

(1) 1-level domain abstraction (2) 1-level value abstraction

Di ) Di11 where Di is a sub domain and Di11 is its super-domain i11 i11 v jii [p v ji11 where v jii is a specific value and v ji11 is its abstract value jj n p ji1n ji11 ji1n21 ji i11 I12 i1n21 i1n [p v ji12 [p …v ji1n21 [p v ji1n v i [ v i1n iff ' v i11 ; …; 'v i1n21 s:t: v i [ p v jI11

(3) n-level value abstraction (4) n-level domain abstraction

n

j n

j11 i1n i1n Di ) Di1n s:t: v ij [ p v ji1n ; v ij [ Di ; v i11 [ Di11 ; …; v ji1n [ Di1n for all ji

j

specific data value is generalized into an abstract value; thus, the specific data value can be viewed as a specialized value of the abstract value. Using the data abstraction, we construct a KAH that is a knowledge representation framework facilitating multilevel representation of data and knowledge on databases. Specifically, the KAH is constructed using such generalization and specialization relationships between abstract values and specific values. Fig. 1 illustrates the KAH by showing two instances for employee majors and career development education courses derived from a same underlying database. Two abstraction hierarchies exist in the KAH: a value abstraction hierarchy and a domain abstraction hierarchy. In the value abstraction hierarchy, a specific value in a lower abstraction level is generalized into an abstract value in a higher abstraction level and the abstract value can be generalized further into a more abstract value. Conversely, the specific value is considered as a specialized value of the abstract value. As such, there exist value generalization/ specialization (IS-A) relationships among the abstract and specific values. For instance, Finance is a Management while Management is a Business. As such, higher levels provide a more abstract data representation than lower levels. Accordingly, a value at the top level is the most abstract in the hierarchy while values at the bottom-most level are most specialized. The cardinality relationship between an abstract value and its specific values is assumed to be one-to-many. Thus, the value generalization of Finance returns Management while the value specialization of Management returns three values, Finance, Accounting, and Marketing. However, a specific value can also have multiple abstract values that are located in different abstraction levels along a path from the specific value to its most abstract value at the highest abstraction level. In such capacity, an abstract value is called n-level abstract value of the specific value according to the abstraction level difference n. Second, the domain abstraction hierarchy consists of domains that encompass all individual values in the value abstraction hierarchy and there exist INSTANCE-OF relationships between the domains and values. Much as generalization/specialization relationships exist between the data values in two different abstraction levels of the value abstraction hierarchy, a super-domain/sub-domain relationship exists between two different domains, which is obtained by domain abstraction. For instance, MAJOR_ AREA is the super-domain of MAJOR_NAME. All the

j

abstract values of instance values in a sub-domain correspond to the instance values of the super-domain. Thus, the super-domain MAJOR_AREA is more generalized than the sub-domain MAJOR_NAME, since MAJOR_AREA contains more generalized values than MAJOR_NAME. The cardinal relationship between two adjacent domains is assumed to be one-to-one and a super-domain is called nlevel super-domain of the sub-domain according to the abstraction level difference n. The abstraction relationships between values and domains can be formally represented as shown in Table 1. In the table, Di denotes a domain at the abstraction level i and v jii is a specific value of the domain Di. The relationship (1) and (2), respectively represent 1-level value and domain abstraction relationships. On the basis of the 1-level abstraction relationships, (3) and (4) represent general n-level abstraction relationships. As shown in Fig. 1, since the KAH is specific to individual applications, various instances of KAH can be derived from a single underlying database. When necessary, additional hierarchies can be created from the same database. Since value abstraction knowledge is sensitive to the context of application, the designer of the KAH may study the database query purposes and usage patterns of the company. And then, he conducts his own interpretation of the semantic abstraction knowledge from the database, or performs a bottom-up analysis that clusters attribute values into a hierarchy on the basis of distribution and frequency of data values. After performing such preliminary studies, the designer can structure a binary relationship between abstract values and specific values using generalization and specialization relationships, and construct the entire hierarchy incrementally. In the presence of multiple KAH instances, a domain is unique and can be located only in one hierarchy, while the same values can be located in multiple hierarchies and in multiple domains. For instance, in Fig. 1, the values, i.e. Economics, are found in both College Major and Career Development Education hierarchies but belong to different domains depending on the hierarchy. In the College Major hierarchy, one belongs to the MAJOR_AREA domain while the other belongs to the COURSE_AREA domain in the Career Development Education hierarchy. Since Economics values exist in multiple places, the abstract values and specific values of Economics are not uniquely defined by the Economics value. Domain information is additionally needed to identify the abstract and

410

S.-Y. Huh et al. / Information and Software Technology 42 (2000) 407–418

Table 2 Basic functional dependencies

(1) Domain/Super-domain (2) Domain/Sub-domain (3) Value/Abstract value (4) Value/Specific value (5) Relation/Attribute

Functional dependency

Relation representing the functional dependency

Di ! Di11 for Di , Di11 where Di ) Di11 Di ! Di21 for Di21 ; Di where Di21 ) Di i11 i11 i11 …v jii ; Di † ! v ji11 for v jii , v ji11 where v jii [p v ji11 ji ji21 ji ji21 ji21 (v i ; Di † !! v i21 for v i ; v I21 where v i21 [p v jii (Relation, Attribute) ! Domain

DOMAIN_ABSTRACTION DOMAIN_ABSTRACTION VALUE_ABSTRACTION VALUE_ABSTRACTION ATTRIBUTE_MAPPING

specific values for a given specific value. For instance, the 1level abstract value of Economics in the MAJOR_AREA domain is Business while that of Economics in the COURSE_AREA domain is Practical Course. Likewise, different sets of 1-level specific values are identified for a given value depending on the domain. We call this property the domaindependency of abstract values and specific values.

3. A knowledge abstraction database The semantics involved in the KAH is incorporated into the knowledge abstraction database that is a cooperative knowledge database to facilitate cooperative query answering processes. Specifically, the schema of the knowledge abstraction database is designed on a relational data model so that both the underlying database and its knowledge

abstraction database can be handled in a single formalism by a relational query language. 3.1. Relational schema of the knowledge abstraction database To accommodate data semantics of the KAH, the knowledge abstraction database captures three kinds of information including value abstraction, domain abstraction, and mapping information between the abstraction hierarchies and underlying databases. The information can be structured on the basis of the following functional dependencies in Table 2 among the KAH constructs, i.e. values, abstract values, domains, and super-domains, which are derived from the cardinal relationships among KAH constructs and the domain-dependency in the presence of multiple KAH instances.

Fig. 2. The constructor relations of the knowledge abstraction database.

S.-Y. Huh et al. / Information and Software Technology 42 (2000) 407–418

In the table, functional dependencies are expressed by the notation ! and ! ! where “A ! B” implies that A determines only one B, while “A ! ! B” implies that A determines multiple Bs. The dependency (1) and (2) mean that, if a domain name is known, its super-domain and subdomain can be identified. On the other hand, the dependency (3) and (4) mean that, to identify an abstract value or specific values of a certain value, we must know its domain first, which is related to the domain-dependency. Finally, the dependency (5) helps us to identify the domain of an attribute in the relation specified in a query as a result of the query analysis. In carrying out the generalization and specialization processes for cooperative query answering, the knowledge abstraction database uses three core relations: DOMAIN_ABSTRACTION, VALUE_ABSTRACTION and ATTRIBUTE_MAPPING. DOMAIN_ABSTRACTION{(Domain, Super_Domain, Hierarchy, Abstraction_Level)} VALUE_ABSTRACTION{(Value, Domain, Abstract_ Value)} ATTRIBUTE_MAPPING{(Relation, Attribute, Domain)} Fig. 2 provides a sample knowledge abstraction database that accommodates the two KAH instances of Fig. 1. In the ATTRIBUTE_MAPPING relation, the relations and their associated attributes of the underlying database are artificially enumerated for explanatory purposes. The DOMAIN_ABSTRACTION relation captures semantics of the domain abstraction hierarchy. A domain instance has a unique name, its super-domain’s name, the hierarchy name, and the abstraction level as representing its position in the hierarchy. Since a domain is unique in the KAH instances and has one-to-one mapping correspondence with its super-domain, ‘Domain’ attribute becomes the key attribute. Next, the VALUE_ABSTRACTION relation captures the semantics of the value abstraction hierarchy. Since a same value name can be used differently in multiple hierarchies, the value entity becomes a weak entity that is not uniquely determined by the name. Instead, it needs the domain to become unique; in other words, both the value name and its domain name are needed to comprise the composite key, which is related with domain-dependency. In terms of abstraction relationships, each tuple of the relations represents only the 1-level abstraction relationship. On the basis of such a 1-level abstraction relationship, an abstract value in any arbitrary level can be transitively retrieved. Finally, the ATTRIBUTE_MAPPING maintains the domain information of the underlying database and helps to analyze the query intent in query processing. Specifically, the ATTRIBUTE_MAPPING relation integrates two abstraction relations with relations and attributes of the underlying databases, and constructs a knowledge abstraction database on top of the existing databases. Such a context-based knowledge abstraction database can be used

411

to support cooperative query answering as well as dynamically accommodate changes in the underlying databases. Since one attribute name in the underlying database can be used in multiple relations, the ‘Relation’ and ‘Attribute’ become the composite key in the ATTRIBUTE_MAPPING relation. The procedure of utilizing the ATTRIBUTE_MAPPING relation in cooperative query answering is discussed in detail in Section 4. 3.2. Value generalization process An abstract value in a KAH corresponds to multiple specific values at lower levels in the hierarchy. Thus, in cooperative query answering, querying an abstract value is equivalent to querying multiple specific values, as discussed further in the following. The generalization process provides a higher-level abstract value by moving up the KAH from a given value of a query condition. Consider the value of Finance in the College Major KAH in Fig. 1. A 1-level generalization returns its abstract value, Management, which covers Accounting and Marketing, in addition to Finance. A 2-level generalization returns a more abstract value, Business, which corresponds to wider neighborhood values ranging from Management to Economics. The n-level generalization process uses both the VALUE_ABSTRACTION and DOMAIN_ABSTRACTION relations. In principle, an n-level abstract value can be obtained by repeating the 1-level generalization process by n times. Suppose that we want to obtain the 2-level abstract value of Cost Accounting in the Career Development Education KAH. With its domain value, COURSE_ NAME, the first 1-level generalization process generates the abstract value, Accounting, from the VALUE_ABSTRACTION relation. However, to perform the second 1-level generalization process for Accounting, the domain of Accounting is needed, which is also the super-domain of the COURSE_NAME. Such super-domain information is looked up in the DOMAIN_ABSTRACTION relation, and COURSE_AREA is found as the super-domain. The 3-level abstract value can be searched similarly through the third 1level generalization process, with the super-domain information of COURSE_AREA provided in the DOMAIN_ ABSTRACTION relation. If recursively repeated, the nlevel abstract value can be obtained in the same manner. 3.3. Value specialization process The specialization process searches arbitrary n-level specific values for a given abstract value by moving down the KAH. Consider the Business value in the College Major KAH of Fig. 1. The 1-level specialization process returns Management and Economics as its 1-level specific values. The 2-level specialization process returns six values ranging from Finance to Econometrics, which are the specific values of both Management and Economics. Like the generalization process, the specialization process also uses both the VALUE_ABSTRACTION and

412

S.-Y. Huh et al. / Information and Software Technology 42 (2000) 407–418

Fig. 3. The core steps of query relaxation.

DOMAIN_ABSTRACTION relations to obtain n-level specific values. In general, the 1-level specialization for a given abstract value starts with the identification of the domain of an abstract value. From the domain of the abstract value, the DOMAIN_ABSTRACTION relation identifies the domain of its specific values. Once the domain of the specific values is identified, we can obtain the specific values for the abstract value from the VALUE_ABSTRACTION relation. As an example, consider the Economics value in the College Major KAH of Fig. 1 and the knowledge abstraction database in Fig. 2. To obtain the 1-level specific values of the Economics value, we first need to know the domain of the specific values of Economics, due to the domain dependency of the abstract and specific values. Since Economics has MAJOR_AREA as its domain, we get MAJOR_NAME as the sub-domain of MAJOR_ AREA from the DOMAIN_ABSTRACTION relation. Next, in the VALUE_ABSTRACTION relation, by selecting the VALUE_ABSTRACTION tuples having MAJOR_ NAME and Economics for the Domain and Abstract_Value attributes, we obtain specific values of Economics including Macroeconomics, Microeconomics, and Econometrics. By carrying out such 1-level specialization process n times, we can obtain the n-level specific values for the given abstract value. In the following, we present typical cooperative query answering mechanisms that use the generalization and specialization processes.

4. Cooperative query answering using query relaxation Cooperative query answering can return neighborhood or generalized information by relaxing the search conditions—broadening the answer scope—to include additional information. It can be invoked in various ways depending on the query requirements and the extent of the knowledge abstraction. We present four typical query answering mechanisms on the basis of the generalization and specialization processes in the KAH: approximate

selection, approximate join, conceptual selection and conceptual join. Approximate queries including approximate selection and approximate join can provide approximate neighborhood information besides the exact answers that can also be obtained by the conventional query processing. It first searches the exact answers which may be either unavailable or insufficient. In any case, if the user is not satisfied with the result, he can make a request for query relaxation by increasing the abstraction level. Also, higher-level query relaxation may be carried out repeatedly the under user’s control until the user is satisfied. In terms of the extent of query relaxation, the knowledge abstraction database can guide a more interactive query relaxation process—incrementally and directly to a certain preferred abstraction level, which eventually contributes to making the relaxation more efficient and effective. In an incremental procedure, the user increases abstraction level one by one until he is satisfied with the query result. In between the query processing, the user determines whether he will proceed query relaxing or not after examining the intermediate result. On the other hand, in a direct procedure, the user previously specifies the abstraction level to relax the query condition. Of course, he can be provided the query result only after relaxation with the specified extent. If he is not satisfied with the result, he can proceed the incremental or direct relaxation, which means we can use the two procedures together. In conceptual queries including conceptual selection and conceptual join, the attributes and values specified in query condition have different abstraction levels. They are posed at a higher conceptual level based on the user’s context without detailed database schema knowledge. If proper query relaxation is not applied to the conceptual queries, it compares data values existing in different domains to check if the query condition is satisfied. Thus, conceptual queries are rejected in conventional query processing that provides the only exact answers. As shown in Fig. 3, cooperative query answering consists of three core steps, i.e. ordinary query processing, query generalization, and query specialization. All kinds of queries, in the first step, carry out the conventional ordinary query processing to search for the exact answers satisfying the original query condition. If unsatisfied with the result, the query relaxation process through the query generalization and specialization is invoked. The query generalization step moves upward along the KAH and generalizes (or relaxes) the value or attributes specified in the query condition. The query specialization step moves downward along the KAH and specializes (or shrinks) the abstracted value in the abstract query to transform it into a range of specific values in an ordinary query. As detailed processes of the four types of queries are individually explained in the following sections, query answering for join operation only needs the generalization step. On the other hand, the approximate selection query

S.-Y. Huh et al. / Information and Software Technology 42 (2000) 407–418

413

Fig. 4. The core steps of approximate selection.

needs both generalization and specialization step, while the conceptual selection query needs only the specialization step. These differences are due to the features of the selection and join operation and the abstraction level of the values and attributes in the query condition. As a whole, each mechanism serves different aspects of cooperative query answering. Depending on the user’s need and knowledge about the KAH, the four types of query answering conditions can be selectively combined in a single query and enable more sophisticated queries. For the demonstration and explanation of the query answering processes, we use a simplified personnel database that is defined as the following: EMPLOYEE {(id, emp_name, dept, title)} TASK_HISTORY{(id, beginning_date, ending_date, task_performed)} COLLEGE_MAJOR{(id, entrance_data, major, graduation_date)} CAREER_PATH{(task, prerequisite_task)} TASK_MAJOR{(task, required_major_area)}. The EMPLOYEE relation provides the current job position information of an employee, while the TASK_HISTORY relation provides the history of tasks which the employee has performed. The COLLEGE_MAJOR relation contains the college education records, and CAREER_PATH shows the career development path defining prerequisite relationships among job tasks. Finally, the TASK_MAJOR relation prescribes the relationships between individual tasks and the college major area requirements for the tasks.

4.1. Approximate selection In the approximate selection, an attribute in the selection condition can be abstracted into a range of neighboring answers approximate to a specified value in a query. In the personnel database example, if the right candidate employees majoring in Finance are unavailable or there are not enough, other employees with related majors need to be obtained by enlarging the scope of the search. In the KAH, searching approximate values for a specific value is equivalent to finding the abstract value of the specific value, since specific values of the same abstract value constitute approximate values of one another. Thus, as explained earlier, the core relaxation steps include both the query generalization and specialization after ordinary query processing. With the steps in view, the approximate selection query for searching for employees with Finance or related majors is written in step 1 (Fig. 4). In the query, the approximate selection is specified by the relaxation operator ˆ ?. Note that if the relaxation operator is to be meaningful, both the attribute and specified value should be in the same domain. In this sense, the approximate selection in the above query is valid since both domains are identically MAJOR_NAME; the domain of the major attribute in the college_major relation is found to be MAJOR_ NAME from the ATTRIBUTE_MAPPING relation, and the domain of Finance is also found from the VALUE_ ABSTRACTION relation. Moreover, in the cooperative query answering process based on the knowledge abstraction database, the domain

414

S.-Y. Huh et al. / Information and Software Technology 42 (2000) 407–418

Fig. 5. The core steps of approximate join.

information of the attributes and values in a query plays a central role for the generalization and specialization processes by obtaining an abstract value and a set of specific values for a given value. Such domain information enables the functions in Table 3 to be used as operational functions for performing the generalization and specialization processes on the target attributes and values. Since an abstract value at a higher level corresponds to multiple specific values at lower levels, finding a one-level abstract value can produce neighborhood values for the specified value in the selection condition. Thus, the generalized query at the one level is produced by finding 1-level abstract value of Finance as shown in the first query of step 2. An additional query language construct, is-a, indicates the generalization relationship between a specific value and an abstract value. In a generalized query, since the domain of Finance is identified as MAJOR_NAME through query analysis, Get_Abstract_Value(“Finance”) in the generalized query returns the abstract value, Management, and thus the query condition is relaxed as the second query of step 2. The is-a operator transforms the abstract value, Management, in first query in step 3, to evaluate if the c.major has membership in the set of specific values of Management. The 1-level specialization of the abstract value, Management, returns a set of specific values {Finance, Accounting, Marketing} as neighborhood values around Finance. Thus, the specialized query is finally written as the second query in step 3 and can be answered as an ordinary SQL query. Users may want to distinguish exact answers from approximate answers. The proposed KAH and knowledge abstraction database do not include the necessary semantics to distinguish the two kinds of answers. However, by help of

an application system, they can be distinguished by checking if the provided answers exactly satisfy the query condition. Our prototype uses a different method that is easy to implement and does not put a burden on the system. It always searches, in the first response to the users, the exact answers only and then asks for the range of query relaxation to search approximate answers in the next response. 4.2. Approximate join As an extension of the approximate selection, an attribute in the join condition can be abstracted into an approximate range of nearby values. If viewed in terms of the KAH, the query is equivalent to joining the two attributes on the basis of their abstract values, which would bring in broader join results than the join based on the ordinary specific values. Thus, the query answering does not need the query specialization step after the generalization step. It performs an ordinary join operation on the basis of the abstracted values of both join attributes. The approximate join query searching for people having experienced a task that is close to the prerequisite tasks for Asset Management task is written in step 1 (Fig. 5). In the query, the relaxation operator ˆ ?, is used between the t.task and c.prerequisite_task. The use of the relaxation operator is quite similar to the case in the approximate selection query in the sense that both compared attributes are to be in the same domain. In the above example, both domains are TASK_NAME. Subsequently, query relaxation is made by generalizing the two attributes with their abstract values in step 2. Joining the two relations on the basis of common abstract

S.-Y. Huh et al. / Information and Software Technology 42 (2000) 407–418

Fig. 6. The ABSTRACTION relation.

values can be performed in several ways with the knowledge abstraction database. One intuitive approach is introducing the ABSTRACTION relation that selects only the Value and Abstract_Value attributes from the VALUE_ ABSTRACTION relation. In the above example, records having the TASK_NAME domain are extracted into the ABSTRACTION relation in Fig. 6. Since the ABSTRACTION relation provides abstract values for individual specific values, it can be used as an intermediary between the two joined attributes and makes an ordinary query shown in step 3 (Table 3). 4.3. Conceptual selection If users are not acquainted with attribute values, which is often the case in database systems emphasizing classification and categorization of data, they cannot expect to formulate an accurate query. In the personnel database example, suppose that a personnel manager wants to find out people majoring in a business area in general but he is not acquainted with the values such as Accounting, Finance, and Marketing due to unfamiliarity with college majors in the business area. In the cooperative query answering mechanism, the user may write a more conceptual query; i.e. “Find the name and department of an employee who has majored in any field that belongs to the general Business area.” When an abstract value is encountered in the selection condition, it is interpreted as a conceptual query and the abstract value is specialized into an ordinary query. Thus, the core relaxation steps, which are different from those of Table 3 Basic operations for query relaxation Functions

Description

Get_Abstract_Value(value)

It provides an abstract value for the given value. It executes a 1-level generalization process and returns a 1level abstract value It provides an abstract value for the given attribute. It facilitates relational comparison or joining with the 1-level abstract values It provides a set of specific values for the given value. It executes 1-level specialization process and returns 1level specific values

Get_Abstract_Value(attribute)

Get_Specific_Value(value)

415

approximate selection, do not include the query generalization step and proceed to the specialization step. An example of conceptual query searching for employees with Business background is written in step 1 in Fig. 7. In ordinary querying answering, this query is rejected since there is no Business major in the database. However, the cooperative query answering mechanism interprets the mismatching condition as a conceptual condition, since the value Business is found as an abstract value in the hierarchy of Fig. 1. As such, the conceptual condition refers to the selection condition case where the two compared parties, including the attribute and specified value, are located in different domains of the same hierarchy. In the above query, the domain of the c.major attribute is found to be MAJOR_NAME from the ATTRIBUTE_MAPPING relation, while the domain of Business is found to be MAJOR_GROUP from the VALUE_ABSTRACTION relation. If the conceptual condition is to be valid, the domain of the Business value must be a super-domain of MAJOR_ NAME (i.e. domain of the c.major attribute). Once the query condition is proven to be a conceptual one, it is automatically specialized as the first query in the specialization step. The 1-level specialization of the abstract value, Business, returns a set of specific values {Management, Economics} as neighborhood values around Finance. Since the domain of Management and Economics comprise the super-domain of the c.major attribute, further specialization is performed along the hierarchy until the domains of the two compared parties are in the same domain. The third query in the generalization step is the result of a second query specialization in which all values in the selection condition are in the same domain, thereby making the query an ordinary query. 4.4. Conceptual join As an extension of the conceptual selection, the two attributes in the join condition can have different domains and thus be in different abstraction levels. In the personnel database example, the TASK_MAJOR relation prescribes the required major area for each task. Note that the domain of the required major area is the MAJOR_AREA and is more general than that of the major attribute (i.e. MAJOR_ NAME) in the COLLEGE_MAJOR relation. In such capacity, a user may want to find people whose college major belongs to the major area required for performing a certain task, e.g. Health Insurance tasks. When the domains of the two join attributes are different, it is interpreted as a conceptual join, and the two domains are made the same through generalization into an ordinary join query. Thus, the core relaxation steps are same as that of approximate join except that the second generalization step only generalizes the attribute with the lower domain to make both domains the same. A conceptual join query, which retrieves people whose college major belong to the major area required for performing Health Insurance tasks, can be written as the query in step 1 in Fig. 8.

416

S.-Y. Huh et al. / Information and Software Technology 42 (2000) 407–418

Fig. 7. The core steps of conceptual selection.

Both join attribute domains are different but since one domain, MAJOR_AREA, is the super-domain of the other, MAJOR_NAME, the query is valid as a conceptual join query. Subsequently, generalization must be performed on the lower domain attribute, c.major, as the generalization step. As in the previous approximate join mechanism, the ABSTRACTION relation is used to mediate the two join attributes as the query specialization step. 4.5. A prototype system A prototype query answering system has been developed at KAIST, and is tested with a personnel database

system, to demonstrate the usefulness and practicality of the knowledge abstraction database in ordinary database application systems. As shown in Fig. 9, the prototype system consists of three modules including a user client part, an interactive extended SQL part, and a database server. The client part supports only user interface functions while the interactive extended SQL part performs interpretation of the vague queries and transforms them into the conventional queries. The database server accommodates both the knowledge abstraction database and a corporate database such as a personnel database and executes conventional queries submitted from the interactive SQL part.

Fig. 8. The core steps of conceptual join.

S.-Y. Huh et al. / Information and Software Technology 42 (2000) 407–418

417

Fig. 9. The system configuration of prototype query answering system.

The prototype test example for approximate selection is provided in Fig. 10. In the system, the user is asked to designate the values out of the set of specific values, {Finance, Accounting, Marketing} to relax the condition, and the candidate employees who majored in either Finance or Accounting are presented as the result of the approximate selection query.

5. Conclusion In this paper we introduced a knowledge abstraction database to support cooperative query answering which relaxes

the search condition and provides approximate neighborhood information when exact answers are available. The knowledge abstraction database was constructed on the basis of a knowledge representation framework, namely KAH, that extracts abstract values and their domains from an underlying corporate database into several hierarchies using value and domain abstraction. We also presented a query processing procedures handling a variety of intelligent queries including approximate selection, conceptual selection, approximate join, conceptual join. The proposed knowledge abstraction database and query processing procedures improve the effectiveness of information retrieval and decision-making.

Fig. 10. Cooperative query answering examples.

418

S.-Y. Huh et al. / Information and Software Technology 42 (2000) 407–418

The proposed knowledge abstraction database has advantages over other approaches, including the conceptual clustering approach, the object-oriented database approach and the rule-based approach. First, in terms of cooperative query processing, the knowledge abstraction database increases the diversity of admitted vague queries. Additionally, structural and content changes in the knowledge abstraction database can be flexibly accommodated. Details of the change maintenance in the KAH are the subject of a separate document [9]. However, existing conceptual clustering approaches have limitations in the diversity of vague queries and the flexibility of abstraction hierarchy maintenance. Existing approaches do not support join relaxation (e.g. approximate join, conceptual join) which could facilitate multiple relations to be approximately joined at once, and thus require several intermediate approximate selection queries to perform the same approximate join query. Because of the dual abstraction hierarchies, the join relaxation is supported in the KAH and such joinrelated approximate queries can be performed in a single query. Second, the KAH-based knowledge abstraction database can guide more interactive and flexible query relaxation processes. Information about the abstraction level facilitates conceptual queries. Internally, it enables the comparison of abstraction levels in multiple domains, supports the query relaxation process both incrementally and directly to a certain preferred level, and provides users with flexible relaxation control. In future research, the KAH will be extended to retrieve and manage data in federated database systems where component databases have the autonomy of data as well as the schema but store data sets that are similar to one another. Such an extended KAH will facilitate the query relaxation processes across the multiple component databases and draw a wider range of approximate answers which would not be obtainable from existing federated database approaches. Fuzzy set theory is also being studied, since both approximate and conceptual conditions have some commonality with fuzzy conditions. Fuzzy systems are believed to enrich the semantics of the KAH and increase the intelligent characteristics of the knowledge abstraction database. References [1] B.P. Buckles, F.E. Petry, A fuzzy representation of data for relational databases, Fuzzy Sets Systems 7 (3) (1982) 213–226.

[2] Y. Cai, N. Cercone, J. Han, Attribute-oriented induction in relational databases, Knowledge Discovery in Databases, AAAI Press/The MIT Press, 1993. [3] Q. Chen, W. Chu, R. Lee, Providing cooperative answers via knowledge-based type abstraction and refinement, in: Proceedings of the Fifth International Symposium on Methodologies for Intelligent Systems, Knoxville, TE, 1990. [4] L. Cholvy,R. Demolombe, Queryng a Rule Base, in: Proceedings of the First International Conference on Expert Database System, 1986, pp. 365–371. [5] W. Chu, Q. Chen, Structured approach for cooperative query answering, IEEE Transactions on Knowledge and Data Engineering 6 (5) (1994) 738–749. [6] W. Chu, Q. Chen, R. Lee, Cooperative query answering via type abstraction hierarchy, in: S.M. Deen (Ed.), Cooperative Knowledge Base System, Elsevier, Amsterdam, 1991, pp. 271–292. [7] F. Cuppens, R. Demolombe, Cooperative answering: a methodologies to provide intelligent access to databases, in: Proceedings of the Second International Conference on Expert Database Systems, October 1989, pp. 621–643. [8] A. Hemerly, M. Casanova, A. Furtado, (Exploiting user models to avoid misconstruals), Nonstandard Queries and Nonstandard Answers, Oxford, 1994, pp. 73–98. [9] S.-Y. Huh, K.-H. Moon, Knowledge abstraction hierarchy for cooperative query answering, Working Paper, Graduate School of Management, KAIST, 1998, submitted for publication. [10] T. Ichikawa, ARES: a relational database with the capability of performing flexible interpretation of queries, IEEE Transactions on Software Engineering SE-12 (5) (1986) 624–634. [11] A.K. Jain, R.C. Dubes, Algorithms for Cluster Analysis, Prentice Hall, Englewood Cliffs, NJ, 1988. [12] M.J. Minock, W. Chu, Explanation for Cooperative Information Systems, in Proceedings of the Ninth International Symposium on Methodologies for Intelligent Systems, June 1996. [13] A. Motro, User interface to relational databases that permits vague queries, ACM Transactions on Office Information Systems 6 (3) (1988) 187–214. [14] A. Motro, Tolerent and cooperative user interface to databases, IEEE Transactions on Knowledge and Data Engineering 2 (2) (1990) 231– 246. [15] H. Prade, C. Testemale, Generalizing database relational algebra for the treatment of incomplete or uncertain information and vague queries, Information Science 34 (2) (1984) 115–143. [16] S. Ram, Intelligent database design using the unifying semantic model, Information and Management 29 (4) (1995) 191–206. [17] C.D. Shum, R.R. Muntz, An information-theoretic study on aggregate responses, in: Proceedings of the 14th International Conference on Very Large Databases, Morgan Kaughmann, Los Altos, CA, 1988, pp. 479–490. [18] J.D. Ullman, Database and Knowledge-Base Systems, 1, Computer Science Press, Rockville, MD, 1987. [19] S.V. Vrbsky, W.S. Liu, APPROXIMATE—a query processor that produces monotonically improving approximate answers, IEEE Transactions on Knowledge and Data Engineering 5 (6) (1993) 1056– 1068. [20] M. Zemankova, A. Kandel, Implementing Imprecision in Information Systems, Information Science 37 (1) (1985) 107–141.