Analysis of fuzzy operators for high quality ... - Semantic Scholar

Report 2 Downloads 68 Views
Information Elsevier

Processing

Letters

9 July 1993

46 (1993) 251-256

Analysis of fuzzy operators for high quality information

retrieval a;!

Myoung Ho Kim, Joon Ho Lee and Yoon Joon Lee Department of Computer Science, Korea Advanced Institute of Science and Technology, 373-1, Kusung-dong, Yusung-gu, Taejon 305-701, South Korea Communicated by K. Ikeda Received 13 March 1992 Revised 26 November 1992

Abstract Kim, M.H., J.H. Lee and Y.J. Lee, Analysis Letters 46 (1993) 251-256.

of fuzzy operators

for high quality

information

retrieval.

Information

Processing

It has been argued that the conventional fuzzy set model based on the MIN and MAX operators is not appropriate for a model of information retrieval systems due to the properties of the MIN and MAX operators adverse to retrieval effectiveness. In this paper we investigate the behavioral aspects of various fuzzy operators and address the properties which are important in achieving high quality document rankings. We then present the fuzzy set model based on positively compensatory operators, which provides higher retrieval effectiveness than the conventional fuzzy set model. It is also described that the behavior of the proposed fuzzy set model almost coincides with that of the extended boolean model. Keywords: Information

retrieval;

fuzzy set model;

fuzzy operators;

1. Introduction The conventional fuzzy set model based on the MIN and MAX operators has been proposed in the past to support ranking facility for boolean retrieval systems [1,2,5,8]. It uses document term weights which reflect the importance of individual terms to the document. Formally, an Information Retrieval (IR) system based on the conventional fuzzy set model is defined by the quadruple (T, Q, D, F >, where:

Correspondence to: M.H. Kim, Department of Computer Science, Korea Advanced Institute of Science and Technology, 373-1, Kusung-dong, Yusung-gu, Taejon 305-701, South Kohis work was supported in part by Korea Science ngineering Foundation Grant No. 921-1100-005-1.

0020-0190/93/$06.00

0 1993 - Elsevier

Science

Publishers

and

extended

boolean

model

(1) T is a set of index terms which are used to represent queries and documents. (2) Q is a set of queries that can be recognized by the system. Each query q E Q is a legitimate boolean expression composed of index terms and the logical operators AND, OR and NOT. (3) D is a set of document. Each document d ED is represented by ((t,, w,>,.. .,(t,, w,>), where wi designates the weight of term tj in d and may take any value between zero and one, i.e. 0 G wi G 1. (4) F is a retrieval function F: D x Q + [OJ] which assigns to each pair (d, q) a number in the closed interval [O,l]. This number is a measure of similarity between the document d and the query

B.V. All rights resewed

251

Volume

46, Number

Boolean

formulation

5

Fkf, t1 AND t,) F(d, t, OR t,) F(d, tI AND NOT t,)

Fig. 1. Similarity

evaluation

INFORMATION Evaluation

PROCESSING

MIN (F(d, t,), F(d, t,)) MAX (F(d, t,), F(d, t,)) F(d, C,k(lF(d, t,))

fuzzy set

and is called the document value for document d with respect to query q. The retrieval function F(d, q) is defined as follows: (i) For each term ti in a query, the function P(d, ti) is defined as the weight of ti in document d, i.e., wi. (ii) Logical operators, i.e., AND, OR and NOT are then evaluated based on the rules given in Fig. 1. For boolean queries containing more than one logical operator, the evaluation proceeds recursively from the innermost clause. Although the conventional fuzzy set model is an elegant approach, it generates incorrect document rankings in many cases [1,4]. This is because the MIN and MAX operators do not correspond well with humans’ behavior for document ranking. Since the first introduction of fuzzy set theory, a variety of fuzzy operators have been proposed for the AND and OR operations. These fuzzy operators can be classified into T-operators [3] and averaging operators [9,10], depending on their operational characteristics. In this paper we investigate the behavioral aspects of the fuzzy operators and address important issues to enhance the conventional fuzzy set model. The remainder of this paper is organized as follows. Section 2 gives some terminology, which is relevant to the presentation of this paper. In Section 3 we describe that T-operators are not appropriate for evaluation formulas of the fuzzy set model. In Section 4 the analyses of averaging operators are presented. The new formulation of the fuzzy set model based on positively compensatory operators is also proposed. In Section 5 the proposed fuzzy set model is compared ‘with the extended boolean model which has been shown to be effective in the literature. The concluding remarks are given in Section 6. q,

252

9 July 1993

2. Preliminaries

formula

rules in the conventional model.

LETTERS

In this section we give some relevant terminology about behavioral properties of the fuzzy operator, and then define the operator graph which is useful to analyze the behavioral properties of various fuzzy operators. 1. An operator 8 is single operand if 0(x, y) is either x or y for all x,y E [O,l]. It is called partially single operand dependent when one of the following conditions is satisfied: (i) e(O, x) = 0(x, 0) = 0 or X, (ii) e(l, x) = 0(x, 1) = 1 or X. Definition dependent

Definition 2. An operator 0 is negatively compensatory if f3(x, y) is less than MIN(x, y) or greater

than MAX(X, y) for all x,y 13(x, y) is less than MIN(x, y) MAX(x, y) only in some value y, it is called partially negatively

E (0, 1). When or greater than ranges of x and compensatory.

Definition 3. An operator 0 is positively compensatory if 19(x, y) is greater than MIN(x, y) and

less than MAX(x, y) for all x,y E LO,11with the exception that 0(x, x) is equal to X. Figure 2(a) shows how the operator graph is constructed for the given two operand graphs A and B, where the vertical coordinate denotes the degree of membership. The operator graph is a set of points which are computed by applying the operator to the values of A and B at each element in the set of objects. Note that operand graphs A and B can be any arbitrary graphs. This notation of the operator graph is presented

A

B

MAX

Fig. 2. The operator graph. (a) Creating the operator graph. (b) The operator graphs for the MIN and MAX operators.

Volume

46. Number

INFORMATION

5

PROCESSING

in [3]. For example, a point y in Fig. 2(a) is computed by applying the operator to the values of two operands (Y and p. The operator graphs for the MIN and MAX operators, for instance, are depicted in Fig. 2(b), where the bold line represents the values of MIN and the other line represents the values of MAX. We can easily understand behavioral properties of an operator from its operator graph. For example, Fig. 2(b) shows the MIN and MAX operators are single operand dependent because all points of the operator graphs are on either one operand graph or the other. For each operator graph in Fig. 3, the corresponding operator has the following properties: (a) partially single operand dependent and negatively compensatory, (b) partially single operand dependent and partially negatively compensatory, Cc) partially negatively compensatory, (d) positively compensatory.

3. Critics against

the use of T-operators

The conventional fuzzy set model has been criticized as inappropriate for a model of IR systems [1,4]. This is because the MIN and MAX operators have the single operand dependent property, which decreases retrieval effectiveness. The examples below show that the conventional fuzzy set model using the MIN operator for the AND operation results in undesirable ranked output opposite to humans’ intuition. It can also be easily understood that the use of the MAX operator causes similar problems. Example 4. Suppose that we have two documents d, and d, shown below. The documents are rep-

resented weight.

by two pairs of an index term and its

d, = {(Thesaurus,

0.40)) (Clustering,

0.40)))

d, = ((Thesaurus,

0.99)) (Clustering,

0.39)},

q1 = Thesaurus

AND Clustering.

When the MIN operator is used for the AND operation, the document values of d, and d, for the query q1 are evaluated as 0.40 and 0.39, respectively. Hence, d, is retrieved with a higher

9 July 1993

LETTERS

(a)

(a

(cl

(b)

Fig. 3. The behavioral

properties

of various

operators.

rank than d,. Most people, however, will obviously decide that d, rather than d, is more similar to ql. Example 5. Suppose that we have two documents d, and d, and a query q2 as follows:

d,= d,=

{ qz = t, AND t, AND . . . AND t 100.

I)}, O)},

Here, the fuzzy set model using the MIN operator decides that document values of d, and d, for the query qz are the same, i.e., zero. Furthermore, the document value of d, is zero even though ninety-nine terms of d, satisfy the query terms specified in the given query q2. The problems shown in Example 4 and 5 occur whenever an operator is single operand dependent. The partially single operand dependent operator can alleviate the problem described in Example 4, while it still causes the same problem described in Example 5. For example, suppose that the product operator, i.e. x .y is used instead of the MIN operator in Example 4. (Note that the product operator is partially single operand dependent.) The document values of d, and d, are evaluated as 0.16 and 0.39 respectively, and hence d, is retrieved with a higher rank than d,. The T-operators, namely T-norms and T-conorms originated from the studies of probabilistic metric spaces. Later, it was proposed that Tnorms and T-conorms could be used for the AND or OR operations, respectively, in the fuzzy set theory. The MIN operator belongs to T-norms and the MAX operator belongs to T-conorms. In the rest of this section we consider the effect of the use of other types of T-operators instead of 253

Volume

46, Number

5

INFORMATION

PROCESSING

the MIN and MAX operators on retrieval effectiveness. T-norms and T-conorms are denoted by T and T”, respectively. Figures 4 and 5 show some T-operators and their operator graphs (For more T-operators, see [3].) We can easily understand from Fig. 5 that they are partially single operand dependent. The definition of T-operators supports, on the other hand, that all of T-operators are partially single operand dependent [91. Therefore the problem described in Example 5 cannot be overcome though the fuzzy set model uses any types of T-operators for the AND and OR operations. In addition, the T-operators have the common characteristics adverse to retrieval effectiveness as follows [9]: T-norms give the resulting value which is less than or equal to the minimum value, and T-conorms give the resulting value which is greater than or equal to the maximum value. In other words, T-operators are negatively compensatory. When the negatively compensatory operators are used in the fuzzy set model, the document value depends heavily on the number of query terms. This property can cause the problem which is illustrated in the next example. Example 6. Suppose a document d, and two queries q1 and q2 are given as follows: d, = {(Thesaurus,

0.70)) (Clustering, 0.70)) (System, 0.70)},

q1 = Thesaurus AND Clustering, q2 = System.

Though the fuzzy set model uses any operator, the similarity between q2 and d, is evaluated as 0.70 which is the weight of term “System” in d,.

1 2

T(n, y)

TYx,

X’Y MAX(x+y-1,O)

x+y--xy

xy

3

MIN(x + y, 1) x+y-2xy

x+y-xy x ify=l 4

Y i 0

X

ifx=l otherwise

1-w ify=O

Y ( 1

Fig. 4. T-operators. 254

Y)

ifx=O otherwise

LETTERS

Fig. 5. The operator

9 July 1993

graphs

for the T-operators

in Fig. 4.

The negatively compensatory operators will decide the similarity between q1 and d, is less than 0.70. Note that the similarity between q1 and d, is less than that between q2 and q5, which clearly does not agree with most people’s decision. We have described that the T-operators do not well model humans’ behavior for document ranking. We close this section with the following summarized statements. First the conventional fuzzy set model based on the MIN and MAX operators suffers from the problem of singfe operand dependency which has been presented in Example 4 and 5. Second, the fuzzy set model adopting other types of T-operators causes the problems of not only partially single operand dependency described in Example 5 but also negative compensation described in Example 3.

4. The fuzzy set model based on positively compensatory operators In the fuzzy decision theory the “decision” has been viewed as the intersection or the union of the fuzzy sets, and T-operators have been used to model human decisions in many cases. However, it has been noted that T-operators are not appropriate to model “managerial decisions”. Averaging operators have been developed to overcome this problem. In this paper we consider four averaging operators proposed in the past [9,10]. In averaging operators the resulting value is controlled by a parameter y. In Figs. 6 and 7 three general operators, A,-A, and a pair of operators which distinguishes the AND operator, A,,, and the OR operator, Ad2 are shown with the corresponding operator graphs. From Fig. 7 we can easily see that A, and A, are partially negatively compensatory and A, is

INFORMATION

Volume 46, Number 5

also partially single operand dependent. Hence, the fuzzy set model based on A, cannot avoid the problems of partially single operand dependency and negative compensation, and the fuzzy set model based on A, causes the problem of negative compensation. The averaging operators A, and A, are positively compensatory. Because positiuely compensatory operators are neither partially single operand dependent nor partially negatively compensatory, the fuzzy set model using positively compensatory operators as evaluation

formulas for the AND and OR operations avoids all the problems described in Examples 4, 5 and 6. Consequently, we propose the use of the positively compensatory operators, e.g., A, and A, as evaluation formulas of the fuzzy set model. This new formulation of the fuzzy set model can overcome all the problems described in Examples 4, 5 and 6. The proposed fuzzy set model also provides higher retrieval effectiveness though we have not performed exact experiments due to the lack of experiment data. However, the next section can support our claim.

5. Comparison

with the extended boolean model

The extended boolean model has been shown to be an effective IR model in the literature [6,71. It represents a unifying retrieval model in which the conventional fuzzy set model and the vector space model are special cases. The query-document similarity in the extended boolean model is based on L, vector norm computations, and is controlled by a parameter p, 1