Information Elsevier
Processing
Letters
9 July 1993
46 (1993) 251-256
Analysis of fuzzy operators for high quality information
retrieval a;!
Myoung Ho Kim, Joon Ho Lee and Yoon Joon Lee Department of Computer Science, Korea Advanced Institute of Science and Technology, 373-1, Kusung-dong, Yusung-gu, Taejon 305-701, South Korea Communicated by K. Ikeda Received 13 March 1992 Revised 26 November 1992
Abstract Kim, M.H., J.H. Lee and Y.J. Lee, Analysis Letters 46 (1993) 251-256.
of fuzzy operators
for high quality
information
retrieval.
Information
Processing
It has been argued that the conventional fuzzy set model based on the MIN and MAX operators is not appropriate for a model of information retrieval systems due to the properties of the MIN and MAX operators adverse to retrieval effectiveness. In this paper we investigate the behavioral aspects of various fuzzy operators and address the properties which are important in achieving high quality document rankings. We then present the fuzzy set model based on positively compensatory operators, which provides higher retrieval effectiveness than the conventional fuzzy set model. It is also described that the behavior of the proposed fuzzy set model almost coincides with that of the extended boolean model. Keywords: Information
retrieval;
fuzzy set model;
fuzzy operators;
1. Introduction The conventional fuzzy set model based on the MIN and MAX operators has been proposed in the past to support ranking facility for boolean retrieval systems [1,2,5,8]. It uses document term weights which reflect the importance of individual terms to the document. Formally, an Information Retrieval (IR) system based on the conventional fuzzy set model is defined by the quadruple (T, Q, D, F >, where:
Correspondence to: M.H. Kim, Department of Computer Science, Korea Advanced Institute of Science and Technology, 373-1, Kusung-dong, Yusung-gu, Taejon 305-701, South Kohis work was supported in part by Korea Science ngineering Foundation Grant No. 921-1100-005-1.
0020-0190/93/$06.00
0 1993 - Elsevier
Science
Publishers
and
extended
boolean
model
(1) T is a set of index terms which are used to represent queries and documents. (2) Q is a set of queries that can be recognized by the system. Each query q E Q is a legitimate boolean expression composed of index terms and the logical operators AND, OR and NOT. (3) D is a set of document. Each document d ED is represented by ((t,, w,>,.. .,(t,, w,>), where wi designates the weight of term tj in d and may take any value between zero and one, i.e. 0 G wi G 1. (4) F is a retrieval function F: D x Q + [OJ] which assigns to each pair (d, q) a number in the closed interval [O,l]. This number is a measure of similarity between the document d and the query
B.V. All rights resewed
251
Volume
46, Number
Boolean
formulation
5
Fkf, t1 AND t,) F(d, t, OR t,) F(d, tI AND NOT t,)
Fig. 1. Similarity
evaluation
INFORMATION Evaluation
PROCESSING
MIN (F(d, t,), F(d, t,)) MAX (F(d, t,), F(d, t,)) F(d, C,k(lF(d, t,))
fuzzy set
and is called the document value for document d with respect to query q. The retrieval function F(d, q) is defined as follows: (i) For each term ti in a query, the function P(d, ti) is defined as the weight of ti in document d, i.e., wi. (ii) Logical operators, i.e., AND, OR and NOT are then evaluated based on the rules given in Fig. 1. For boolean queries containing more than one logical operator, the evaluation proceeds recursively from the innermost clause. Although the conventional fuzzy set model is an elegant approach, it generates incorrect document rankings in many cases [1,4]. This is because the MIN and MAX operators do not correspond well with humans’ behavior for document ranking. Since the first introduction of fuzzy set theory, a variety of fuzzy operators have been proposed for the AND and OR operations. These fuzzy operators can be classified into T-operators [3] and averaging operators [9,10], depending on their operational characteristics. In this paper we investigate the behavioral aspects of the fuzzy operators and address important issues to enhance the conventional fuzzy set model. The remainder of this paper is organized as follows. Section 2 gives some terminology, which is relevant to the presentation of this paper. In Section 3 we describe that T-operators are not appropriate for evaluation formulas of the fuzzy set model. In Section 4 the analyses of averaging operators are presented. The new formulation of the fuzzy set model based on positively compensatory operators is also proposed. In Section 5 the proposed fuzzy set model is compared ‘with the extended boolean model which has been shown to be effective in the literature. The concluding remarks are given in Section 6. q,
252
9 July 1993
2. Preliminaries
formula
rules in the conventional model.
LETTERS
In this section we give some relevant terminology about behavioral properties of the fuzzy operator, and then define the operator graph which is useful to analyze the behavioral properties of various fuzzy operators. 1. An operator 8 is single operand if 0(x, y) is either x or y for all x,y E [O,l]. It is called partially single operand dependent when one of the following conditions is satisfied: (i) e(O, x) = 0(x, 0) = 0 or X, (ii) e(l, x) = 0(x, 1) = 1 or X. Definition dependent
Definition 2. An operator 0 is negatively compensatory if f3(x, y) is less than MIN(x, y) or greater
than MAX(X, y) for all x,y 13(x, y) is less than MIN(x, y) MAX(x, y) only in some value y, it is called partially negatively
E (0, 1). When or greater than ranges of x and compensatory.
Definition 3. An operator 0 is positively compensatory if 19(x, y) is greater than MIN(x, y) and
less than MAX(x, y) for all x,y E LO,11with the exception that 0(x, x) is equal to X. Figure 2(a) shows how the operator graph is constructed for the given two operand graphs A and B, where the vertical coordinate denotes the degree of membership. The operator graph is a set of points which are computed by applying the operator to the values of A and B at each element in the set of objects. Note that operand graphs A and B can be any arbitrary graphs. This notation of the operator graph is presented
A
B
MAX
Fig. 2. The operator graph. (a) Creating the operator graph. (b) The operator graphs for the MIN and MAX operators.
Volume
46. Number
INFORMATION
5
PROCESSING
in [3]. For example, a point y in Fig. 2(a) is computed by applying the operator to the values of two operands (Y and p. The operator graphs for the MIN and MAX operators, for instance, are depicted in Fig. 2(b), where the bold line represents the values of MIN and the other line represents the values of MAX. We can easily understand behavioral properties of an operator from its operator graph. For example, Fig. 2(b) shows the MIN and MAX operators are single operand dependent because all points of the operator graphs are on either one operand graph or the other. For each operator graph in Fig. 3, the corresponding operator has the following properties: (a) partially single operand dependent and negatively compensatory, (b) partially single operand dependent and partially negatively compensatory, Cc) partially negatively compensatory, (d) positively compensatory.
3. Critics against
the use of T-operators
The conventional fuzzy set model has been criticized as inappropriate for a model of IR systems [1,4]. This is because the MIN and MAX operators have the single operand dependent property, which decreases retrieval effectiveness. The examples below show that the conventional fuzzy set model using the MIN operator for the AND operation results in undesirable ranked output opposite to humans’ intuition. It can also be easily understood that the use of the MAX operator causes similar problems. Example 4. Suppose that we have two documents d, and d, shown below. The documents are rep-
resented weight.
by two pairs of an index term and its
d, = {(Thesaurus,
0.40)) (Clustering,
0.40)))
d, = ((Thesaurus,
0.99)) (Clustering,
0.39)},
q1 = Thesaurus
AND Clustering.
When the MIN operator is used for the AND operation, the document values of d, and d, for the query q1 are evaluated as 0.40 and 0.39, respectively. Hence, d, is retrieved with a higher
9 July 1993
LETTERS
(a)
(a
(cl
(b)
Fig. 3. The behavioral
properties
of various
operators.
rank than d,. Most people, however, will obviously decide that d, rather than d, is more similar to ql. Example 5. Suppose that we have two documents d, and d, and a query q2 as follows:
d,= d,=
{ qz = t, AND t, AND . . . AND t 100.
I)}, O)},
Here, the fuzzy set model using the MIN operator decides that document values of d, and d, for the query qz are the same, i.e., zero. Furthermore, the document value of d, is zero even though ninety-nine terms of d, satisfy the query terms specified in the given query q2. The problems shown in Example 4 and 5 occur whenever an operator is single operand dependent. The partially single operand dependent operator can alleviate the problem described in Example 4, while it still causes the same problem described in Example 5. For example, suppose that the product operator, i.e. x .y is used instead of the MIN operator in Example 4. (Note that the product operator is partially single operand dependent.) The document values of d, and d, are evaluated as 0.16 and 0.39 respectively, and hence d, is retrieved with a higher rank than d,. The T-operators, namely T-norms and T-conorms originated from the studies of probabilistic metric spaces. Later, it was proposed that Tnorms and T-conorms could be used for the AND or OR operations, respectively, in the fuzzy set theory. The MIN operator belongs to T-norms and the MAX operator belongs to T-conorms. In the rest of this section we consider the effect of the use of other types of T-operators instead of 253
Volume
46, Number
5
INFORMATION
PROCESSING
the MIN and MAX operators on retrieval effectiveness. T-norms and T-conorms are denoted by T and T”, respectively. Figures 4 and 5 show some T-operators and their operator graphs (For more T-operators, see [3].) We can easily understand from Fig. 5 that they are partially single operand dependent. The definition of T-operators supports, on the other hand, that all of T-operators are partially single operand dependent [91. Therefore the problem described in Example 5 cannot be overcome though the fuzzy set model uses any types of T-operators for the AND and OR operations. In addition, the T-operators have the common characteristics adverse to retrieval effectiveness as follows [9]: T-norms give the resulting value which is less than or equal to the minimum value, and T-conorms give the resulting value which is greater than or equal to the maximum value. In other words, T-operators are negatively compensatory. When the negatively compensatory operators are used in the fuzzy set model, the document value depends heavily on the number of query terms. This property can cause the problem which is illustrated in the next example. Example 6. Suppose a document d, and two queries q1 and q2 are given as follows: d, = {(Thesaurus,
0.70)) (Clustering, 0.70)) (System, 0.70)},
q1 = Thesaurus AND Clustering, q2 = System.
Though the fuzzy set model uses any operator, the similarity between q2 and d, is evaluated as 0.70 which is the weight of term “System” in d,.
1 2
T(n, y)
TYx,
X’Y MAX(x+y-1,O)
x+y--xy
xy
3
MIN(x + y, 1) x+y-2xy
x+y-xy x ify=l 4
Y i 0
X
ifx=l otherwise
1-w ify=O
Y ( 1
Fig. 4. T-operators. 254
Y)
ifx=O otherwise
LETTERS
Fig. 5. The operator
9 July 1993
graphs
for the T-operators
in Fig. 4.
The negatively compensatory operators will decide the similarity between q1 and d, is less than 0.70. Note that the similarity between q1 and d, is less than that between q2 and q5, which clearly does not agree with most people’s decision. We have described that the T-operators do not well model humans’ behavior for document ranking. We close this section with the following summarized statements. First the conventional fuzzy set model based on the MIN and MAX operators suffers from the problem of singfe operand dependency which has been presented in Example 4 and 5. Second, the fuzzy set model adopting other types of T-operators causes the problems of not only partially single operand dependency described in Example 5 but also negative compensation described in Example 3.
4. The fuzzy set model based on positively compensatory operators In the fuzzy decision theory the “decision” has been viewed as the intersection or the union of the fuzzy sets, and T-operators have been used to model human decisions in many cases. However, it has been noted that T-operators are not appropriate to model “managerial decisions”. Averaging operators have been developed to overcome this problem. In this paper we consider four averaging operators proposed in the past [9,10]. In averaging operators the resulting value is controlled by a parameter y. In Figs. 6 and 7 three general operators, A,-A, and a pair of operators which distinguishes the AND operator, A,,, and the OR operator, Ad2 are shown with the corresponding operator graphs. From Fig. 7 we can easily see that A, and A, are partially negatively compensatory and A, is
INFORMATION
Volume 46, Number 5
also partially single operand dependent. Hence, the fuzzy set model based on A, cannot avoid the problems of partially single operand dependency and negative compensation, and the fuzzy set model based on A, causes the problem of negative compensation. The averaging operators A, and A, are positively compensatory. Because positiuely compensatory operators are neither partially single operand dependent nor partially negatively compensatory, the fuzzy set model using positively compensatory operators as evaluation
formulas for the AND and OR operations avoids all the problems described in Examples 4, 5 and 6. Consequently, we propose the use of the positively compensatory operators, e.g., A, and A, as evaluation formulas of the fuzzy set model. This new formulation of the fuzzy set model can overcome all the problems described in Examples 4, 5 and 6. The proposed fuzzy set model also provides higher retrieval effectiveness though we have not performed exact experiments due to the lack of experiment data. However, the next section can support our claim.
5. Comparison
with the extended boolean model
The extended boolean model has been shown to be an effective IR model in the literature [6,71. It represents a unifying retrieval model in which the conventional fuzzy set model and the vector space model are special cases. The query-document similarity in the extended boolean model is based on L, vector norm computations, and is controlled by a parameter p, 1