A Comparison of Two Weighting Schemes for Boolean ... - CiteSeerX

Report 1 Downloads 49 Views
1980 TOC

Search

A comparison of two weighting schemes for Boolean retrieval Abraham Bookstein

3.1

Introduction

Boolean retrieval logic is the basis of most operating information retrieval systems (IRSs). There are many reasons why this type of system has been so attractive. For example, it allows users to issue requests in which the topics Of interest and the relations between them are clearly and precisely stated; the user has considerable flexibility in formulating his request; and the request can be reformulated, as convenient, to equivalent requests that will retrieve the identical set of documents (Bookstein and Cooper, 1976). Further, it is relatively easy to learn how to use such a system, and, given software that is widely available, such systems can be easily implemented to permit of an efficient search, even of rather large files. For these reasons, a number of intrinsic weaknesses inherent in these systems are often overlooked. A very serious constraint of Boolean systems is the necessity of associating a number of index terms with each document. The problem is that it is often unclear whether a given index term is appropriate for a document - - b o t h the decision to include the term and the decision to omit it might result in retrieval errors: false drops in the first case, lost relevant documents in the second. The issuer of a request is similarly constrained either to include a term in his request or to leave it out. It is not possible for a patron to include two terms, while indicating that one is more important than the other. The above weaknesses have encouraged the development of alternative approaches, such as the use of vector models (Salton, 1968), which permit the patron and the indexer to differentiate index terms by weight. A user of such a system, however, cannot indicate how the terms logically relate to one another. Others have created multi-stage systems, in which a standard Boolean retrieval process first retrieves a set of documents; these documents are then processed by an independent weighting mechanism that assigns to each retrieved document a value representing the importance of the terms by which the document is indexed (Noreault, Koll and McGill, 1977). Unfortunately, such hybrid methods are subject to inconsistencies, in that two logically equivalent requests can retrieve different sets of documents (Bookstein, 1978). 23

24

A comparison of two weighting schemes for Boolean retrieval

Recently attempts have been made to incorporate weighting into an intrinsically Boolean system (Bookstein, 1980; Buell and Kraft, to be published). It is the purpose of this chapter to present and compare two such techniques. As both have been motivated by, and are most effectively described in terms of, fuzzy sets, we shall briefly define this concept in the next section.

3.2

Fuzzy sets

Boolean retrieval, though often thought of in terms of the relation of structured requests to a document collection, is conveniently defined in terms of sets and set operations. For example, we associate a set with each index term: those documents that are indexed by that term. Similarly, each logical operator is naturally associated with a set operator: AND ointersection (n) OR

o u n i o n (tJ)

N O T ocomplementation (indicated by a bar over the affected set) If, in a Boolean request, one replaces the index terms by their associated sets and the logical connectives by their associated set operators, we obtain an expression which, when evaluated, produces the set of retrieved documents. This set will be identical with the set of documents that would be retrieved ira search of each document were made and all those for which the request was satisfied were retrieved. Because of the isomorphism between sets and their operators and statements and their connectives, manipulations of Boolean requests will be consistent with the corresponding manipulations of the sets. Thus, traditional Boolean retrieval can be thought of as the application of set operators to sets associated with index terms. Our concern in this chapter is to generalise this process so that the importance of, as well as the logical relations among, terms in a request or document can be represented. Perhaps the most direct way that this can be done is by generalising the characteristic function associated with a set. In traditional set theory, if MA is the characteristic function of the set A, then

MA(x)=

{~ ifx~A otherwise

Thus, a set can be thought of as a function, and the various set operations can be translated into operations on functions. Given this representation, however, the generalisation we are seeking is immediate: permit MA(x)to take values between 0 and 1. That is, whereas traditional set theory permits us to say, for any entity, only whether it is in the set or not, we can now indicate degrees of membership. In the context of IRSs, an indexer, instead of simply deciding whether or not to index a document by a given index term, will assign a weight to each term, the weight designating the strength of applicability of that term to the document. The weight of 0 means that the term has not been assigned to the document; the weight of 1 means that the document is fully indexed by the term. Thus, in the instance when a traditional indexer would be uncertain whether or not to assign a term to a document, the fuzzy indexer might assign it, say, with a weight of .3.

Fuzzy sets

25

Once each document has been indexed in this manner, we can define the fuzzy sets of documents associated with each index term: for a given index term, T, the fuzzy set is defined by the membership function Mr(x), where, for x a document, Mr(x) is the weight with which T w a s assigned to x. Thus, the graphic conceptualisation of index terms as sets of documents can be extended naturally to the conceptualisation in terms of fuzzy sets - - sets with partial membership. To complete our definition of fuzzy sets, it is necessary to indicate how the traditional set operators can be generalised to act on fuzzy sets. (1)

Union: Traditionally, that is,

MAvI~X) is equal to

1 ifeither MA(x) or MR(x) is 1 - -

MAus(X)= max. (mA(x), mB(x))

(2)

We can directly extend this definition to fuzzy sets, where MA and M s are no longer constrained to take only the values 0 and 1. Intersection: Again we can express the operator in terms of its effect on the characteristic functions of traditional sets:

MAc~x) = min. (M~(x), MB(x))

(3)

and again this definition generalises to fuzzy sets as defined by their membership function. Complementation: For fuzzy sets, as for traditional sets,

M a(x) = 1 - M.4(x) (4)

Finally, we define the inclusion relation as follows: A is a sub-set of B (denoted AC__B)if MA(x)>-Mc(x) for x does not imply that we are assuming this to be true for all documents; it is a property of x, not of A and B. A4. The membership functions for intersection and union are non-decreasing: if M A(x) >1Mc(x), then

M,w~(x) >1Mcc~(X); and MAul(X) >1McuB(x) F o r example, suppose that x is retrieved for the request 'A A N D B' with a certain strength. Then a reassessment in which x is evaluated as being more relevant to A than acknowledged earlier should not cause us to reduce the strength with which x is retrieved for the request 'A A N D B'. A5. If M ~ ( x ) = MB(x), then MA(X) >~MB(x) That is, if x is less about A than it is about B, then it cannot be as much about both A and B as it is about B. This assumption formally describes intersection as an essentially restrictive operator. F o r similar reasons, we assume:

if M AuB(X)= MB, then M A(x) 1min.(MA(x), MB(x)) This follows from A1, A3 and the monotonicity assumption, A4. Without loss of generality, we label the sets so that MA(x ) >1MB(x). Thus, MAns(X) >~Mana(X) = MB(x)=min. (MA(x), Ms(x)). Similarly, we prove: Clb. MauB(X)iM ,~-~(x). Thus, M ACu~x)~max. (M A(x), MB(x)) C3. MAuB(x)=max. (M A(x), MB(X)); M AaB(X)=rain. (M A(x), MB(x)) These are immediate consequences of C1 and C2.

A comparison of two weighted Boolean retrieval systems C4.

27

MAnB(X)and MAoB(x)depend only on the values MA(x) and MB(x).

Thus, they cannot depend on characteristics of x not related to their membership in A and in B, or on the relationship of the set A to the set B: nor can they be influenced by other documents in the collection. For example, once MA(x) and Ma(x) have been evaluated, introducing new or deleting old documents from the collection cannot change the strength with which x is retrieved for requests of the form 'A OR B' or 'A AND B'. For those who find the above, or Bellman and Giertz', criteria persuasive, Zadeh's definitions of set operators are the only ones that will allow us to conceptualise sets as having partial members, and as capable of manipulation in a way that satisfies those properties of set algebra most critical for information retrieval. This development shows that it is possible to extend our notion of Boolean retrieval to include weights, and describes how this can be done. However, our results also have a negative interpretation. They show that if the max. and rain. definitions of fuzzy set operations are not acceptable, it will not be possible to find alternative definitions of the basic operations without sacrificing some very desirable characteristics of Boolean retrieval. In such a situation a totally different approach may be called for. Fuzzy sets have been discussed before in the context of information retrieval systems. A partial set of papers and authors that indicate the range of effort appears in the references (Negoita and Flonder, 1976; Radecki, 1976; Tahani, 1976; Yager, 1979). Indeed, its application to information retrieval has recently been the source of controversy in the information science literature (Robertson, 1978, 1979; Cerny, 1979). The ultimate test of a new concept such as fuzzy sets is whether a substantial number of researchers find the concept an economical means for comprehending and discussing an otherwise complex issue, and whether the view it presents of the field encourages the generation of new ideas that are fruitful for approaching the field's problems. Among the outstanding examples of such a conceptualisation is the introduction of the vector notion into physics --especially mechanics and electromagnetism. In fact, vectors provide nothing new for the theory of these fields; however, they do provide an enormous economy of presentation, and, by means of the algebraic rules regarding their manipulation, have facilitated research in these areas. Whether fuzzy set theory provides such an economy will only become apparent in time. It has been the author's experience that a mechanism for increasing the flexibility of Boolean retrieval by permitting of weighting, while at the same time allowing of the manipulatibility of traditional Boolean algebra, has been very elusive, while the use of fuzzy set ideas immediately suggested a possible approach. As for the aesthetic appeal of fuzzy set theory and its value as an economical device that permits of the unification of weighted and traditional indexing systems - - this each reader will have to determine for himself.

3.3

A comparison of two weighted Boolean retrieval systems

In an earlier paper (Bookstein, 1980) the following approach towards weighted Boolean retrieval was presented. The weighted indexing of documents was directly defined in terms of fuzzy sets: if Tdenotes the fuzzy set of documents

28

A comparison of two weighting schemes for Boolean retrieval

indexed by some index term, then Mr(x) represents the strength with which that index term is associated with the document denoted by x. As we saw, this definition is consistent with most rules by which sets are manipulated in information retrieval systems. The weighting of terms in a request is more complex, and was defined as follows (we modify the original notation in this presentation). We first define the fuzzy sets aA and 1/aA, where a is a number between 0 and 1, as follows:

M.A(x)=aM,4(x) and

~(1/a)MA(x) if this quantity is less than or equal to 1 M~/,A(x) = [1 otherwise We can now define a weighted request and use these definitions to define the fuzzy set retrieved by any such request. To avoid ambiguity, when a single character represents both a request and its associated set, we shall print the character in bold-face type to represent the set. We first note that any Boolean request can be uniquely represented in one of the following forms: (1) T(where Tis an index term) (2) E OR F (where E and F are legitimate requests) (3) E and F (4) N O T E (We omit parentheses to simplify notation, although they can be introduced when needed to avoid ambiguity.) We now permit a user to assign a weight between 0 and 1 to any request, and denote the weight a assigned to request E by aE. The request, E, could itself contain weighted components. These requests are translated by the system into the following set operations: (1) aE: retrieves the set aE if E retrieves the set E (2) aE OR bF: retrieves the set aEUbF (3) aE AND bF: retrieves the set 1/aEN1/bF NOT aE is ambiguous and its definition more arbitrary. Standing alone, we define it as follows: (4')

N O T aE: retrieves (1/aE)

If N O T aE is connected to another expression by an OR operator, then (4')

(NOT aE) OR Fi retrieves (1/aE)UF

If N O T aE is connected to another expression by an AND operator, then (4"') (NOT aE) AND F: retrieves (aE)AF We note that N O T aE is a different request from a(NOT E), and the system must take care in deciding which form to use. In situations in which the patron tentatively includes a term of the form N O T E in his request but is uncertain whether it really ought to be there, then the a(NOT E) form is indicated. We

A comparison of two weighted Boolean retrieval systems

29

expect this to be the typical situation, and it should be assumed when there is doubt. The properties of this system are described in Bookstein (1980); in particular, rules are given that permit one to translate one weighted Boolean expression into others that are equivalent. For example, corresponding to the usual relation A AND (B OR C)=(A AND B) OR (A AND C) we have

aA AND (bB OR cC)= b(abA AND B) OR c(acA AND C) or, more naturally, in set-theoretic terms,

1/aAn(b BucC) =(1/aANb B)U(1/ah ncC) = b(1/abAOg)Uc(1/acA AC) We believe that the above system presents to the user a powerful and flexible language that is easy to learn and use. The complexities of the system are invisible to the user, who merely assigns weights, if desired, to index terms and/or component expressions; this approach offers to the system the capability of changing the request, as convenient, to another that will retrieve the identical fuzzy set. We further believe that the weighting scheme reasonably models what the user means when he indicates that one term in his request is less important than another. An alternative approach is possible, however, that simplifies the relationship between the requests and the set operations, and yet continues to permit of flexible manipulation of the requests and their associated sets. We shall now describe such an alternative. Weighted indexing in the second approach is defined as before. The indexing process structures the collection into fuzzy sets, each set associated with an index term. In the preceding approach we represented the strength of a term or expression in a request by modifying the set itself - - t h a t is, by changing the strength with which documents appear in the associated set. Thus, the request .8T, for T a n index term, retrieves the set T, with each document receiving a reduced weight. But the user can express his interest in Tsomewhat differently. If he is very interested in T, he can tell the system that he would like to receive all the documents about T, regardless of how peripheral they are to T. On the other hand, if the user is only weakly interested in T, he can express this by requesting that the system retrieve only the most relevant documents. In other words, instead of modifying the set associated with Tand retrieving that set, the user can introduce his strength of interest into the retrieval process by setting a threshold and retrieving only documents whose membership values exceed this threshold. This criterion can be generalised and expressed more precisely as follows: The user represents the strength of his interest in a term or expression, E, by presenting a non-decreasing 'filter' function f(x, O) to the system; 0 represents a set of parameters. (f(x), in fact, depends on x through the value Me(x), but this detailed dependence is not explicitly indicated for notational simplicity.) If E, with membership function Me(x), is the set that would have been retrieved before application of f, then the set actually retrieved, denoted by EA, has as its membership function ./(x, O)Me(x). Thus, f acts as a filter through which the set E must pass before being retrieved.

30

A comparison of two weighting schemes for Boolean retrieval

Indeed, if D represents the full document space, a fuzzy set, E, can be described as what remains of D after passing through the filter fE. Thus, Eo is the set of documents that passed through the two filters fE and fo - - one representing the documents' content, the other the user's interest. In general, the retrieval criterion can be loosely described as establishing conditions regarding the filters through which a retrieved document must pass. For example, EotdFo2 is the set of documents that passed through either Eo, or Fo2. As possibilities for the form of f we may have: (1) Simple threshold. The user simply gives a cut-off value, 0o, andfis defined by

f(x, 00)= {~ if M(x)~Mo~x'). This is not true if ~aE is substituted for aE. We finally note that the algebraic properties of this system, while not difficult to show, do reqUir~ proof. We refer the reader to Buell and Kraft for a more complete discussiOn ~' the properties of these systems• The main advantage of the second approach is its simplicity, intersection and union are treated symmetrically. Many of the algebraic properties are immediate consequences of the definitions - - they follow directly from the truth of these properties for fuzzy sets. Other relations follow easily. For example: if Eo denotes E after passing through the filter, f(x, 0), then

(A UB)o = AoUB o and

(A nB)o = AoNB o This follows immediately from the multiplicative properties off:

f(x, O)max.{MA(x), MAx)} = max.{f(x, O)MA(x),fix, O)M~(x)} and similarly for the intersection operation. Similarly, we find

(Ao, O Bo2)o~= Ao,o~U Bo~o~ where, for example, Ao,o3 is equal to (Ao,)o3: its membership functiOn is f(x, 01)f(x, 03)Mdx ). A weakness of this approach, at least given the class of filters we are using, ~s that it may distort the intentions of the patron who is using threshold value~tl~ represent the importance of a term. Consider, for example, the request 'A u r t B', where A is more important than B for the patron. Suppose that docunaC" d~ has strength .9 in A and 0 in B, while d 2 hasstrength 0 in A and .9 in B. In an unweighted request, documents d 1 and d 2 would be etrieved with eq ual weights of .9. But if A is more important to the issuer of the request, then ~ e would expect dl to be retrieved with greater strength than d 2. This is ind~O 2B' case for the first approach. For example, the weighted request '.SA O.1~ "~d would retrieve d~ with strength .72 and d 2 with strength .18. On the other l a a ~ using the class of simple threshold functions, we may issue the request 'A.? ~uB . 7 ' , the subscripts representing the thresholds. Here, however, both dO~ ments are retrieved with equal weight. Although a fuzzier threshold function could lower the weight of d2, the thrust of the second approach is to rem~'~ e less relevant documents rather than to alter the strength with which lat~'~r relevant (to a component of the request) documents are retrieved; thuS, ~" documents such as d,, the system is relatively insensitive. We see, then, !laa~ unlike the first approach, the threshold approach does not directly model t,,,notion of importance as applied to an index term. This distinction between the two approaches can be underscored by examining the effect on retrieval of changing the weights in a Boolean expression. Consider, for example, the request 'Ao AND B', which is to retrieve the set AoAB. We observe, following the arguments presented in BookSte~ln~ (1980), that we can interpret such an intersection as an operation in whicla 0

32

h

A comparison of two weighting schemes for Boolean retrieval

acts on the set B to reduce its membership function. With such a interpretation, it is reasonable to expect that as A becomes less important i the request (so 0 increases), Ao should have less impact on B; indeed, in tt limiting case of 0 = 1, A o should approach the universal set. But in fact, as increases, A~ becomes more restrictive as an operator. Indeed, if no documel has a membership-function value of unity, in the limiting case of 0 = 1, A o is tt null set and AoNB is also the null set, which is as extreme a change of B as x~ can cause. On the other hand, A o approaching the empty set as 0 approach~ unity is proper in the request 'Ao OR B'. It was, in fact, reasoning such as th that caused us to define two types of weights in the first approach: one f~ unions, the other for intersections. In conclusion, we note that we have described two approaches, tied togeth~ by their conceptual basis in fuzzy set theory, for representing weighte Boolean requests. Although both may be used by a patron to configul precisely the request to his need, they are doing rather different things. The fir represents directly the importance of component Boolean expressions to tt user; the other establishes conditions that must be met before a document ca be retrieved. Both, or variants thereof, may have a place in informatic retrieval systems. Their relative strengths should appear through empiric~ investigation.

3.5

Postscript

After submission of the abstract of this chapter, the author learned that a vet similar approach was being developed independently by D. A. Buell and D. t Kraft. We wish to acknowledge their contribution, and to thank the authol for making their work available to us.

References

i

,,\!

BELLMAN, R. and GIERTZ, M. (1973). 'On the analytic formalism of t~ theory of fuzzy sets', Information Sciences, 5, 149-156 B I R K H O F F , G. and MACLANE, S. (1953). A Survey of Modern Algebr, Macmillan, New York B O O K S T E I N , A. (1978). 'On the perils of merging Boolean and weighte retrieval systems', Journal of the ASIS, 29, 156-158 B O O K S T E I N , A. (1980). 'Fuzzy requests: an approach to weighted Boolea searches', Journal of the ASIS, 31, 240-247 B O O K S T E I N , A. and COOPER, W. S. (1976). 'A general mathematic~ model for information retrieval systems', Library Quarterly, 46, 153-167 BUELL, D. A. and KRAFT, D. H. (to be published). "A model for a weiglate retrieval system' CERNY, B. A. (1979). 'A reply to Robertson's diatribe on the nature of fuT_2 Journal of the ASIS, 30, 356-357 N E G O I T A , C. V. and F L O N D E R , P. (1976). 'On fuzziness in informatio retrieval', International Journal of Man-Machine Studies, 8, 711-716 NORI~,AULT, T., K O L L , M. and M C G I L L , M. (1977). 'Automatic ranke input from Boolean searches in SIRE', Journal of the ASIS, 28, 333-341

Appendix: Bellman and Giertz' criteria for fuzzy sets

33

RADECKI, T. (1976). 'Mathematical model of information retrieval systems based on a concept of fuzzy thesaurus', Information Processing and Management, 12, 313-318 ROBERTSON, S. E. (1978). 'On the nature of fuzz: a diatribe', Journal of the ASIS, 29, 304-307 ROBERTSON, S. E. (1979). 'On fuzzy sets: reply to Cerny', Journal of the ASIS, 30, 357-358 SALTON, G. (1968). Automatic Information Organization and Retrieval, McGraw-Hill, New York TAHANI, V. (1976). 'A fuzzy model of document retrieval systems', Information Processiny and Management, 12, 177-188 YAGER, R. R. (1979). A Logical Online Bibliographic Searcher: An Application of Fuzzy Sets, Iowa College, School of Business Administration, Technical Report RRY79-08 ZADEH, L. A. (1965). 'Fuzzy sets', Information and Control, 8, 338-353

Appendix: Bellman and Giertz' criteria for fuzzy sets Much of the power of Boolean retrieval is a consequence of several axioms of Boolean algebra, and if fuzzy-retrieval systems are to retain this ability to manipulate requests, these axioms must be satisfied. Hence, we demand of any acceptable generalisation that the following properties be valid: (a) Commutative: AUB=BUA; A N B = B N A (b) Associative: AN(BAC) = (A OB)AC; A u(B uC) = (A UB)UC (c) Distributive: An(BUC)=(AAB)U(AAC); AU(BOC)=(AuB)A(AUC) We would expect that, for those documents that are full members or null members of Sets, the fuzzy operations would act in the same way as traditional operations. Hence: (d) If MA(X) = Me(x) = 1, then MAc,(x) = 1 and if MA(x) = Me(x) =0, then MAue(X)= 0 The above requirements directly tie fuzzy sets to traditional sets. In addition, three properties intrinsically pertaining to fuzzy sets seem compelling.

(e) Maue(X) and MANe(X) should depend continuously on Ma(x), and certainly should not decrease if MA(x) increases. (f)

If MA(x) =Me(x), then MAue(X) and MAx) increases.

MANe(X) should actually increase as

And finally: (g) MAue(X)>lmax.(MAx), Me(x)); MAnB<min. (MAx), m~(x)) For example, it is unreasonable that a document be retrieved with greater strength for the request 'A A N D B' than it would for a request for documents about 'A', or for documents about 'B'. With these assumptions, it is shown that Mane(X)= min.(Ma(x), Me(x)) and

34

A comparison of two weighting schemes for Boolean retrieval

MAos(x)=max.(MA(x), Ms(x)). If, in addition, we accept the set-theoreti( definitions of sub-sets, A --qB if and only if AN B = A, then it is easy to see that ,~ -~ C if and only iffA(x)