Using Genetic Algorithms to Find Suboptimal ... - ACM Digital Library

Report 2 Downloads 88 Views
Using Genetic Algorithms to Find Suboptimal Retrieval Expert Combinations Holger Billhardt Daniel Borrajo Victor Maojo ESCET, Departarnento de Inforrn~tica, Facultad de Inform~tica, Univ. Rey Juan Carlos Univ. Carlos III de Madrid Univ. Polit~cnicade Madrid 28933 M6stoles (Madrid), 28911 Legan#s (Madrid), 28660 Boadilla del Monte Spain Spain (Madrid), Spain [email protected], urjc.es [email protected] [email protected] ABSTRACT

combinations of different retrieval models ([1, 10, 16]). In these studies the final similarity values are o b t a i n e d from a linear combination of the similarity values of the different experts. T h e question of which retrieval experts should be combined has been addressed in [12, 15] and different combination functions have been analyzed in [8]. In this article we describe and evaluate the use of Genetic Algorithms ( G A ) [9] for the task of a u t o m a t i c a l l y obtaining a s u b o p t i m a l linear combination of retrieval experts for a given d o c u m e n t collecion. In other d a t a fusion studies (e.g. [1, 10, 16]) the experts to be combined are either fixed or they are selected based on some indicators, and only the p a r a m e t e r s of t h e combination function are optimized in a learning process. In contrast, our approach a u t o m a t i c a l l y selects both, the experts to be combined and the p a r a m e t e r s of the combination function (e.g. the weights given to each expert in a linear combination). Furthermore, instead of using j u s t a small set of combination candidates we use a massive set of 6056 different experts, consisting of variants of the classical Vector Space Model (VSM) mad of an IR model called Context Vector Model (CVM). T h e Context Vector Model has been introduced b y Billh a r d t et al. [3]. It is a semantic indexing approach t h a t uses co-occurrence d a t a to e s t i m a t e t h e probability-based semantic meaning of a t e r m or its context in relation to other terms. In contrast to other semantic indexing approaches, e.g. Latent Semantic lndexin 9 (LSI) [6] or H o f m a n n ' s Probabilistic Latent Semantic Indexing (PLSI) [10], CVM maintains the original (term) dimensions of the vector space. Therefore, it can be easily combined with VSM, and c o m m o n VSM t e r m weighting techniques can be used. T h e experiments r e p o r t e d in [3] have shown t h a t p a r t i c u l a r CVM variants perform very well on some collections and it has been argued t h a t combinations of CVM variants with word-matching m e t h o d s m a y improve retrieval effectiveness. Several researchers have previously used GAs a n d genetic p r o g r a m m i n g for I R tasks. Much of the research is concorned with improving query representations in a relevance feedback scenario or in information filtering [4, 5, 11, 18]. GAs have also been used for a d a p t i n g the m a t c h i n g function employed in a retrieval system [7]. T h e l a t t e r can be considered as an e x p e r t selection approach and, in this sense, it is similar to our framework. However, in our settings GAs select multiple experts by choosing a p a r t i c u l a r t e r m weighting scheme and d o c u m e n t and query t r a n s f o r m a t i o n functions.

A c o m m o n problem of expert combination approaches in Information Retrieval (IR) is the selection of both, the experts to be combined and the combination function. In most studies the experts are selected from a r a t h e r small set of candidates using some heuristics. Thus, only a r e d u c e d n u m b e r of possible combinations is considered and other possibly better solutions are left out. In this p a p e r we propose the use of genetic algorithms to find a s u b o p t i m a l combination of experts for a d o c u m e n t collection. Our system a u t o m a t i c a l l y determines both, the experts to be combined and the p a r a m eters of t h e combination function. We test and evaluate the approach on four classical text collections. T h e results show that~ the learnt combination strategies perform b e t t e r t h a n any o f t h e individual m e t h o d s and that genetic algorithms provide a viable m e t h o d to learn e x p e r t combinations.

Keywords Information retrieval, D a t a fusion, Genetic algorithms

1.

INTRODUCTION

M a n y I R models have been proposed in the p a s t to improve the effectiveness of information retrieval systems. In recent years, IB, researchers have analyzed the possibility of combining the results of multiple I R models or different query representations in a way t h a t t h e advantages of different models are brought together. This approach is known as data fusion and the combined retrieval systems are often called retrieval experts. Research has shown t h a t well selected combinations of multiple retrieval systems improve retrieval effectiveness in contrast to using single systems. Some of the proposed m e t h o d s center the d a t a fusion approach on combining the similarities obtained with different query representations or in combining different query representations directly ([2, 8]). Other researchers investigated

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commemial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC 2002, Madrid, Spain Copyright 2002 ACM 1-58113-445-2/02/03 ...$5.00.

657

t e r m c o r r e l a t i o n m a t r i x t h a t reflects t h e t e r m r e l a t i o n s h i p s , a n d iii) t r a n s f o r m t h e t f - b a s e d d o c u m e n t vectors to c o n t e x t vectors. I n t h e first step, t h e t e r m frequencies of all t e r m s axe o b t a i n e d for all d o c u m e n t s . T h e n , a n x n t e r m correlat i o n m a t r i x C is c a l c u l a t e d , which defines t h e r e l a t i o n s h i p s a m o n g terms. E a c h e l e m e n t ckj of this m a t r i x specifies t h e degree w i t h w h i c h t e r m tk i n d i c a t e s t h e e x i s t e n c e of t e r m t j . T h e k t~ c o l u m n v e c t o r of C, d e n o t e d b y ~'~, is called t e r m c o n t e x t vector for t e r m tk. T h e s e vectors axe s e m a n t i c a l l y e n r i c h e d r e p r e s e n t a t i o n of t e r m s a n d reflect t h e i r c o n t e x t in r e l a t i o n to o t h e r t e r m s . I n this p a p e r we use four d e f i n i t i o n s for C, d e n o t e d prob, intu, probOdia 9 mad intuOdia9. All of t h e m axe b a s e d o n t h e c o n c e p t of c o - o c c u r r e n c e frequency, e.g. t h e n u m b e r of t i m e s two different t e r m s o c c u r in the same documents.

I n Sections 2 a n d 3 we briefly review t h e Vector Space Model a n d the C o n t e x t Vector Model. S e c t i o n 4 describes t h e set of retrieval e x p e r t s a n d how t h e y axe c o m b i n e d to retrieval strategies. S e c t i o n 5 e x p l a i n s t h e use of G A s in our f r a m e w o r k . E x p e r i m e n t a l results o n four different test collections axe s h o w n a n d discussed i n S e c t i o n 6. F i n a l l y , S e c t i o n 7 gives some c o n c l u s i o n s a n d d i r e c t i o n s for f u t u r e research.

2.

THE VECTOR SPACE MODEL

V S M [14] is one of t h e m o s t p o p u l a r m o d e l s used in inform a t i o n retrieval. It i m p l e m e n t s f u l l - t e x t a u t o m a t i c i n d e x i n g and relevance ranking. A V S M v a r i a n t is c h a r a c t e r i z e d by t h r e e p a r a m e t e r s : i) a t r a n s f o r m a t i o n f u n c t i o n t h a t t r a n s f o r m s queries a n d d o c u m e n t s i n t o a r e p r e s e n t a t i o n i n t h e vector space (local t e r m weights), ii) a t e r m w e i g h t i n g s c h e m e t o b e a p p l i e d globally (global t e r m weights), a n d iii) a s i m i l a r i t y m e a s u r e . Let { d i l l < i < m } b e t h e set of d o c u m e n t s in a collection, a n d {t,[1 < i < w} the set of i n d e x t e r m s . T h e n , & = (wit, ~ v i 2 , - . . , wi,~) is t h e vector for the i t ^ d o c u m e n t , w h e r e wij is the local weight of t e r m t~ i n d~. Similarly, a q u e r y q is r e p r e s e n t e d as a q u e r y v e c t o r ~ = (w~l, ZUqa,..., z u ~ ) . U s u a l l y t h e cosine m a t c h i n g f u n c t i o n is u s e d to c a l c u l a t e t h e s i m i l a r i t y b e t w e e n queries a n d d o c u m e n t s :

~(di, q )

=

~=t

~a(t~)

. ~,,

. ~g(t~)

.

1

prob: ~S =

1

ifi =j

~=x;tl~j~o ~1~'

if i ~ j

(3)

(4)

~. tfki cij is set to 0 if i = j . See [3] for a d i s c u s s i o n of t h e m a t r i x definitions. D o c u m e n t c o n t e x t vectors ~ -----(wit, w i z , . • . , wi~) axe obt a i n e d b y c a l c u l a t i n g the c e n t r o i d of t h e t e r m c o n t e x t vectors for all t e r m s b e l o n g i n g to a d o c u m e n t :

2

(1) w h e r e wg(tk ) is t h e global t e r m weight of t e r m tk. I n t h e classical V S M t h e t r a n s f o r m a t i o n f u n c t i o n for docu m e n t s a n d queries is defined by the t e r m frequencies (t f ) : wlj =-- tfij, e.g. t h e n u m b e r of t i m e s t e r m tj occurs i n d o c u m e n t di. T h e p o p u l a r Inverse Document Frequency ( id]~ is often u s e d as the global t e r m w e i g h t i n g scheme:

k=l

[Ckl

k=l

where I~1 is t h e Euclidema vector l e n g t h of t e r m c o n t e x t vector c'k. I n t h e r e t r i e v a l process a q u e r y is t r a n s f o r m e d i n t o a cont e x t vector ~" ---- (~Llql, W q 2 , . - , , I/)qrt)T h e n , this v e c t o r is m a t c h e d a g a i n s t the d o c u m e n t c o n t e x t vectors u s i n g s o m e s i m i l a r i t y m e a s u r e mad global t e r m weights. T h e retrieval result is a list of d o c u m e n t s r a n k e d b y t h e i r s i m i l a r i t y to t h e query.

(2)

w h e r e dfj is t h e n u m b e r of d o c u m e n t s c o n t a i n i n g t e r m t j . O f t e n t h e s a m e t r a n s f o r m a t i o n f u n c t i o n a n d global t e r m w e i g h t i n g s c h e m e is a p p l i e d to queries a n d d o c u m e n t s b u t different f u n c t i o n s m a y b e used. O t h e r d e f i n i t i o n s of local a n d global t e r m weights h a v e b e e n p r o p o s e d , However, t h e classical V S M is chaxacterized b y t h e fact t h a t o n l y t h e t e r m s c o n t a i n e d in a d o c u m e n t ( q u e r y ) axe r e p r e s e n t e d in its vector.

4. 4.1

THE CONTEXT

if i ~ j

probOdia 9 mad intuOdiag axe defined i n t h e s a m e way, b u t IU

3.

~=a

intu: ci~ =

~

w y ( t j ) ~ idf(tj) = log2 (-~fj ) + 1

ifi-----j

tY~.tlhj

VECTOR MODEL

RETRIEVAL STRATEGIES T h e S e t o f R e t r i e v a l Experts

T h e set of r e t r i e v a l e x p e r t s in our f r a m e w o r k consists of different V S M mad C V M v a r i a t i o n s . All e x p e r t s use t h e cosine s i m i l a r i t y m e a s u r e (1) to c o m p u t e the r e l e v a n c e of a d o c u m e n t t o a query. A n e x p e r t is specified b y a t u pie (dt f , qt f , dw, qw, cvmtype), w h e r e dtf and qtf d e n o t e t h e t r a n s f o r m a t i o n f u n c t i o n s (for d o c u m e n t s a n d queries, res p e c t i v e l y ) , dw a n d qw d e n o t e t h e global t e r m w e i g h t i n g s c h e m e (for d o c u m e n t s a n d queries, r e s p e c t i v e l y ) , a n d tomtype specifies t h e t e r m c o r r e l a t i o n m a t r i x u s e d in C V M indexing. T h e last p a r a m e t e r is o n l y r e q u i r e d if a C V M dependent transformation function is used. It can be one of t h e four m a t r i x d e f i n i t i o n s given i n S e c t i o n 3: prob, inca, probOdia9, a n d in~uOdiag. W e i m p l e m e n t e d two b a s i c t r a n s f o r m a t i o n f u n c t i o n s : i) t~ w~j is defined b y the t e r m frequencies (tf~j), a n d ii) cvm:

I n this s e c t i o n we briefly review the C o n t e x t Vector M o d e l for IR. A d e e p e r d e s c r i p t i o n of C V M a n d its v a r i a n t s c a n b e f o u n d i n [3]. C V M is b a s e d o n V S M a n d c a n b e c o n s i d e r e d as a variat i o n of VSM. T h e difference is t h a t it i n c o r p o r a t e s t e r m dep e n d e n c i e s in the d o c u m e n t a n d q u e r y r e p r e s e n t a t i o n s . I n this sense, each t e r m in a d o c u m e n t i n d i c a t e s the e x i s t e n c e of a set of different c o n c e p t s w h e r e t h e c o n c e p t s t h e m s e l v e s are r e p r e s e n t e d b y t e r m s . As a result, a t e r m t h a t does n o t a c t u a l l y occur i n a d o c u m e n t m a y h a v e a high v a l u e in its vector r e p r e s e n t a t i o n b e c a u s e it is s t r o n g l y r e l a t e d to m a n y of t h e d o c u m e n t ' s t e r m s . C o n t e x t vector i n d e x i n g consists of t h r e e steps: i) comp u t e t h e classical t f - b a s e d d o c u m e n t vectors, ii) g e n e r a t e a

658

Model, t h a t the angle between the vectors of a d o c u m e n t and a query determines their similarity. If only one expert is used b o t h measures will lead to exactly the same ranking. However, this is n o t the case when experts are combined. Suppose we have two d o c u m e n t vectors ~ and d2, a query vector ~, and two retrieval experts RE1 and RE2. Furthermore, suppose t h a t the angles between d~ and g i n RE1 and RE2 are 30 ° a n d 20 °, respectively, a n d t h a t the angles between d2 and ~ are 10 ° and 40 °. A linear c o m b i n a t i o n of the cosine values of RE1 and RE2 with uniform weights will judge dl more relevant t h a n d2 (with an estimate of 1.81 versus 1.75). However, from the VSM a s s u m p t i o n one could argue t h a t both documents are equally relevant (e.g. the averages of the angles are the same). This is assured using the proposed equation (with a similarity estimate of 2.27).

wij is defined by equation (5). W i t h respect to query indexes, it has been argued t h a t it is sometimes b e t t e r to use b i n a r y instead of t.f indexes [3]. We defined two additional t r a n s f o r m a t i o n functions for queries: i) bin: wqj is 1 if ~j occurs in q a n d 0 otherwise, and ii) cvmbin: Wqj is obtained by using the b i n a r y query vectors as the s t a r t i n g point for query context vector calculation. 14 different term weighting schemes were used:

1. no: constant weight of 1 for all terms, 2. idf. inverse d o c u m e n t frequency (equation 2), 3. tfrnarnd: modified average m e a n deviation over t f docu m e n t vectors,

4. t]mvar, modified variance over t ] d o c u m e n t vectors, 5. dcvmamd: modified average m e a n deviation over doc-

5.

u m e n t context vectors,

6. dcvmvar,

modified variance over d o c u m e n t context

Genetic algorithms are inspired by the principles of selection and inheritance of n a t u r a l evolution [9]. GAs can be viewed as search techniques and are often used to solve optimization problems where the search space is very big. T h e y start with an initial p o p u l a t i o n of individuals, each one representing a solution to the problem. I n the classical approach each individual is represented as a string of bits. T h e goal of the GA is to produce new populations of better individuals. A new generation is built in two steps: i) selecting the individuals t h a t will be used to create the new generation, and ii) applying a set of genetic operations to the selected individuals. In our framework, the goal of the GA is to find a s u b o p t i m a l retrieval strategy (combination of retrieval experts) for a given d o c u m e n t collection. Each individual of the p o p u l a t i o n represents a set of k retrieval experts, where each expert is an instmatiation of the parameters (dr f, qt f , dw, qw, cvmtype) together with its weight in the combination. Experts are encoded as strings of binary genes of length 18, i -----( i l , . . - , i x s ) , where the encoding is as follows:

vectors,

7. tcvmamd: modified average meart deviation of term context vectors,

g. tevmva~ modified variance of term context vectors, 9. idflfrnamd, idftfmvar, idftcvmamd, idflcvmvar, idfdcvroared, and idfdcvmvar, combinations of the previous weights with idf. Weights 3 to 6 measure the dispersion of the local weights of a term across the t f a n d d o c u m e n t context vectors, respectively. Weights 7 and 8 measure the dispersion of the values in term context vectors. We define Lhe modified average m e a n deviation of a sample X of size k by 1 + (~"]ik~ [Xi - X I ) / ( m . -X), where X is the sample mean. T h e modified variance is given by l + l o g 2 (1 + S ~ / X ) , where S~- is the sample variance. In tcvmvar we o m i t t e d the log. Before calculating these weights, t f and CVM d o c u m e n t vectors were normalized to an Euclidean length of 1. CVM d e p e n d e n d weights are calculated for each m a t r i x definition a n d can only be used if d o c u m e n t s a n d / o r queries are indexed with a CVM approach. A discussion of the different term weights can be found in [3].

4.2

1. it : d o c u m e n t transformation function (dtD: 0 - t [ , 1 -

Combining Experts

3. (is,i7)

: query t r a n s f o r m a t i o n function (qt]): O0 - tf, 01 - bin, 1 0 - cvm, 11 - evrabin

4. (is,ig,ilo,Q1)

: global term weighting scheme for queries (qw): 0 0 0 0 - no, 0001 - id], ...

5. (it2,ila) : type of term correlation m a t r i x used in CVM indexing (cvmtype): O0 - prob, 01 - probOdiag, 10 - intu, 11 - intuOdiag (only used in CVM experts).

k

- arccosCs

C~I/TD,

2. (i2,ia,i4, i5) : global t e r m weighting scheme for docum e n t s (d~o): 0 0 0 0 - no, 0001 - id[, ...

Experts are combined by m e a n s of a linear c o m b i n a t i o n of their individual similarity scores. Let { R E 1 , . . . , R E k } be a set of retrieval experts and let s~Ei(di,q), for 1 _< j _< k, denote the similarity of d o c u m e n t di to query q calculated with expert R E j , then the overall similarity estimate of the c o m b i n a t i o n of { R E 1 , . . . , R E k } is given by:

g) =

RETRIEVAL STRATEGY SELECTION WITH GENETIC A L G O R I T H M S

(d , q)))

(6)

6. ( i 1 4 , - . . , ils) : weight of the expert in the c o m b i n a t i o n (between 0 and 32).

j=l

where pj, for 1 < j _< k, are weights given to the individual experts. A set of experts with associated c o m b i n a t i o n weights is called retrieval strategy. We use ~ - arccos(sn~j (di, q)) instead of the cosine values because this function grows linearly as the angle between the ~wo vectors decreases. We t h i n k t h a t this corresponds better to the u n d e r l y i n g a s s u m p t i o n of the Vector Space

Thus, each individual is the c o n c a t e n a t i o n of k strings encoding k particulax experts. Because of the encoding some individuals m a y be invalid o incoherent (for exarnple, two combinations of (i2, is, i4~ i~) are invalid because 4 bits encode 14 weighting schemes). We control such individuals by assigning t h e m a fitness value of zero.

659

we did a T - t e s t on p a i r e d o b s e r v a t i o n s . For all four collections, T > 2 implies s t a t i s t i c a l l y reliable i m p r o v e m e n t at a significance level lower t h a n 5 ~ . T h e results show t h a t a p p r o p r i a t e linear c o m b i n a t i o n s of retrieval e x p e r t s do b e t t e r t h a n any of t h e i n d i v i d u a l systems. O n all four collections t h e l e a r n t c o m b i n a t i o n s of two e x p e r t s p e r f o r m b e t t e r t h a n the best i n d i v i d u a l v a r i a n t . Moreover, t h e s t a t i s t i c a l significance of the results increases w h e n u s i n g two i n s t e a d of one e x p e r t . T h e i m p r o v e m e n t is s m a l l on M E D b u t i m p r e s s i v e o n t h e o t h e r collections. O u r e x p l a n a t i o n of this fact is t h a t t h e b e s t e x p e r t for t h e M E D collection is a l r e a d y very good a n d n o significantly b e t t e r s t r a t e g y c a n b e f o u n d w h e n a d d i n g n e w experts. As t h e n u m b e r of e x p e r t s grows, precision increases r a t h e r slowly a n d m a y even fail. T h e l a t t e r is due to t h e n a t u r e of genetic algorithms. T h e size of t h e search space increases d r a m a t i cally as t h e n u m b e r of e x p e r t s a u g m e n t s . T h u s , it is h a r d e r to find g o o d solutions. I n fact, i n our e n c o d i n g the search space of all c o m b i n a t i o n s w i t h k e x p e r t s i n c l u d e s all possible c o m b i n a t i o n s of less t h a n k e x p e r t s b e c a u s e t h e e x p e r t weights m a y b e set to 0. T h u s , if t h e precision falls for comb i n a t i o n s w i t h m o r e e x p e r t s this is b e c a u s e t h e G A d i d n o t find the s a m e or a b e t t e r s o l u t i o n yet. W e did n o t a n a l y z e c o m b i n a t i o n s of m o r e t h a n 4 experts. G A s w o u l d p r o b a b l y n e e d m o r e g e n e r a t i o n s to find b e t t e r c o m b i n a t i o n s w i t h m o r e experts. Since the fitness evaluation in t h e p r o p o s e d f r a m e w o r k is costly, this w o u l d lead to very long l e a r n i n g times. F u r t h e r m o r e , we believe t h a t t h e o b t a i n e d i m p r o v e m e n t s are a c c e p t a b l e a n d c o m b i n a t i o n s of m o r e e x p e r t s will p r o b a b l y n o t do m u c h b e t t e r . E m p i r i c a l l y , t h e o p t i m a l n u m b e r of e x p e r t s seems to b e a r o u n d 3. A b o u t 90~o of t h e selected e x p e r t s were C V M v a r i a n t s , e.g. they used C V M i n d e x i n g for e i t h e r d o c u m e n t s a n d / o r queries. O n l y 10% of the e x p e r t s were "pure" V S M vaxiartts. F u r t h e r m o r e , all of t h e c o m b i n a t i o n s used at least one C V M expert. T h i s e m p i r i c a l l y confirms t h e h y p o t h e s e s of B i l l h a r d t et al., t h a t t h e a d v a n t a g e s of C V M c a n be best exploited by c o m b i n i n g different C V M v a r i a n t s or by comb i n i n g it w i t h t r a d i t i o n a l w o r d - m a t c h i n g approaches. O n e of the l i m i t a t i o n s of C V M is its m e m o r y a n d calc u l a t i o n cost. C V M uses t h e s a m e vector space as VSM. However, i n c o n t r a s t to V S M , C V M d o c u m e n t vectors axe u s u a l l y n o t sparse b e c a u s e marry t e r m s t h a t do n o t o c c u r in a d o c u m e n t will have n o n - z e r o values i n t h e d o c u m e n t ' s vector. T h e s a m e holds for queries. T h i s results in high m e m o r y r e q u i r e m e n t s a n d worse retrieval times. A s o l u t i o n to this p r o b l e m c o u l d b e a d i m e n s i o n r e d u c t i o n of t h e vector space as it is p e r f o r m e d i n LSI or P L S I . However, such a n a p p r o a c h c h a n g e s the i n t e r p r e t a t i o n of the d i m e n s i o n s from t e r m s to some a b s t r a c t factors. W e t e s t e d a n o t h e r m e t h o d t h a t m a i n t a i n s t h e original dim e n s i o n s of t h e vector space b u t g e n e r a t e s sparser vectors. W e set all b u t the v h i g h e s t e l e m e n t s i n each d o c u m e n t , query, a n d t e r m vectors to 0. After s o m e e x p e r i m e n t s we e m p i r i c a l l y chose v = 150 b e c a u s e it d r a s t i c a l l y r e d u c e s the n u m b e r of n o n - z e r o e l e m e n t s in t h e vectors a n d still m o d e l s t h e t e r m r e l a t i o n s h i p s sufficiently well. Before r e d u c i n g docu m e n t or q u e r y c o n t e x t vectors we t e m p o r a r i l y m u l t i p l i e d t h e e l e m e n t w i t h global t e r m weights (id] for d o c u m e n t s ; t h e weight specified b y qw i n t h e e x p e r t for queries). T h e global t e r m weights t h a t d e p e n d o n d o c u m e n t c o n t e x t vectors were r e c a l c u l a t e d w i t h t h e r e d u c e d model. T h e r e s u l t s o b t a i n e d w i t h t h e r e d u c e d C V M a p p r o a c h are given i n Ta-

I n t h e selection process we used a n elitist strategy, keeping the I fittest i n d i v i d u a l s to be p a r t of t h e n e w generation. T h e r e m a i n i n g i n d i v i d u a l s were selected u s i n g a prop o r t i o n a l selection c r i t e r i o n following the r o u l e t t e principle. T h e selected i n d i v i d u a l s are m o d i f i e d w i t h t h e two s t a n d a r d genetic operators: o n e - p o i n t crossover ( e x c h a n g i n g the b i t s b e t w e e n two i n d i v i d u a l s u p to a r a n d o m l y chosen crossover p o i n t ) , a n d m u t a t i o n ( r a n d o m changes of bits). Since c h r o m o s o m e s e n c o d e p a r t i c u l a r retrieval strategies, t h e fitness of i n d i v i d u a l s c a n b e e v a l u a t e d u s i n g classical I R p e r f o r m a n c e measures. W e u s e d the n o n - i n t e r p o l a t e d average precision over all r e l e v a n t d o c u m e n t s ( c a l c u l a t e d as in the T H E C e x p e r i m e n t s [17]). Let { q ~ , . . . , q ~ } b e the set of queries in a collection a n d let avpRs(q5), for 1