Application of genetic algorithm in search engine - Semantic Scholar

Report 2 Downloads 67 Views
Application of Genetic Algorithm in Search Engine Weifeng Li, Baowen Xu Department of Computer Science and Engineering, Southeast Universig, Nanjing, China bwxu @seu.edu.cn

Hongji Yang Department of Computer Science, De Montfort University, LeicesteK England

Abstract The general search engine GSE provides service to the users by dispensing the users’ requests to the existing search engines. The existing search engines selected by GSE determine the searching qua&. Because the performance of the existing search engines and the users’ requests are changed dynamically, it is not favorable for the fixed search engines to optimize the holistic performance of GSE. This paper applies the genetic algorithm ( C A ) to realize the scheduling strategy of agent manager; which can simulate the evolution process of living things more lively and more eficiently. By using GA, the combination of search engines can be optimized and hence the holistic performance of GSE can be improved dramatically.

Keywords: Algorithm, Agent

Internet, Search Engine, Genetic

1. Introduction With the rapid development of Internet and WWW, the resources in the Internet become much more abundant. Many types of information search services based on Internet, such as Gopher, WAIS and so on, are produced and developed rapidly. It brings us a great convenience to use search engines to find enormous information. Especially some special search engines like Yahoo, Infoseek and Sohoo, provided by some companies, are indeed useful search tools. But, the special search engines mainly cover the information in some specific domain. It’s difficult for the users to select the appropriate one for the specific information. Now there are some meta search engines[4] which don’t provide search services

0-7695-0933-9/00 $10.00 0 2000 IEEE

William Cheng-Chung Chu,

TungHai UniversiQ Taiwan

Chih-Wei Lu Department of Information Science, Feng- Chia UniversiQ, Taiwan

themselves but provide consistent interfaces to users and distribute the users’ requests to different existing search engines. The speed and precision are the two basic guidelines of search engines. It is difficult to decide which search engine is selected when a request comes. The precision can be met, if some search engines (better to select less than 5 search engines) are used together to search the same information. The key problem is how to extract these search engines. The policy of group search and information exchanging among individuals are the two basic features, which contribute to the optimization in full scope and the implicit parallel. Aiming at the above problems and implementing the general search engine (GSE), this paper uses GA[8,14,15] method to dynamically change the

combinations of search engines. First, we present the basic thought of GSE. And then the genetic algorithm is introduced. Finally we give the use of C A in the scheduler of GSE.

2. Basic Ideas of GSE 2.1 Inheritance Anomaly W e have presented one general search engine (GSE), which uses the idea of meta search engine, but does not provide the database of search information itself. The users’ requests are dispatched into some other existing search engines. The results returned by these search engines are processed first, and then returned to the users. In fact, GSE provides a middle agent between the users and the existing search engines. Using the GSE to search information has such following features: *The GSE can use multi-engines to process parallel queries, which extends the covering area of queries and also makes the search results depend on more than one search engine. The different search engines have different relativity to

366 ’

the same information . The GSE can analyze, compare and classify the results returned by the different search engines so that the satisfying precision can be achieved. As the GSE need not maintain huge data base, the developer of GSE can put emphasis on the distribution of the requests and the processing the results. *As the GSE lies between the users and the other web search engines, it can trace the users’ query requests and adopt the appropriate cache policy to promote the search speed. The GSE can provide a consistent search interface. The users need not consider which existing search engines are selected and which search methods are adopted by different search engines. All the search space is divided into different fields by GSE. Each field is served by one or more agents[7,11,12,13]. How many agents are used for each field can be adjusted according to the load (Basing on the request frequency of the information in some field, the scheduler can send one or more agents to provide services for searching these requests.). When the request is submitted to GSE, it will firstly be divided into the requests of sub-fields of the field, each of which is dispatched to the agents of the corresponding fields. Each agent is made up of a group of search engines. It can distribute the search requests on some field to the search engines it is in charge of and can provide the abilities of collecting, processing and feeding back the search results returned by these search engines..

such survival evolution principles as the fittest to live, the better to be accepted and the worse to be eliminated throughout selection or contest. For the possible solutions the basic genetic operations are used repeatedly, the new groups are produced constantly, the groups evolve continually, and the satisfying solutions can be achieved by searching the most optimal individual by the parallel searching techniques in the whole. The genetic algorithm has such obvious features as simplicity, blindness, selforganization, self-adaptation, self-study and parallel. More complicated the problems are and less clear the targets are, more useful the C A is. By integrating the existing model of GSE, we use the CA to describe the dispatcher model of GSE.

3. h ~ ~ o f t h e G e n e t i c A l g o r i t h m

All the m‘ possible sequences form the value space (marked as I) of the agent agent;.

3.1 Coding Method Let Sse={ E, , E , ,. .., E,,, } be the set of search engines. The corresponding agents assigned to this field process the user’s requests in some fields. The GSE can contain some agents, such as agent, , agent,, agent3 , each of them may use some search engines when it is running, like Ejl , Ei2 ,..., E;, . Because one special agent agent; is mainly represented by the sequence of the search engines, we can represent such agent; as ( Ejl , Ei2 ,..., E , ), where 1 is the number of the search engines used by agent agent;, Ejl , E;* ,... E;, E &. W e can take the search engines used by agent agenti as the genes of such agent.

From the paragraphs above, we can know that the agents are the kernel components of GSE. These agents lie in the low layer of GSE and contact with the existing search engines directly. The whole performance of GSE is greatly affected by the agents. Training the agents to be specialists, i.e. to have the better performance in searching some fields, is the objective of GSE. It is desired that the agents’ search engine sequences are optimal. The simplest scheduling method is first to test the different combinations of the search engines by simulation, and then to specify some fixed search engine sequences to serve for some kind of agents. But we know that the load of GSE changes continuously, and that the number, functions and performance of the search engines keeps changing too. This simple combination of search engines can not keep up with the changes of the GSE system, so the search engine sequence in the agent can not be optimal. It is desired that the search engines sequence of each agent should change dynamically according to the changes of the system load and the search engines. It is the main reason that the genetic algorithm[ 1,3,6,9, IO] is introduced into the GSE. The genetic algorithm is a certain computing model that simulates the natural biologic evolutions. It persists in

3.2 Adaptation Function and Selection The search engines can be divided into the integrated search engines and the special search engines. The integrated search engines can search the information in almost all fields, but such information is not at all in detail. The special search engines are designed for some special fields. The coverage of their search results are much limited, but such results are in much detail and they are suitable for users’ special demand. The adaptation function can be used to evaluate the individual’s performance comprehensively. In the GSE, each agent is corresponding to some search engines. By integrating the performance guidelines of these search engines, such agent’s performance guidelines can be represented. The performance guidelines of the search engine include several important aspects such as the field allocating set f , the search precision p , the search completeness c , the average response time U and the updated time t . The performance of the search engine E can be represented as a pentad P( E )=( f ( E ), p ( E ), c ( E 1, U ( E 1, t ( E 1). The search space T of the search engine is represented as

367

a set { t , , t , ... t , ), where t , , t , ... t , denote each field

yfw)

p (agent;). f =

respectively.

E~agenr,

Definition 1: In the search space T, the weight wi of the field ti in the search engine E is defined as a pair ( t i , w i). The set of all such pairs is named as the field allocating set of the search space T in the search engine E , denoted as f ( E ), f ( E )=( ( t i , w i)I ti E T, 0 I w iI 1, i =1,2,. .., p ,

The search accuracy and the search completeness of the agents have close relation with the style in which the agent processes the results returned from each search engine. Because from each search engine the search results have been gotten and then processed before they are returned to the users, we can present P ( agent; ). U =Max

P C W ,= I }

( u ( E ; , ,... ) ,U ( E j / ) ) + F ( E

r=I

where

Definition 2: In the search space, the precision of the field t in the search engine E is defined as the ratio of

) is the average time of the

agent agent; processing the results returned by the search

the number of the corresponding documents in the result set to the number of documents in the result set, denoted as a pair ( t , precision). The set of all such pairs is named as the precision of the search space T in the search engine E , denoted as p ( E ), p ( E )=( ( t , precision)l t E T, 0 I precision I 1 ) Definition 3: In the search space, the completeness of the field t in the search engine E is defined as the ratio of the number of the documents in the results to the number of existingly corresponding documents, denoted as a pair ( t , completeness). The set of all such pairs is named as the completeness of the search space T in the search engine E , denoted as c ( E ), c ( E )=( ( t , comp1eteness)l t E T, 0 I completeness 5 1} Definition 4: The correlative search engine set RSE(s- field ) is a set of search engines whose corresponding degree to s- field is greater than M, denoted as RSE(s- field )=( E I a, ( E , s- field )>M, E E

7 ( E;, ,..., Ei,, 1

Ej,,l )

-

engines Ejl ,..., E;/ . T ( E;, ,..., Ei, , 1 ) is the function of the search engines used by the agent and the number of the search engines (here is 1 ), where the value of 1 has

r

great effect on . The latest updated time of the search engines used by the agent agent; can be used to represent the whole updated time of the agent agent;, denoted as

P( agent, ). t = Max( t ( Ejl),. ..,. t ( E;, ))

Definition 5: The correlation between the agent agent; and the field set s- field is defined as REA( agent; , s- field )=

a,( E , s - f i e l d ) , where s- field and E~agenr,

a, ( E ,s- field ) are the same as the definitions in the Definition 4. From the definitions above, we can let the adaptation function of the agent agent; be (agent;)er ( f ,p ,c ,U ,t ), where r is a scale transforming function and f , p ,c , U ,t are the

SSE,

performance guidelines of the agent agent; . All the agents in the GSE form a parent group whose scale is ( 2 1). To avoid confusion, we simply use agenti (t)(t 2 0) to represent the i th agent.

s-field c_T } where a, ( E ,s- field ) is a congregating function, and M is a critical value. The greater M is the closer the gotten correlative search engine and the field s- field are. But if M is too great, the search engines to be chosen are too few. The manager of the GSE can adjust M on demand. The agents in the GSE are made up of the search engines managed by such agents. The performance guidelines of the agent can also be represented as a pentad ( f , p , c , , t ). Assume the agent agent; is made up of a search engine sequence ( EiI , Eil ,..., E;, ). The field

3.3 Initialization of the Group When the GSE is used at the first time, it is necessary to assign each field set with an agent according to the parted field set. Each agent is assigned with a search engine sequence by hand or by random. Such agents in the GSE form the primary group P(O)=( agent, (0),..., agent, ( 0 ) ) .The simplest allocation method is to specify

a search engine E , ( E , E R) for all the agents in the GSE, i.e. , agent, (Ohagent, (Oh...=agentp (OHE , , E , ,..., E , )

allocating set f can be represented as the combination of the field allocating sets of the agent’s search engines, denoted as

Under such circumstance, the crossover has no effect on these agents.

368

3.4 Crossover In the standard genetic algorithm, the crossover (recombination) arithmetic operator is the main genetic operator. The genes in different individuals are recombined together, and then the new individuals are produced. Every two individuals are recombined in probability p c (the crossover probability), where p , is in the range [0.6, 1.01. The GSE divides the search space into different field sets. Each agent is oriented to some special field set, i.e. the agents are partitioned according to the division of the search space. T o prove that the recombined agents are more suitable for the service in some field set, it is necessary that there are some relations between these individuals, i.e. these individuals belong to the same class when selecting the individuals for crossover. Though it is not favorable to bring about the new class of individuals, this limitation makes the recombined agents more particular for some field set when the partition of the search space is decided. The system optimization made by the genetic algorithm lies on the optimization of the service capability of each field in the GSE. The following procedure is the crossover between the agents. First, randomly select two agents a = ( A, ,A, ,..., A, ), b =( B, ,B , ,. . . B, ) serving for the same

where w j is a consistent random number in the range

[0,1.0], B ,

I I I I I I

I

E

RSE(s- field ), and k is a consistent

?

request

response

agent manager

request-distributing and the result-

-

I

\

I

field set s-field from the parent group. Such two agents

b ,s- field )>N, and REA( b ,s- field ) are

satisfy REA( a , s- field )>N, REA(

Fig. 1 intelligent scheduler in the GSE

where REA( a ,s- field ) defined above, and N is a constant. Second, a random number is produced in the range 0-1.0. If this number is less than p , , these two agents keep the same, or else a

4 InteDigentScheduhmGSE

and b are crossed and two sub-agents are produced:

Figure 1 represents the intelligent scheduler in the GSE. It is made up of the user interface, the request

random number in ( I , ..., 1 )(If B, = A j , k is reselected).

a’=( A I , ..., A , - l , B , ,... B,

distributing and the result integrating unit, and the agent[5] manager. The user interface takes charge of the users’ requests and feedback receiving, and provides the manager the corresponding managing interface for the request distributing and the result-integrating unit. The request distributing and the result integrating unit distributes the decomposed requests to the corresponding agents, which send the requests to the existing search engines and collect the corresponding results returned. The results returned by all the corresponding agents are integrated by the request distributing and the resultintegrating unit at first and are returned to the user who puts forward this request. Each agent in the intelligent scheduler in the GSE represents a search engine sequence. These agents are coordinated with the changes of the performance of the search engines[2] and the users’ requests by the agent manager using genetic algorithm. The procedure of coordinating these agents can be described as bellow:

b =( B , ,. . ., B,-, , A, ,... A, ) The crossover point j is a consistent random number between 1 and 1 . This crossover method is named as one point crossover [ 141.

3.5 Mutation In the standard genetic algorithm, the mutation is an assistant arithmetic operator, which changes the genes of the chromosome in probability p,, where p m is in the range [0.001,0.01].In the GSE, in order to prove that the mutated agents still belong to the same field set, the selected genes are limited when the agents mutate. In our algorithm, a value range is set for the selected genes of mutation. The agent a =( A, , A, ,. .., A, ) which serves for one particular field set s- field in the parent group changes to be the agent a’=( A,’, A; ,... A,’) by mutation,

369

Step 1 Initialize the p agents t:=O, P(O):= { agentl (0),...,

performance. If the time is set to be too short, it is not useful to run the coordinating procedure, for the performance of these agents does not change dramatically. If the time is set to be too long, the service capability of the system can be adjusted by running the coordinating procedure of the agent, but it slows down the timely adjustment of the system capability. We use 1000 users’ requests to take turns to submit them to the GSE in 48 hours. The query results and the response time of query are stated and analyzed respectively. Table 1 shows the results of this experiment. W e note that the experiment data depend on the time when we do the experiment. By this experiment, we can find that the average variety rate of the updated time of these agents in the GSE increased obviously with the interval increasing. When the interval increases to 24 hours, the average variety rate of the updated time of these agents goes up to 13.4 percent. It is appropriate that the interval is set to be 24 hours.

agentp ( 0 )1;

step2 DO the crossover on the agents in P(t) to produce P ’ (t); step3 DOthe mutation on the agents in P ’ (t) to produce P ” (t); step4 M ~ ” ( t ) :CD( (agentl”(t)),.. ., 0 (agent; (t))); step5 Select p appropriate agents from P’ (t)(select one for each main field) Step6 If the set time comes, go to step 2, or else wait. This procedure of the adaptation of the agent runs at the agent manager in the GSE continuously. And it can coordinate the search engine sequences in the agents, according to the users’ requests, the history data of the search results and the current working states of these search engines. In the GSE, in order to keep the running efficiency of this procedure and the speed of processing the users’ search results, we select 50 search engines for candidates, i.e. m =50. There are 100 agents used for crossover and mutation, i.e. p =loo. Each agent manages

5 search engines, i.e. 1 =5. In the Step 6 of the coordinating procedure of the agent, the set time has great influence on the system

6. Acknowledgments 5. Conclusion

We would like to thank Guan Yu, Xu Lei, Liu Yuan, Li Shenzhi, Huang Hui and other people at Ada lab who met to discuss the related issues and improved the readability.

The selection of the search engine sequences of the agents in the GSE has direct influence on the whole performance of the GSE. The application of the genetic algorithm in scheduler of the GSE makes the agents capable of self-adaptation. Based on the current working situations of the search engines in the GSE and the states of the users’ requests, the search engines of the agents are coordinated in the appropriate time. Then the whole quality of service is improved. W e will make further research on the correlative technologies to improve the adaptation function of the genetic algorithm, which will evaluate the performance of the agent more accurately.

References Belw R K, Booker L B eds. Proceedings of the Fourth International Conference on Genetic Algorithms and Their Applications. San Diego, CA: Morgan Kaufmann, 1991 [ 2 ] Cheung D W, et al, Discovering User Access Patterns on the World-wide Web; Knowledge Based Systems. Journal Elsevier Science, 1998,10(7) [3] De Jong K. An analysis of the behavior of a class of genetic adaptive systems[Ph.D. dis]. University of Michigan, 1975 [I]

370

Dreilinger D, Howe A E. Experiences with selecting search engine using metasearch. ACM Trans on Inf Sy~,1997,15(3): 195-222 Etzioni 0, Weld D. Intelligent agents on the Intemet, fact, fiction and forecast. IEEE Expert, 1995,10(4):44-49 Forrest Sed. Proceedings of the Fifth Intemational Conference on Genetic Algorithms and Their Applications. San Mateo, CA :Morgan Kaufmann,l993 Genesereth M R., Ketch S P. Software Agents. Communication of the ACM, 1994, 37(7):48-53 Goldberg D E. Genetic Algorithms in Search, Optimization and Machine Learning. Reading, MA:Addison-Wesley, 1989 Grefenstette J J ed. Proceedings of the Third International Conference On Genetic Algorithms and Their Applications. Hillsdale, NJ: Lawrence Erlbaum, 1985 Holland J. Adaptation in Natural and Artificial Systems. Ann Arbor: The University of Michigan Press, 1975 Krulwich B, Burkoy C. The InfoFinder Agent:Learning user interests through heuristic phrase extraction. IEEE Expert,1997 12(5) Marko Balabanovi-c and Yoav Shoham. Leaming information retrieval agents: Experiments with automated web browsing. In Proceedings of the AAAI Spring Symposium on Information Gathering from Heterogenous, Distributed Resources, Stanford, CA, March 1995.

[ 131 Robert Armstrong, Dayne Freitag, Thorsten Joachims, and

[I41 [ 151

[I61 [17] [IS]

[I91

371

Tom Mitchell. Web-Watcher: A learning apprentice for the World-Wide Web. In Proceedings of the AAAI Spring Symposium on Information Gathering from Heterogenous, Distributed Resources, Stanford, CA, March 1995. Yao Xin, Chen Guoliang and Xu Huimin, A Survey of Evolutionary algorithms, Chinese J. Computers, 1995, 18(9):694-706 Zbigniew Michalewicz, Genetic Algorithms+Data Structures=Evolution Programs, Springer-Verlag Berlin Heidelberg, 1996 Zhang Weifeng, Xu Baowen, Research on Framework Supporting Web Search Engine, Joumal of Computer Research & Development, 2000,37(3) Zhang Weifeng, Xu Baowen, Zhou Xiaoyu, Counting Techniques in Web Pages, Mini-Micro Systems, in preparation Zhang Weifeng, Xu Baowen, Zhou Xiaoyu, Web Page Techniques for Interacting between Elements, Computer Enginering, in preparation Zou Tao .et al, The Technology Implementation of Information Mining on W, Journal of Computer Research & Development, 1999,36(8): 1021-1024