MRST: A New Technique for Information Summarization - CiteSeerX

Report 2 Downloads 51 Views
World Academy of Science, Engineering and Technology 4 2007

MRST: A New Technique for Information Summarization Afnan Ullah Khan

Shahzad Khan



increasingly is becoming the main source of content for many different forms of expression in modern society. The content of the web includes different kinds of audiovisual information; however the majority of the available information is still in textual form i.e. coded in natural language. In this sense it is not surprising that recent research and proposals for new projects focused on web information retrieval, mostly concern textual information, even if the final goal is to address other kinds of data usually through the use of metadata. On the other hand with the common paradigms for retrieving and interpretation of web content by a regular user, there is a great degree of inefficiency regarding the amount of information that the user as retrieved until he can actually interpret it in an meaningful way. This is not a new problem in the information society. Taking for example a common case as the information display on a newspaper, we found headlines and carefully created titles that are inserted strategically so that the reader can get a general overview of the contents before he decides if he wants to read more indepth information. It is clear that in modern information retrieving and interpretation understanding and comprehension are roles that more and more are being displaced into to the information provider then into the receiver, with the objective of reducing the latency time between information retrieving and interpretation. In this context the summarization of textual content in an automatic way is a major research area, and the development of systems that are oriented to the specificities of the web information are object of great interest in the computer science community. Automatic text summarization is the field that deals with techniques to reduce a text to a smaller size and to its most important points. Automatic text summarization is of particular importance nowadays because of the practical need to deal with the information overload to which the Internet is perceived to have greatly contributed. Apart from the interest in tools to filter, extract and summarize text on the Internet, there is also increasing interest in techniques which allow it to push text (from the Internet) to fixed smaller formats such as WAP, SMS on mobile phones and PDA's. Since text understanding is from a descriptive and a computational point of view to a very large extent still an unsolved problem, there exist currently only (quite simple) stochastic extraction methods (that do not require an ``understanding'' of the text). These approaches extract the probably most relevant sentences of a text and perform surprisingly successful, although there is still room for improvements concerning the methods on the one hand and

Abstract—Information summarization is defined as “The process of concisely restating the essential ideas of a text or passage, and synthesizing the ideas into an overarching, idea”. There is an increasing need of coming up with ideas that can perfectly generate summaries as there is large repository of data present online but to get to the right information is indeed very difficult. In order to achieve this task many summarizing techniques have been devised such as LEAD, MEAD and RANDOM. The paper proposes a new technique for information summarization that basically combines the rhetorical structure theory with the MEAD summarizer system as Mead is based totally based on mathematical calculation and lack a knowledge base Rhetorical Structure Theory is used to overcome this weakness and in the end the new summarizer system is evaluated against the original MEAD summarizer system exploit mainly two areas of information that are Financial Articles and Medline abstracts. In theory MRST should be better than Mead but in practice Mead came ahead of MRST and that’s merely because of one reason and that is there is no true parser that completely implements the Rhetorical Structure Theory. The results show that Mead produces successful summaries 75% time for both short and long documents. Incase of MRST it produces successful summaries for short documents 70% of the time and for long documents it produces successful summaries 65% of the time, as the size of the document increases the performance of MRST deteriorate. The main finding of the work is if we could come up with a parser that comprehensively implement the rhetorical structure theory then we would be able to come up the summarizer system that would be better then MEAD.

Keywords—Mead, Rhetorical Structure Theory, SPADE, Text Summarızatıon.

1.

Waqar Mahmood

INTRODUCTION

I

N this section a short account of information summarization

and its various techniques is given. The World Wide Web is today the largest repository of information available, and Manuscript received Feburary 18, 2004. Fırst Afnan Ullah Khan is with the National University Of Science And Technology, 166-A St 9, Chaklala Scheme 3, Rawalpindi, Pakistan phone: 0092-51-2260911; e-mail: 31afnan@nııt.edu.pk). Second Shahzad Khan., was with with the National University Of Science And Technology. He is now with the Department of Computer Science, Cambridge University, UK (e-mail: [email protected]). Third Waqar Mehmood is with the Networkıng Engineering Department, National University Of Science And Technology,166-A St 9, Chaklala Scheme 3, Rawalpindi, Pakistan.(e-mail: [email protected]).

639

World Academy of Science, Engineering and Technology 4 2007

In this section the approach through which the work was conducted is discussed. The first iteration was of the implementation of Mead[1] the flavor of Mead implemented in this work is a little different than the one proposed in the original Mead idea as in the work done vectors are made on sentence level rather than making them on document level which was originally proposed in Mead[1]. The idea to make vectors on sentence level rather than on document level was originated to see how well Mead[1] would perform and the results were astonishing. To make sentence level vector the theory of Vector Space Model(VSM)[10] was used. The VSM proposes a way by which you can assign weights to the words that are present in a document and on the basis of those words the summary of that document can be made.

the cohesion of the output text on the other (e.g., anaphoric references not included in the summary). 1.1 Mead Mead since its publication in 1965, the Nelder Mead simplex" algorithm has become one of the most widely used methods for nonlinear unconstrained optimization. The Nelder Mead algorithm should not be confused with the (probably) more famous simplex algorithm of Dantzig for linear programming both algorithms employ a sequence of simplices but are otherwise completely different and unrelated in particular, the Nelder Mead method is intended for unconstrained optimization. The Nelder Mead algorithm is especially popular in the fields of chemistry, chemical engineering, and medicine. The recent book, which contains a bibliography with thousands of references, is devoted entirely to the Nelder Mead method and variations. Two measures of the ubiquity of the NelderMead method are that it appears in the best-selling handbook Numerical Recipes , where it is called the moeba algorithm," and in Matlab . The Nelder Mead method attempts to minimize a scalarvalued nonlinear function of n real variables using only function values, without any derivative information (explicit or implicit). The Nelder Mead method thus falls in the general class of direct search methods.

As for RST[2] the SPADE[8] parser is used to get the RST trees required. The SPADE parser works on the sentence level and gets the nucleus, satellite and there relations in the form of RST trees. The SPADE parser internally uses the Charnaik parser to generate RST trees. The next iteration was of combining of Mead with RST to get the new technique that is MRST. The new technique work in a way that it first ultilizes Mead to make vectors and then it compares the result with the RST generated summary to see which are similar the similar ones are then taken as the summary of the document.

1.2 Lead Lead[21] is a technique in which the first and last sentence or sentences of the paragraph depending upon the compression percentage are chosen and is very good for news articles as they have the main theme set in the first lines of the articles.

The final iteration was of the evaluation of all the techniques for that three sources of information that were used which are Medline[5] abstracts and Wall Street journal Financial articles. There were four main criteria laid down for the evaluation of these techniques that are Coherence[8], Correctness[8], Compression[8] and Overall. Finally Reports were generated in the end to support our analysis.

1.3 Random The Random[2] technique is the simplest of all the other as it randomly selects lines from the source document, depending upon the compression percentage and put them inside the summary. It is used for generating baseline documents that can be used for benchmarking of more sophisticated algorithms.

3. DESIGN AND IMPLEMENTATION In this section explanation of the implementation of each technique take place. 3.1 Mead Summarizer The implementation of Mead[1] is as follows: Initially all the sentences from the input file were extracted and were assigned separate identifiers. After that all the words that were present in the file were extracted and there number of occurrence were found so that we can calculate the inverse term frequency. Now for every sentence the number of occurrence of each word was calculated and is then multiplied by the inverse frequency of the same word of the complete input file (tf : term frequency of each word in a particular sentence ; itf : inverse term frequency of the whole document for a particular word). Now the vectors of each sentence were made according to the words that were present in the sentence. Now the process of calculating the centroid is started and average is taken of all the sentences to do that. Now in order to find out the similarity between the centriod and the vector cosine measure is used that tells us the angle of each vector with the centriod.

1.4 RST RST[2] was originally developed as part of studies of computer-based text generation. A team at Information Sciences Institute (part of University of Southern California) was working on computer-based authoring. In about 1983 part of the team, (Bill Mann, Sandy Thompson and Christian Matthiessen) noted that there was no available theory of discourse structure or function that provided enough detail to guide programming any sort of author. Responding to this lack, RST[2] was developed out of studies of edited or carefully prepared text from a wide variety of sources. It now has a status in linguistics that is independent of its computational uses.

2. APPROACH

640

World Academy of Science, Engineering and Technology 4 2007

According to the percentage compression entered by the user those lines that are near to the centriod in terms of the angle that they make with it are selected starting from the one that is the closest and then moving onwards. These selected lines are the summary and are then printed to a file.

4. ANALYSIS DATA 4.1 Data Repository The main sources of information for the experiments are divided into two categories which are as follows:

3.2 Rhetorical Structure Theory Summarizer The implementation if RST[2] is as follows: In order to made summaries for multi documents, all the input documents are dumped into one file so that file can then be passed to the spade parser. The SPADE[8] parser only takes files parsed with the tags <s> and ,so every sentence should have the tag <s> at the start and should be present at the end. In order to tackle this problem the file containing all the input files is read and a new file is created that fulfils the above mentioned requirement. After having a file that has the required tags it is now time to invoke the parser for that purpose the Runtime class of java is used to invoke a script that enters the spade directory and runs the parser with the file given to it and the RST[2] trees are generated into a new file. Now the job is to retrieve the nucleus from the RST[2] trees. For this purpose an algorithm was devised , after studying the trees it was clear that after every nucleus line only three possible lines can come that are

• Financial News articles. • PubMed abstracts. 4.1.2 Financial News Articles Financial news articles are taken from Wall Street journal. Most of them are mainly focusing on the current financial situation of the world. 4.1.3 PubMed Abstracts PubMed[5], a service of the National Library of Medicine, includes over 15 million citations for biomedical articles back to the 1950's. These citations are from MEDLINE and additional life science journals. From there website PubMed abstracts were downloaded and then tested. 5. EVALUATION The summaries are evaluated on the basis of the following four criteria.

• Line containing Nucleus Information. • Another Nucleus line. • A satellite line

5.1.1 Coherence The coherence of a text is the degree to which the reader can describe the role of each individual sentence (or group of sentences) with respect to the text as a whole. Theories such as Rhetorical Structure Theory (Mann and Thompson, 1988) attempt to formalize coherence using a set of inter-segment relations (such as Cause, Solutionhood, Elaboration) that express the internal document structure.

So keeping this in mind relevant lines of data were extracted from the RST[2] trees. Still there was a major problem the lines extracted from the RST[2] trees still have syntax abbreviations in the lines that were entered by the Spade parser. So these abbreviations were deleted from the lines and then the lines were properly formatted.

5.1.2 Correctness Subjective evaluation of the degree to which the information contained in the original text has been reproduced without distortion in the translation.

3.3 MRST The implementation of MRST is as follows:

5.1.3 Compression Is the technique used doing compression according to the percentage given by the user of the system or not.

Now we have two summarizer systems the task was to combine them. Initially the Mead[1] parser was used and the file was passed to it the results were then put in a file. Similarly the same file was passed to the RST[2] parser the results were then put in another file. After doing that a comparison was made between the two files and the sentences that were occurred in both files were put in the final summary. According to the given compression percentage if there is room for more sentences to be added then RST generated summary is given priority and the sentences that occurred in it were extracted. The problem with RST generated summary is that it has grammatical problems and therefore in order to solve the problem the sentences that were taken from RST, there counter parts from the Mead summary were taken, if they are not present in Mead than they are taken from the RST summary itself.

5.1.4 Overall In overall part of evaluation the summarization is judged against the human made summaries.

6.

TEXTUAL ANALYSIS

6.1 Mead The results show that Mead comes out to be more successful than the new technique MRST. The main problem with Mead is that its incompliance with the given percentage as it work on sentence level when doing the compression, but it is extremely faster than MRST as the new technique takes at

641

World Academy of Science, Engineering and Technology 4 2007

least 20 times more time to process due to the reason that it invokes two parsers and then after that uses Mead technique to do further summarization. In almost all the remaining judgment criteria Mead performs well as it work on the sentence level so the chemistry of the sentence is not changed so it scores full marks in the correctness.

and the MRST techniques eventually we would have a summarization system that would outperform Mead [1] summarization system. REFERENCES [1]

6.2 MRST MRST results show that it performs well and sometimes even better than Mead if the given compression percentage is small. The results are quiet similar to that of Mead but the performance start worsening when high compression is required as it then have to take the sentences from the RST summary which has serious correctness problems.

[2] [3]

[4] [5]

[6] [7] [8]

[9]

Radev D., H. Jing and M. Budzikowska, 2000. “Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies.” ANLP/NAACL Workshop on Summarization. Seattle, WA. William C. Mann, Sandra A. Thompson: Rhetorical Structure Theory: A Theory of Text Organization, Text, 8 (3), 1988. Radev, D. 2000. A common theory of information fusion from multiple text sources, step one: Cross-document structure. In Proceedings, 1st ACL SIGDIAL Workshop on Discourse and Dialogue. PubMed: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi Dragomir R. Radev, Hongyan Jing, and Malgorzata Budzikowska. 2000. Centroid-Based Summarization of Multiple Documents: Sentence Extraction, Utility-Based Evaluation, and User Studies. In Proceedings of the Workshop on Automatic Summarization at the 6th Applied Natural Language Processing Conference and the 1st Conference of the North American Chapter of the Association for Computational Linguistics, Seattle, WA, April. Paul Seaebury: www.c-sharpcorner.com www.isi.edu/ Radu Soricut and Daniel Marcu. “Sentence Level Discourse Parsing using Syntactic and Lexical Information”. Information Sciences Institute University of Southern California. ISO 215:1986

[10] A. Wong,G. Salton, C. S. Yang. “A vector space model for automatic indexing”. Cornell Univ., Ithaca, NY [11] Daniel Mallett, James Elding, Mario A. Nascimento. “InformationContent Based Sentence Extraction for Text Summarization”. Department of Computing Science, University of Alberta, Canada

7. CONCLUSION Text Summarization has played an important role in the development of World Wide Web. As the usage of Internet increases and the information online present grows so will the importance and need of summarization techniques will grow, so researchers will have to come up with techniques that can summarize information according to the will of the people. In the work conducted a new technique MRST was presented and it was compared with an old technique Mead that has been very successful in recent years. In theory MRST should outperform Mead but due to the fact that there is no true RST parser present the new technique failed to perform according to the expectations. There is still need of a lot of work done on the new technique before it could take the place of Mead as the main summarizer system in the industry. 8. RECOMMENDATIONS RST[2] mainly takes information summarization from the sentence to the paragraph level or above but currently no parser is available that would implement the RST[2] technique completely, so the main focus of research should be on coming up this parser so that eventually we would have technique that outperforms all others. 9. FUTURE WORK The work can be further extended into the development of a new of a parser that basically implements the technique of RST[2] to the fullest. If we can come up with this parser then we would solve the problems that were faced by the RST[2]

642