Using Document Structure for Automatic Summarization Aurélien Bossard
LIPN - CNRS UMR 7030 Université Paris 13, 99 av J-B Clément 93430 Villetaneuse, France
[email protected] Categories and Subject Descriptors
3. CBSEAS EVALUATION
I.2.7 [Automatic Summarization]: Miscellaneous
We evaluated our system under two tasks of the TAC 2008 evaluation campaign1 : Opinion Summarization Task (blog summarization) and Update Task (newswire summarization and information update). We had very good results on blog summarization task, ranking second on overall results, but Update tasks results were disappointing. Blogs are poorly structured documents, whereas newswire articles share specific discursive structure. We concluded that our system is not competitive enough for newswire summarization because it doesn’t take in account any structure element.
General Terms Algorithms
Keywords Automatic summarization
1.
INTRODUCTION
4. USING STRUCTURE IN CBSEAS
In this paper, we present a novel approach for automatic summarization. We believe redundancy is the most important factor in building a summary automatically. We want to detect it automatically with an unsupervized method that could apply to any multi-document summarization task. CBSEAS, the system implementing our approach integrates a new method to detect redundancy at its very core, in order to produce more expressive summaries than previous approaches. However, the evaluation of our system at TAC 2008 –Text Analysis Conference– revealed some failings. We propose to make up for these weaknesses by using document structure inside the automatic summarizer.
2.
We have chosen to implement a naive method to test if the use of structure can really improve the summaries quality. For the moment, we classify the newswire articles in four groups: chronologies, comparative news, enumerative news and classic news. The three first categories are very interesting for an automatic summarizer. They are written in a concise style, and can be easier inserted into a summary. All sentences belonging to an article tagged as chronology, comparative or enumerative news get a 15% bonus in the scoring function, in order to force our system to favor sentences extracted from non-classic news. By doing so, we constated a 10% improvement of our results on the Update Task, using ROUGE measure2 to evaluate these post-campaign experiments.
CBSEAS SYSTEM DESCRIPTION
We assume that, in multi-document summarization, redundancy is a good indicator of sentence importance: if an information is given in several documents of a document collection, it means that this information is crucial and should be included in the summary of the document collection. Detecting redundancy can be a means to select sentences, but also to improve the quality of a summary by avoiding the selection of two redundant pieces of text. Our system, CBSEAS, clusters the sentences from the documents to summarize using fast global k-means [1] on the sentences’ similarity matrix. The semantic clusters created this way are close to redundancy-based clusters. The supposed most important sentence is then selected by choosing the sentence in each cluster that maximizes a weighted sum of the similarity to the other sentences in its cluster and the similarity to a user query.
5. CONCLUSION We presented here a new approach for multi-document summarization. It uses an unsupervized clustering method to group semantic related sentences together and extract one sentence per cluster to create the summary. We also proposed a way to improve the quality of news summaries using the newswire article specific structure. We showed by integrating some basic structure traits that it truely does boost the quality of the summaries. We plan to extend this work by taking in account more news categories and a real structure representation in our summarizer.
6. REFERENCES [1] S. L´ opez-Escobar, J. A. Carrasco-Ochoa, and J. F. M. Trinidad. Fast global k-means with similarity functions algorithm. In Intelligent Data Engineering and Automated Learning - IDEAL, pages 512–521, 2006.
Copyright is held by the author/owner(s). SIGIR’09, July 19–23, 2009, Boston, Massachusetts, USA. ACM 978-1-60558-483-6/09/07.
1 2
850
http://www.nist.gov/tac/tracks/2008/summarization/ http://berouge.com/