Long C, Huang ML, Zhu XY et al. A new approach for update multi-document summarization. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 25(4): 739–749 July 2010. DOI 10.1007/s11390-010-1057-8
A New Approach for Multi-Document Update Summarization Chong Long1 ( and Ming Li2 (
1
2
), Xiao-Yan Zhu (ý), Member, CCF
), Min-Lie Huang1 ( ), Fellow, ACM, IEEE
1,∗
State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China School of Computer Science, University of Waterloo, Waterloo N2L 3G1, Canada
E-mail:
[email protected];
[email protected];
[email protected];
[email protected]. Received October 22, 2009; revised April 8, 2010. Abstract Fast changing knowledge on the Internet can be acquired more efficiently with the help of automatic document summarization and updating techniques. This paper describes a novel approach for multi-document update summarization. The best summary is defined to be the one which has the minimum information distance to the entire document set. The best update summary has the minimum conditional information distance to a document cluster given that a prior document cluster has already been read. Experiments on the DUC/TAC 2007 to 2009 datasets (http://duc.nist.gov/, http://www.nist.gov/tac/) have proved that our method closely correlates with the human summaries and outperforms other programs such as LexRank in many categories under the ROUGE evaluation criterion. Keywords
1
data mining, text mining, Kolmogorov complexity, information distance
Introduction
Automated summarization dates back to the 1950’s[1]. In recent years, since web contents grow in an increasing speed, people need to have a concise overview of a large set of articles in a short time. So document summarization, aiming at generating brief and understandable summaries, has quickly become a hot research topic. Document updating technique is also very helpful for people to acquire new information or knowledge by eliminating out-of-date or redundant information. Multi-document update summarization is introduced by Document Understanding Conference (DUC) in 2007. It aims to produce a summary describing the majority of information content from a set of documents under the assumption that the user has already read a given set of earlier documents. This type of summarization has been proved extremely useful in tracing news stories: only new and update contents should be summarized if we have already known something about the story. In a news service website called “NewsBlaster ”, news articles are grouped into several topics, and a great number of articles have two summaries: one is the
whole story of the topic, and the other one tells readers what have happened recently. For example, there are ten news articles about the development of Australia’s uranium mine project in its Kakadu National Park and the protests and obstacles encountered. A good summary should contain four aspects: 1) What is the project going on? 2) What is the attitude of the government? 3) Where are the protests and obstacles coming from? 4) How does the government deal with these problems? As exemplified, a good summary is expected to preserve the information contained in the documents as much as possible, and at the same time keep the information as novel as possible[2] . Information distance is based on the theory of Kolmogorov complexity. Kolmogorov complexity was introduced almost half a century ago[3] and it is now widely accepted as an information theory for individual objects parallel to Shannon’s information theory which is defined on an ensemble of objects. In this paper, we propose a novel document summarization approach based on the theory of information distance among many objects. In order to deal with update summarization, we will extend the information distance theory to conditional information distance among many
Regular Paper The work was supported by the National Natural Science Foundation of China under Grant No. 60973104, the National Basic Research 973 Program of China under Grant No. 2007CB311003, and the IRCI Project from IDRC, Canada. ∗ Corresponding Author http://newsblaster.cs.columbia.edu/ 2010 Springer Science + Business Media, LLC & Science Press, China
J. Comput. Sci. & Technol., July 2010, Vol.25, No.4
740
objects. Finally summaries are generated according to our newly developed theory. Our paper’s main contributions lie in two aspects: 1) We provide a framework in which multi-document summarization can be modeled by the information distance theory. The best summary is defined as having the minimal information distance (or conditional information distance) to the entire document set (if a prior document set is given). 2) Also we have provided two feasible ways, one by compression and the other through semantic element extraction, to implement this framework. Extensive experiments on the DUC/TAC 2007 to 2009 datasets show that the proposed method outperforms the state-of-the-art systems. The paper is organized as follows. Section 2 discusses some related studies. Kolmogorov complexity and information distance are reviewed and the conditional information distance among many objects is described in Section 3. Section 4 presents our proposed summarization method based on conditional information distance among many objects and experiments in Section 5 emphasize the novelty and advantages of our work. Conclusions and future work are outlined in Section 6. 2
Related Work
Our work aims at summarizing documents using information distance, thus there are mainly two related topics: document summarization and the theory of information distance. In this section, we will introduce related studies on these two topics, respectively. 2.1
Document Summarization
Generally speaking, there are mainly two different kinds of document summarization methods: extractionbased and abstraction-based. Here we focus on extractive summarization. Most extractive summarization studies have focused on NLP and statistical machine learning techniques. Carbonell and Goldstein proposed to use Maximal Marginal Relevance (MMR), which aims to select summary sentences relevant to the user query and least similar to previously chosen ones[4] . Radev et al. described an extractive multi-document summarizer which extracts a summary from multiple documents based on the document cluster centroids[5]. Researchers have also proposed a number of machine learning methods, including Na¨ıve Bayes[6], SVM[7] , Conditional Random Fields[8] , and Adaboost[9] , to search the sentences related to a topic, or extract most representative sentences, or both. Most recently, graph-based methods have been proposed for document summarization, such as LexRank[10] and TextRank[11-12] . Wan et al.
improved graph models by studying the relationship between different semantic granularities[13-14] . Different from all the previous summarization methods, we will propose a novel summarization approach based on the information distance theory. Next we will introduce some related work about the theory of information distance and its applications in text mining area. 2.2
Information Distance and Its Applications in Text Mining
Kolmogorov complexity was introduced in 1963 by Solomonoff et al.[3] Then Bennett et al. defined information distance to measure information between two objects[15] . Li et al. defined normalized information distances in [16-17] and showed how to use them to compare documents and genomes. Long et al. generalized the theory of information distance to more than two objects[18] . Since Kolmogorov complexity is not computable, there are mainly two different methods to do the approximations: one by compression and the other through semantic information. More details about this theory will be introduced in Section 3. The theory of information distance has been initially applied in text mining area since 2002. Bennett et al. firstly approximated information distance through compression. They traced out language history[17,19] and chain letter history[20] . If two languages or chain letters are similar to each other, they share similar words and phrases and they are able to compress each other well. In 2007, Cilibrasi and Vit´ anyi measured the information distance between words and phrases through their semantic information[21] . Semantic information was considered through these words’ probabilities and frequencies. Then Zhang et al. proposed a method to measure the information distance from a question to an answer in a question answering (QA) system[22] . Long et al. selected typical reviews whose information distances are close to other reviews[18]. Our document summarization approach is based on a new theory of conditional information distance among many objects. This approach has a general framework which could complete the traditional summarization task (summarizing on a single document set), and also the update summarization task (summarizing on document cluster B when given cluster A). In the next section, the theories about Kolmogorov complexity and information distance will be reviewed and our extended theory will be introduced. 3
Theory Fix a universal Turing machine U . The Kolmogorov
Chong Long et al.: Multi-Document Summarization by Information Distance
complexity[3] of a binary string x conditioned to another binary string y, KU (x|y), is the length of the shortest (prefix-free) program for U that outputs x with input y. It can be shown that for a different universal Turing machine U , for all x, y KU (x|y) = KU (x|y) + C, where the constant C depends only on U . Thus KU (x|y) can be simply written as K(x|y). We write K(x|), where is the empty string, as K(x). It has also been defined in [15] that the energy to convert between x and y to be the smallest number of bits needed to convert x to y and vice versa. That is, with respect to a universal Turing machine U , the cost of conversion between x and y is: E(x, y) = min{|p| : U (x, p) = y,
Dmax (x, y) = max{K(x|y), K(y|x)}.
(2)
This distance is shown to satisfy the basic distance requirements such as positivity, symmetricity, and triangle inequality is admissible[15] . Dmax (x, y) satisfies the above requirement because of Kraft’s Inequality (with the prefix-free version of Kolmogorov complexity). It has been proved in [15] that for any admissible computable distance D, there is a constant c, for all x, y, Dmax (x, y) D(x, y) + c. Putting it bluntly, if any such distance D discovers some similarity between x and y, so will Dmax [15] . Here for an object x, we can measure its information by Kolmogorov complexity K(x); for two objects x and y, their shared information can be measured by information distance D(x, y). In [18], the authors generalize the theory of information distance to more than two objects. Similar to (1), given strings x1 , . . . , xn , they define the minimal amount of thermodynamic energy needed to convert any xi to any xj as: Em (x1 , . . . , xn ) = min{|p| : U (xi , p, j) = xj for all i, j}. Then it is proved in [18] that:
Theorem 2. Modulo to an O(log n) additive factor, min K(x1 . . . xn |xi ) Em (x1 , . . . , xn ) i min Dmax (xi , xk ). i
(3)
k=i
(4)
In update summarization, the summary should contain new information which former documents have not mentioned, so (3) is extended to: Em (x1 , . . . , xn |c) = min{|p| : U (xi , p, j|c) = xj for all i, j} (5) where c is the conditional sequence that is given for free to compute from sequence x to y and from y to x. Similar to (4): Theorem 3. Modulo to an O(log n) additive factor,
U (y, p) = x}. (1)
It is clear that E(x, y) K(x|y) + K(y|x). From this observation, the following theorem has been proved in [15]. Theorem 1. E(x, y) = max{K(x|y), K(y|x)}. Thus, the max distance was defined in [15]:
741
min K(x1 . . . xn |xi , c) Em (x1 , . . . , xn |c) i min Dmax (xi , xk |c). i
k=i
(6)
Given n objects and a conditional sequence c, the left-hand side of (6) may be interpreted as the most comprehensive object that contains the most information about all of the others. The right-hand side of (6) may be interpreted as the most typical object that is similar to all of the others. 4
Summarization Approach
We have developed the theory of conditional information distance among many objects. In this section, a new summarization model will firstly be built based on our new theory, and then we are going to develop a method to approximate Kolmogorov complexity and information distance through two different ways. 4.1
Modeling
4.1.1 Modeling Traditional Summarization The task of traditional multi-document summarization can be described as follows: given n documents B = {B1 , B2 , . . . , Bn }, the task requires the system to generate a summary S of B. According to our theory, the conditional information distance among B1 , B2 , . . . , Bn is Em (B). However, it is very difficult to compute Em . Moreover, Em itself does not tell us how to generate a summary. (4) has provided us a feasible way to approximate Em : the most comprehensive object and the most typical one are the lower and upper bounds of Em , respectively. The most comprehensive object is long enough to cover as much information in B as possible, while
J. Comput. Sci. & Technol., July 2010, Vol.25, No.4
742
the most typical object is a concise one that expresses the most common idea shared by those objects. Since we aim to produce a short summary to represent the general information, the right-hand side of (4) should be used. The most typical document is the Bj such that min Dmax (Bi , Bj ). j
i=j
However, Bj is far from enough to be a good summary. A good method should be able to select the information from B1 to Bn to form a best S. We view this S as a document in this set. Since S is a short summary, it does not contain extra information outside B. The best traditional summary Strad should satisfy the constraint as: Strad = arg min Dmax (Bi , S). (7) S
i
In most applications, the length of S is confined by |S| θ (θ is a constant integer) or |S| α i |Bi | (α is a constant real number between 0 and 1). 4.1.2 Modeling Update Summarization Given a set of earlier m articles A = {A1 , A2 , . . . , Am }, the update summarization task is to summarize new contents presented by a document set B = {B1 , B2 , . . . , Bm }. This earlier article set A can be viewed as a precondition. Thus this task can be well modeled by the conditional version of information distance. The best summary Sbest should satisfy the constraint as follows: Sbest = arg min Dmax (Bi , S|A). (8) S
i
If m = 0 (A = ∅), it will be a traditional multidocument summarization problem. If m > 0 (A = ∅), it will be a multi-document update summarization problem. Therefore, the traditional summarization can be viewed as a special case of (8). According to [22], from (8) we can get: Dmax (Bi , S|A) = Dmax (BiA , S|A) = Dmax (BiA , S) where Bi is mapped to BiA under the condition of A. Then for a document Bi and a document set A, BiA is a set of Bi ’s sentences (Bi,k s) which are different from all the sentences in A1 to Am : Ai , Dmax (Bi,k , sen) > ϕ BiA = Bi,k | ∀sen ∈ i
where Ai is the sentence set of a document Ai and ϕ is a threshold. Note that ϕ is the only parameter to be specified in our approach and it is only related to update summarization clusters. We tune it on the B
cluster of the DUC 2007 dataset under the ROUGE-1 criterion. We have already developed a framework for summarization. However, the problem is that neither K(·) nor Dmax (·, ·) is computable. As mentioned in Subsection 2.2, two methods can be used in the approximation and the computation of information distance: one by compression and the other through semantic element extraction. In the next several sub-sections, we will discuss how to use these two methods to do the approximations, respectively. 4.2
Approximation by Compression
From (1) we can get E(x, y) = max{K(x|y), K(y|x)} = max{K(x, y) − K(x), K(x, y) − K(y) = K(x, y) − min{K(x), K(y)}, where K(x, y) is the shortest program which can output both x and y. In [16], the authors proposed to approximate Kolmogorov complexity through a compressor C. The boundary case is C = K if C is powerful enough. Now we have to use a real-world reference compressor C which approximates the information distance E(x, y). The compression distance EC (x, y) is defined as EC (x, y) = C(xy) − min{C(x), C(y)}. Here C(xy) denotes the compressed size of the concatenation of x and y. C(x) denotes the compressed size of x, and C(y) denotes the compressed size of y. Then EC (x, y) is just an approximation of Dmax : Dmax (x, y) = E(x, y) ≈ EC (x, y).
(9)
Here we take LZ77[23] as the compressor just as the paper [19] did. EC is computed by its compression between two sentences. For example, there are two sentences x and y: x = “The trial of Caldera Inc.’s antitrust lawsuit against Microsoft Corp. has been postponed from June until next January.”, and y =“ Business seemed as usual at Microsoft headquarters on the first day of the U.S. government’s antitrust trial against this software giant.”; then xy = “The trial of Caldera Inc.’s antitrust lawsuit against Microsoft Corp. has been postponed from June until next January. Business seemed as usual at Microsoft headquarters on the first day of the U.S. government’s antitrust trial against this software giant.” Through LZ77 we can get C(x) = 171, C(y) = 182, C(xy) = 292 and EC (x, y) = 121 (bytes).
Chong Long et al.: Multi-Document Summarization by Information Distance
4.3
Approximation by Semantic Element Extraction
The compression method is a language-independent summarization method. It is easy to be implemented and it can summarize documents written in any other language without any modifications. However, this method only uses morphological features of a sentence. The semantic meanings of terms or phrases have been heavily neglected by simple compression. Alternatively, we can apply the approximation method based on semantic information: firstly we divide a sentence into semantic elements; then information distance between two sentences is estimated through their semantic element sets. 4.3.1 Semantic Element Extraction In a document, each word or entity contains a certain amount of information, and the information varies according to the word or entity’s importance. Such words or entities are called “semantic elements”, and “elements” for short in this paper. There are two types of elements: 1) named entities such as person, organization, time, and location, containing a large portion of meaningful information; and 2) common words except stop-words. For example, the meaningful elements of the sentence “George W. Bush was born on July 6th, 1946” are “George W. Bush” (person), “born” and “July 6th, 1946” (time). First, we recognize entities about a person, location, organization and other names with Stanford Named Entity Recognition (NER) . We also extract entities about a date or time with patterns. Totally five types of name entities are recognized. Second, words or phrases with the same meanings are normalized into one entity through coreference resolution. For example, “George W. Bush” and “George Bush” are normalized to the same entity; “May 15th, 2008”, “May 15, 2008” and “5/15/2008” are recognized as the same date. 4.3.2 Information Distance Approximation Next we will take several steps to do the approximations. Although some steps contain rough approximations, we will investigate the influence of our estimations with extensive experiments in Subsection 5.6. Let M = {M1 , M2 , . . .} and N = {N1 , N2 , . . .} be two sets of sentences. After those steps mentioned in Subsection 4.3.1, each sentence Mi (or Nj ) has an element set Mi∗ (or Nj∗ ). According to (2), Dmax (M, N ) = max{K(M |N ), K(N |M )},
http://nlp.stanford.edu/ner/index.shtml
743
then K(M |N ) ≈ K( Mi∗ \ Nj∗ ), i
j
j
i
K(N |M ) ≈ K( Nj∗ \ Mi∗ ).
(10)
The Kolmogorov complexity of an element set W can be computed by the sum of the complexities of all its elements: K(W ) = K(w). w∈W
We can use frequency count, and use Shannon-Fano code[21] to encode a semantic element which occurs in probability p in approximately − log p bits to obtain a short description. A semantic element’s probability p can usually be approximated by its document frequency in the corpus: K(w) = − log P (w) ≈ − log df (w).
(11)
Although this method contains semantic information, there are mainly three steps which may lead to an approximation bias during the process of generating a summary: 1) when the complexity between two sentences is computed through their elements’ complexities in (10); 2) when an element’s complexity is estimated by its document frequency in (11); and 3) when Em is approximated by its upper bound. In Subsection 5.6 we will analyze how these approximations affect our system’s performance. 5
Experimental Results
In this section, our summarization method will be evaluated on the DUC/TAC 2007 to 2009 datasets. Firstly, the datasets, preprocessing methods and the ROUGE evaluation criterion are introduced. Then our results are compared with LexRank on the DUC/TAC 2007 to 2009 update summarization tasks, respectively. Finally, pyramid units, which are important semantic units written by annotators in the reference summaries, are used to justify our complexity approximation method based on semantic element extraction. 5.1
Datasets
The Document Understanding Conference (DUC), renamed as Text Analysis Conference (TAC) since 2008, has been the major forum for comparing summarization systems on a shared test set. Every year several summarization topics are released and the summaries
J. Comput. Sci. & Technol., July 2010, Vol.25, No.4
744
produced by participants are evaluated, both automatically and manually. In DUC/TAC 2007 to 2009, one of those tasks is “update summarization”. The task requires participants to write a short summary of a set of newswire articles, under the assumption that the user has already read a given set of earlier articles. The length of the summary should be no more than 100 words. Table 1. Summary of Datasets Year
2007
2008
2009
No. Topics No. Clusters for Each Topic No. Documents
10 3 250
48 2 960
44 2 880
Table 1 shows the statistics of the datasets. In DUC 2007, there are 10 topics, each of which has three clusters: A, B and C. Each cluster has about 10 news articles. Summarizing on cluster A is a traditional summarization task. Summarizing on cluster B is required to generate an update summary with already read A. The summary from cluster C should be updated with clusters A and B. In TAC 2008 and 2009, there are 48 and 44 topics and two clusters (A and B) for each topic, respectively. The update task is similar with that in DUC 2007. For each cluster, there are four standard summaries written by four different people, respectively. The score on a cluster is the mean of the scores on four manual summaries. The overall score on a dataset is just the mean of the scores on its clusters. 5.2
Preprocessing
During the preprocessing, we need to filter out those sentences which are impossible to be a part of a summary. The top 10% entities with the highest document frequency on a document set are viewed as “the topic set”. The sentences which do not contain any entity of the topic set are eliminated and the remaining ones are called “candidate sentences”. After preprocessing we search for the best combination of the candidate sentences to make (8) minimal. In our datasets described in Subsection 5.1, each set has on average 9.6 documents with less than 300 sentences. Averagely less than 70 candidate sentences are selected. As the length of the summary is less than 100 words, averagely there are three to four sentences per summary in average. Therefore, there are totally less than C470 < 106 different combinations. It is a small number and our approach can generate the summaries in real time. The exhaustive enumeration is simple but can get the optimal result. We will develop a
http://haydn.isi.edu/ROUGE/ http://www.summarization.com/mead/
heuristic search algorithm to generate longer summaries for larger document sets in our future work. 5.3
Evaluation Metrics
ROUGE toolkit [24] is used for evaluation. It measures the quality of a summary by counting the overlapping units such as the n-gram, word sequences and word pairs between the candidate summary and the reference summary. ROUGE-N measures the n-gram recall as follows: ROUGE-N = S∈{RefSum}
S∈{RefSum}
ngram∈S
Count match (ngram)
ngram∈S
Count (ngram)
where n stands for the length of the n-gram. Count match (ngram) is the number of co-occurring ngrams in a evaluated summary and a set of reference summaries. Count(ngram) is the number of n-grams in the reference summaries. The ROUGE toolkit reports separate F -measure scores for 1, 2, 3 and 4-gram, and also for longest common subsequence co-occurrences. Among these different scores, unigram-based ROUGE score (ROUGE-1) has been shown to agree with human judgment most[24] . Thus ROUGE-1 recall will be used to evaluate our results. 5.4
Results on the DUC/TAC 2007 to 2009 Datasets
Firstly we will show the results of our system on all datasets. We take the popular summarization method, LexRank[10] implemented in MEAD , as our baseline. We run the LexRank package on each set’s “candidate sentences” (as described in Subsection 5.2), with the same preprocessing step as our method does. Then our multi-document update summarization approach is implemented, approximated by compression and semantic element extraction, respectively. The results on the DUC/TAC 2007 to 2009 datasets are shown in Fig.1. From these figures we can get three conclusions: 1) our methods, even simply approximated by compression, always outperform the LexRank method; this means that the proposed framework is safe and sound, and has already exhibited potentials for generating good summaries even with a very simple compression technique problem effectively; 2) the results of approximation through semantic element extraction are much better than those of compression. This phenomenon implies that semantic information is very important
Chong Long et al.: Multi-Document Summarization by Information Distance
745
Fig.1. Results on the datasets. (a) DUC 2007. (b) TAC 2008. (c) TAC 2009.
while approximating information distance between sentences; and 3) our system performs almost equally well on traditional clusters (Cluster A) and update clusters (Clusters B and C) by virtue of that the framework is universal to two cases. 5.5
A Sample Result
Here is a sample result of our system. This sample comes from the DUC 2007 dataset. The input files are news articles about the development of Australia’s uranium mine project in its Kakadu National Park and the protests and obstacles encountered. One of these ten articles writes like follows: SYDNEY, Australia (AP) _ The opening of Australia’s first new uranium mine in a decade continues to inch closer, despite the latest vow Wednesday by Aboriginal landowners to block the development. Energy Resources of Australia Ltd. said it expects to begin digging at the remote Jabiluka mining lease in the Northern Territory this month, after the territorial government on Tuesday signed authorizations allowing work to begin on the mine’s entrance and shaft. ... The summary outputted by our system is: The Australian government today gave the green light to the country’s controversial uranium mining plan, arguing that it would generate billions of U.S. dollars in revenue for Australians and create 2,000 jobs. Construction of Australia’s first new uranium mine in a decade was under way Tuesday, after a protracted legal battle with Aboriginal land owners and amid continuing environmental protest. Police arrested more than 100 protesters
Tuesday at Energy Resources of Australia’s Jabiluka mine development at Kakadu, a world heritage listed area in the Northern Territory. Manually written summaries for evaluation are offered by DUC 2007. One of them is as follows: In 1997, the Australian government approved the development of the Jabiluka uranium mine. The mine is in Kakadu National Park, which is on the UNESCO World Heritage list of important cultural and environmental sites. The government’s decision was condemned by environmentalists, who were concerned with the potential for consigning tons of radioactive waste in an environmentally sensitive area, and by opposition parties, who claimed the government ignored the wishes of Aboriginal people in the area. The government defended their decision saying the mine would generate millions of revenue dollars and create 2,000 jobs. Protestors were arrested at the mine site. The system’s summary and the manual summary have both mentioned four important pieces of information about this news article cluster: 1) the Australian government approved the development of the Jabiluka uranium mine; 2) the government believed that the mine would generate millions of revenue dollars and create 2,000 jobs; 3) the government’s decision was condemned by environmentalists; and 4) protestors were arrested by the police. Our output has got 45.2% overlap with this manual summary under the ROUGE-1 recall criterion. 5.6
Estimation Analysis
We noticed in Subsection 4.3.2 there are mainly three approximating biases: 1) approximating the
J. Comput. Sci. & Technol., July 2010, Vol.25, No.4
746
complexity of a sentence through its elements; 2) using document frequency to approximate an element’s complexity; and 3) approximating Em using its upper bound. It is important to note that although sometimes our theory can be specialized (or trivialized) to become a known “frequency” estimation method as in [22], the fact that the theory allows other kind of better approximation (the comparison between our compressionbased and semantic-element-based methods has already demonstrated this), and it is more general than any of the trivializations. In fact, it is possible to prove the optimality of our theory. Alternatively, we take it to the experiments to justify our approach.
Manually built pyramids have provided us a good way to investigate how human being write summaries with important semantic units. A pyramid represents the opinions of multiple human summary writers, each of whom has written a reference summary for the input set of documents. Each semantic unit (usually a short sentence or a phrase) is called a Summary Content Unit (SCU). For example, in the pyramids provided by DUC/TAC 2007 to 2009, words or phrases contributed to the reference summaries are named “contributors”. They are collected and grouped according to manual annotations. The following is an SCU example in an XML format:
5.6.1 Different Elements For the first bias, we check our method on estimating the distance among sentences through their elements. Two important steps taken from a sentence M to its element set M ∗ are recognizing words or phrases as entities and grouping them through reference resolution. Here three different methods are compared on the DUC 2007 dataset, as shown in Fig.2: (a) “Words” means to treat every word as an element. For example, “George Bush” is viewed as two different elements; (b) “Words+Entities”: after entities are recognized, phrases such as “George Bush” are viewed as one element. Other words such as “born” are still in the element set; (c) “Normalized”: after reference resolution, “George Bush” and “George W. Bush”, “May 15,2008” and “5/15/2008” are normalized into one element, respectively. Through this figure we can see that after entity recognition and reference resolution, the performance has been improved remarkably, where we may conjecture it has got a more accurate approximation of information distance.
Fig.2. Comparison of different elements.
5.6.2 Complexity Estimation Pyramids [25] are used to study the second bias.
<scu uid="22" label="Euro was scheduled to be launched on January 1, 1999">! This SCU has four contributors from four different reference summaries. The most important elements, “January 1999” and “Euro”, exist in all four contributors. The more frequently an element exists in contributors, the more important it is. In a pyramid, we take every contributor defined in the XML file as a document. Each element T in this pyramid has a weight w(T ), computed by T ’s document frequency on the contributors. Compared with (11), w(T ) might be a more accurate approximation of K(T ) in that w(T ) was weighted by the humanannotated SCUs. Let K(T ) = w(T ), we get a group of better results, which is closer to human summaries. In Fig.3 we have three groups of results on each topic of the DUC 2007 dataset: the brown ones are the average ROUGE-1 scores of human written summaries provided by the organizers. Each set has four reference summaries written by four different people. The blue ones are the results of our approach with K(T ) = w(T ). The pink ones are the results of our proposed approach with (11). From
http://www1.cs.columbia.edu/˜becky/DUC2006/2006-pyramid-guidelines.html
Chong Long et al.: Multi-Document Summarization by Information Distance
747
Fig.3. Comparisons on the DUC 2007 dataset.
this figure we have two observations: the first one is that the approximation of K(T ) by (11) has very close results with that by K(T ) = w(T ). The latter takes into account the important semantic units which must be concerned when assessors write summaries. Thus we believe this might be a right way to approach the unit weights of “perfect” summaries. The second one is that our ROUGE-1 scores are close to those of the human written summaries. We further compares the average differences and Pearson correlations of ROUGE-1 scores among three approximation methods for K(T ) in Table 2. The first method is our approximation defined by (11) (PK (T ) for short); the second one is K(T ) = w(T ), as mentioned before; and the third one is K(T ) = constant which means each T contains one unit of information. The third method is termed CK (T ) for short. As shown in this table, our PK (T ) correlates more with w(T ) than CK (T ) does. The difference between PK (T )’s results and w(T )’s results is smaller than that between CK (T )’s results and w(T )’s results. The observation may tell us that our approximation method has already reflected the human summarization process to some extent. We may conclude here that our approximation is a reasonable computation schema for the theory. As to the third bias, we will study it in theory to find out the upper bound of the difference between Em and minS i Dmax (Bi , S|A). Table 2. Differences and Correlations PK (T ) vs. w(T ) Diff. Cor.
CK (T ) vs. w(T ) Diff. Cor.
07A 07B 07C
0.003 0.015 0.012
0.953 0.933 0.899
0.022 0.022 0.024
0.846 0.921 0.612
08A 08B
0.013 0.015
0.889 0.880
0.020 0.023
0.782 0.812
09A 09B
0.009 0.012
0.935 0.901
0.019 0.020
0.818 0.798
Cluster
5.7
Discussion
Many of the existing document summarization methods need to eliminate sentences with redundant content in their extra post-processing steps. We have no post-processing steps to remove redundancy as MMR methods do, neither for compression based nor for semantic-element-based approximation. The reasons are as follows: 1) In the compression based approximation, we can get K(xx) = K(x) + O(1) for a string x according to the definition of Kolmogorov complexity. If another string y is very similar to x, K(xy) ≈ K(x) ≈ K(y). The concatenation of those redundant sentences contains nearly the same information as one of them, so they are not possible to be included in the summaries generated by our system. 2) In the semantic-element-based approximation, the union of element sets is used to compute the information distance between sentences. If two sentences are redundant, the number of unified elements will be less, which leads to a larger information distance from the two sentences to all the other sentences in the document cluster. And further, each element is weighted by Kolmogorov complexity. If an element is more frequently observed in different documents, it contains more information and it is more relevant to the topic of the document cluster; thus our method is not simply as counting overlapping terms to exclude redundant sentences. Therefore, our approach always tends to select those sentences with different important elements (related to topic, as defined in 5.2) as many as possible. 6
Conclusion and Future Work
In this paper, we have proposed a novel document summarization framework based on the theory of information distance. Two approximation methods are used to estimate information distance, one by compression and the other through semantic element extraction. The very simple compression-based method has demonstrated that our framework is theoretically safe
748
and practically feasible to produce good summaries. Better approximation schemas which take into account more semantic information in estimating information distance, as exemplified by the semantic elements or semantic content units in this paper, will definitely contribute better results in this framework. And this framework obviously embraces other approximation techniques. Experiments show that our approach performs well on the DUC/TAC 2007 to 2009 datasets. In future work, we will further improve our approach mainly in two ways: firstly, better approximation of information distance will be studied; then a heuristic method will be developed in order to find the best summary more effectively. References [1] Luhn H P. The automatic creation of literature abstracts. IBM Journal of Research and Development, 1958, 2(2): 159165. [2] Wan X, Yang J, Xiao J. Manifold-ranking based topic-focused multi-document summarization. In Proc IJCAI, Hyderabad, India, Jan. 6-12, 2007, pp.2903-2908. [3] Li M, Vit´ anyi P M. An Introduction to Kolmogorov Complexity and Its Applications. Springer-Verlag, 1997. [4] Carbonell J, Goldstein J. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proc. SIGIR, Melbourne, Australia, Aug. 24-28, 1998, pp.335-336. [5] Radev D R, Jing H, Stys M, Tam D. Centroid-based summarization of multiple documents. Information Processing and Management, 2004, 40(6): 919-938. [6] Kupiec J, Pedersen J, Chen F. A trainable document summarizer. In Proc. SIGIR, Seattle, USA, Jul. 9-13, 1995, pp.6873. [7] Leskovec J, Milic-Frayling N, Grobelnik M. Impact of linguistic analysis on the semantic graph coverage and learning of document extracts. In Proc. AAAI, Pittsburgh, USA, Jul. 913, 2005, pp.1069-1074. [8] Shen D, Sun J T, Li H, Yang Q, Chen Z. Document summarization using conditional random fields. In Proc. IJCAI, Hyderabad, India, Jan. 6-12, 2007, pp.2862-2867. [9] Zhang J, Cheng X, Wu G, Xu H. Adasum: An adaptive model for summarization. In Proc. CIKM, Napa Valley, USA, Oct. 26-30, 2008, pp.901-909. [10] Erkan G, Radev D R. Lexpagerank: Prestige in multidocument text summarization. In Proc. EMNLP, Barcelona, Spain, Jul. 25-26, 2004, pp.365-371. [11] Mihalcea R, Tarau P. Textrank — Bring order into texts. In Proc. EMNLP, Barcelona, Spain, Jul. 25-26, 2004, pp.119126. [12] Mihalcea R, Tarau P. A language independent algorithm for single and multiple document summarization. In Proc. IJCNLP, Jeju Island, Korea, Oct.11-13, 2005, pp.19-24. [13] Wan X, Yang J, Xiao J. Towards an iterative reinforcement approach for simultaneous document summarization and keyword extraction. In Proc. ACL, Prague, Czech Republic,
J. Comput. Sci. & Technol., July 2010, Vol.25, No.4 Jun. 23-30, 2007, pp.552-559. [14] Wan X. An exploration of document impact on graph-based multi-document summarization. In Proc. EMNLP, Hawaii, USA, Oct. 25-27, 2008, pp.755-762. [15] Bennett C H, G´ acs P, Li M, Vit´ anyi P M, Zurek W H. Information distance. IEEE Transactions on Information Theory, Jul. 1998, 44(4): 1407-1423. [16] Li M, Badger J H, Chen X, Kwong S, Kearney P, Zhang H. An information-based sequence distance and its application to whole mitochondrial genome phylogeny.Bioinformatics, 2001, 17(2): 149-154. [17] Li M, Chen X, Li X, Ma B, Vit´ anyi P M. The similarity metric. IEEE Transactions on Information Theory, 2004, 50(12): 3250-3264. [18] Long C, Zhu X, Li M, Ma B. Information shared by many objects. In Proc. CIKM, Napa Valley, USA, Oct. 26-30, 2008, pp.1213-1220. [19] Benedetto D, Caglioti E, Loreto V. Language trees and zipping. Physical Review Letters, Jan. 2002, 88(4): 048702. [20] Bennett C H, Li M, Ma B. Chain letters and evolutionary histories. Scientific American, Jun. 2003, 288(6): 76-81. [21] Cilibrasi R L, Vit´ anyi P M. The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering, Mar. 2007, 19(3): 370-383. [22] Zhang X, Hao Y, Zhu X, Li M. Information distance from a question to an answer. In Proc. SIGKDD, San Jose, USA, Aug. 12-15, 2007, pp.874-883. [23] Ziv J, Lempel A. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 1977, 23(3): 337-343. [24] Lin C Y, Hovy E. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proc. HLT-NAACL, Edmonton, Canada, May 27-June 1, 2003, pp.71-78. [25] Nenkova A, Passonneau R, Mckeown K. The pyramid method: Incorporating human content selection variation in summarization evaluation. ACM Transactions on Speech and Language Processing, Apr. 2007, 4(2): 1-23.
Chong Long received his B.E. degree from Tsinghua University, China in 2005. He is a Ph.D. candidate in Department of Computer Science and Technology, Tsinghua University, China. His research interests include Kolmogorov complexity and its applications, text mining and natural language processing. Min-Lie Huang now is a faculty member of Dept. Computer Science and Technology, Tsinghua University. He received his Ph.D. degree from Tsinghua University in 2006. His research interests include machine learning, natural language processing, graph-based text mining, opinion and review mining, and complex question answering.
Chong Long et al.: Multi-Document Summarization by Information Distance Xiao-Yan Zhu is a professor and the Deputy Head of State Key Lab of Intelligent Technology and Systems, Tsinghua University. She obtained the Bachelor’s degree from University of Science and Technology Beijing in 1982, the Master’s degree from Kobe University in 1987, and the Ph.D. degree from Nagoya Institute of Technology, Japan in 1990. She has been teaching at Tsinghua University since 1993. Her research interests include pattern recognition, neural network, machine learning, natural language processing and bioinformatics. She is a member of CCF.
749
Ming Li is a Canada Research Chair in Bioinformatics and a University Professor of the University of Waterloo. He is a fellow of Royal Society of Canada, ACM, and IEEE. He is a recipient of E.W.R. Steacie Fellowship Award in 1996, and the 2001 Killam Fellowship. Together with Paul Vitanyi they have pioneered the applications of Kolmogorov complexity and co-authored the book “An Introduction to Kolmogorov Complexity and Its Applications”. His research interests recently include protein structure determination and next generation Internet search engine.