A New Approach for Multi-Document Update ... - Springer Link

Comment

Report 1 Downloads 58 Views

Long C, Huang ML, Zhu XY et al. A new approach for update multi-document summarization. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 25(4): 739–749 July 2010. DOI 10.1007/s11390-010-1057-8

A New Approach for Multi-Document Update Summarization Chong Long1 ( and Ming Li2 (

1

2

), Xiao-Yan Zhu (ý), Member, CCF

), Min-Lie Huang1 ( ), Fellow, ACM, IEEE

1,∗

State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China School of Computer Science, University of Waterloo, Waterloo N2L 3G1, Canada

E-mail: [email protected]; [email protected]; [email protected]; [email protected]. Received October 22, 2009; revised April 8, 2010. Abstract Fast changing knowledge on the Internet can be acquired more eﬃciently with the help of automatic document summarization and updating techniques. This paper describes a novel approach for multi-document update summarization. The best summary is deﬁned to be the one which has the minimum information distance to the entire document set. The best update summary has the minimum conditional information distance to a document cluster given that a prior document cluster has already been read. Experiments on the DUC/TAC 2007 to 2009 datasets (http://duc.nist.gov/, http://www.nist.gov/tac/) have proved that our method closely correlates with the human summaries and outperforms other programs such as LexRank in many categories under the ROUGE evaluation criterion. Keywords

1

data mining, text mining, Kolmogorov complexity, information distance

Introduction

Automated summarization dates back to the 1950’s[1]. In recent years, since web contents grow in an increasing speed, people need to have a concise overview of a large set of articles in a short time. So document summarization, aiming at generating brief and understandable summaries, has quickly become a hot research topic. Document updating technique is also very helpful for people to acquire new information or knowledge by eliminating out-of-date or redundant information. Multi-document update summarization is introduced by Document Understanding Conference (DUC) in 2007. It aims to produce a summary describing the majority of information content from a set of documents under the assumption that the user has already read a given set of earlier documents. This type of summarization has been proved extremely useful in tracing news stories: only new and update contents should be summarized if we have already known something about the story. In a news service website called “NewsBlaster ”, news articles are grouped into several topics, and a great number of articles have two summaries: one is the

whole story of the topic, and the other one tells readers what have happened recently. For example, there are ten news articles about the development of Australia’s uranium mine project in its Kakadu National Park and the protests and obstacles encountered. A good summary should contain four aspects: 1) What is the project going on? 2) What is the attitude of the government? 3) Where are the protests and obstacles coming from? 4) How does the government deal with these problems? As exempliﬁed, a good summary is expected to preserve the information contained in the documents as much as possible, and at the same time keep the information as novel as possible[2] . Information distance is based on the theory of Kolmogorov complexity. Kolmogorov complexity was introduced almost half a century ago[3] and it is now widely accepted as an information theory for individual objects parallel to Shannon’s information theory which is deﬁned on an ensemble of objects. In this paper, we propose a novel document summarization approach based on the theory of information distance among many objects. In order to deal with update summarization, we will extend the information distance theory to conditional information distance among many

Regular Paper The work was supported by the National Natural Science Foundation of China under Grant No. 60973104, the National Basic Research 973 Program of China under Grant No. 2007CB311003, and the IRCI Project from IDRC, Canada. ∗ Corresponding Author http://newsblaster.cs.columbia.edu/ 2010 Springer Science + Business Media, LLC & Science Press, China

J. Comput. Sci. & Technol., July 2010, Vol.25, No.4

740

objects. Finally summaries are generated according to our newly developed theory. Our paper’s main contributions lie in two aspects: 1) We provide a framework in which multi-document summarization can be modeled by the information distance theory. The best summary is deﬁned as having the minimal information distance (or conditional information distance) to the entire document set (if a prior document set is given). 2) Also we have provided two feasible ways, one by compression and the other through semantic element extraction, to implement this framework. Extensive experiments on the DUC/TAC 2007 to 2009 datasets show that the proposed method outperforms the state-of-the-art systems. The paper is organized as follows. Section 2 discusses some related studies. Kolmogorov complexity and information distance are reviewed and the conditional information distance among many objects is described in Section 3. Section 4 presents our proposed summarization method based on conditional information distance among many objects and experiments in Section 5 emphasize the novelty and advantages of our work. Conclusions and future work are outlined in Section 6. 2

Related Work

Our work aims at summarizing documents using information distance, thus there are mainly two related topics: document summarization and the theory of information distance. In this section, we will introduce related studies on these two topics, respectively. 2.1

Document Summarization

Generally speaking, there are mainly two diﬀerent kinds of document summarization methods: extractionbased and abstraction-based. Here we focus on extractive summarization. Most extractive summarization studies have focused on NLP and statistical machine learning techniques. Carbonell and Goldstein proposed to use Maximal Marginal Relevance (MMR), which aims to select summary sentences relevant to the user query and least similar to previously chosen ones[4] . Radev et al. described an extractive multi-document summarizer which extracts a summary from multiple documents based on the document cluster centroids[5]. Researchers have also proposed a number of machine learning methods, including Na¨ıve Bayes[6], SVM[7] , Conditional Random Fields[8] , and Adaboost[9] , to search the sentences related to a topic, or extract most representative sentences, or both. Most recently, graph-based methods have been proposed for document summarization, such as LexRank[10] and TextRank[11-12] . Wan et al.

improved graph models by studying the relationship between diﬀerent semantic granularities[13-14] . Diﬀerent from all the previous summarization methods, we will propose a novel summarization approach based on the information distance theory. Next we will introduce some related work about the theory of information distance and its applications in text mining area. 2.2

Information Distance and Its Applications in Text Mining

Kolmogorov complexity was introduced in 1963 by Solomonoﬀ et al.[3] Then Bennett et al. deﬁned information distance to measure information between two objects[15] . Li et al. deﬁned normalized information distances in [16-17] and showed how to use them to compare documents and genomes. Long et al. generalized the theory of information distance to more than two objects[18] . Since Kolmogorov complexity is not computable, there are mainly two diﬀerent methods to do the approximations: one by compression and the other through semantic information. More details about this theory will be introduced in Section 3. The theory of information distance has been initially applied in text mining area since 2002. Bennett et al. ﬁrstly approximated information distance through compression. They traced out language history[17,19] and chain letter history[20] . If two languages or chain letters are similar to each other, they share similar words and phrases and they are able to compress each other well. In 2007, Cilibrasi and Vit´ anyi measured the information distance between words and phrases through their semantic information[21] . Semantic information was considered through these words’ probabilities and frequencies. Then Zhang et al. proposed a method to measure the information distance from a question to an answer in a question answering (QA) system[22] . Long et al. selected typical reviews whose information distances are close to other reviews[18]. Our document summarization approach is based on a new theory of conditional information distance among many objects. This approach has a general framework which could complete the traditional summarization task (summarizing on a single document set), and also the update summarization task (summarizing on document cluster B when given cluster A). In the next section, the theories about Kolmogorov complexity and information distance will be reviewed and our extended theory will be introduced. 3

Theory Fix a universal Turing machine U . The Kolmogorov

Chong Long et al.: Multi-Document Summarization by Information Distance

complexity[3] of a binary string x conditioned to another binary string y, KU (x|y), is the length of the shortest (preﬁx-free) program for U that outputs x with input y. It can be shown that for a diﬀerent universal Turing machine U , for all x, y KU (x|y) = KU (x|y) + C, where the constant C depends only on U . Thus KU (x|y) can be simply written as K(x|y). We write K(x|), where is the empty string, as K(x). It has also been deﬁned in [15] that the energy to convert between x and y to be the smallest number of bits needed to convert x to y and vice versa. That is, with respect to a universal Turing machine U , the cost of conversion between x and y is: E(x, y) = min{|p| : U (x, p) = y,

Dmax (x, y) = max{K(x|y), K(y|x)}.

(2)

This distance is shown to satisfy the basic distance requirements such as positivity, symmetricity, and triangle inequality is admissible[15] . Dmax (x, y) satisﬁes the above requirement because of Kraft’s Inequality (with the preﬁx-free version of Kolmogorov complexity). It has been proved in [15] that for any admissible computable distance D, there is a constant c, for all x, y, Dmax (x, y) D(x, y) + c. Putting it bluntly, if any such distance D discovers some similarity between x and y, so will Dmax [15] . Here for an object x, we can measure its information by Kolmogorov complexity K(x); for two objects x and y, their shared information can be measured by information distance D(x, y). In [18], the authors generalize the theory of information distance to more than two objects. Similar to (1), given strings x1 , . . . , xn , they deﬁne the minimal amount of thermodynamic energy needed to convert any xi to any xj as: Em (x1 , . . . , xn ) = min{|p| : U (xi , p, j) = xj for all i, j}. Then it is proved in [18] that:

Theorem 2. Modulo to an O(log n) additive factor, min K(x1 . . . xn |xi ) Em (x1 , . . . , xn ) i min Dmax (xi , xk ). i

(3)

k=i

(4)

In update summarization, the summary should contain new information which former documents have not mentioned, so (3) is extended to: Em (x1 , . . . , xn |c) = min{|p| : U (xi , p, j|c) = xj for all i, j} (5) where c is the conditional sequence that is given for free to compute from sequence x to y and from y to x. Similar to (4): Theorem 3. Modulo to an O(log n) additive factor,

U (y, p) = x}. (1)

It is clear that E(x, y) K(x|y) + K(y|x). From this observation, the following theorem has been proved in [15]. Theorem 1. E(x, y) = max{K(x|y), K(y|x)}. Thus, the max distance was defined in [15]:

741

min K(x1 . . . xn |xi , c) Em (x1 , . . . , xn |c) i min Dmax (xi , xk |c). i

k=i

(6)

Given n objects and a conditional sequence c, the left-hand side of (6) may be interpreted as the most comprehensive object that contains the most information about all of the others. The right-hand side of (6) may be interpreted as the most typical object that is similar to all of the others. 4

Summarization Approach

We have developed the theory of conditional information distance among many objects. In this section, a new summarization model will ﬁrstly be built based on our new theory, and then we are going to develop a method to approximate Kolmogorov complexity and information distance through two diﬀerent ways. 4.1

Modeling

4.1.1 Modeling Traditional Summarization The task of traditional multi-document summarization can be described as follows: given n documents B = {B1 , B2 , . . . , Bn }, the task requires the system to generate a summary S of B. According to our theory, the conditional information distance among B1 , B2 , . . . , Bn is Em (B). However, it is very diﬃcult to compute Em . Moreover, Em itself does not tell us how to generate a summary. (4) has provided us a feasible way to approximate Em : the most comprehensive object and the most typical one are the lower and upper bounds of Em , respectively. The most comprehensive object is long enough to cover as much information in B as possible, while

J. Comput. Sci. & Technol., July 2010, Vol.25, No.4

742

the most typical object is a concise one that expresses the most common idea shared by those objects. Since we aim to produce a short summary to represent the general information, the right-hand side of (4) should be used. The most typical document is the Bj such that min Dmax (Bi , Bj ). j

i=j

However, Bj is far from enough to be a good summary. A good method should be able to select the information from B1 to Bn to form a best S. We view this S as a document in this set. Since S is a short summary, it does not contain extra information outside B. The best traditional summary Strad should satisfy the constraint as: Strad = arg min Dmax (Bi , S). (7) S

i

In most applications, the length of S is conﬁned by |S| θ (θ is a constant integer) or |S| α i |Bi | (α is a constant real number between 0 and 1). 4.1.2 Modeling Update Summarization Given a set of earlier m articles A = {A1 , A2 , . . . , Am }, the update summarization task is to summarize new contents presented by a document set B = {B1 , B2 , . . . , Bm }. This earlier article set A can be viewed as a precondition. Thus this task can be well modeled by the conditional version of information distance. The best summary Sbest should satisfy the constraint as follows: Sbest = arg min Dmax (Bi , S|A). (8) S

i

If m = 0 (A = ∅), it will be a traditional multidocument summarization problem. If m > 0 (A = ∅), it will be a multi-document update summarization problem. Therefore, the traditional summarization can be viewed as a special case of (8). According to [22], from (8) we can get: Dmax (Bi , S|A) = Dmax (BiA , S|A) = Dmax (BiA , S) where Bi is mapped to BiA under the condition of A. Then for a document Bi and a document set A, BiA is a set of Bi ’s sentences (Bi,k s) which are diﬀerent from all the sentences in A1 to Am : Ai , Dmax (Bi,k , sen) > ϕ BiA = Bi,k | ∀sen ∈ i

where Ai is the sentence set of a document Ai and ϕ is a threshold. Note that ϕ is the only parameter to be speciﬁed in our approach and it is only related to update summarization clusters. We tune it on the B

cluster of the DUC 2007 dataset under the ROUGE-1 criterion. We have already developed a framework for summarization. However, the problem is that neither K(·) nor Dmax (·, ·) is computable. As mentioned in Subsection 2.2, two methods can be used in the approximation and the computation of information distance: one by compression and the other through semantic element extraction. In the next several sub-sections, we will discuss how to use these two methods to do the approximations, respectively. 4.2

Approximation by Compression

From (1) we can get E(x, y) = max{K(x|y), K(y|x)} = max{K(x, y) − K(x), K(x, y) − K(y) = K(x, y) − min{K(x), K(y)}, where K(x, y) is the shortest program which can output both x and y. In [16], the authors proposed to approximate Kolmogorov complexity through a compressor C. The boundary case is C = K if C is powerful enough. Now we have to use a real-world reference compressor C which approximates the information distance E(x, y). The compression distance EC (x, y) is deﬁned as EC (x, y) = C(xy) − min{C(x), C(y)}. Here C(xy) denotes the compressed size of the concatenation of x and y. C(x) denotes the compressed size of x, and C(y) denotes the compressed size of y. Then EC (x, y) is just an approximation of Dmax : Dmax (x, y) = E(x, y) ≈ EC (x, y).

(9)

Here we take LZ77[23] as the compressor just as the paper [19] did. EC is computed by its compression between two sentences. For example, there are two sentences x and y: x = “The trial of Caldera Inc.’s antitrust lawsuit against Microsoft Corp. has been postponed from June until next January.”, and y =“ Business seemed as usual at Microsoft headquarters on the ﬁrst day of the U.S. government’s antitrust trial against this software giant.”; then xy = “The trial of Caldera Inc.’s antitrust lawsuit against Microsoft Corp. has been postponed from June until next January. Business seemed as usual at Microsoft headquarters on the ﬁrst day of the U.S. government’s antitrust trial against this software giant.” Through LZ77 we can get C(x) = 171, C(y) = 182, C(xy) = 292 and EC (x, y) = 121 (bytes).

Chong Long et al.: Multi-Document Summarization by Information Distance

4.3

Approximation by Semantic Element Extraction

The compression method is a language-independent summarization method. It is easy to be implemented and it can summarize documents written in any other language without any modiﬁcations. However, this method only uses morphological features of a sentence. The semantic meanings of terms or phrases have been heavily neglected by simple compression. Alternatively, we can apply the approximation method based on semantic information: ﬁrstly we divide a sentence into semantic elements; then information distance between two sentences is estimated through their semantic element sets. 4.3.1 Semantic Element Extraction In a document, each word or entity contains a certain amount of information, and the information varies according to the word or entity’s importance. Such words or entities are called “semantic elements”, and “elements” for short in this paper. There are two types of elements: 1) named entities such as person, organization, time, and location, containing a large portion of meaningful information; and 2) common words except stop-words. For example, the meaningful elements of the sentence “George W. Bush was born on July 6th, 1946” are “George W. Bush” (person), “born” and “July 6th, 1946” (time). First, we recognize entities about a person, location, organization and other names with Stanford Named Entity Recognition (NER) . We also extract entities about a date or time with patterns. Totally ﬁve types of name entities are recognized. Second, words or phrases with the same meanings are normalized into one entity through coreference resolution. For example, “George W. Bush” and “George Bush” are normalized to the same entity; “May 15th, 2008”, “May 15, 2008” and “5/15/2008” are recognized as the same date. 4.3.2 Information Distance Approximation Next we will take several steps to do the approximations. Although some steps contain rough approximations, we will investigate the inﬂuence of our estimations with extensive experiments in Subsection 5.6. Let M = {M1 , M2 , . . .} and N = {N1 , N2 , . . .} be two sets of sentences. After those steps mentioned in Subsection 4.3.1, each sentence Mi (or Nj ) has an element set Mi∗ (or Nj∗ ). According to (2), Dmax (M, N ) = max{K(M |N ), K(N |M )},

http://nlp.stanford.edu/ner/index.shtml

743

then K(M |N ) ≈ K( Mi∗ \ Nj∗ ), i

j

j

i

K(N |M ) ≈ K( Nj∗ \ Mi∗ ).

(10)

The Kolmogorov complexity of an element set W can be computed by the sum of the complexities of all its elements: K(W ) = K(w). w∈W

We can use frequency count, and use Shannon-Fano code[21] to encode a semantic element which occurs in probability p in approximately − log p bits to obtain a short description. A semantic element’s probability p can usually be approximated by its document frequency in the corpus: K(w) = − log P (w) ≈ − log df (w).

(11)

Although this method contains semantic information, there are mainly three steps which may lead to an approximation bias during the process of generating a summary: 1) when the complexity between two sentences is computed through their elements’ complexities in (10); 2) when an element’s complexity is estimated by its document frequency in (11); and 3) when Em is approximated by its upper bound. In Subsection 5.6 we will analyze how these approximations aﬀect our system’s performance. 5

Experimental Results

In this section, our summarization method will be evaluated on the DUC/TAC 2007 to 2009 datasets. Firstly, the datasets, preprocessing methods and the ROUGE evaluation criterion are introduced. Then our results are compared with LexRank on the DUC/TAC 2007 to 2009 update summarization tasks, respectively. Finally, pyramid units, which are important semantic units written by annotators in the reference summaries, are used to justify our complexity approximation method based on semantic element extraction. 5.1

Datasets

The Document Understanding Conference (DUC), renamed as Text Analysis Conference (TAC) since 2008, has been the major forum for comparing summarization systems on a shared test set. Every year several summarization topics are released and the summaries

J. Comput. Sci. & Technol., July 2010, Vol.25, No.4

744

produced by participants are evaluated, both automatically and manually. In DUC/TAC 2007 to 2009, one of those tasks is “update summarization”. The task requires participants to write a short summary of a set of newswire articles, under the assumption that the user has already read a given set of earlier articles. The length of the summary should be no more than 100 words. Table 1. Summary of Datasets Year

2007

2008

2009

No. Topics No. Clusters for Each Topic No. Documents

10 3 250

48 2 960

44 2 880

Table 1 shows the statistics of the datasets. In DUC 2007, there are 10 topics, each of which has three clusters: A, B and C. Each cluster has about 10 news articles. Summarizing on cluster A is a traditional summarization task. Summarizing on cluster B is required to generate an update summary with already read A. The summary from cluster C should be updated with clusters A and B. In TAC 2008 and 2009, there are 48 and 44 topics and two clusters (A and B) for each topic, respectively. The update task is similar with that in DUC 2007. For each cluster, there are four standard summaries written by four diﬀerent people, respectively. The score on a cluster is the mean of the scores on four manual summaries. The overall score on a dataset is just the mean of the scores on its clusters. 5.2

Preprocessing

During the preprocessing, we need to ﬁlter out those sentences which are impossible to be a part of a summary. The top 10% entities with the highest document frequency on a document set are viewed as “the topic set”. The sentences which do not contain any entity of the topic set are eliminated and the remaining ones are called “candidate sentences”. After preprocessing we search for the best combination of the candidate sentences to make (8) minimal. In our datasets described in Subsection 5.1, each set has on average 9.6 documents with less than 300 sentences. Averagely less than 70 candidate sentences are selected. As the length of the summary is less than 100 words, averagely there are three to four sentences per summary in average. Therefore, there are totally less than C470 < 106 diﬀerent combinations. It is a small number and our approach can generate the summaries in real time. The exhaustive enumeration is simple but can get the optimal result. We will develop a

http://haydn.isi.edu/ROUGE/ http://www.summarization.com/mead/

heuristic search algorithm to generate longer summaries for larger document sets in our future work. 5.3

Evaluation Metrics

ROUGE toolkit [24] is used for evaluation. It measures the quality of a summary by counting the overlapping units such as the n-gram, word sequences and word pairs between the candidate summary and the reference summary. ROUGE-N measures the n-gram recall as follows: ROUGE-N = S∈{RefSum}

S∈{RefSum}

ngram∈S

Count match (ngram)

ngram∈S

Count (ngram)

where n stands for the length of the n-gram. Count match (ngram) is the number of co-occurring ngrams in a evaluated summary and a set of reference summaries. Count(ngram) is the number of n-grams in the reference summaries. The ROUGE toolkit reports separate F -measure scores for 1, 2, 3 and 4-gram, and also for longest common subsequence co-occurrences. Among these diﬀerent scores, unigram-based ROUGE score (ROUGE-1) has been shown to agree with human judgment most[24] . Thus ROUGE-1 recall will be used to evaluate our results. 5.4

Results on the DUC/TAC 2007 to 2009 Datasets

Firstly we will show the results of our system on all datasets. We take the popular summarization method, LexRank[10] implemented in MEAD , as our baseline. We run the LexRank package on each set’s “candidate sentences” (as described in Subsection 5.2), with the same preprocessing step as our method does. Then our multi-document update summarization approach is implemented, approximated by compression and semantic element extraction, respectively. The results on the DUC/TAC 2007 to 2009 datasets are shown in Fig.1. From these ﬁgures we can get three conclusions: 1) our methods, even simply approximated by compression, always outperform the LexRank method; this means that the proposed framework is safe and sound, and has already exhibited potentials for generating good summaries even with a very simple compression technique problem eﬀectively; 2) the results of approximation through semantic element extraction are much better than those of compression. This phenomenon implies that semantic information is very important

Chong Long et al.: Multi-Document Summarization by Information Distance

745

Fig.1. Results on the datasets. (a) DUC 2007. (b) TAC 2008. (c) TAC 2009.

while approximating information distance between sentences; and 3) our system performs almost equally well on traditional clusters (Cluster A) and update clusters (Clusters B and C) by virtue of that the framework is universal to two cases. 5.5

A Sample Result

Here is a sample result of our system. This sample comes from the DUC 2007 dataset. The input ﬁles are news articles about the development of Australia’s uranium mine project in its Kakadu National Park and the protests and obstacles encountered. One of these ten articles writes like follows: SYDNEY, Australia (AP) _ The opening of Australia’s first new uranium mine in a decade continues to inch closer, despite the latest vow Wednesday by Aboriginal landowners to block the development. Energy Resources of Australia Ltd. said it expects to begin digging at the remote Jabiluka mining lease in the Northern Territory this month, after the territorial government on Tuesday signed authorizations allowing work to begin on the mine’s entrance and shaft. ... The summary outputted by our system is: The Australian government today gave the green light to the country’s controversial uranium mining plan, arguing that it would generate billions of U.S. dollars in revenue for Australians and create 2,000 jobs. Construction of Australia’s first new uranium mine in a decade was under way Tuesday, after a protracted legal battle with Aboriginal land owners and amid continuing environmental protest. Police arrested more than 100 protesters

Tuesday at Energy Resources of Australia’s Jabiluka mine development at Kakadu, a world heritage listed area in the Northern Territory. Manually written summaries for evaluation are offered by DUC 2007. One of them is as follows: In 1997, the Australian government approved the development of the Jabiluka uranium mine. The mine is in Kakadu National Park, which is on the UNESCO World Heritage list of important cultural and environmental sites. The government’s decision was condemned by environmentalists, who were concerned with the potential for consigning tons of radioactive waste in an environmentally sensitive area, and by opposition parties, who claimed the government ignored the wishes of Aboriginal people in the area. The government defended their decision saying the mine would generate millions of revenue dollars and create 2,000 jobs. Protestors were arrested at the mine site. The system’s summary and the manual summary have both mentioned four important pieces of information about this news article cluster: 1) the Australian government approved the development of the Jabiluka uranium mine; 2) the government believed that the mine would generate millions of revenue dollars and create 2,000 jobs; 3) the government’s decision was condemned by environmentalists; and 4) protestors were arrested by the police. Our output has got 45.2% overlap with this manual summary under the ROUGE-1 recall criterion. 5.6

Estimation Analysis

We noticed in Subsection 4.3.2 there are mainly three approximating biases: 1) approximating the

J. Comput. Sci. & Technol., July 2010, Vol.25, No.4

746

complexity of a sentence through its elements; 2) using document frequency to approximate an element’s complexity; and 3) approximating Em using its upper bound. It is important to note that although sometimes our theory can be specialized (or trivialized) to become a known “frequency” estimation method as in [22], the fact that the theory allows other kind of better approximation (the comparison between our compressionbased and semantic-element-based methods has already demonstrated this), and it is more general than any of the trivializations. In fact, it is possible to prove the optimality of our theory. Alternatively, we take it to the experiments to justify our approach.

Manually built pyramids have provided us a good way to investigate how human being write summaries with important semantic units. A pyramid represents the opinions of multiple human summary writers, each of whom has written a reference summary for the input set of documents. Each semantic unit (usually a short sentence or a phrase) is called a Summary Content Unit (SCU). For example, in the pyramids provided by DUC/TAC 2007 to 2009, words or phrases contributed to the reference summaries are named “contributors”. They are collected and grouped according to manual annotations. The following is an SCU example in an XML format:

5.6.1 Diﬀerent Elements For the ﬁrst bias, we check our method on estimating the distance among sentences through their elements. Two important steps taken from a sentence M to its element set M ∗ are recognizing words or phrases as entities and grouping them through reference resolution. Here three diﬀerent methods are compared on the DUC 2007 dataset, as shown in Fig.2: (a) “Words” means to treat every word as an element. For example, “George Bush” is viewed as two diﬀerent elements; (b) “Words+Entities”: after entities are recognized, phrases such as “George Bush” are viewed as one element. Other words such as “born” are still in the element set; (c) “Normalized”: after reference resolution, “George Bush” and “George W. Bush”, “May 15,2008” and “5/15/2008” are normalized into one element, respectively. Through this ﬁgure we can see that after entity recognition and reference resolution, the performance has been improved remarkably, where we may conjecture it has got a more accurate approximation of information distance.

Fig.2. Comparison of diﬀerent elements.

5.6.2 Complexity Estimation Pyramids [25] are used to study the second bias.

<scu uid="22" label="Euro was scheduled to be launched on January 1, 1999">! This SCU has four contributors from four diﬀerent reference summaries. The most important elements, “January 1999” and “Euro”, exist in all four contributors. The more frequently an element exists in contributors, the more important it is. In a pyramid, we take every contributor deﬁned in the XML ﬁle as a document. Each element T in this pyramid has a weight w(T ), computed by T ’s document frequency on the contributors. Compared with (11), w(T ) might be a more accurate approximation of K(T ) in that w(T ) was weighted by the humanannotated SCUs. Let K(T ) = w(T ), we get a group of better results, which is closer to human summaries. In Fig.3 we have three groups of results on each topic of the DUC 2007 dataset: the brown ones are the average ROUGE-1 scores of human written summaries provided by the organizers. Each set has four reference summaries written by four diﬀerent people. The blue ones are the results of our approach with K(T ) = w(T ). The pink ones are the results of our proposed approach with (11). From

http://www1.cs.columbia.edu/˜becky/DUC2006/2006-pyramid-guidelines.html

Chong Long et al.: Multi-Document Summarization by Information Distance

747

Fig.3. Comparisons on the DUC 2007 dataset.

this ﬁgure we have two observations: the ﬁrst one is that the approximation of K(T ) by (11) has very close results with that by K(T ) = w(T ). The latter takes into account the important semantic units which must be concerned when assessors write summaries. Thus we believe this might be a right way to approach the unit weights of “perfect” summaries. The second one is that our ROUGE-1 scores are close to those of the human written summaries. We further compares the average diﬀerences and Pearson correlations of ROUGE-1 scores among three approximation methods for K(T ) in Table 2. The ﬁrst method is our approximation deﬁned by (11) (PK (T ) for short); the second one is K(T ) = w(T ), as mentioned before; and the third one is K(T ) = constant which means each T contains one unit of information. The third method is termed CK (T ) for short. As shown in this table, our PK (T ) correlates more with w(T ) than CK (T ) does. The diﬀerence between PK (T )’s results and w(T )’s results is smaller than that between CK (T )’s results and w(T )’s results. The observation may tell us that our approximation method has already reﬂected the human summarization process to some extent. We may conclude here that our approximation is a reasonable computation schema for the theory. As to the third bias, we will study it in theory to ﬁnd out the upper bound of the diﬀerence between Em and minS i Dmax (Bi , S|A). Table 2. Diﬀerences and Correlations PK (T ) vs. w(T ) Diﬀ. Cor.

CK (T ) vs. w(T ) Diﬀ. Cor.

07A 07B 07C

0.003 0.015 0.012

0.953 0.933 0.899

0.022 0.022 0.024

0.846 0.921 0.612

08A 08B

0.013 0.015

0.889 0.880

0.020 0.023

0.782 0.812

09A 09B

0.009 0.012

0.935 0.901

0.019 0.020

0.818 0.798

Cluster

5.7

Discussion

Many of the existing document summarization methods need to eliminate sentences with redundant content in their extra post-processing steps. We have no post-processing steps to remove redundancy as MMR methods do, neither for compression based nor for semantic-element-based approximation. The reasons are as follows: 1) In the compression based approximation, we can get K(xx) = K(x) + O(1) for a string x according to the deﬁnition of Kolmogorov complexity. If another string y is very similar to x, K(xy) ≈ K(x) ≈ K(y). The concatenation of those redundant sentences contains nearly the same information as one of them, so they are not possible to be included in the summaries generated by our system. 2) In the semantic-element-based approximation, the union of element sets is used to compute the information distance between sentences. If two sentences are redundant, the number of uniﬁed elements will be less, which leads to a larger information distance from the two sentences to all the other sentences in the document cluster. And further, each element is weighted by Kolmogorov complexity. If an element is more frequently observed in diﬀerent documents, it contains more information and it is more relevant to the topic of the document cluster; thus our method is not simply as counting overlapping terms to exclude redundant sentences. Therefore, our approach always tends to select those sentences with diﬀerent important elements (related to topic, as deﬁned in 5.2) as many as possible. 6

Conclusion and Future Work

In this paper, we have proposed a novel document summarization framework based on the theory of information distance. Two approximation methods are used to estimate information distance, one by compression and the other through semantic element extraction. The very simple compression-based method has demonstrated that our framework is theoretically safe

748

and practically feasible to produce good summaries. Better approximation schemas which take into account more semantic information in estimating information distance, as exempliﬁed by the semantic elements or semantic content units in this paper, will deﬁnitely contribute better results in this framework. And this framework obviously embraces other approximation techniques. Experiments show that our approach performs well on the DUC/TAC 2007 to 2009 datasets. In future work, we will further improve our approach mainly in two ways: ﬁrstly, better approximation of information distance will be studied; then a heuristic method will be developed in order to ﬁnd the best summary more eﬀectively. References [1] Luhn H P. The automatic creation of literature abstracts. IBM Journal of Research and Development, 1958, 2(2): 159165. [2] Wan X, Yang J, Xiao J. Manifold-ranking based topic-focused multi-document summarization. In Proc IJCAI, Hyderabad, India, Jan. 6-12, 2007, pp.2903-2908. [3] Li M, Vit´ anyi P M. An Introduction to Kolmogorov Complexity and Its Applications. Springer-Verlag, 1997. [4] Carbonell J, Goldstein J. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proc. SIGIR, Melbourne, Australia, Aug. 24-28, 1998, pp.335-336. [5] Radev D R, Jing H, Stys M, Tam D. Centroid-based summarization of multiple documents. Information Processing and Management, 2004, 40(6): 919-938. [6] Kupiec J, Pedersen J, Chen F. A trainable document summarizer. In Proc. SIGIR, Seattle, USA, Jul. 9-13, 1995, pp.6873. [7] Leskovec J, Milic-Frayling N, Grobelnik M. Impact of linguistic analysis on the semantic graph coverage and learning of document extracts. In Proc. AAAI, Pittsburgh, USA, Jul. 913, 2005, pp.1069-1074. [8] Shen D, Sun J T, Li H, Yang Q, Chen Z. Document summarization using conditional random ﬁelds. In Proc. IJCAI, Hyderabad, India, Jan. 6-12, 2007, pp.2862-2867. [9] Zhang J, Cheng X, Wu G, Xu H. Adasum: An adaptive model for summarization. In Proc. CIKM, Napa Valley, USA, Oct. 26-30, 2008, pp.901-909. [10] Erkan G, Radev D R. Lexpagerank: Prestige in multidocument text summarization. In Proc. EMNLP, Barcelona, Spain, Jul. 25-26, 2004, pp.365-371. [11] Mihalcea R, Tarau P. Textrank — Bring order into texts. In Proc. EMNLP, Barcelona, Spain, Jul. 25-26, 2004, pp.119126. [12] Mihalcea R, Tarau P. A language independent algorithm for single and multiple document summarization. In Proc. IJCNLP, Jeju Island, Korea, Oct.11-13, 2005, pp.19-24. [13] Wan X, Yang J, Xiao J. Towards an iterative reinforcement approach for simultaneous document summarization and keyword extraction. In Proc. ACL, Prague, Czech Republic,

J. Comput. Sci. & Technol., July 2010, Vol.25, No.4 Jun. 23-30, 2007, pp.552-559. [14] Wan X. An exploration of document impact on graph-based multi-document summarization. In Proc. EMNLP, Hawaii, USA, Oct. 25-27, 2008, pp.755-762. [15] Bennett C H, G´ acs P, Li M, Vit´ anyi P M, Zurek W H. Information distance. IEEE Transactions on Information Theory, Jul. 1998, 44(4): 1407-1423. [16] Li M, Badger J H, Chen X, Kwong S, Kearney P, Zhang H. An information-based sequence distance and its application to whole mitochondrial genome phylogeny.Bioinformatics, 2001, 17(2): 149-154. [17] Li M, Chen X, Li X, Ma B, Vit´ anyi P M. The similarity metric. IEEE Transactions on Information Theory, 2004, 50(12): 3250-3264. [18] Long C, Zhu X, Li M, Ma B. Information shared by many objects. In Proc. CIKM, Napa Valley, USA, Oct. 26-30, 2008, pp.1213-1220. [19] Benedetto D, Caglioti E, Loreto V. Language trees and zipping. Physical Review Letters, Jan. 2002, 88(4): 048702. [20] Bennett C H, Li M, Ma B. Chain letters and evolutionary histories. Scientific American, Jun. 2003, 288(6): 76-81. [21] Cilibrasi R L, Vit´ anyi P M. The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering, Mar. 2007, 19(3): 370-383. [22] Zhang X, Hao Y, Zhu X, Li M. Information distance from a question to an answer. In Proc. SIGKDD, San Jose, USA, Aug. 12-15, 2007, pp.874-883. [23] Ziv J, Lempel A. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 1977, 23(3): 337-343. [24] Lin C Y, Hovy E. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proc. HLT-NAACL, Edmonton, Canada, May 27-June 1, 2003, pp.71-78. [25] Nenkova A, Passonneau R, Mckeown K. The pyramid method: Incorporating human content selection variation in summarization evaluation. ACM Transactions on Speech and Language Processing, Apr. 2007, 4(2): 1-23.

Chong Long received his B.E. degree from Tsinghua University, China in 2005. He is a Ph.D. candidate in Department of Computer Science and Technology, Tsinghua University, China. His research interests include Kolmogorov complexity and its applications, text mining and natural language processing. Min-Lie Huang now is a faculty member of Dept. Computer Science and Technology, Tsinghua University. He received his Ph.D. degree from Tsinghua University in 2006. His research interests include machine learning, natural language processing, graph-based text mining, opinion and review mining, and complex question answering.

Chong Long et al.: Multi-Document Summarization by Information Distance Xiao-Yan Zhu is a professor and the Deputy Head of State Key Lab of Intelligent Technology and Systems, Tsinghua University. She obtained the Bachelor’s degree from University of Science and Technology Beijing in 1982, the Master’s degree from Kobe University in 1987, and the Ph.D. degree from Nagoya Institute of Technology, Japan in 1990. She has been teaching at Tsinghua University since 1993. Her research interests include pattern recognition, neural network, machine learning, natural language processing and bioinformatics. She is a member of CCF.

749

Ming Li is a Canada Research Chair in Bioinformatics and a University Professor of the University of Waterloo. He is a fellow of Royal Society of Canada, ACM, and IEEE. He is a recipient of E.W.R. Steacie Fellowship Award in 1996, and the 2001 Killam Fellowship. Together with Paul Vitanyi they have pioneered the applications of Kolmogorov complexity and co-authored the book “An Introduction to Kolmogorov Complexity and Its Applications”. His research interests recently include protein structure determination and next generation Internet search engine.

Recommend Documents

A Practical Approach Defeating Blackmailing - Springer Link

A Joint Segmenting and Labeling Approach for ... - Springer Link

A Generalized Topic Modeling Approach for Maven ... - Springer Link

Flow Shop Scheduling Using a General Approach for ... - Springer Link