Topic Analysis for Topic-Focused Multi-Document Summarization Xiaojun Wan Institute of Computer Science and Technology & Key Laboratory of Computational Linguistics, MOE, Peking University, Beijing 100871, China
[email protected] important information contained in the whole document set as much as possible, but also is expected to guarantee that the information is biased to the given topic. Therefore, we need effective methods to take into account this topic-biased characteristic during the summarization process.
ABSTRACT Topic-focused multi-document summarization has been a challenging task because the created summary is required to be biased to the given topic or query. Existing methods consider the given topic as a single coarse unit and then directly incorporate the relevance between each sentence and the single topic into the sentence evaluation process. However, the given topic is usually not well-defined and it consists of a few explicit or implicit subtopics. In this study, the related subtopics are discovered from the topic’s narrative text or document set through topic analysis techniques. Then, the sentence relationships against each subtopic are considered as an individual modality and the multi-modality manifold-ranking method is proposed to evaluate and rank sentences by fusing the multiple modalities. Experimental results on the DUC benchmark datasets show the promising results of our proposed methods.
Categories and Subject Descriptors: H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing – abstracting methods; I.2.7 [Artificial Intelligence]: Natural Language Processing – text analysis
General Terms: Algorithms, Experimentation, Performance Keywords: Topic-focused multi-document summarization, topic analysis, multi-modality manifold-ranking
1. INTRODUCTION Topic-focused (or query-based) multi-document summarization aims to create from a document set a summary which answers the need for information expressed in a given topic or query. Topic-focused summarization has drawn much attention in recent years and it has been one of the main tasks in recent Document Understanding Conferences (DUC). Topic-focused summary can be used to provide personalized news services for different users. In a Question/Answering system, a question-focused summary is usually required to answer the information need in the issued question.
To date, a variety of methods have been proposed for topic-focused multi-document summarization by incorporating the topic information into the process of sentence evaluation and selection [1, 5, 6, 8, 9, 10, 11]. For example, graph-based methods have been recently exploited for topic-focused multi-document summarization [8, 9, 10]. The graph-based methods first construct a graph representing the sentence relationships at different granularities and then evaluate the topic-biased saliency of the sentences based on the graph. Wan et al. [8] have used the basic manifold-ranking algorithm for topic-focused multi-document summarization by considering the topic as a single query unit. And Wan et al. [9] further use the two-modality manifold-ranking algorithm for extracting topic-focused summary by considering the within-document sentence relationships and the cross-document sentence relationships as two separate modalities (graphs). Almost all the methods consider the given topic or query as a single coarse unit and then directly evaluate and incorporate the relevance between each sentence and the single topic. However, the given topic is usually not well-defined by users and it may consist of a few explicit or implicit subtopics (or aspects). These subtopics are comprehensive and vivid descriptions of the coarse topic. The subtopics can be extracted explicitly from the topic’s narrative text or implicitly from the document set. We believe that it will benefit the sentence evaluation and selection by making use of the subtopics. In this study, we propose a novel graph-based summarization method to make use of the multiple subtopics in a multi-modality learning process. First, the related subtopics for a given topic/query are discovered from the topic’s narrative text or the document set through different topic analysis techniques. Second, the sentence relationships against each subtopic are considered as an individual modality, and we propose the more general multi-modality manifold-ranking algorithm to evaluate and rank the sentences by fusing the multiple modalities. Two fusion schemes are proposed for fusing the multiple modalities, i.e. the linear fusion scheme and the score combination scheme.
As compared with generic multi-document summarization, the challenge for topic-focused multi-document summarization is that a topic-focused summary is not only expected to deliver the
Experiments have been performed on the DUC2005 and DUC2006 benchmark datasets, and the results demonstrate that the proposed multi-modality learning methods outperform the baseline manifold-ranking methods. Both the two fusion schemes are effective.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’09, November 2–6, 2009, Hong Kong, China. Copyright 2009 ACM 978-1-60558-512-3/09/11...$10.00.
1609
2.3 Multi-Modality Sentence Ranking
2. THE PROPOSED SUMMARIZATION METHOD 2.1 Summarization Framework
2.3.1 Overview Given the discovered subtopics, we consider the sentence relationships against each subtopic as an individual modality, and thus we have multiple modalities based on multiple subtopics. We believe that it will benefit the sentence ranking process at the finer granularity of subtopic, instead of the coarse granularity of topic. And we propose the multi-modality manifold-ranking algorithm for sentence ranking by extending the manifold-ranking algorithms text and image ranking in [7, 13].
The proposed summarization method consists of three steps: topic analysis, sentence ranking and sentence selection. The step of topic analysis aims to discover the related subtopics from the given topic description or the document set. The step of sentence ranking aims to make use of the discovered subtopics to better evaluate the topic-biased saliency of the sentences by exploiting the multi-modality manifold-ranking algorithm. The step of sentence selection aims to select both highly salient and novel sentences into the summary as in [8], by penalizing the sentences highly overlapping with other informative sentences, whose details will be ignored in this paper due to page limit.
Formally, given a subtopic and the document set, the corresponding data point set is denoted by χ = {x0 , x1 ,..., xn } ⊂ R m , where the first point x0 represents the given subtopic (query point) and the rest n points represent all the sentences in the document set (data points to be ranked).
2.2 Topic Analysis
Let W = [Wij ]( n+1)×( n+1) be the affinity matrix reflecting the similarity relationships for the n+1 data points, where Wij is the cosine similarity value if xi or xj is x0, otherwise, Wij is the topic-sensitive similarity value between xi and xj against the subtopic x0, which is defined as follows:
A DUC topic usually consists of a title and a narrative text. The title is usually a short phrase concisely describing the focused object that the users have interest in. The narrative text is a further description of the title, and it usually reflects a few aspects of the object. The title and the narrative text together describe the information need to which the summary should be biased, and the narrative text is much more detailed than the short title. In this study, two topic analysis methods are employed for discovering subtopics related to a given topic.
r r r r xi ⋅ x j cij ⋅ x0 Wij = θ ⋅ r r + (1 − θ ) ⋅ r r (1) xi ⋅ x j cij ⋅ x0 r r r where x i , x j and x0 represent the term vector of xi, xj and x0 , r r r respectively. cij = xi I x j is a vector which contains the
2.2.1 Explicit Subtopic Discovery By analysis of the narrative text, we can discover a few subtopics related to the given topic. The subtopics discovered directly from the narrative text are called explicit subtopics, and the explicit subtopic discovery is very similar to complex question decomposition in question answering systems. In this study, our aim is not to decompose the given topic into all possible subtopics at a very fine granularity as in question decomposition. Instead, we take a straightforward algorithm by simplifying the syntactic question decomposition method in [2, 3, 4]. The algorithm first splits the narrative text into sentences and then applies heuristic syntactic rules to split each compound sentence into simple sentences by using conjunctions (e.g. and, or) or punctuations. Each simple sentence is used as a subtopic.
common terms of the sentences xi and xj. The weight of each term in crij is the average value of the associated weights of the term in xri and xr j . The above similarity measure is a linear combination of two sources: the original sentence similarity and the topic-biased similarity. θ is a weighting parameter to specify different weights to the two sources and we simply set θ=0.5. Note that we let Wii=0 to avoid self loops. We then normalize W by S=(D)-1/2W(D)-1/2, where D is the diagonal matrix with (i,i)-element equal to the sum of the i-th row of W. Assume we have discovered total m subtopics for the given topic, and we use W{k}, D{k} and S{k} to represent the matrices constructed in the k-th modality based on the k-th subtopic. In total, there are m matrix tuples {(W{k}, D{k}, S{k}) | k=1, …, m}.
2.2.2 Implicit Subtopic Discovery In the DUC dataset, the narrative text for each DUC topic has already been provided by NIST annotators. However, in practical applications, we cannot require users to input the long narrative text to explain their information need. The users often issue only a short query (i.e. the title) to reflect their information need, as in most search engines. Therefore, in real applications, the only representation of each given topic is its title. In this study, we will investigate this real circumstance. The problem is how to discover the related subtopics from the title and the document set. Such subtopics are called implicit subtopics. We adopt the clustering techniques for implicit subtopic discovery from the document set. We first collect all sentences in the document set that share some terms with the title, and then Group the sentences into a few clusters by applying the k-means clustering algorithm, where the cluster number is hard to predict and we simply set it to the square root of the sentence number. Finally we sort the clusters by size and use several largest clusters as subtopics and then use the central term vector of each cluster to represent the corresponding subtopic.
In addition, we let f : χ → R denote a ranking function which assigns to each point xi (0≤i≤n) a ranking value fi. We can view f as a vector f=[f0,…,fn]T. We also define a prior vector y=[y0,…,yn]T, in which y0=1 because x0 is the query object and yi=0 (1≤i≤n) for all the remaining points that we want to rank. With the above notation, the multi-modality learning task is to infer the ranking function f from W{1}, …, W{k} , …, W{m} and y. S{1},…,S{k},…, S{m} and y can be considered as constraints in the learning task, where 1) if two points (xi and xj) are measured as similar by any Sk, they should receive similar ranking values in f (fi and fj) and vice versa; 2) if a data point xi is within the initial query points, its ranking value fi should be as consistent as possible with the initial value yi. In the following subsections, we will describe two different learning schemes for fusing the multiple modalities based on different optimization strategies.
1610
2.3.2 The Score Combination Scheme
⎛ n ⎜ Q( f ) = ∑ ⎜ μ k ⋅ ∑ (W{k } ) ij k =1 ⎜ i , j =0 ⎝
This scheme is the most intuitive method for fusing the multiple modalities. It first computes the ranking score by using the basic manifold-ranking algorithm in each individual modality, and then linearly combines all the ranking scores into the final scores.
n
+ η ⋅ ∑ f i − yi
2
n 2 1 1 Q( f{k}) =α⋅ ∑(W{k})ij ( f{k})i − ( f{k})j +η⋅ ∑( f{k})i − yi (D{k})ii (D{k})jj i, j=0 i=0
(2)
{k}
{k}
{k}
f
Similar to [12], solving the above optimization problem by differentiating Q(f) defined by Equation (9) with respect to f leads to the following optimal ranking function f*:
(t )
f{ k }
m
k=1
k=1
f
}
( t +1)
=
∑ (μ m
k =1
where
f
* {k }
= (1 − α ) ⋅ ( I − αS{k } ) y −1
(6)
And we have
can be expressed in a closed form, for large scale
∑
After we obtain the ranking scores of sentences in each modality, the final ranking function f* is defined as follows:
k =1
k
In this scheme, the regularization parameter η for the fitting constraint is fixed at 0.01, as in [7, 12, 13]. λk is set to the normalized cosine similarity value between the subtopic representation and the whole document set.
2.3.3 The Linear Fusion Scheme
f * = limf (t) t→∞
m
k =1
k
)y
(12)
= y
by similar analysis in [13].
k
In the experiments, the proposed multi-modality manifold-ranking methods with the score combination scheme and the linear fusion scheme are denoted as “MultiMR-COM” and “MultiMR-LIN”, respectively. Because the multi-modality learning algorithms rely on the subtopic discovery algorithms, we combine each subtopic discovery algorithm with each learning algorithm. Therefore, there are total four multi-modality summarization methods. For example,
This scheme fuses the constraints from S{1},…,S{k},…, S{m} and y simultaneously by a weighted sum. The cost function associated with f is defined to be:
(W{k } ) ij refers to the (i, j)-th element in matrix W{k}. refers to the j-th element in vector f{k}.
(0)
) + (1 − ∑ μ
Topic-focused multi-document summarization has been the main task on DUC2005 and DUC2006, so we used the two DUC datasets for evaluation in this study. We used the ROUGE-1.5.5 toolkit for evaluation, which was officially adopted by DUC for automatic summarization evaluation. In this study, we show three ROUGE F-measure scores in the experimental results: ROUGE-1 (unigram-based), ROUGE-2 (bigram-based), and ROUGE-W (based on weighted longest common subsequence, weight=1.2)2.
∑λ =1. k=1
f
(t )
3. EXPERIMENTS
(7) m
where λk [0,1] is the combination weight and we have
S {k } f
value between the subtopic representation and the whole document set.
m
k=1
k
In this scheme, the regularization parameter η for the fitting constraint is fixed at 0.01, the same as in [7, 12, 13]. Therefore, we m have μ = 0.99 . μk is set to the normalized cosine similarity
problems, the iteration algorithm in Equation (5) is preferable due to computational efficiency.
f * = ∑λk f{*k}
(11)
In practice, the following iterative form is more preferable than the above close form to get the ranking function f*:
converges to
1
m
f * = (1− ∑μk ) ⋅ (I − ∑μk S{k} )−1 y
(5)
(t )
(9)
(10)
f * = argminQ( f )
(4)
The theorem in [13] guarantees that the sequence {
f{k* }
)
With the above optimization criterion, the optimal ranking function f* is achieved when Q(f) is minimized:
According to [12, 13], the ranking values can be obtained by iterating the following computation until convergence:
Although
(
k =1
f{ k }
( t +1 )
(8)
where μ1,…, μk,…, μm, η capture the trade-off between the constrains, usually we have 0≤μ1,…, μk,…, μm,