An Evaluation of Passage-based Text Categorization

Report 3 Downloads 57 Views
Journal of Trivia, Volume 1, Issue 4 (July 2009) Date of Revision: 2003-09-26

An Evaluation of Passage-based Text Categorization Jinsuk Kim∗

Myoung Ho Kim†

Center for Computational Biology & Bioinformatics, Korea Institute of Science and Technology Information (KISTI), P.O. Box 122, Yuseong-gu, Daejon, Republic of Korea 305-600 [email protected]

Department of Electrical Engineering & Computer Science, Korea Advanced Institute of Science and Technology (KAIST), 373-1, Guseong-dong, Yuseong-gu, Daejon, Republic of Korea 305-701 [email protected]

Category

Computer Science/Information Retrieval, Information Retrieval/Text Categorization

Keywords

Text Categorization, Passage, Non-overlapping Window, Overlapping Window, Paragraph, Bounded-Paragraph, Page, TextTile, Passage Weight Function

Abstract

Researches in text categorization have been confined to whole-document-level classification, probably due to lacks of full-text test collections. However, fulllength documents available today in large quantities pose renewed interests in text classification. A document is usually written in an organized structure to present its main topic(s). This structure can be expressed as a sequence of subtopic text blocks, or passages. In order to reflect the subtopic structure of a document, we propose a new passage-level or passage-based text categorization model, which segments a test document into several passages, assigns categories to each passage, and merges passage categories to document categories. Compared with traditional document-level categorization, two additional steps, passage splitting and category merging, are required in this model. Using four subset of Reuters text categorization test collection and a full-text test collection of which documents are varying from tens of kilobytes to hundreds, we evaluate the proposed model, especially the effectiveness of various passage types and the importance of passage location in category merging. Our results show simple windows are best for all test collections tested in these experiments. We also found that passages have different degrees of contribution to main topic(s), depending on their location in the test document.

1

Introduction

Text categorization, the task of assigning one or more predefined categories to a document, is an active research field in information retrieval and machine learning. However, research interests in text categorization have been confined to the problems such ∗ This

is a copy version of a paper published in other journal. If you want to cite this paper, use the following bibliography: Jinsuk Kim and Myoung-Ho Kim. “An Evaluation of Passage-Based Text Categorization”. Journal of Intelligent Information Systems 23(1):47-65, 2004. † To whom correspondence should be addressed. Copyright © 2004 Journal of Intelligent Information Systems. Copyrighted by Springer Science.

Volume 1, Issue 4 (July 2009)

365

as feature extraction, feature selection, supervised learning algorithms, and hypertext classification. Traditional categorization systems, or classifiers, have treated whole document as a categorization unit but there is few researches related to the input units of classifiers. However, emergence of full-length documents such as word-processor files, full-text SGML/XML/HTML documents, and PDF/Postscript files in large quantities today, tackles traditional categorization models which process whole-document as an input unit. As an alternative access method, we regard each document as a set of passages, where a passage is a contiguous segment of text. In the area of information retrieval, introduction of passages can be dated back to early 1990s[4, 3], and various types of passages have been proposed and tested for document retrieval effectiveness[3, 5, 6, 7]. Large quantity of full-length documents available today imposes renewed interests on the area of text categorization as well as on the information retrieval. In this article, we propose a new text categorization model, in which a test document is splitted into passages, categorization is processed for each passage, and then the document’s categories are merged from passages’ categories. We name the proposed model as Passage-based Text Categorization, compared with the trends that traditional categorization tasks are done based on document-level. In our experimental results, we compare the effectiveness of several passage types in text categorization, using a kNN(k-Nearest Neighbor) classifier[14]. In the case of a test collection which consists of very long documents, we find that use of passages can improve the effectiveness about 10% for all passage types used in our experiments compared with that of document-level categorization. In addition, in the case of collections consisting of rather short documents such as newswires, there are about 5% improvement as well. This paper starts introducing the passage-based text categorization in Section 2. Then data sets and several measures used in the experiments are explained in Section 3, and the experimental results are given in Section 4. Finally, we conclude in Section 5.

2

Passage-based Text Categorization Model

Generally a document is deliberately structured to a sequence of subtopical discussions that occur in the context of one or more main topic discussions[5]. If this is true, it is natural to deal a document as a sequence of subtopic blocks of any unit such as sentences, paragraphs, sections, and contiguous text segments. By this motivation, we propose a new text categorization model, passage-based text categorization, shown in Figure 1. The primary difference of passage usages between in information retrieval and in text categorization is the target documents. In information retrieval, documents stored in the databases are splitted into passages and queries are evaluated their similarities against the passages[3, 4, 10, 18]. On the other hand, in text categorization, the document to categorize is splitted into passages and categorization tasks are applied to these passages instead of parent document. As shown in Figure 1, passage-based categorization system splits the test document into several passages, classifies each passage into categories, and determines the test document’s categories by merging all passage categories. This procedure includes two

366

Journal of Trivia

test document

Text Text Classifier Classifier

document categories

Document-based Text Categorization (Traditional)

passage 1’s categories

passage 1

test document

Passage Splitter

passage 2

.. .

Text Text Classifier Classifier

passage n

passage 2’s categories

.. .

Category Merger

document categories

passage n’s categories

Passage-based Text Categorization (Proposed)

Figure 1. Comparison of document-based(upper) and passagebased(lower) text categorization models additional steps, passage splitting and category merging, compared with traditional document-level classification systems. These two additional steps are topics of Section 2.1 and Section 2.2, respectively. 2.1

Passages in Text Categorization

In regard of passage splitting in Figure 1, the definition of a passage is the key point of this step in passage-based text categorization. Since a passage is defined as any sequence of text from a document[7], many types of passages have been used in document retrieval. [3] grouped these passage types into three classes: discourse passages, semantic passages, and window passages. 2.1.1

Discourse passages

Discourse passages are based on logical components of documents such as sentences and paragraphs[3, 5]. This passage definition is intuitive, because discourse boundaries organize material by content. There are three problems with discourse passages. First, there is no guarantee that documents have discourse consistency among authors[3]. Second, sometimes it’s impossible to make discourse passages, because many documents are supplied without passage demarcation[7]. Finally, the lengths of discourse passages can be vary, from very long to very short[7].

Volume 1, Issue 4 (July 2009)

367

As an alternative solution for the first and last problems, [3] suggested a passage type, known as bounded-paragraphs. As the name implies, while building boundedparagraphs, short paragraphs are merged to subsequent paragraphs, paragraphs longer than some minimum length are kept intact. [3] used 50 words as the minimum length of a bounded-paragraph. 2.1.2

Semantic passages

As discussed above, discourse passages may have inconsistency or may be impractical to be built due to poor structure of the source document. An alternative approach is to split a document into semantic passages, each corresponding to a topic or to a subtopic. Several algorithms to partition documents into segments have been proposed and developed[7]. One of such algorithms, known as TextTiling[5], partitions full-length documents into coherent multi-paragraph segments, known as TextTiles or simply tiles, which represent a subtopic structure of a document. TextTiling splits a document into small text blocks and computes the similarities of all adjacent blocks based on term frequencies. Boundaries of two blocks that show relatively low similarities are regarded as boundaries between two adjacent tiles, while blocks with high similarities are considered to be merged into a tile. 2.1.3

Window passages

While discourse and semantic passages are based on structural properties of documents, an alternative approach called window passages are based on the sequences of words. This is computationally simple and can be applied to documents without explicit structural properties as well to the well-structured documents. [4] segmented documents into even-sized blocks, each corresponding to a fixedlength sequence of words and starting just after the end of previous block. Accordingly there is no shared region between two adjacent blocks and thus this passage type is referred to as non-overlapping windows. [3] partitioned documents into overlapping windows where two adjacent segments share words at the boundary. In our experiments, the second half of an overlapping window is shared by the following window. Another window passage type is page[9]. Pages are similar to bounded-paragraphs, but pages are bounded their lengths by physical length in bytes while bounded-paragraphs are bounded based on number of words. A page’s minimum length is 1.0kb[9]. 2.2

Selecting Document Categories from Passage Categories

Usually a document is written in organized manner to describe the authors’ intention. For example, a newspaper article may locate its main topic at title and early part to arrest its readers’ attention. On the other hand, a science article may locate its conclusion at the last part of itself. Therefore, depending on the location, passages from a document may have different degrees of contribution to the document’s main topic(s). In our passage-based text categorization model, a passage’s degree of contribution to document categories is expressed as a passage weight function. We chose six passage weight functions that are shown in Table 2 in Section 3.4. After categories for all passages from a document are assigned, passage weights are computed by the passage

368

Journal of Trivia

weight function and a category’s weight is summed from weights of which passage assigned the category. Categories with higher weights than some predefined value are assigned to the document as the final result. More detailed procedures are described in Section 3.4.

3 3.1

Data Sets and Measures Data Sets

We used the Reuters version 3[1] and its three subsets named GT800, GT1200, and GT1600, and KISTI-Theses collections to verify the effectiveness of passages in text categorization. Reuters version 3 collection was constructed by C. Apte et al.[1]. They removed all unlabeled documents from both the training and and test sets and restricted the categories to have a training set frequency of at least two. By the name of constructors we call this data set as Apte collection here after. GTnnnn test collections were constructed from Apte data set by removing all the documents of which lengths are less than nnnn bytes from the test set. This restriction resulted in test sets of 1,109, 652, and 410 documents for GT800, GT1200, and GT1600, respectively. Finally, KISTI-Theses data set was constructed from master and doctoral theses of KAIST1 , POSTECH2 , and CNU3 , which were submitted in electronic forms as their partial fulfillment of graduation. These theses are counted to 1,042 documents from 22 departments. We regarded the 22 departments as categories and selected one thirds of documents as test set(347 documents) and remaining two thirds as training set(695 documents). Majority of the documents are written in Hangul(Korean text). The category distributions for KISTI-Theses data set are listed below in < category, #te, #tr, sum> format, where #te is the number of documents in test set, #tr is the number of documents in training set, and sum is the summation of #te and #tr: <Electrical Engineering, 48, 102, 150> <Environmental Engineering, 5, 6, 11> <Management Engineering, 44, 90, 134> 1.

Korea Advanced Institute of Science and Technology, Daejon, Korea, http://www.kaist.ac.kr

2.

Pohang University of Science and Technology, Pohang, Korea, http://www.postech.ac.kr

3.

Chungnam National University, Daejon, Korea, http://www.cnu.ac.kr

Volume 1, Issue 4 (July 2009)

369

Table 1. Characteristics of test collections used in this work. Collection

Apte

GT800

GT1200

GT1600

KISTI-Theses

Test Set

3309

1019

652

410

347

Training Set

7789

7789

7789

7789

695

93

93

93

93

22

Category Count Minimal text size (kb)

0.1

0.8

1.2

1.6

14.8

Average text size (kb)

0.8

1.8

2.2

2.7

92.9

<Materials Science, 20, 33, 53> <Mathematics, 7, 21, 28> <Mechanical Engineering, 45, 98, 143> <Metal Engineering, 11, 17, 28> <Miscellaneous, 0, 1, 1> <Steel Engineering, 7, 12, 19> The characteristics of data sets used in this experiment are shown in Table 1. 3.2

Similarity Measures

To assess the effectiveness of various passaging methods, k-nearest neighbor(kNN) classifier[14] was used as a document-level classifier. As an example-based classifier[14, 11], kNN classifier has many similarities with traditional information retrieval systems. Our kNN classifier was built on an information retrieval system, KRISTAL-II, which was developed by KISTI’s Information Systems Group4 to manage and retrieve semistructured texts such as bibliographies, theses, and journal articles. To retrieve k top-ranked documents, the kNN classifier uses a vector-space similarity measure(Sim(q, d)) between query document q and target document d, which is defined to be Sim(q, d) =

1 Wd



(wq,t · wd,t + min(fd,t , fq,t ))

(1)

t∈q∧d

with: wd,t wq,t Wd

N + 1) ft N = log(fq,t + 1) · log( + 1) ft ∑ = log( fd,t )

= log(fd,t + 1) · log(

t∈d

where fx,t is the frequency of term t in document x; N is the total number of documents; min(x, y) is the smaller value between x and y; ft is the number of documents where 4.

More information about KRISTAL-II information retrieval system can be obtained at http://www.

kristalinfo.com.

370

Journal of Trivia

the term t occurs more than once; wx,t means the weight of term t in query or document x and Wd represents the length of document d. Equation 1 is an empirically derived TF·IDF form of traditional vector based information retrieval schemes[13] which have been commonly used due to its robustness and simplicity. The most noticeable modification in Equation 1, compared with traditional vector schemes, is introduction of the expression, min(fd,t , fq,t ), which reflects the term frequency in query or document to the query-document similarity directly. Introducing this expression, the categorization performance is slightly better than traditional vector-space similarity measures(data not shown). Summation of all min(fd,t , fq,t ) in Equation 1 means total frequency of terms that cooccurred in the query and the target document simultaneously. Sum of min(fd,t , fq,t ), however rather indirectly, reflects the term cooccurrence information to the similarity measure, which may result in improved performance. 3.3

Performance Measures

To evaluate the categorization effectiveness of various passages, we use the standard definition of precision(p) and recall(r) as basic performance measures: p=

Categories relevant and retrieved Categories retrieved

Categories relevant and retrieved Categories relevant Along with precision and recall, many other researches in text categorization used F1 measure as the performance measure. The F1 measure[12] is the harmonic average of precision and recall, which is defined as r=

2pr p+r A special point of F1 measure where precision is equal to recall is called precision and recall break-even point, or simply break-even point(BeP). Since, theoretically, BeP is always less than or equal to F1 measures in any point, BeP is usually used to compare the effectiveness among different kinds of classifiers or among categorization methods[16, 11]. We present precision, recall, and BeP as the categorization effectiveness, and if it is unable to get BeP, F1 measure will be presented instead. To average the precision and recall across categories, we used the micro-averaging method[11]. F1 =

3.4

Category Relevance Measure

In document-level categorization, to determine whether a document d belongs to a category cj ∈ C = {c1 , c2 , · · · , c|C| }, our kNN classifier retrieves k training documents most similar to d and computes cj ’s weight by adding up similarities between d and documents that are retrieved and belong to cj ; if the weight is large enough, the decision is taken to be positive and negative otherwise. Category cj ’s weight for document d is called category relevance score, Rel(cj , d), and computed as follows:[17] Rel(cj , d) =

∑ d′ ∈Rk (d)∩Dj

Sim(d′ , d)

(2)

Volume 1, Issue 4 (July 2009)

371

Table 2. Passage weight functions Functions†

Weighting tendency

pwf1 (p) = 1

Head = Body = T ail −1

pwf2 (p) = p

Head ≫ Body ≫ T ail

pwf3 (p) = p √ pwf4 (p) = (p − n2 )2 √ pwf5 (p) = ( n2 )2 − (p − n2 )2

Head < Body < T ail

pwf6 (p) = (log(p + 1))

−1

Head = T ail > Body Head = T ail < Body Head > Body > T ail

† Normalization factors are omitted for clarity. p: passage location n: total number of passages

where set Rk (d) are the k nearest neighbors(top-ranked training documents from the 1st to the kth) of document d, Dj is the set of training documents assigned category cj , and Sim(d′ , d) is the document-document similarity obtained by Equation 1 in Section 3.2. For each test document, categories with relevance scores greater than given threshold are assigned to the document. In passage-level categorization, the same procedure for document-level categorization task is applied and categories are assigned to each passage pi from the test document, d. However, since d’s categories are not yet determined, relevance scores for all candidate categories are computed from categories of all passages as follows: Rel(cj , d) =



pwfn (i)

(3)

pi ∈Pj

where pi is the ith passage of document d, Pj is the set of passages assigned category cj , and pwfn () is one of the passage weight functions shown in Table 2. As in documentlevel categorization, categories with relevance scores greater than given threshold are assigned to the document. As stated in Section 2.2, we use 6 passage weight functions(pwf s) which are functions of passage location and return a value between 0 and 1 (For clarity, the normalization factors are omitted in Table 2). Weighting tendency for each pwf is also shown dividing a document briefly into Head, Body, and T ail.

4 4.1

Experiments Experimental Settings

Terms separated by space characters were extracted as features from documents and passages. Stemming was not applied to the terms since it hurted the effectiveness in our experiments(data not shown), as shown in previous researches such as [2]. Common words(also known as stopwords) were removed from the feature pool. We also removed digits or numeric values from feature pool. Removing digits slightly improved the

372

Journal of Trivia

Table 3. Average passage number and length. Passage Type

Apte

GT800

GT1200

GT1600

Theses

Nonoverlapping window

1.8(0.5)

2.2(0.5)

4.2(0.6)

4.7(0.6)

114.0(0.8)

Overlapping window

2.4(0.5)

5.1(0.5)

6.5(0.6)

8.0(0.6)

226.4(0.8)

Paragraph

7.1(0.1)

10.8(0.2)

12.8(0.2)

15.6(0.2)

90.8(1.0)

Bounded-paragraph

2.3(0.4)

2.8(0.6)

5.4(0.4)

6.6(0.4)

72.3(1.3)

N/A

N/A

N/A

N/A

56.9(1.6)

1.9(0.5)

3.1(0.6)

3.5(0.6)

3.8(0.7)

64.4(1.4)

Page TextTile

Average number of passages per document. Average length(kilobytes) of a passages is shown in ().

categorization effectiveness in the case of Apte data set and its three subsets(data not shown). An additional step was applied to KISTI-Theses data set. Since majority of it’s documents are written in Hangul(Korean text), we applied Hangul morpheme analyzer to Hangul terms and included them into feature pool5 . Feature selection was applied according to terms’ document frequencies(DF). [15] showed DF is a simple, effective, and reliable thresholding measure for selecting features in text categorization[15]. In our experiments, by varying DF ranges during feature selection, we chose minimal DF(DFmin ) and maximal DF(DFmax ) that performed best in document-level categorization tasks for each test collections. The same DFmin and DFmax were also applied to passage-level categorization tasks. In regard of k in our kNN classifier, the best k value for document-level categorization was chosen for passage-level tasks. For Apte, GT800, GT1200, and GT1600 data sets, k value was selected to be 10. And for KISTI-Theses collection, k = 1 was chosen. The k = 1 for KISTI-Theses collection is very interesting and it’s probably due to the fact that only one category is assigned for each document and each document’s length may be long enough to describe the category retrieved. More discussions are in section 4.3.5. Table 3 shows the average passage numbers of a test document and its average length for each data set. The data for both non-overlapping and overlapping window are shown in the case of passage size being 100 words. Since a page is bounded to minimal length of 1.0kb, we think it’s meaningless to apply page type to short documents of 1 or 2kb. Therefore we did not test the effectiveness of page type in Apte and its three subset collections. 4.2

Effectiveness of Passage Weight Functions

As stated in Section 2.2, we assume that the location of a passage plays an important role in the determination of parent document’s main topic(s). To reflect this assumption, we introduced six different kinds of passage weight functions, which are functions 5.

Hangul morpheme analyzer used in this experiment is a component of KRISTAL-II information

retrieval system which is used as the base of our kNN classifier.

Volume 1, Issue 4 (July 2009)

373





ŵƌŽ









     

      

     



    

 !

Figure 2. Effectiveness of 6 passage weighting functions. (Data set: GT1600)

of passage location and returns passage weights between 0 and 1(see Table 2). Summing these passage weights, document’s categories are assigned as shown in Figure 1 in Section 2. Microaveraged break-even points for GT1600 collection are plotted against various passage types in Figure 2, where the trends for the six passage weight functions can be seen. For GT1600 collection, passage weight function 2 (pwf2 ) and function 6 (pwf6 ) show the best performance. These two functions return high weight for passage from Head part, middle from Body, and low from Tail part(Table 2). And pwf3 has the reverse pattern and shows the worst performance. Situations are same with Apte, GT800 and GT1200 collections(data not shown). This means that main topics of Reuters news articles are determined mainly by the early part of the document. On the other hand, KISTI-Theses data set shows very different situation(Figure 3). pwf2 shows the worst performance for KISTI-Theses data set, while it was best for newswire data, Apte and its three subsets. Furthermore, weighting tendencies of pwf2 and pwf6 is very similar(Table 2), but pwf6 performs well for several passage types while pwf2 is poor for all passage types. This situation makes the interpretation very difficult. However, it’s clear that different passage weighting functions should be applied for different document types. Note that pwf1 always returns 1 for any passage location while other pwf s return variable values depending on passage’s location in the document. But in Figure 2, pwf2

374

Journal of Trivia 





ŵƌŽ























 











   

    

 

 

Figure 3. Effectiveness of 6 passage weighting functions. (Data set: KISTI-Theses) and pwf6 outperform pwf1 , and in Figure 3, pwf3 and pwf6 outperform pwf1 . These support our assumption that each passage has different degree of contribution to the document’s main topic(s). 4.3 4.3.1

Effectiveness of Passages Apte collection

Experimental results for Apte test collection are shown in Table 4. Microaveraged precision, recall and break-even point(BeP) applied passage weight function 2 (pwf2 ) are presented as performance measures. See the Section 4.2 for comparison of six pwf s’ effectiveness. The differences, ∆%, are performance improvement in percentile compared with document-based categorization. They are based on break-even points. The best BeP for all passage types is underlined. Since the Apte, GT800, GT1200, GT1600 test collections consist of rather short documents(refer Table 1), page type was not applied to the these data sets. And we slightly modified TextTiling to yield fine-grained tiles, because standard TextTiling is too coarse-grained segmentation algorithm to apply for the data sets. Average size of tiles partitioned by the modified TextTiling algorithm is 98 words corresponding to 454 bytes while average size of standard TextTiles is 1 or 2 kilobytes(see Table 3). Though there is no significant improvement in BeP, overlapping windows at passage

Volume 1, Issue 4 (July 2009)

375

Table 4. Effectiveness of passages for Apte dataset with pwf2 Precision

Recall

BeP†

∆%

0.818

0.818

0.818

0.0

window size = 50

0.817

0.817

0.817

-0.1

window size = 100

0.820

0.821

0.820

0.3

window size = 150

0.822

0.822

0.822

0.5

window size = 200

0.819

0.824

0.822‡

0.4

window/overlap = 50/25

0.820

0.820

0.820

0.3

window/overlap = 100/50

0.823

0.823

0.823

0.6

window/overlap = 150/75

0.823

0.823

0.823

0.6

window/overlap = 200/100

0.822

0.822

0.822

0.4

Paragraphs

0.812

0.812

0.812

-0.8

Bounded-Paragraphs

0.761

0.766

0.763‡

-6.7

TextTiles

0.831

0.812

0.821‡

0.4

Document Non-overlapping Windows

Overlapping Windows

† Micro-averaged precision and recall break-even points ‡ Micro-averaged F1 measures size of 100 and 150 words showed the best performance. The passages type showing the worst performance is bounded-paragraphs. A bounded-paragraph is defined as a paragraph containing at least 50 words and at most 200 words[3]. We reason that short nature of document lengths in Apte collection cause malformed bounded-paragraphs in this collection resulting in mal-performance. The poor performance improvement for all passage types in this data set seems to be due to the fact that proportion of very short documents in Apte collection is too high. For example, about 60% of the documents in the test set are less than 100 words which corresponds to only one non-overlapping window of size 100. This causes the degree of improvement by passage-level categorization being overwhelmed by the inaccuracy introduced by step of category merging in Figure 1, resulting in poor performance improvement. To eliminate the negative effect of short test documents and to verify the effectiveness of passage-level categorization, we prepared three subsets of Apte collection, GT800, GT1200, and GT1600. The results will be shown in the following sections. 4.3.2

GT800 collection

Experimental result for GT800 test collection is shown in Table 5. GT800 collection is a subset of Apte test collection, where documents that are shorter than 800 bytes are removed from the test set. Experimental environment is same as that of Apte collection. There are some improvements in performance for most passages types, where overlapping windows of size 100 works best. Compared with that of Apte collection, per-

376

Journal of Trivia

Table 5. Effectiveness of passages for GT800 dataset with pwf2 Precision

Recall

BeP†

∆%

0.690

0.690

0.690

0.0

window size = 50

0.688

0.688

0.688

-0.3

window size = 100

0.695

0.699

0.697

1.0

window size = 150

0.706

0.701

0.703

1.9

window size = 200

0.706

0.706

0.706

2.3

window/overlaps = 50/25

0.690

0.690

0.690

0.0

window/overlaps = 100/50

0.715

0.707

0.711‡

3.0

window/overlaps = 150/75

0.704

0.704

0.704

2.0

window/overlaps = 200/100

0.706

0.710

0.708

2.6

Paragraphs

0.696

0.697

0.697

0.9

Bounded-Paragraphs

0.688

0.707

0.697‡

1.0

TextTiles

0.704

0.702

0.703

1.9

Document Non-overlapping Windows

Overlapping Windows

† Micro-averaged precision and recall break-even points ‡ Micro-averaged F1 measures

formance improvements in this data set is clear. Detailed explanation will be stated in Section 4.3.4. 4.3.3

GT1200 collection

Experimental result for GT1200 test collection is shown in Table 6. GT1200 collection is a subset of Apte test collection, where documents that are shorter than 1,200 bytes are removed from the test set. Experimental environment is same as that of Apte collection. There are significant improvements in performance for most passages types, where overlapping windows of size 100 works best. As lengths of test documents increases, compared with Apte and GT800 datasets, the difference of BeP for passage-level and document-level categorization gets enlarged. Detailed explanation will be stated in Section 4.3.4. 4.3.4

GT1600 collection

Experimental result for GT1600 test collection is shown in Table 7. GT1600 collection is a subset of Apte test collection, where documents that are shorter than 1,600 bytes are removed from the test set. Experimental environment is the same as that of Apte collection. For all Apte, and GTnnnn collections, overlapping windows showed the best performance. This is probably because variable length distribution of paragraphs, boundedparagraphs, and tiles, compared with evenly distributed lengths of overlapping win-

Volume 1, Issue 4 (July 2009)

377

Table 6. Effectiveness of passages for GT1200 dataset with pwf2 Precision

Recall

BeP†

∆%

0.660

0.660

0.660

0.0

window size = 50

0.674

0.673

0.673

2.1

window size = 100

0.689

0.683

0.686‡

4.0

window size = 150

0.665

0.664

0.665

0.8

window size = 200

0.642

0.637

0.640‡

-3.0

window/overlaps = 50/25

0.677

0.678

0.678

2.7

window/overlaps = 100/50

0.690

0.689

0.689

4.5

window/overlaps = 150/75

0.665

0.664

0.665

0.8

window/overlaps = 200/100

0.643

0.645

0.644

-2.3

Paragraphs

0.675

0.675

0.675

2.3

Bounded-Paragraphs

0.683

0.683

0.683

3.5

TextTiles

0.671

0.670

0.671

1.7

Document Non-overlapping Passages

Overlapping Passages

† Micro-averaged precision and recall break-even points ‡ Micro-averaged F1 measures

dows, causes skewed categorizations for passages, accordingly resulting in poor performance. Furthermore, overlapping window type is superior to non-overlapping window for all data sets. While other passage types, including non-overlapping window type, may have some degree of loss in term locality information at the boundaries, overlapping window passages reserve the locality information since they overlap with the adjacent passages. The best performing passage size of both non-overlapping and overlapping is 100 words. This size is about 3 paragraphs for test documents of Apte and its three subset collections. (Note that we do not include numeric values in word count.) The meaning of GTnnnn collections is the increasing lengths of test documents by removing short documents from test set of Apte collection. From Table 5, 6, and 6, performance improvements are approximately proportional to the lengths of test documents. This is clear in Figure 4. In Figure 4, performance improvements in percentage are plotted against average sizes of test documents for Apte, GT800, GT1200, and GT1600 data sets, which shows again overlapping passage type is the best performer. There is a trend that as average document size gets larger performance also improves better. Furthermore, there is strong linear correlation between average document size and performance improvements for overlapping window type. This means that the longer the test documents are, the better the performance of all passage types. We have examined the result for rather longer documents among short-document data sets so far. In the following section we will present the passage-level classification tasks for full-length documents of master and doctoral theses.

378

Journal of Trivia

Table 7. Effectiveness of passages for GT1600 dataset with pwf2 Precision

Recall

BeP†

∆%

0.636

0.636

0.636

0.0

windows size = 50

0.649

0.648

0.648

1.9

windows size = 100

0.665

0.663

0.664

4.4

windows size = 150

0.653

0.642

0.647‡

1.8

windows size = 200

0.626

0.621

0.623

-2.0

window/overlaps = 50/25

0.658

0.658

0.658

3.4

window/overlaps = 100/50

0.670

0.670

0.670

5.4

window/overlaps = 150/75

0.668

0.663

0.666

4.6

window/overlaps = 200/100

0.649

0.649

0.649

2.0

Paragraphs

0.652

0.652

0.652

2.5

Bounded-Paragraphs

0.665

0.665

0.665

4.5

TextTiles

0.659

0.658

0.658

3.4

Document Non-overlapping Windows

Overlapping Windows

† Micro-averaged precision and recall break-even points ‡ Micro-averaged F1 measures

4.3.5

KISTI-Theses collection

So far we have examined the passage-based text categorization on data sets with short documents of one or two kilobytes(Table 1). In this section, we will present experimental result for KISTI-Theses data set which consists of full-length documents in average 92.9kb, varying from 14.8kb to 533.5kb. Data with passage weight function3(pwf3 ) are shown in Table 8. In this experiment, k value for kNN classifier is 1, and feature selection condition is document frequency of at most 2 and at least 69(corresponding to 10% of training documents). The k = 1 for this collection is very interesting. It’s probably due to the fact that this collection is rather a small data set and only one category is assigned to each document in the collection. This means that the higher is the k value the higher possibility of inadequate documents being contained in the top-ranked documents, causing performance degradation. In addition, since all the documents in the collection is very long(average 92,900 characters), one top-ranked document to query document may contain sufficient terms which can fully describe a category’s features. Categorization performance using any type of passage results in more effective categorization than when whole-document is used as categorization unit(Table 8). Based on Figure 4 and Table 8, overlapping window of passage size 100 shows the best performance for all data sets. And for all data sets except Apte collection, performance of bounded-paragraph type is better than that of paragraph type. As stated in Section 2.1, the lengths of bounded-paragraphs are less skewed than those of

Volume 1, Issue 4 (July 2009)

379







 



   

            

     



 

 

 

 

 

   

Figure 4. Correlation between document size and performance improvement for various passage types.

First point of bounded-

paragraph is omitted due to its huge bias. paragraphs, because the length of a bounded-paragraph is bounded to some minimal value, while paragraphs are varying from one sentence to tens of sentences. Skewing in passage sizes seems to be harmful to passage-based text categorization. Finally we would like to note shortly the speed-effectiveness trade-off in classification of KISTI-Theses data set. In this experiment, we find that passage-level categorization is 2 to 5 times slower than document-level categorization, but memory space required is greatly reduced, because splitting a document into passages increases the number of categorization tasks by passage count but decreases the amount of work load from document to passage size. This is a bearable trade-off regarding that a document is splitted into hundreds of passages in the case of KISTI-Theses collection(see Table 3). The speed-memory trade-off will be very useful when it’s impractical to apply wholedocument-level categorization due to bulky volume of the categorization unit such as full-length digital books and whole web sites.

5

Conclusions

Advent of full-length document databases and expansion of web sites pose new challenges to text classification, because traditional whole-document classification may be impractical to these document types. In this article, we introduce a new text categorization model, called passage-based text categorization, in which the problem size(categorization unit) is reduced from whole-document to small passages splitted

380

Journal of Trivia

Table 8. Effectiveness of passages for KISTI-Theses dataset with pwf3 Precision

Recall

BeP†

∆%

0.631

0.631

0.631

0.0

window size = 50

0.677

0.677

0.677

7.3

window size = 100

0.695

0.695

0.695

10.0

window size = 200

0.683

0.683

0.683

8.2

window size = 400

0.687

0.689

0.688‡

9.0

window/overlaps = 50/25

0.669

0.669

0.669

5.9

window/overlaps = 100/50

0.697

0.697

0.697

10.5

window/overlaps = 200/100

0.692

0.692

0.692

9.6

window/overlaps = 400/200

0.686

0.686

0.686

8.7

Paragraphs

0.669

0.669

0.669

5.9

Bounded-Paragraphs

0.686

0.686

0.686

8.7

Pages

0.689

0.689

0.689

9.1

TextTiles

0.689

0.689

0.689

9.1

Document Non-overlapping Windows

Overlapping Windows

† Micro-averaged precision and recall break-even poi nts ‡ Micro-averaged F1 measures from the document. We explored the text categorization using various passage types, such as nonoverlapping windows, overlapping windows, paragraphs, bounded-paragraphs, pages, and tiles. The improvement obtained by passage-level categorization compared with wholedocument categorization is greater than 5% for short-document data set and is greater than 10% for long-document data set. For all passages types, there are general improvements in categorization effectiveness. However, overlapping windows showed superior effectiveness to other passage types across five test collections. We also introduced passage weight which is applied to merge passage categories to document categories. Despite overall improvements are observed for all passage weighting schemes tested in this article, there do exist one or more scheme that is superior to simple summation of weights(pwf1 ). Therefore, careful design of passage weighting scheme will further improve the effectiveness. Furthermore, our results showed that different data sets have different optimal passage weighting schemes. The superior effectiveness of passages to whole-document may be the result of a few factors. First, passages can quest for the subtopic structure of the test document while whole-document ignores detailed design of the document. In passage-level text categorization, this subtopic structure can be exploited by a pool of classifiers, each of which has a local, i.e. passage-level, view of the document. Second, passages are short segments from the document. It means they embody locality: document-level classification mashes term’s locality information, but dividing document into passages reserves the partial locality. In this respect, it’s natural that overlapping windows

Volume 1, Issue 4 (July 2009)

381

which reserve the locality even at the boundaries outperforms other passage types. Dividing document to smaller pieces of passages are an effective mechanism for text categorization where collection of long documents are considered. Even In the collections of short texts, we found that passages have the potential to improve effectiveness. Though there is slight speed-effectiveness trade-off in passage-based text categorization, it is bearable to process really long documents. We suggest that passage-level text categorization is one of the influential methods in the environments such as unstructured full-length documents, XML documents and whole web sites. For unstructured documents, any passage type quoted in this paper can be applied to categorization task. In regard of XML documents, since they are naturally composed of several passages(each appropriate element content as a passage), it’s reasonable to expect that passage-based categorization method is well suited for XML documents. For web site classification, since a web site is usually composed of many web pages, passage-based categorization model can be applied by regarding each web page as a passage. Passage-based text categorization model resembles classifier committees[8, 11] in two ways. First, Its result of the classification is obtained by a pool of classifiers. Second, it uses a passage weight function to compute document’s categories while classifier committees use a combination function to choose categories[8, 11]. Therefore we expect that many research results on classifier committees can also be applied easily to passage-based categorization in near future.

Acknowledgements We would like to thank Wonkyun Joo for some helpful comments and fruitful discussions, Hwa-muk Yoon for providing raw data for KISTI-Theses test collection, and Changmin Kim and Jieun Chong for supporting this work.

References 1. Apte, C., Damerau, F., and Weiss, F. (1994). Towards Language Independent Automated Learning of Text Categorization Models. Proceedings of the 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, 23-30. 2. Baker, L. D. and McCallum, A. K. (1998). Distributional Clustering of Words for Text Classification. Proceedings of the 21th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, 96-103. 3. Callan, J. P. (1994). Passage Retrieval Evidence in Document Retrieval. Proceedings of the 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, 302-310. 4. Hearst, M. A., and Plaunt, C. (1993). Subtopic Structuring for Full-length Document Access. Proceedings of the 16th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, 59-68.

382

Journal of Trivia

5. Hearst, M. A. (1994). Multi-paragraph Segmentation of Expository Texts. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, 9-16. 6. Kaszkiel, M., Zobel, J. and Sacks-Davis, R. (1999). Efficient Passage Ranking for Document Databases. ACM Transactions on Information Systems, 17(4), 406-439. 7. Kaszkiel, M., and Zobel, J. (2001). Effective Ranking with Arbitary Passages. The Journal of American Society for Information Science and Technology, 52(4), 344-364. 8. Larkey, L. S., and Croft, W. B. (1996). Combining Classifiers in Text Categorization. Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval, 289-297. 9. Moffat, A., Sacks-Davis, R., Wilkinson, R. and Zobel, J. (1994). Retrieval of Partial Documents. NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC 2), 181-190. 10. Salton, G., Allan, J., and Buckley, C. (1993). Approaches to Passage Retrieval in Full Text Information Systems. Proceedings of the 16th Annual International Conference on Research and Development in Information Retrieval, 49-58. 11. Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1), 1-47. 12. van Rijsbergen, C. (1979). Information Retrieval. Butterworths, London, 1979. 13. Witten, I. H., Moffat, A., and Bell, T. C. (1999). Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishing, San Francisco, 1999. 14. Yang, Y. (1994). Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval. Proceedings of the 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, 13-22. 15. Yang, Y. and Pedersen, J. O. (1997). A Comparative Study on Feature Selection in Text Categorization. Proceedings of the 14th International Conference on Machine Learning(ICML’97), 412-420. 16. Yang, Y. (1999). An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval, 1(1), 67-88. 17. Yang, Y., Slattery, S., and Ghani, R. (2002). A Study of Approaches to Hypertext Categorization. Journal of Intelligent Information Systems, 17(2), 219-241. 18. Zobel. J., Moffat, A. Wilkinson, R., and Sacks-Davis, R. (1995). Efficient Retrieval of Partial Documents. Information Processing and Management, 31(3), 361-377.