A Hybrid Two-Stage Approach for Discipline-Independent Canonical Representation Extraction from References Sung Hee Park
Roger W. Ehrich
Edward A. Fox
Digital Library Research Laboratory Department of Computer Science Virginia Tech Blacksburg, Virginia, 24061
Center for Human Computer Interaction Department of Computer Science Virginia Tech Blacksburg, Virginia, 24061
Digital Library Research Laboratory Department of Computer Science Virginia Tech Blacksburg, VA, 24061
[email protected] [email protected] [email protected] ABSTRACT In education and research, references play a key role. However, extracting and parsing references are difficult problems. One concern is that there are many styles of references; hence, given a surface form, identifying what style was employed is problematic, especially in heterogeneous collections of theses and dissertations, which cover many fields and disciplines, and where different styles may be used even in the same publication. We address these problems by drawing upon suitable knowledge found in the WWW. In particular, we research a two-stage classifier approach, involving multiclass classification with respect to reference styles, and partially solve the problem of parsing surface representations of references. We describe empirical evidence for the effectiveness of our approach and plans for improvement of our methods.
Categories and Subject Descriptors H.3.7 [Information storage and retrieval]: Digital Libraries—systems issues; I.5 [Pattern Recognition]: Design Methodology—classifier design and evaluation
General Terms Algorithm, Performance, Design, Experimentation
Keywords Canonical Representation Extraction, Knowledge Acquisition, Reverse-Engineering, Style-Free Reference Metadata Extraction
1.
INTRODUCTION
In scholarly digital libraries, citation/reference analysis can support patrons if it leads to suitable services. Citation relations between digital objects in collections can be
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. JCDL’12, June 10–14, 2012, Washington, DC, USA. Copyright 2012 ACM 978-1-4503-1154-0/12/06 ...$10.00.
identified. Examples of citation analysis include: metadata extraction [1, 4, 5, 6, 12, 14, 29], citation matching/entity resolution [2, 23, 27], and citation/co-authorship network construction/analysis [7]. These studies are important, since scholars use citations to support their discussion and argumentation, and since collections of references aid in identifying relationships among works. Garfield and others have used citation indexes to aid bibliometric studies [10]. Further, as research papers have become readily available on the WWW, automated citation indexing systems, such as CiteSeer [18], have been developed. Accurate citation metadata extraction now plays a critical role in research evaluation [25]. Some work on digital libraries aims to aggregate publications across groups and disciplines. For instance, consider Virginia Tech’s ETD-db digital library, which is an instantiation of the ETD-db system originally developed at Virginia Tech. It contains over 18,000 electronic theses and dissertations from 8 colleges and 79 departments1 . Another example is arXiv, with 662,023 e-prints from 7 large disciplines, covering 148 sub-categories2 as of March 2011. Extracting references from such collections is difficult since those references have a variety of different forms depending on disciplines. Currently, a number of reference extraction and analysis approaches and systems have been making use of domain knowledge, and have been specialized to an area, such as computing or chemistry. However, in order for a citation analysis algorithm to be scalable, it should be able to extract information from heterogeneous collections, putting results into suitable canonical forms. It should operate across disciplines and reference styles, and handle a broad range of reference surface forms [25]. Due to the success of machine learning, natural language processing, and web mining, some researchers envision Machine Reading [22], which goes beyond domain-independent and unsupervised text understanding. We aim to help move toward the vision of Machine Reading by using knowledge derived from web mining, along with machine learning techniques such as support vector machines (SVM) and conditional random fields (CRF). This 1 http://www.vt.edu/academics/academicdepartments.html 2 Physics (45), Mathematics (35), Nonlinear Science (5), Computer Science (40), Quantitative Biology (10), Quantitative Finance (7), Statistics (6)
Inverse Problem Reverse Engineering Parsing vs. Rendering
Rules/ Grammars
Approach Rule-based
APA
?
Machine learning
Albillos, A., R. Banares, M. Gonzalez, C. Ripoll, R. Gonzalez, M. V. Catalina, and L. M. Molinero. "Value of the Hepatic Venous Pressure Gradient to Monitor Drug Therapy for Portal Hypertension: A Meta-Analysis." American Journal of Gastroenterology 102, no. 5 (2007): 1116-26.
IEEE Albillos A, Banares R, Gonzalez M, et al. Value of the hepatic venous pressure gradient to monitor drug therapy for portal hypertension: A meta-analysis. Am J Gastroenterol. May 1 2007;102(5):1116-1126.
Turabian
Figure 1: Overview of Canonical Representation Extraction Problem will overcome limitations of current citation information extraction techniques, which work in a domain-specific and supervised manner. Further, we can learn from analysis of collections of references that appear in large documents, as opposed to only working on individual references. This paper is logically structured as follows: In Section 2, challenging problems that we will deal with will be defined. In Section 3, related work will be reviewed in terms of classification methods and features in classification. Our novel and elaborated methodology to tackle the challenging problems will be described in Section 4. Evaluation to examine the performance and effectiveness will be discussed in Section 5. At the end of this paper, conclusions & future work will be described.
2.
Table 1: Comparison of Previous Approaches
Albillos, A., R. Banares, M. Gonzalez, C. Ripoll, R. Gonzalez, M. V. Catalina, M. V., et al. (2007). Value of the hepatic venous pressure gradient to monitor drug therapy for portal hypertension: A meta-analysis. American Journal of Gastroenterology, 102(5), 1116-1126.
Canonical Representations (e.g., BibTex) @article{albillos2007value, title={{Value of … a meta-analysis}}, author={Albillos, A. and Baares, R. and Gonzlez, M. and Ripoll, C. and Gonzalez, R. and Catalina, M.V. and Molinero, L.M.}, journal={The American journal of gastroenterology}, volume={102}, number={5}, pages={1116--1126}, issn={0002-9270}, year={2007}, publisher={Nature Publishing Group}
Surface forms Output Styles
PROBLEM STATEMENT
Our problem, as illustrated intuitively in Figure 1, can be considered as an inverse problem to mathematicians, as well as a reverse engineering problem for software engineering researchers, and a parsing vs. rendering problem for compiler and linguistics experts. More formally speaking, problems that we are dealing with include: 1) the surface & semantics mapping problem, 2) the disciplines & styles problem, and 3) the implicit background knowledge acquisition problem. The surface & semantics mapping problem is a semantic tagging problem / sequence labeling problem in natural language processing and bioinformatics, which is the association of a sequence of reference tokens with their semantic labels, where the input sequence is a reference surface form. This can be formalized as follows. Let a reference surface form be a sequence of tokens T (t1 , t2 , .., tn ) and let its meaning (semantics) be a sequence of semantic labels L(l1 , l2 , .., lm ). A sequence labeling is a mapping f : T → L. The sequence labeling problem from the conditional probabilistic perspective is to find a mapping f that maximizes the conditional probability P (L|T ) which means that given a sequence of tokens T , how often a sequence of labels T occur. The disciplines & styles problem concerns the scalability of the surface and semantics mapping problem. For the many specific disciplines found, reference styles must be interpreted so references can be analyzed. For this reason, two terms, style-free and discipline-independent, will be used in
Author&Year Day et al. (2006) [6] Cortez et al. (2007) [4] Afzal et al. (2010) [1] Councill et al. (2008) [5] Hong et al. (2009) [14] Hetzner (2008) [12]
Uns/Su S U U S S S
this paper interchangeably. Thus, solving this problem will provide a discipline-independent & style-independent surface & semantics mapping solution for users. This research involves knowledge acquisition from the Web. When humans interpret the semantics of the surface form of references, they use implicit background knowledge in terms of a variety of reference styles. Citation analysis is a challenging problem because of: 1) the wide variety of citation styles [20], 2) different document types [26], 3) domain dependent properties [20], 4) lexical ambiguities (e.g., acronyms, homonyms of acronyms), 5) missing information (e.g., pages or venues), 6) inconsistency, and 7) errors in typing and other inaccuracies. The large number of reference styles (e.g., APA, IEEE) makes it difficult to fit a machine learning model to each specific reference style. Different document types, sometimes referred to as genres (e.g., books, book chapters, journal articles, and web pages), also complicate the task of citation metadata extraction. In particular, it is worth noting that different domains (e.g., Computer Science, Health Science, and Social Science) tend to generate different terminals (e.g., different journal titles and conference titles) in the references, and these effect the output. This dependency property reduces the generality of existing solutions, and complicates the problem of developing scalable approaches that can operate across the myriad domains found in large heterogeneous collections. Our problems raise the following research questions: 1. What features should be selected for training? 2. What methods give the best performance and greatest effectiveness?
3.
LITERATURE REVIEW
In this section, we review prior work regarding canonical representation extraction in terms of 1) sequence labeling methods and 2) features used in sequence labeling.
3.1
Sequence Labeling Methods
Many efforts have been made to improve canonical representation extraction performance. Those can be categorized into two large types of approaches: 1) Rule & knowledgebased approaches [6, 4, 1] and 2) machine learning approaches [5, 14, 12, 29]. Table 1 shows a summary of labeling methods in terms of the classification method.
3.1.1
Rule/Knowledge-Based Approaches
Rule-based approaches to canonical reference representation extraction, basically, are classification methods that exploit rules to mark tokens in the questioned sequence with proper semantic labels, where, generally, one or more human experts in the specific areas would consult rules.
Some prior successful research has been done using rulebased approaches to reference metadata extraction, in a limited problem space (e.g., less than ten reference styles, two or three disciplines). Day et al. [6] adopted a rule-based approach using a knowledge representation frame, which is called INFOMAP, with respect to six styles. This knowledge describes layout appearances of semantic labels such as author, title, publisher, and year. Ding et al. [8] used different templates designed to deal with the specific citations from digital contents. Cortez et al. [4] proposed a knowledge-based approach for extracting citation metadata in a flexible way called FLUXCiM, which consists of blocking, matching, binding, and joining processes. Unlike other knowledge/ontology based approaches such as [4, 6], this approach used a knowledge base to gather frequencies of terms that occur in each field like authors, titles, journals, etc. To evaluate the effectiveness of the method on reference style-free extraction, it used two or three disciplines (e.g., health science and social science). However, the feature, term frequency, that they have used as a primary feature is likely to be dependent on discipline. In addition, Embley et al. [9] used a conceptual model like Ding’s template based approach not only in citation parsing but also in general information extraction. A general disadvantage of rule-based approaches is that it is not easy to extract rules. Although previous research has shown that they are effective in the case of limited reference styles and disciplines [4, 6], they are not easily adapted to our problem, which involves extending the problem space so that there almost certainly is a requirement for solutions that are discipline-independent and style-free.
3.1.2
Machine Learning Approaches
The surface and semantics mapping problem can be framed as the sequence labeling problem in a natural language processing context. Machine learning approaches have led to sequence labeling machine learning techniques involving hidden Markov models (HMMs) [12], conditional random fields (CRFs) [14, 5, 21], prediction by partial matching (PPM) [28], and support vector machines (SVMs) [11]. Machine learning approaches can generally be categorized into two main categories: kernel function based such as support vector machine and probabilistic graphical models such as CRF, HMM, and Maximum Entropy Markov Model (MEMM). Originally, SVM was a binary classifier, but it has been extended to solve the sequence labeling problem. SV M struct is one of the support vector machine implementations for sequence labeling [16]. Probabilistic graphical models can further be grouped into generative models and discriminative models. A generative model is a probabilistic graphical model in which an input sequence generates an output sequence. Generally, generative models calculate a joint probability P (X, Y ), whereas a discriminative model directly calculates a conditional probability of P (Y |X), where X is an input sequence and Y is an output label sequence. An example classifier for a single classification involves a na¨ıve Bayesian network. HMM [24] is a typical generative model for a sequence data segmenter and labeler, while a discriminative model, for example, CRF, directly calculates the conditional probability P (Y |X), where X is an input sequence and Y is an output label sequence. Supervised learning techniques learn from a training set,
Table 2: Features for Canonical Representation Extraction Features Local features Lexical features Contextual features Layout features
Description Non-lexical information about the token Information about the meaning of the words within the token Lexical or local features of a token’s neighbours Relative position of a word in the entire reference string
Literature [5, 14, 21, 29, 30] [5, 14, 21, 29, 30] [5, 14] [5, 14, 21, 29]
labeled by human effort. This is time consuming and expensive. To alleviate this burden in supervised learning methods, some efficient methods called semi (weak)-supervised learning have been proposed and evaluated [15, 19]. In this paper, we describe how these machine learning approaches can be applied to our research problems.
3.2
Features
In general, with classification methods, features are considered to play a critical role so that it is possible to discriminate the data belonging to a class from data belonging to other classes. Novel and effective features have been proposed and utilized. Councill et al. [5] used 23 features grouped as: Token identity (3 features), N-gram prefix/suffix (9 features), Orthographic case (1 feature), Punctuation (1 feature), Number (1 feature), Dictionary (6 features), Location (1 feature), and Possible editor (1 feature), in an open source reference parsing tool, ParsCit. It is currently used in CiteSeerX. Hong et al. [14] used a set of ten features, which are categorized into lexical (dictionary) features, local features, contextual features, and layout features, in the lightweight real-time reference string extraction & parsing system called FireCite. Similarly, Zou et al. [30] used 14 binary features consisting of three dictionary features (Author Name, Article Title, and Journal Title) and an additional 11 binary features describing local and orthographic information (e.g., pagination pattern, name initial pattern, four digit year pattern, etc.). Yu and Fan [29] used 15 features, including nine local features (ALLCHINESE, CONTDIGITS, ALLDIGITS, SIXDIGITS, CONTDOTS, CONTAINS@, SINGLECHAR, NAME, EMAIL), three layout features (LINE START, LINE IN, LINE END), and three lexicon features (FAMILYNAME, AFFILIATION, ADDRESS) for metadata extraction from Chinese research papers. Peng et al. [21] investigated state transition features, unsupported vs. supported, local features, layout features, and lexicon features. Recognizing the properties of features, automatic reference metadata labeling methods have used the following four classes of features which we arbitrarily categorize: 1) local features, 2) lexical features (dictionaries), 3) contextual features, and 4) layout features. Comparisons of classes of features are given in Table 2.
4.
METHODOLOGY
In this study, we consider a hybrid method of knowledgebased approaches and machine learning methods to solve the style-free reference metadata extraction problem. One
2
Preprocessing
Algorithm 1 Acquiring training data for reference style from the Web 1: Extract all style names from the bibliographical reference generation interface. 2: Import a reference set into the EndNoteWeb. 3: Crawl all styled references. 4: Convert HTML files to text files. 5: Convert raw files to training sets.
INPUT Source texts (e.g., ETD references)
Training corpus of references
2
1
Preprocessing Knowledge bases
Tokenization
Reference Output Styles
Feature extraction 3 Learning
Tokenization Feature extraction
Classifiers/ Taggers
Information extraction
OUTPUT Structured information
Figure 2: A Flow of the Hybrid Information Extraction Process of the key features in our proposed method is that knowledge bases from web mining of tagging tokens are obtained to make reference metadata extraction more accurate. In addition to the knowledge bases, datasets consisting of concepts and instances for training also are crawled from the Web. Once knowledge bases and training datasets are built, machine learning methods capture the rules for labeling a sequence of input reference strings. Our methodology (see Figure 2) can be largely divided into three parts: 1) Building knowledge bases through web mining, 2) feature extraction including preprocessing and tokenization, and 3) two-stage classification and sequence labeling as highlighted with labels 1, 2, and 3 in Figure 2.
4.1 4.1.1
Building Knowledge Bases from Mining the Web Knowledge Bases
Recently, Cortez et al. [4] defined a knowledge base as a set of pairs of objects and instances like a dictionary. Here, we also will borrow their definition of knowledge base. A knowledge base is defined as a set of pairs K = {(o1 , i1 ), (o2 , i2 ), ..., (on , in )} where oi is a bibliographical field like author, title, journal, or years, and ii is its corresponding instance or value. For example, examples of entries in the knowledge base might be (‘AUTHOR’, ‘John Smith’), (‘JOURNAL’, ‘ACM Transactions on Information System’), and (‘YEAR’, ‘2011’). We also find other formats for knowledge bases than as (object, instance) pairs; thus, a so called compound object is a set of objects, that can describe a canonical representation of references (e.g., ( ‘IEEE styles’, ‘A. Albillos, et al.,“Value of the hepatic venous pressure gradient to monitor drug therapy for portal hypertension: A meta-analysis,” American Journal of Gastroenterology, vol. 102, pp.11161126, May 2007.’))
process for reference styles by reverse-engineering a reference management tool. Input data are randomly sampled data crawled from the Web. The input data are imported to a reference management tool such as EndNote or Zotero, which exports a large number of styles of references (see Algorithm 1). This knowledge is used as a training corpus of references for training our reference style classifiers and sequence labelers.
R
4000
Reference in EndNote (R)
Reference management tools (e.g., EndNote, BibTex)
Random sample across -Discipline(400) -Genre(10)
1
R1
2
R2
3
R3
…
C2 Classifier training
C3
.
.
.
.
. 2000
Classifier1
R2000
… C2000
Figure 3: Our Proposed Reference Style Knowledge Acquisition Method Knowledge instances: Instances of the more than 3000 reference styles 4 are shown in Figure 4. The second frame is illustrative of the 3,338 style entries. Reference surface forms are listed in the third frame.
4.2 4.2.1
Feature Extraction Tokenization
Generally, in natural language processing, tokenization is a step in which an input string is segmented into a set of tokens. This utilizes non-alphanumeric letters, such as commas (,), parentheses (()), spaces (’ ’), or tabs (’\t’), etc., as delimiters; thereafter, the delimiters are usually ignored. However, we exploit these delimiters as significant features for discriminating among fields even after the tokenization step since our input reference strings are closely associated with the delimiters in the surface form.
4.2.2
Feature Types
Sources: EndNoteWeb 3 is a private web resource that can be accessed by subscription. It provides a service of generating more than 3,000 reference styles from an imported canonical representation (e.g., BibTex, RIS, EndNote format). Pipelines: Figure 3 illustrates a knowledge acquisition
We consider two types of features: local features and contextual features. Local features: Local features provide orthographic information for a token. Local features shown in Table 3 will be investigated as used in both reference style classification (stage 1) and sequence labeling (stage 2). Contextual features: Contextual features are features that can capture state transitions. In our problem, these fea-
3
4
4.1.2
Reference Styles
http://www.myendnoteweb.com
http://parsifal.dlib.vt.edu/kbstyles
Figure 4: Knowledge Base for Reference Styles Acquired from EndNote Categories
Letters Patterns
Special Character Patterns
Special Patterns
Numeric Patterns
Length Patterns
Table 3: Local Features
Names INITCAP ALLCAP ACRO LONELYINITIAL CAPLETTER SINGLECHAR CONTAINSDOTS CONTAINSDASH PUNC URL ENDEDWITHDOT EMAIL WORD PAGINATION FOURDIGITYEAR SIXDIGIT CONTAINSDIGITS ALLDIGITS/DIGITONLY DIGITANDLETTERONLY DIGITANDLETTER PHONEORZIP FIELDLENGTH
Descriptions Starts with a capitalized letter All letters are capitalized Acronyms One single capitalized letter Contains capital letters One single character Contains at least one dot Contains at least one dash Punctuation Regular expression for URLs Regular expression for ending with a dot Regular expression for e-addresses Word Regular expression for pagination formats Regular expression for four-digit year pattern Regular expression for six-digit patterns Contains at least one digit All letters are numeric Contains digits and letters only Contains both digits and letters Phone number or zip # of characters the token has
Examples Computer Science COMPUTER WWW S. arXiv p S., C4.5 123-124 dot (”.”), comma (”,”) http://www.acm.org A.
[email protected] references 200-5, H100-H105 2005 2005 1, F1, A1* 111 H1N1 2-3-4 tree 231-3615 fieldLength(style)=5
tures will measure possibilities of orderings of tokens. The contextual features reviewed in Table 2 are being investigated.
stage is reference style classification while the second is sequence labeling.
4.3
We adopt a two-stage SVM/CRF sequence classifier. The reason we choose this structure is because the references dealt with have a characteristic that the identical citation style has the same rules to generate references. A two-stage classifier has been used in several tasks [13]. Hoefel et al. [13] exploited a two-stage SVM/CRF sequence classifier on the task of handwritten word recognition. Our style classification is illustrated by labels 1-7 in Figure 5 and our sequence labeling is shown by labels 5-14. The following sections describe more details.
Style-Free Canonical Representation Extraction through Two-Stage Method
To solve our challenging problem, discipline-independent and style-free canonical representation extraction, we propose a hybrid combination of machine learning based and knowledge-based approaches. Figure 5 illustrates the proposed method. Our solution for discipline-independent citation metadata extraction uses reference style information, built from mining the Web, as a critical aid. The Web is a resource abundant in structured knowledge. Our method is broadly divided into two stages. The first
4.3.1
4.3.2
Two-stage SVM/CRF Sequence Classifier
Reference Style Classification
Reference style classification is a general multi-class clas-
Citation layout structure 4
Output style structure
3
5
Any styles of references
Tagged references
8 Chicago
12
Style Tagger (e.g. CRF)
6 2
Table 4: Different Styles and Surface Forms of Dataset 1
EndNoteX Classified references
1
11
13
9 IEEE Style Tagger
Classifier (e.g. SVM)
7
10Turabian
14
Style Tagger
Figure 5: Our proposed extraction method sification problem. It is convenient to convert a multi-class problem into multiple binary classification problems. In this work, we will use one of the machine learning techniques, support vector machine (SVM) [3], which was first proposed by Vapnik. This SVM technique classifies two classes in a way to maximize margins between one class and the other. For a complex dataset, sometimes, a function called a kernel function transforms the dataset into a high dimension space in which a dataset can be linearly separable and later mapped back to the original space. In Figure 5, input data (see the part labeled 1) is a styled reference string S. After input data are preprocessed and tokenized involving a set of m delimiters D = {d1 , d2 , ..., dm }, they are segmented into a set of n tokens T = {t1 , t2 , ..., tn }. Next, a set of m features F = {fi,1 , fi,2 , ..., fi,j } are extracted per each token ti . Once features have been extracted, they comprise a reference feature vector ~r = (f1,1 , f1,2 , ..., fp,q ) where fi,j is the j th feature of the ith token, similar to a document vector in information retrieval. A set of reference feature vectors are fed into the SVM classifier mentioned above (see part 2 in Figure 5). This SVM has learned from a training corpus of references (part 3 in Figure 5), resulting from collecting important general world knowledge from the Web (e.g., EndNoteWeb reference management tool of part 4). As a result of reference style classification, a reference string, transformed into a feature vector, has been classified into one of the reference styles (parts 5-7 in Figure 5). The next steps are concerned with the sequence labeling problem, described in the next section.
4.3.3
Styles AAG
Surface forms Albillos, A., R. Banares, M. Gonzalez, C. Ripoll, R. Gonzalez, M. V. Catalina & L. M. Molinero (2007) Value of the hepatic venous pressure gradient to monitor drug therapy for portal hypertension: A meta-analysis. American Journal of Gastroenterology, 102, 1116-1126. Turabian Albillos, A., R. Banares, M. Gonzalez, C. Ripoll, R. Gonzalez, M. V. Catalina, and L. M. Molinero. ”Value of the Hepatic Venous Pressure Gradient to Monitor Drug Therapy for Portal Hypertension: A Meta-Analysis.” American Journal of Gastroenterology 102, no. 5 (2007): 1116-1126. AIP A. Albillos, R. Banares, M. Gonzalez, C. Ripoll, R. Gonzalez, M. V. Catalina and L. M. Molinero, Am J Gastroenterol 102 (5), 1116-1126 (2007). APA Albillos, A., Banares, R., Gonzalez, M., Ripoll, C., Gonzalez, R., Catalina, M. V., et al. (2007). Value of the hepatic venous pressure gradient to monitor drug therapy for portal hypertension: A meta-analysis. American Journal of Gastroenterology, 102(5), 1116-1126. Chicago15A Albillos, A., R. Banares, M. Gonzalez, C. Ripoll, R. Gonzalez, M. V. Catalina, and L. M. Molinero. ”Value of the Hepatic Venous Pressure Gradient to Monitor Drug Therapy for Portal Hypertension: A Meta-Analysis.” American Journal of Gastroenterology 102, no. 5 (2007): 1116-26. IEEE A. Albillos, et al.,”Value of the hepatic venous pressure gradient to monitor drug therapy for portal hypertension: A metaanalysis,” American Journal of Gastroenterology, vol. 102, pp. 1116-1126, May 2007. JAMA Albillos A, Banares R, Gonzalez M, et al. Value of the hepatic venous pressure gradient to monitor drug therapy for portal hypertension: A meta-analysis. Am J Gastroenterol. May 2007;102(5):1116-1126. MLA Albillos, A., et al. ”Value of the Hepatic Venous Pressure Gradient to Monitor Drug Therapy for Portal Hypertension: A MetaAnalysis.” American Journal of Gastroenterology 102 5 (2007): 1116-26. Print. NLM Albillos A, Banares R, Gonzalez M, Ripoll C, Gonzalez R, Catalina MV, Molinero LM. Value of the hepatic venous pressure gradient to monitor drug therapy for portal hypertension: A meta-analysis. Am J Gastroenterol2007 May;102(5):1116-26.
by Lafferty et al. [17], involve a discriminative probabilistic graphical model; they can help label sequence data. This method has been reported to outperform techniques such as hidden Markov model. Our goal is to find parameters maximizing argmaxY P (Y |X; W ) instead of argmaxY P (Y, X) where Y is a permutation of a set of labels L = {l1 , l2 , ..., lk }, X is an input reference string, transformed into a set of tokens T = {t1 , t2 , .., tn }, and W is a set of weights for feature functions W = {w1 , w2 , .., wm }.
5.
EVALUATION
5.1
Reference Style Classification
Sequence Labeling
A reference string (e.g., reference 5 out of classified references 5-7 in Figure 5), already classified into a specific reference style (e.g., Chicago style tagger, style tagger 8 out of style taggers 8-10 in Figure 5), is input into a sequence labeler that is dedicated to the specific reference style. Style taggers are trained by knowledge from the Web (a web resource 11, labeled citations from EndNoteWeb, in Figure 5). These sequence labelers (style taggers) tag each field corresponding to the tokens. The output of sequence labeling is a tagged reference string (canonical form), e.g., tagged references 12-14 in Figure 5. For the sequence labeling problem, we will use one of the machine learning methods based on probabilistic graphical models. Conditional random fields (CRFs), first proposed
5.1.1
Experimental Design
Objectives: The objective of this experiment is to investigate what methods give the best performance and greatest effectiveness in improving canonical reference extraction. Datasets: We use three large and heterogeneously styled reference datasets. These datasets are randomly selected from a knowlege base for the reference styles described in Section 4.1. Datasets 1 and 2 are described in Table 4 and Table 5, respectively. As we can see in the styles column, the reference styles include ones used in a variety of disciplines. Tokenization and Features Extraction: Tokenization is also one of the critical processes to determine accuracy of style classification. Tokenization involves delimiters. A nor-
Table 5: Different Styles and Surface Forms of Dataset 2 No. Styles 1 396. Behavioral Cognitive Psycho 2
3
4
5
6
186. Anal Soc Issues Pub Policy 2143. J Phytopathology 721. Circulation
2815. Proc Amer Thoracic Soc.ens 17. ACS
7
1047. Entomologia Exp et App
8
1721. J Biogeography
9
2326. Law Probability Risk
10
2866. Public Finance Review
Surface forms BONFERONI, M., ROSSI, S., FERRARI, F., STAVIK, E., PENA-ROMERO, A., & CARAMELLA, C. (2000). Factorial analysis of the influence of dissolution medium on drug release from carrageenan-diltiazem complexes. AAPS PharmSciTech, 1 (2), 72-79. Bonferoni, M., Rossi, S., Ferrari, F., Stavik, E., Pena-Romero, A., & Caramella, C. (2000). Factorial analysis of the influence of dissolution medium on drug release from carrageenandiltiazem complexes. AAPS PharmSciTech, 1 (2), 72-79. Bonferoni, M., Rossi, S., Ferrari, F., Stavik, E., Pena-Romero, A. & Caramella, C. (2000) Factorial analysis of the influence of dissolution medium on drug release from carrageenandiltiazem complexes. AAPS PharmSciTech, 1: 72-79. 58. Bonferoni M, Rossi S, Ferrari F, Stavik E, Pena-Romero A, Caramella C. Factorial analysis of the influence of dissolution medium on drug release from carrageenan-diltiazem complexes. AAPS PharmSciTech . 2000;1:72-79 58. Bonferoni M, Rossi S, Ferrari F, Stavik E, Pena-Romero A, Caramella C. Factorial analysis of the influence of dissolution medium on drug release from carrageenan-diltiazem complexes. AAPS PharmSciTech 2000;1(2):72-79. 58. Bonferoni, M.; Rossi, S.; Ferrari, F.; Stavik, E.; PenaRomero, A.; Caramella, C., Factorial analysis of the influence of dissolution medium on drug release from carrageenandiltiazem complexes. AAPS PharmSciTech 2000, 1 (2), 72-79. A. Bonferoni M, Rossi S, Ferrari F, Stavik E, Pena-Romero A & Caramella C (2000) Factorial analysis of the influence of dissolution medium on drug release from carrageenan-diltiazem complexes. AAPS PharmSciTech 1: 72-79. doi:citeulikearticle-id:6447037. Bonferoni, M., Rossi, S., Ferrari, F., Stavik, E., Pena-Romero, A. & Caramella, C. (2000) Factorial analysis of the influence of dissolution medium on drug release from carrageenandiltiazem complexes. AAPS PharmSciTech , 1 , 72-79 BONFERONI, M., ROSSI, S., FERRARI, F., STAVIK, E., PENA-ROMERO, A. and CARAMELLA, C. 2000 Factorial analysis of the influence of dissolution medium on drug release from carrageenan-diltiazem complexes. AAPS PharmSciTech 1 , 72-79. Bonferoni, Maria, Silvia Rossi, Franca Ferrari, Evy Stavik, Angelina Pena-Romero, and Carla Caramella. ”Factorial Analysis of the Influence of Dissolution Medium on Drug Release from Carrageenan-Diltiazem Complexes.” AAPS PharmSciTech 1, no. 2 (2000): 72-79.
Table 6: Three Datasets Used in Experiments Item # of Styles # of instances # of features (normal tokenizer) # of features (simple tokenizer)
dataset 1 9 2,250 3,880 6,801
dataset 2 10 4,290 8,022 10,887
dataset 3 10 4,290 5,014 7,927
Figure 6: ROC of SVM before and after Clustering with Datasets 1, 2, and 3
mal tokenizer that we used in this evaluation uses delimiters as follows: carriage return, tab, blank, period, comma, semicolon, colon, single quotes, double quotes, round parentheses, question mark, and exclamation mark. Sometimes the delimiters are noise. However, some delimiters in the normal tokenizer are also the cues for distinguishing the reference styles, such as a period (.), a comma (,), a semicolon (;), a colon (:), double quotes (”) as well as parentheses (()). So, some delimiters should be included as critical features along with word features, (e.g., A., and “A Hybrid ). Thus, in addition to experimenting with the normal tokenizer, we also test a Simple tokenizer which includes just carriage return, tab, blank, single quotes, question mark, and exclamation mark. Methods: We compare several multi-class classification methods: 1) decision tree, 2) na¨ıve Bayes, and 3) support vector machine. In addition, to answer the concern about the difficulty of classifying similar reference styles, we investigate if a clustering before classification helps classifiers to select the reference styles effectively and to enhance the sequence labeling accuracy. The methods that we choose are decision tree, na¨ıve Bayes, support vector machine, clustering + decision tree, clustering + na¨ıve Bayes, and clustering + support vector machine. To implement the reference style classification, we used the WEKA toolkit. Metrics: To assess classification ability, we use accurary, receiver operating characteristic (ROC), precision, and recall as performance and effectiveness metrics. Each metric can be formalized as follows using true positives (A), true negatives (B), false positives (C), and false negatives (D). + B , Accuracy = A+AB+C+ D A True positive rate (TP rate) = A+D , False positive rate (FP rate) = C+C B , rate , Receiver Operating Characteristic (ROC) = TP FP rate A A P recision = A+C , Recall = A+D
5.1.2
Results
We conducted 10-fold cross-validation with each dataset. First, Figure 6 illustrates a receiver operating characteristic (ROC), which shows the true positive rate vs. false positive rate. Figures 7, 8, and 9 display the precision and recall before and after clustering with Datasets 1, 2, and 3, respectively.
5.1.3
Discussion
Table 7 lists the accuracy and F1-score in terms of the variants of classification methods with Datasets 1, 2, and 3. Figure 6 shows the result of classification before and after clustering with all data, which are located in the upper part on the non-discrimination line. Classifiers above a discrimination line are generally regarded as better classifiers. Using a complete randomized design, we have conducted an ANOVA test. In terms of clustering vs. non-clustering, the p-value is less than 0.0001 (< 0.05). Thus, the null hypothesis that there is no significant difference between the variation of clustering, which is likely to affect the variation of feature types, is rejected. In other words, it can be said that the variation of clustering method significantly affects the variation of accuracy. In addition, with respect to a simple tokenization method vs. a normal one, the p-value is 0.0004 (< 0.05), so we also can reject the next null hypothesis that there is no significant difference between the variation of
Table 7: Accuracies of Different Classifiers with the Preprocessing Combinations Clustering
Tokenization Normal tokenizer
Individual classes Simple tokenizer
Normal tokenizer Clustered classes Simple tokenizer
Classifier Decision Tree Na¨ıve Bayes SVM Decision Tree Na¨ıve Bayes SVM Decision Tree Na¨ıve Bayes SVM Decision Tree Na¨ıve Bayes SVM
Dataset1 F1 Acc. 0.73 74.0 0.58 58.6 0.71 71.2 0.75 74.7 0.66 65.8 0.74 74.4 0.94 93.7 0.94 94.3 0.98 98.4 0.96 95.8 0.98 97.8 1.00 99.6
Dataset2 F1 Acc. 0.49 50.2 0.44 43.8 0.50 47.9 0.75 73.7 0.72 72.5 0.76 77.2 0.82 83.1 0.82 81.8 0.86 85.9 0.95 95.4 0.94 93.6 0.98 98.1
Dataset3 F1 Acc. 0.31 32.1 0.21 21.1 0.24 23.2 0.67 67.5 0.61 61.2 0.63 62.9 0.60 63.6 0.55 55.4 0.59 60.1 0.94 94.3 0.96 95.7 0.97 97.1
Avg. Acc. 52.1 41.1 47.4 72.0 66.5 71.5 80.1 77.2 81.5 95.2 95.7 98.2
Figure 7: Precision and Recall of SVM before & after Clustering with Dataset 1
Figure 8: Precision and Recall of SVM before & after Clustering with Dataset 2
feature types by means of the different tokenization methods. Similarly, the use of simplified delimiters in the simple tokenization method can be said to affect the improvement of accuracy. The use of different classifiers (decision tree, na¨ıve Bayes, and SVM) shows 0.3892 as the p-value, which means that the corresponding null hypothesis is accepted, that is, there was no significant difference between the use of different classifiers. However, since SVM showed the best accuracy with all three datasets, we used only the classification outcomes by the SVM method in the sequence labeling stage. As a result of classification, there were a few misclassified references even after clustering. For example, in Table 5, for Dataset 2, 7 references out of 858 from class 9 (2326. Law Probability Risk ) clustered with class 10 (2866. Public Finance Review ) were mis-assigned to class 1. We looked into one of them, an instance 3471 of Dataset 2: “BARAKAT, N. In Vitro and In Vivo Characteristics of a Thermogelling Rectal Delivery System of Etodolac. AAPS PharmSciTech.” from class 1 and “BARAKAT, N. In Vitro and In Vivo Characteristics of a Thermogelling Rectal Delivery System of Etodolac. AAPS PharmSciTech.” from class 9. In these cases, the year was missing so the two references became identical. Originally, instances in the two classes 1 and 9, as you can see from Dataset 2 in Table 5, except only the year part, are almost the same. Clearly, identical references should yield the same result regardless of before or after reference style classification.
5.2 5.2.1
Sequence labeling Experimental Design
Objectives: The objective of this experiment is to check the dependency of styles on reference metadata extraction accuracy and to ascertain the effectiveness of the two-stage method for improving canonical representation extraction performance. Accordingly, we compare the performance of a tagger for a single style to that of the tagger for a combination of styles. Datasets: We used 2,250 references of Dataset 1, described in Table 4. This dataset consists of nine different surface forms and 250 reference entries per reference style. We searched 250 reference entries from EndNote under ‘citation’ and ‘analysis’ keywords and generated references in nine different reference styles. The reference styles used in this experiment are 1) AAG, 2) Turabian, 3) API, 4) APA, 5) Chicago15A, 6) IEEE, 7) JAMA, 8) MLA, and 9) NLM. Additionally, we used the CORA dataset to evaluate the effectiveness of our approach. Methods: We used a machine learning method, conditional random fields (CRFs). We used an open source implementation in Java, CRF5 , coded by Sarawagi et al. Methods to be compared to are one-stage CRF and two-stage CRF with the same dataset and features. 5
http://crf.sourceforge.net
Table 8: Sequence labeling result of CORA dataset with respect to different methods
Figure 9: Precision and Recall of SVM before & after Clustering with Dataset 3
Author Booktitle Date Editor Institution Journal Location Note Pages Publisher Tech Title Volume Average
Hybrid Precision Recall 0.98 0.99 0.91 0.88 0.98 0.96 1.00 0.47 0.89 0.81 0.88 0.95 0.87 0.89 0.95 0.62 0.96 0.98 0.86 0.87 0.86 0.80 0.95 0.98 0.91 0.97 0.94 0.94
F1 0.98 0.89 0.97 0.63 0.83 0.91 0.88 0.69 0.97 0.86 0.82 0.97 0.94 0.94
ParsCit F1 0.99 0.93 0.99 0.86 0.89 0.91 0.93 0.65 0.98 0.92 0.86 0.97 0.96 0.95
Peng F1 0.99 0.94 0.99 0.88 0.94 0.91 0.87 0.81 0.99 0.76 0.87 0.98 0.98 0.91
served in a collection, the better the performance. Notice that the performance when a single style is used is close to 100. This observation has inspired our approach, a hybrid two-stage methodology. We apply machine learning and knowledge-based techniques to the discipline-independent canonical representation extraction problem. The sequence tagging after classification approach results in high F1-score regardless of reference style. In other words, our two-stage method appears to be independent of reference styles.
6.
Figure 10: Performance of Reference Labeling with Regard to the Number of Reference Styles Metrics: We used F1, in the reference style classification stage, as effectiveness metric.
5.2.2
Results
Figure 10 shows the performance comparison of sequence labeler methods with respect to the AUTHOR field, JOURNAL field, and overall over the number of reference styles 1 to 9. In the graph, the result shows that the performance of methods without reference style classification is decreasing as the number of reference styles increase. Surprisingly, in contrast to one stage sequence labelers, two-stage sequence labelers keep their F1-scores close to 1 as they seem to be dealing with one reference styled references. Table 8 compares the results of sequence labeling of the CORA dataset using our approach versus existing methods.
5.2.3
Discussion
We conducted an experiment about the dependency between the sequence labeling ability of CRF and the number of reference styles without style classification. We obtained some results before classification, as illustrated in Figure 10, that show that the performance is decreasing as the number of reference styles increases. We cautiously inferred that the smaller the number of reference styles ob-
CONCLUSIONS
In this paper, we proposed a hybrid two-stage method for canonical representation extraction from heterogeneously styled references. To go beyond domain-specific and specialized reference style citation analysis methods, we built up a reference style knowledge base. Based on our hybrid methods, combining the general background knowledge and machine learning, we have conducted several experiments. We showed that our hybrid two stage approach along with preclustered classes is effective regarding our challenging problem: discipline-independent and style-free canonical representation extraction. In future work, we plan to compare our CRF sequence tagger with additional machine learning approaches to sequence labeling, such as HMM, structured SVM, and MEMM.
7.
ACKNOWLEDGMENTS
This material is based upon work supported by Digital Library and Archives, Virginia Tech and the National Science Foundation under Grant No. IIS-0916733.
8.
REFERENCES
[1] M. Afzal, H. Maurer, W. Balke, and N. Kulathuramaiyer. Rule based Autonomous Citation Mining With TIERL. Journal of Digital Information Management, 8(3), 2010. [2] I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM Trans. Knowl. Discov. Data, 1, March 2007. [3] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3): 273–297, 1995.
[4] E. Cortez, A. da Silva, M. Gon¸calves, F. Mesquita, and E. de Moura. FLUX-CIM: flexible unsupervised extraction of citation metadata. In Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, page 224. ACM, 2007. [5] I. Councill, C. Giles, and M. Kan. Parscit: An open-source CRF reference string parsing package. In Proceedings of the 6th international Conference on Language Resources and Evaluation (2008), volume 2008. European Language Resources Association (ELRA), 2008. [6] M. Day, R. Tsai, C. Sung, C. Hsieh, C. Lee, S. Wu, K. Wu, C. Ong, and W. Hsu. Reference metadata extraction using a hierarchical knowledge representation framework. Decision Support Systems, 43(1): 152–167, 2007. [7] Y. Ding. Applying weighted pagerank to author citation networks. Journal of the American Society for Information Science and Technology, 62(2): 236–245, 2011. [8] Y. Ding, G. Chowdhury, and S. Foo. Template mining for the extraction of citation from digital documents. In Proceedings of the Second Asian Digital Library Conference, Taiwan, pages 47–62, 1999. [9] D. Embley, D. Campbell, Y. Jiang, S. Liddle, D. Lonsdale, Y. Ng, and R. Smith. Conceptual-model-based data extraction from multiple-record Web pages. Data & Knowledge Engineering, 31(3): 227–251, 1999. [10] E. Garfield and R. Merton. Citation indexing: Its theory and application in science, technology, and humanities. Wiley, New York, 1979. [11] H. Han, C. Giles, E. Manavoglu, H. Zha, Z. Zhang, and E. Fox. Automatic document metadata extraction using support vector machines. In International Conference on Digital Libraries: Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries, pages 37–48. IEEE, 2003. [12] E. Hetzner. A simple method for citation metadata extraction using hidden Markov models. In Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries, pages 280–284. ACM, 2008. [13] G. Hoefel and C. Elkan. Learning a two-stage SVM/CRF sequence classifier. In Proceeding of the 17th ACM conference on Information and knowledge management, pages 271–278. ACM, 2008. [14] C. Hong, J. Gozali, and M. Kan. FireCite: Lightweight real-time reference string extraction from webpages. In Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries, pages 71–79. Association for Computational Linguistics, 2009. [15] F. Jiao, S. Wang, C. Lee, R. Greiner, and D. Schuurmans. Semi-supervised conditional random fields for improved sequence segmentation and labeling. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 209–216. Association for Computational Linguistics, 2006. [16] T. Joachims, T. Hofmann, Y. Yue, and C. Yu. Predicting structured objects with support vector
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
machines. Communications of the ACM, 52(11): 97–104, 2009. J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Machine Learning-International Workshop and Conference, pages 282–289, 2001. S. Lawrence, L. Giles, and K. Bollacker. Digital libraries and autonomous citation indexing. Computer, 32(6): 67–71, 2002. G. Mann and A. McCallum. Simple, robust, scalable semi-supervised learning via expectation regularization. In Proceedings of the 24th international conference on Machine learning, pages 593–600. ACM, 2007. K. Nagao and K. Hasida. Automatic text summarization based on the Global Document Annotation. In Proceedings of the 17th international conference on Computational linguistics-Volume 2, pages 917–921. Association for Computational Linguistics, 1998. F. Peng and A. McCallum. Information extraction from research papers using conditional random fields. Information Processing & Management, 42(4): 963–979, 2006. H. Poon, J. Christensen, P. Domingos, O. Etzioni, R. Hoffmann, S. Soderland, D. Weld, F. Wu, and C. Zhang. Machine Reading at the University of Washington. In First International Workshop on Formalisms and Methodology for Learning by Reading (FAM-LbR), page 87, 2010. H. Poon and P. Domingos. Joint inference in information extraction. In Proceedings of the National Conference on Artificial Intelligence, volume 22, page 913. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2007. L. Rabiner and B. Juang. An introduction to hidden Markov models. ASSP Magazine, IEEE, 3(1): 4–16, 1986. A. Strotmann and D. Zhao. Combining commercial citation indexes and open-access bibliographic databases to delimit highly interdisciplinary research fields for citation analysis. Journal of Informetrics, 4(2): 194–200, 2010. A. Stubbe, C. Ringlstetter, and K. Schulz. Genre as noise: Noise in genre. International Journal on Document Analysis and Recognition, 10(3): 199–209, 2007. M. Wick, A. Culotta, K. Rohanimanesh, and A. McCallum. An entity based model for coreference resolution. In SIAM International Conference on Data Mining (SDM), 2009. S. Yeates, I. Witten, and D. Bainbridge. Tag insertion complexity. In Data Compression Conference, 2001. Proceedings. DCC 2001, pages 243–252. IEEE, 2001. J. Yu and X. Fan. Metadata Extraction from Chinese Research Papers Based on Conditional Random Fields. In FSKD 2007, volume 1, pages 497–501, 2007. J. Zou, D. Le, and G. Thoma. Locating and parsing bibliographic references in HTML medical articles. International journal on document analysis and recognition, 13(2): 107–119, 2010.