SuperSense Tagging with a Maximum Entropy Markov Model Giuseppe Attardi, Luca Baronti, Stefano Dei Rossi, and Maria Simi Università di Pisa, Dipartimento di Informatica, Largo B. Pontecorvo 3, 56127 Pisa, Italy {attardi,barontil,deirossi,simi}@di.unipi.it
Abstract. We tackled the task of SuperSense tagging by means of the Tanl Tagger, a generic, flexible and customizable sequence labeler, developed as part of the Tanl linguistic pipeline. The tagger can be configured to use different classifiers and to extract features according to feature templates expressed through patterns, so that it can be adapted to different tagging tasks, including PoS and Named Entity tagging. The tagger operates in a Markov chain, using a statistical classifier to infer state transitions and dynamic programming to select the best overall sequence of tags. We exploited the extensive customization capabilities of the tagger in order to tune it for the task of SuperSense tagging, by performing an extensive process of feature selection. The resulting configuration achieved the best scores in the closed subtask. Keywords: SuperSense Tagging, Word Net, Maximum Entropy, Maximum Entropy Markov Model, MEMM, dynamic programming.
1
Introduction
SuperSense tagging (SST) is a NLP task, proposed in [5], which aims at annotating nouns, verbs, adjectives and adverbs in a text, according to a general semantic taxonomy corresponding to the WordNet lexicographer classes (called SuperSenses) [8]. It can be considered a partial form of word sense disambiguation (WSD), where the possible senses of a word are coarser grained with respect to the tens of thousands possible senses, which are typically listed in dictionaries. SST can achieve better accuracy than WSD, which remains a difficult task, and still can provide more detailed semantic information than Named Entity Recognition (NER), which is usually limited to proper nouns and a few semantic categories. SuperSense tagging is therefore a practical and useful technique for many NLP tasks involving large scale information extraction. SST can be tackled as a special case of sequence labeling; therefore we implemented a SuperSense tagger by extending and customizing a generic tagger, developed as part of Tanl pipeline [2]. This tagger was also used for implementing the Tanl NER, which achieves state of the art accuracy on the CoNLL 2003 benchmarks for English.
B. Magnini et al. (Eds.): EVALITA 2012, LNCS 7689, pp. 186–194, 2013. © Springer-Verlag Berlin Heidelberg 2013
SuperSense Tagging with a Maximum Entropy Markov Model
187
At LREC 2010 [1] we reported on preliminary results in SuperSense tagging. A specific resource annotated with SuperSenses for Italian was created and a tagger was trained on that resource, which achieved an accuracy of 79.1 (F1 score): a significant improvement over the state-of-the art accuracy for Italian [12] and a small improvement with respect to English [5]. The annotated resource is called corpus ISST-SST, and was derived from ISST [11] through a semiautomatic process followed by manual revision. The Tanl Tagger can be configured to use different classifiers and to extract features according to feature templates expressed through patterns provided in a configuration file. This flexibility allows experimenting with different configurations of features and settings for the learning model. The tagger adopts a Maximum Entropy Markov Model (MEMM) approach for sequence labeling. A statistical classifier is used for learning which transition to perform between states and dynamic programming is applied in order to select the sequence of tags with the highest overall probability. Two types of classifiers can be used: Maximum Entropy or Logistic Regression. Both are discriminative methods, quite effective for labeling, since they do not assume independence of features. Both algorithms are also more efficient than SVM and, complemented with dynamic programming, can achieve similar levels of accuracy. For Evalita 2011, the challenge was to deal with the new ISST-SST resource, expressly revised for the Evalita 2011 SuperSense task, and try to improve upon our previous results by careful tuning of the system. In the following we will describe the system used for tagging, the experiments performed in order to tune the system for the task, the results achieved and finally draw some conclusions.
2
Description of the System
The Tanl Tagger is a generic, customizable statistical sequence labeller, suitable for many tasks of sequence labelling, such as POS tagging, Super-sense tagging and Named Entity recognition. Its design was inspired by the approach of Chieu & Ng [4]. The tagger implements a Maximum Entropy Markov Model (MEMM) [10] for sequence labeling that combines features of Hidden Markov models (HMMs) and Maximum Entropy models. A MEMM is a discriminative model that extends a standard maximum entropy classifier by assuming that the unknown values to be learned are connected in a Markov chain rather than being conditionally independent of each other. Dynamic programming is applied to the outputs of the classifier in order to select the best sequence of labels to assign to the whole sequence. Dynamic programming is only used in tagging and not in training as required in the inner loops of Conditional Random Fields: this makes the tagger much more efficient. 2.1
Maximum Entropy and Dynamic Programming
The Maximum Entropy framework estimates probabilities based on the principle of making as few assumptions as possible, other than the constraints imposed by the
188
G. Attardi et al.
observations. Such constraints are derived from training data and express relationships between features and outcomes. The probability distribution that satisfies the above requirement is the one with the highest entropy, it is unique, and agrees with the maximum-likelihood distribution. The distribution has the following exponential form [7]:
p (o | h) =
1 k f j ( h ,o ) ∏α j Z ( h) j =1
where o refers to the outcome, h is the history or context, and Z(h) is a normalization function. The features used in the Maximum Entropy framework are binary. An example of a feature function is: 1 if o = B − noun.location, FORM = Washington f j (h, o) = 0 otherwise
The Tanl Tagger estimates the parameters αj either using Generalized Interactive Scaling (GIS) [6] or through the LBFGS algorithm for large-scale multidimensional unconstrained minimization problems [9]. Since the Maximum Entropy classifier assigns tags to each token independently, it may produce inadmissible sequences of tags. Hence a dynamic programming technique is applied to select correct sequences. A probability is assigned to a sequence of tags t1, t2,…, tn for sentence s, based on the probability of the transition between two consecutive tags P(ti+1 | ti), and the probability of a tag P(ti | s), obtained from the probability distribution computed by Maximum Entropy: n
P (t1 , t2 ,..., t n ) = ∏ P(ti | s ) P (ti | ti −1 ) i =1
In principle the algorithm should compute the sequence with maximum probability. We use instead a dynamic programming solution which operates on a window of size w = 5, long enough for most SuperSenses. For each position n, we compute the best probability PB(tn) considering the n-grams of length k < w preceding tn: PB(tn) = maxk PB(tn-k-1) ... PB(tn-1) A baseline is computed, assuming that the k-gram is made all of ‘O’ (outside) tags: PBO(tn) = maxk PB(tn-k-1) P(tn-k = O) ... P(tn-1 = O) Similarly for each class C we compute: PBC(tn) = maxk PB(tn-k-1) P(tn-k = C) ... P(tn-1 = C) and finally: PB(tn) = max(PBO(tn), maxC PBC(tn))
SuperSense Tagging with a Maximum Entropy Markov Model
2.2
189
Model Parameter Specification
The Tanl Tagger can be configured specifying which classifier to use, which optimization algorithm and which specific parameters for it, for example the number of iterations for the GIS procedure. Other parameters can be set to influence the behavior of the tagger. Namely: 1. the cutoff feature, an option that prevents the tagger to learn from features that appear a number of times below a given threshold; 2. the refine feature, an option to split the IOB tags into a more refined set: the B tag is replaced by U for entities consisting of a single token; the last I tag of an entity of more than one token is replaced by E. 2.3
Feature Specification
The tagger extracts features at each step in the labelling of a sequence, which contribute to define the context for a training event, whose outcome is the label of the current token. Features are divided into local and global features. Two kinds of local features can be specified: • attribute features: are extracted from attributes (e.g. Form, PoS, Lemma, NE) of surrounding tokens, denoted by their relative positions w.r.t. to the current token. The feature to extract is expressed through a regular expression. For example the notation POSTAG .* -1 0 means: extract as a features the whole PoS tag (matching .*) of both the previous token (position -1) and of the current token (position 0); • morphological features: these are binary features extracted from a token if it matches a given regular expression. For example FORM ^\p{Lu} -1 means: “previous word is capitalized”. The pattern specifies a Unicode category uppercase letter (\p{Lu}) occurring at the beginning (^) of the token. Besides local features, the tagger also considers • global features: properties holding at the document level. For instance, if a word in a document was previously annotated with a certain tag, then it is likely that other occurrences of the same word should be tagged similarly. Global features are particularly useful in cases where the word context is ambiguous but the word appeared previously in a simpler context.
3
Tuning the system
In preparing for the experiments, we set up the dataset for a proper validation process. The sentences in the training set were shuffled and partitioned into three sets:
190
G. Attardi et al.
• A training set (about the 70% of the training corpus), used to train the models; • A validation set (about 20% of the training corpus), used to choose the best model; • A test set (about 10% of the corpus), used to evaluate the accuracy. To produce a baseline we used a base configuration with no attribute features and with the following set of morphological features: • Features of Current Word - capitalized first word of sentence; - non capitalized first word of sentence; - hyphenated word. • Features from Surrounding Words - both previous, current and following words are capitalized; - both current and following words are capitalized; - both current and previous words are capitalized; - word occurs in a sequence within quotes. With 100 iterations of the Maximum Entropy algorithm we obtained an F-score of 71.07 on the validation set. Tuning consisted in an extensive process of automatic feature selection involving the creation of many configuration files with different combinations of features. In particular about 300 positional permutations of the attribute features were tested along with variations of other parameters like the number of iterations, the cutoff option, and the refine option. The experimental setup was such that many tests could be run in parallel. The accuracy of each resulting system was computed testing the model on the validation set and comparing its accuracy with that of the other systems. Each configuration was used to train a new model on a dataset resulting from merging the training set and the validation set, and the accuracy was tested again on the test set. This validation process was done in order to ensure that the accuracy did not degrade on new and unknown data because of overfitting on the validation set. The best run on the validation set obtained an F-score of 80.01, about 10 points higher than the baseline.
4
Experiments and Runs
We only participated to the closed task, after some experiments using external dictionaries and gazetteers, in particular ItalWordNet (IWN) [13], did not give encouraging results. For the final submission we selected the four runs with the best and most balanced accuracy on the validation and test set. In the following sections we describe the features and parameters used in the four runs, henceforth referred as Run [1-4]. Attributes Features. The table below shows the positional parameters of the attributes features used for the four runs. For example LEMMA .* -1 0 tells the tagger
SuperSense Tagging with a Maximum Entropy Markov Model
191
to use as features the whole LEMMA (matching ‘.*’) of the previous (-1) and of the current (0) token. POSTAG . -1 0, indicates instead to use just the first letter (matching ‘.’) of the POSTAG of the same tokens Table 1. Attribute features for the four runs
FORM POSTAG POSTAG LEMMA
.* .* . .*
Run 1-2
Run 3-4
0 -2 0 1 2 -1 0 -1 0
0 01 -1 0 0
Morphological Features. The set of morphological features described above for the baseline was used for all the runs. An additional set of local features was used for run 3 and run 4 with the aim to improve the accuracy of the tagger on the classes of SuperSenses with low F-score. Such classes are: verb.emotion, verb.possession, verb.contact and verb.creation. A list of the most common non-ambiguous verbs in those classes was obtained from the training set and they were added as local features for the current LEMMA. The list of verbs is the following: • • • •
verb.emotion: sperare, interessare, preoccupare, piacere, mancare, temere, amare; verb.possession: vendere, perdere, offrire, pagare, ricevere, raccogliere; verb.contact: porre, mettere, cercare, colpire, portare, cercare, toccare; verb.creation: realizzare, creare, produrre, aprire, compiere.
Global Features. The refine option, which performed well for tasks with a lower number of classes like Named Entity Recognition, proved to be less relevant for SST where the number of classes and the level of ambiguity are already high, so we did not use it for the runs. Also changing the threshold value of the cutoff option to values greater than one did not lead to improvements on the accuracy of the system, so we left it to zero. Different numbers of training iterations were used for the four runs, i.e. 100, 150, 200, 500. The main differences of the four runs are highlighted in Table 2. Table 2. Summary of differences between the four runs Feature
Run 1
Run 2
Run 3
Run4
Non-ambiguous verbs POSTAG LEMMA N. of iterations
No -2 0 1 2 01 100
No -2 0 1 2 01 150
Yes -1 0 0 200
Yes -1 0 0 500
192
5
G. Attardi et al.
Results
The Evalita test set consists of two parts: about 30,000 tokens from the original ISST (mostly newspaper articles) and an additional portion of about 20,000 derived from Wikipedia texts, which can be considered of a different genre. It is interesting to note the behavior of the systems on the two subsets, since a lower accuracy on the new domain was to be expected. The following table summarizes the official results in the four runs for the closed task. Table 3. UniPI systems results on the closed subtask
run 3 run 2 run 1 run 4
Accuracy
Precision
Recall
FB1
88.50% 88.34% 88.30% 88.27%
76.82% 76.69% 76.64% 76.48%
79.76% 79.38% 79.33% 79.29%
78.27 78.01 77.96 77.86
Run 3 was the best of our systems and also the best performing system for the Evalita 2011 SST closed task. Measuring the accuracy of Run 3 separately on the two parts of the test set gives values of 78.23 F1 on ISST and 78.36 F1 on the Wikipedia fragment. Table 3 reports detailed results for each of the 44 categories, obtained in Run 3. Table 4. Results by category Category
adj.all adj.pert adv.all noun.Tops noun.act noun.animal noun.artifact noun.attribute noun.body noun.cognition noun.communication noun.event noun.feeling noun.food noun.group noun.location noun.motive noun.object noun.person noun.phenomenon noun.plant noun.possession
FB1 88.43 77.24 96.77 60.47 85.37 50.00 63.68 82.09 85.25 75.44 72.03 79.59 78.79 28.57 59.46 65.70 72.41 64.46 61.73 82.61 37.84 75.88
Category noun.process noun.quantity noun.relation noun.shape noun.state noun.substance noun.time verb.body verb.change verb.cognition verb.communication verb.competition verb.consumption verb.contact verb.creation verb.emotion verb.motion verb.perception verb.possession verb.social verb.stative verb.weather
FB1 76.19 81.96 67.25 66.67 80.34 57.14 83.61 22.22 80.10 82.39 85.26 48.28 50.00 71.43 69.27 62.96 71.23 78.76 79.91 77.29 85.75 0.00
SuperSense Tagging with a Maximum Entropy Markov Model
6
193
Discussion
Analysing the data of the tuning experiments, we noticed that the tagger using Maximum Entropy achieves the best F1 results with a number of iterations between 100 and 200 iterations (Fig. 1). Increasing the number apparently is not beneficial, possibly causing overfit and of course increasing the training time.
Fig. 1. Tagger accuracy vs. number of iterations of GIS
It is worth nothing that, consistently with this analysis, the best results on the test set were obtained with run 3, 2 and 1 with 100, 150 and 200 iterations, while run 4, with 500 iterations, obtained the worst score. The tagger was capable of achieving about the same accuracy on the two parts of the test set, with even a small increase in accuracy when moving to the new domain. We regard this as an indication that the extensive tuning of the parameters was successful in selecting a stable configuration of the system, which is capable of good accuracy across domains. Looking at the results on individual categories, Table 4 shows that among the most difficult ones are categories that refer to encyclopedic knowledge (as opposed to common sense), i.e. noun.animal, noun.plant, noun.food, and rare verbs in categories consumption and competition. A more balanced corpus including annotated portions of the Wikipedia would help in mitigating the lack of background knowledge, which is likely to be under-represented in the training corpus, consisting mostly of newspaper articles. As an alternative, external resources, such as the Wikipedia itself or WordNet, might be exploited to address this problem, as shown by the good results obtained by Basile [3] in the Open task.
7
Conclusions
We tackled the task of Evalita 2011 SuperSense tagging by performing extensive tuning of the Tanl Tagger. We chose Maximum Entropy as classifier and generated approximately 300 different configurations, varying the choices of features and parameters for the tagger.
194
G. Attardi et al.
The results were a little bit lower than those obtained with a previous version of the ISST-SST resource, but still represent the state-of-the-art for Italian. The Evalita 2011 task was more realistic though, since in the corpus multi-word expressions were left as separate tokens, rather than grouped into a single one. Analyzing which categories turned out more difficult than others to guess, we concluded that disambiguating among some of them would require access to external resources. Some other categories introduce very subtle distinctions, whose utility in applications remains questionable. For the future, we would like to investigate a more rational grouping of the 44 SuperSenses in order to determine whether the accuracy can be improved while still preserving a semantic tagger, which is practical and useful in building applications. Acknowledgments. Partial support for this work has been provided by the PARLI Project (MIUR/PRIN 2008).
References 1. Attardi, G., Dei Rossi, S., Di Pietro, G., Lenci, A., Montemagni, S., Simi, M.: A Resource and Tool for SuperSense Tagging of Italian Texts. In: Proceedings of 7th Language Resources and Evaluation Conference (LREC 2010), Malta, pp. 17–23 (2010) 2. Attardi, G., Dei Rossi, S., Simi, M.: The Tanl Pipeline. In: Proceedings of Workshop on Web Services and Processing Pipelines in HLT, Malta (2010) 3. Basile, P.: Super-Sense Tagging using support vector machines and distributional features. In: Working Notes of Evalita 2011, Rome, Italy (January 2012) ISSN 2240-5186 4. Chieu, H.L., Ng, H.T.: Named Entity Recognition with a Maximum Entropy Approach. In: Proceedings of CoNLL 2003, Edmonton, Canada, pp. 160–163 (2003) 5. Ciaramita, M., Altun, Y.: Broad-Coverage Sense Disambiguation and Information Extraction with a Supersense Sequence Tagger. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing EMNLP, pp. 594–602 (2006) 6. Darroch, J.N., Ratcliff, D.: Generalized Iterative Scaling for Log-Linear Models. Annals of Mathematical Statistics 43(5), 1470–1480 (1972) 7. Della Pietra, S., Della Pietra, V., Lafferty, J.: Inducing Features of Random Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(4), 380–393 (1997) 8. Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998) 9. Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization methods. Mathematical Programming 45, 503–528 (1989) 10. McCallum, A., Freitag, D., Pereira, F.: Maximum Entropy Markov Models for Information Extraction and Segmentation. In: Proc. ICML 2000, pp. 591–598 (2001) 11. Montemagni, S., et al.: Building the Italian Syntactic-Semantic Treebank. In: Abeillé (ed.) Building and using Parsed Corpora, Language and Speech Series, pp. 189–210. Kluwer, Dordrecht (2003) 12. Picca, D., Gliozzo, A., Ciaramita, M.: SuperSense Tagger for Italian. In: Proceedings of LREC 2008, Marrakech (2008) 13. Roventini, A., Alonge, A., Calzolari, N., Magnini, B., Bertagna, F.: ItalWordNet: a Large Semantic Database for Italian. In: Proceedings of LREC 2000, Athens (2000)