Instructions for Preparing LREC 2006 Proceedings - Institut für ...

Report 10 Downloads 38 Views
Terminology extraction and term variation patterns: a study of French and German data Marion Weller, Helena Blancafort, Anita Gojun, Ulrich Heid Institut für maschinelle Sprachverarbeitung

Syllabs

Universität Stuttgart

Paris

{wellermn|gojunaa|heid}@ims.uni-stuttgart.de

[email protected]

Abstract The terminology of many technical domains, especially new and evolving ones, is not fully fixed and shows considerable variation. The purpose of the work described in this paper is to capture term variation. For term extraction, we apply hand-crafted POS patterns on tagged corpora, and we use rules to relate morphological and syntactic variants. We discuss some French and German variation patterns, and we present first experimental results from our tools. It is not always easy to distinguish (near) synonyms from variants that have a slightly different meaning from the original term; we discuss ways of operating such a distinction. Our tools are based on POS tagging and an approximation of derivation and compounding; however, we also propose a non-symbolic, statistics-based line of development. We discuss general issues of evaluating variant detection and present a smallscale precision evaluation. Keywords: terminology, term variation, comparable corpora, pattern-based term extraction, compound nouns

1. Introduction

documents published on the Internet are often the most TTC1

recent sources of data. In such domains, terminology

and

typically has not yet been standardized, and thus

Comparable Corpora) is the extraction of terminology

numerous variants co-exist in published documents.

from comparable corpora. The tools under development

Tools which support the extraction, identification and

within the project address the issues of compiling corpus

interrelating of term variants are thus necessary to

collections, monolingual term extraction and the

capture the full range of expressions used in the

alignment

multilingual

respective domain. End users may then decide (e.g. on

equivalence candidates, as well as the management and

the basis of variant frequency and sources of variants)

the export of the resulting terminological data towards

which expression to prefer.

CAT and MT tools.

A second, more technical motivation for term variant

Since parallel corpora of specialized domains are scarce

extraction is provided by the procedures for term

and not necessarily available for a broad range of

alignment (either lexical or statistical strategies), for

languages (TTC deals with English (EN), Spanish (ES),

which data sparseness is a problem. In order to reduce

German (DE), French (FR), Latvian (LV), Russian

the complexity of term alignment, TTC intends to gather

(RU), Chinese (ZH)), comparable corpora are used

monolingual variants into sets of related terms.

instead: textual material from specialized domains is

Particularly for this application, we do not only allow

accessible for many languages, either on the Internet or

for (quasi) synonyms, but also for variants with a slight

in publications of companies.

difference in meaning as shown in 1.

In technical domains which are rapidly evolving,

1) production d'électricité ↔ électricité produite

The

objective

(Terminology

of

of

the

EU-funded

Extraction,

terms

into

project

Translation

pairs

of

Tools

(production of electricity ↔ produced electricity) 1

http://www.ttc-project.eu The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under Grant Agreement n. 248005.

Terms may be of different forms (single-word vs. multiword terms) in different languages: this is a challenge

for term alignment. For example, compound nouns play

languages, we work with lemmas rather than inflected

an important role in German terminology, but have no

forms.

equivalents of the same morpho-syntactic structure in many other languages. Grouping equivalent terms of different syntactic structures can help to deal with such cases, as illustrated in 2: 2) Energieproduktion ↔ Produktion von Energie ↔

2.2. Term candidate extraction and filtering Our main focus is on the extraction of nominal phrases such as [NN NN] or [NN PRP NN] constructions (cf. tables 2-5), but [V NN] collocations are also of interest2.

production d'électricité

For each language, we identify term candidates by using

(energy production ↔ production of energy)

hand-crafted POS patterns. In contrast to nominal

2. Methodology

phrases, which are relatively easy to capture by POS patterns, the identification of [V NN] collocations is

The steps required for term extraction and for variant

more challenging, as verbs and their object nouns do not

identification follow a simple pipeline architecture: first,

necessarily occur in adjacent positions, depending on the

a corpus collection is compiled, which then undergoes

general

linguistic

steps,

particularly to German where constituent order is rather

monolingual term candidates are extracted. As not all

flexible and allows for long distances between verbs and

extracted items are domain relevant, we apply statistical

their objects.

filtering. Since we intend to detect term variation on a

In order to reduce the extracted term candidates to a set

morpho-syntactic

requires

of domain-relevant items, we estimate their domain

model

specificity by comparing them with terms extracted

pre-processing.

morphological

level, processing

Following

this in

last

these

step

order

to

derivational relationships between word classes.

structure of the

sentence. This

applies

from general language corpora (Ahmad et al, 1992). The underlying idea of this procedure is the assumption that

2.1. Compiling a corpus and pre-processing

terms which occur in both domain-specific and general

To collect corpus data, we use the focused Web crawler

language corpora are not domain-relevant, whereas

Babouk (de Groc, 2011) which has been developed

terms occurring only or predominantly in the domain-

within the TTC project. Babouk starts with a set of seed

specific data can be considered as specialized terms. We

terms or URLs given by the user which are combined

use the quotient q of a term's relative frequency in the

into queries and submitted to a search engine. Babouk

specialized data and in the general language corpus as

scores the relevance of the retrieved web pages using a

an indicator for its domain relevance (see table 1).

weighted-lexicon-based thematic filter. Based on the term candidate Gleichstrom (direct current) Jahr (year)

content of relevant retrieved pages, the lexicon is extended and new search queries are combined. One objective of the TTC project is to rely on flat

f domain 128

f general 4

q 22362,7

2157

221.213

1,2

linguistic analysis that is available for all languages. Table 1: Domain-specific vs. general language

One strand of research thus goes towards the development of knowledge-poor strategies, such as using a pseudo part-of-speech tagger (Clark, 2003) as a

2.3. Term variation

basis for probabilistic NP-extraction (Guégan & Loupy,

In TTC we define a term variant as “an utterance which

2011). A knowledge-rich approach is term extraction

is semantically and conceptually related to an original

based on hand-crafted part-of-speech (POS) patterns,

term” (Daille, 2005). Thus, term variants are bound to

which is the method we chose for the present work.

texts (“utterance”) and require the presence of an

Pre-processing of our data collection consists of

“original term” identified e.g. by means of a morpho-

tokenizing, POS-tagging and lemmatization using

syntactic term pattern.

TreeTagger (Schmid, 1994). For efficiency reasons, with German

and French being morphologically rich

NN:noun, PRP: preposition, V: verb, VPART: participle

2

The relationship between term variant and original term

For French, we created a set of rules to model the

is supposed to mainly be one of (quasi-) synonymy or of

relationship between nouns ending in -tion and the

controlled modification (e.g. by attributive adjectives,

respective verbs:

NPs or PPs). We formalize this by explicitly classifying



production → produire (production → produce)

relationships between patterns.



évolution → évoluer (evolution → evolve)

We distinguish the following types of variants:



condition → conditionner (condition → condition)



protection → protéger (protection → protect)



graphical air flow ↔ airflow



morphological (derivation, compounding)



 

Energieproduktion ↔ Produktion von Energie

Similar rules can be formulated, e.g. for nouns ending in

(production of energy)

-ment or -eur, e.g. chargement (nominalized action) →

solare Energie ↔ Solarenergie (solar energy)

charger (verb), as well as convertisseur (nominalized

paradigmatic e.g. omissions

tool name) → convertir (verb). Similarly, terms

les énergies renouvelables ↔ les renouvelables

containing adjectives ending in -able, such as utilisable

(the renewable energies ↔ the renewables)

→ utiliser (cf. table 5) or relational adjectives

abbreviations, acronyms

(prototypique → prototype) are under study. A further

Windenergieanlage ↔ WEA (wind energy plant)

type of pattern that could be added are rules to handle

syntactic variants3 consommation d’énergie ↔

prefixation (e.g. anti-corrosion → corrosion).

consommation annuelle d’énergie (energy consumption ↔ yearly energy consumption)

2.5. Processing formally related items A very common form of graphic variation is

Assuming that German technical texts contain many

hyphenation, e.g. Luftwärmepumpe vs. Luft-Wärme-

domain-specific compounds, we focus in this work on compound nouns and their variant [NN PRP NN] as

pumpe (air-source heat pump). This type of variation is dealt with by the splitting programm, which uses

illustrated above (morphological variants).

hyphens as splitting points. Hyphenated and non-

For French, we choose a similar pattern [NN de NN] ↔ [NN VPART]. In our current work, we restrict this

hyphenated forms are treated as one term. To a certain extent, our variant detection tools also deal

pattern to nouns ending in -tion. The addition of French

with alternating transitional elements (Kraftwerkbetrieb

morphology tools is planned to widen the scope of these

vs. Kraftwerksbetrieb). This is modeled by hand-crafted

patterns.

rules which allow for several realizations. Additionally, there are relatively regular forms of spelling variation,

2.4. Morphological processing

e.g. the new/old orthography in German, resulting in

In order to identify morphological variants of German

e.g. ph/f variation. This can be dealt with either by rules

compounds, we need to split compounds into their

or using a method based on string-distance.

components: in the present work, we opt for a statistical compound splitter; the implementation is

based on

3. Experiments and examples of results

(Koehn & Knight, 2003).

Our experiments are based on comparable corpora

Searching for the most probable split of a given word,

crawled from the Web. While they are generally easy to

the basic idea is that the components of a compound also

obtain with a focused crawler, such corpora might be

appear as single words and consequently should occur in

inhomogeneous with respect to domain coverage or

corpus data. A word frequency list serves as training

types of sources. When working with several languages,

data, supplemented with a hand-crafted set of rules to

the degree of comparability may also vary.

model transitional elements, such as the

We use a collection of 1000 documents each for French

s in

Produktions|kosten (production costs).

and German, with a total size of 1.55 M tokens (FR) and 1.29 M tokens (DE) of the domain of wind energy.

3

This last type of variants is not necessarily synonymous with the original term.

When looking at the extracted German data, we find that

Abgabe von Wärme Beleuchtung von Straße Erzeugung von Strom Produktion von Strom Speicherung von Energie Verbrauch an Primärenergie Versorgung mit Fernwärme Nutzung von Biomasse

1 1 32 4 7 1 2 8

Wärmeabgabe Straßenbeleuchtung Stromerzeugung Stromproduktion Energiespeicherung Primärenergieverbrauch Fernwärmeversorgung Biomassenutzung

18 49 569 72 37 114 13 7

release of warmth street lighting power generation power production energy storage primary energy consumption district heating biomass utilization

Table 2: Prepositional phrases vs. compound nouns consommation d'électricité consommation d'énergie importation de pétrole production d'électricité production de chaleur installation d'éolienne installation de puissance utilisation d'énergie

electricity consumption energy consumption import of petroleum electricity production heat production wind turbine installation installation of power use of energy

28 66 9 225 26 5 1 5

électricité consommée énergie consommée pétrole importé électricité produite chaleur produite éolienne installée puissance installée énergie utilisée

consumed electricity consumed energy imported petroleum produced electricity produced heat installed wind turbine installed power used energy

15 22 1 95 21 16 69 19

Table 3: Related French terms: prepositional phrases vs. noun-participle constructions.

Nutzenergie nutzbar Energie genutzt Energie nutzbar Energieform genutzt Energieform nutzbar Energiegehalt Nutzenergie-Anteil nutzbar Energiemenge

useful energy usable energy used energy usable energy form used energy form usable energy content proportion of useful energy usable amount of energy

89 24 5 9 4 3 1

énergie utilisée énergie utile énergie utilisable forme d'énergie utile form d'énergie utilisable source d'énergie utilisable

1

used energy useful energy usable energy useful energy form form of useable energy source of usable energy

19 14 14 2 2 1

Table 5: Different combinations of the components

Table 4: Variants of the compound Nutzenergie.

energie and utile.

the realization of a term as a compound is often more frequent than the alternative structures [NN PRP NN] or [NN ARTgen NNgen], as illustrated in table 2. This

of the pattern pair4 [NN de NN] ↔ [NN VPART] in

does not only apply to common words like Strom-

one of the two patterns, the overall tendency for

erzeugung (power generation), but also to comparative-

preference is less clear than for the German examples.

ly long and more complex words like Fernwärmever-

The difference in meaning (i.e. action vs. situation) does

sorgung (lit. long-distance heat supply: district heating).

not allow for full interchangeability of related terms, and

We consider this as evidence that the respective

the use of the different forms of realization is context

compound nouns are established as terms in the domain

dependent.

or even in general language. The degree of preference

table 3 have different meanings, as is the case with

varies, up to the point of there not being an alternative

puissance installée vs. installations de puissance élevée

realization, as is the case with Windgeschwindigkeit

in example (3).

(wind speed, freq=149), for which one could imagine a

3)

table 3 are not (near) synonyms, but could rather be considered as related. While some terms seem to prefer

In contrast to the German structures, the French terms

Par contre, le coût et la complexité des installations les réservent le plus souvent à des installations de puissance élevée pour

construction like *Geschwindigkeit des Windes (speed of the wind), which does not occur in our corpus.

Some terms from the pairs contained in

4

Note that the extracted lemma of the participle is its infinitive; we show the inflected form for better readability, i.e. consommée instead of consommer.

However, due to the cost and complexity of the installations, they

same applies to the set of rules used to group variants. For example, the French pattern [NN PRP NN] is

are mostly restricted to installations of high power in order to

restricted to the prepositions de and à. While there might

benefit from the scaling effects.

be valid terms containing other prepositions, they are

bénéficier d’économies d’échelle.

In other cases, grammatical and/or stylistic constraints

excluded from being extracted. Similarly, the large

may lead authors to use one variant rather than another.

number of potential paraphrases of German compounds

For example, compounds in enumerations are rather

cannot be captured.

split in order to facilitate the combination with other

The examples in tables 4 and 5 illustrate the wide range

nouns, e.g. Meeresboden vs. Boden von Meeren in

of possible types of variation and thus the difficulty to

example (4).

capture and relate the different types of variation. In

4)

Methanhydrat bildet sich am Boden von Meeren bzw. tiefen Seen

addition to the problem of pattern coverage, another

Methane hydrate develops at the ground of the sea or deep lakes

factor is the quality of the morphological tools used to model the relationship between word classes.

In table 4, we show examples of variants in a wider sense: starting with the compound Nutzenergie (useful

4.2. Evaluation of precision

energy), we find the synonym nutzbare Energie (usable

In a small experiment, we measured the precision of the

energy) and the related form genutzte Energie (used

100 most-frequent German compound nouns and their

energy). In the entries in the lower part of the table (grey

proposed variants: 74 of the variants are valid. Most of

background), the component

the 26 invalid variants are due to bad PP-attachment, as

Energie is part of a

compound noun while still preserving the (basic)

illustrated by the following example:

meaning of the term Nutzenergie (useful energy).

5) Stromkunde (energy customer) → *Kunde mit

The French examples in table 5 correspond to the

Strom (customer with energy)

German ones (table 4), with related terms consisting of

which is part of the verbal phrase Kunden mit Strom

the basic components in the upper part of the table, and

versorgen (supply costumers with energy). This kind of

terms expanded by an additional component in the lower

error can rather be considered a problem of the

part of the table (gray background). The forms nutzbar

extraction step than of the variant detection.

and utilisable (usable) in table 4 and 5 illustrate one of

However, in the examined set of 100 items, there was

the above mentioned variation pattern for adjectives.

one term-variant pair whose derivation is technically correct, but the meaning is not related:

4. Evaluation and discussion

6) Grundwasser (ground water) → Wasser am Grund eines Sees (water on the ground of a lake)

4.1. Issues in measuring precision and recall While it is relatively easy to measure the precision of

4.3. Symbolic vs. non-symbolic approach

identified (near) synonyms (such as the compound ↔ [NN PRP NN] pairs), it is comparatively difficult to

By relying on a fixed set of rules for extraction, we

determine the precision of related terms like the ones in

In order to extract terms without a set of patterns, we

tables 4 and 5, as it is often difficult to decide on the

present a knowledge-poor approach for term extraction

degree of relatedness.

using a probabilistic NP extractor and string-level term

Even more difficult is the evaluation of recall, which

variation detection. First, we apply a probabilistic NP

largely depends on the set of term variation patterns, but

extractor trained on a small corpus annotated manually

also on the patterns used for term candidate extraction.

with NPs (300 to 600 sentences): this tool has been

In order to avoid noise, term candidate extraction is

described in Guégan & Loupy (2011) for the extraction

restricted to productive patterns; this implies that not all

of NP chunks and uses a pseudo part-of-speech tagger

term variants might be extracted and consequently, that

(Clark, 2003).

some may not be available for variant grouping. The

A further non-symbolic procedure consists in relating

clearly favour precision at the cost of recall.

extracted terms without relying on a predefined set of

inventory by exploring more variation patterns. We

variation patterns. We experimented with comparing

particularly plan to include high-quality morphological

NPs on a string level (using Levenshtein disctance ratio)

tools, e.g. SMOR (Schmid et al., 2004) for German, and

and grouping terms by similarity. The resulting term

DériF (Namer, 2009) for French. SMOR has proven to

groups also provide a basis for the automatic derivation

outperform our statistical splitter.

of term variation patterns, which can be used as an input

Another strand of research is the exploration of term

to the symbolic method.

variation across languages, e.g. relations between term variants that are similar within different language pairs.

4.4. Relatedness of term candidates

References

Using a predefined set of term variation patterns facilitates the decision whether terms are

(near)

Ahmad, K., Davies, A. , Fulford, H. , Rogers, M. (1992):

synonyms or related. As synonyms, we consider for example the type [compound noun] ↔ [NN PRP NN].

What is a Term? The semi-automatic extraction of

Structures involving relational adjectives ([ADJ NN] (DE), [NN ADJ] (FR)), can be expressed by

terms from text. In Translation Studies - an Interdiscipline. John Benjamins Publishing Company. Clark, A.

(2003):

Combining

distributional

and

prepositional phrases, e.g. production énergétique ↔

morphological information for part of speech

production d'énergie (energy production ↔ production

induction. In Proceedings of the 10th conference of

of energy).

the European chapter of the Association for

Similarly, patterns can also help to specify the degree of

Computational Linguistics. Budapest, Hungary.

relatedness: by explicitly formulating term variation

Daille, B. (2005): Variants and application-oriented

rules we can differentiate between merely related terms

terminology engineering. In Terminology, volume. 1.

(e.g. consumption vs. annual consumption) and term

Guégan, M. , de Loupy, C. (2011): Knowledge-Poor

variants where we assume quasi synonymy (cf.

Approach to Shallow Parsing: Contribution of

compound nouns in table 2).

Unsupervised Part-of-Speech Induction. RANLP 2011

A difficult task is the identification of (neoclassical)

- Recent Advances in Natural Language Processing.

synonyms: without additional information (e.g. a

de Groc, C. (2011): Babouk: Focused web crawling for

dictionary), it is impossible to relate terms like

corpus

Sonnenenergie ↔ Solarenergie (solar energy), as the

extraction. In Proceedings of the IEEE/WIC/ACM

relation between Sonne and solar is not known to the

International Conferences on Web Intelligence and

system and cannot be derived by morphological means.

Intelligent Agent Technology. Lyon, France.

compilation

and

automatic

terminology

While the terms in the example above are synonyms,

Koehn, P. , Knight, K. (2003): Empirical Methods for

there can be some slight difference in meaning between

Compound Splitting. In Proceedings of the 10th

neoclassical compounds and their native form: the term

conference of the European chapter of the Association

hydroélectricité (hydroelectricity) is more precise than

for Computational Linguistics. Budapest, Hungary.

énergie de l'eau (water energy), and not necessarily a

Namer, F. (2009): Morphologie, Lexique et Traitement

synonym.

Automatique des Langues - Le système DériF.

5. Conclusion and next steps

Hermès – Lavoisier Publishers. Schmid, H. (1994): Probabilistic part-of-speech tagging

We presented a method for terminology extraction and

using

for the identification of a certain type of term variation.

international conference on new methods in language

Preliminary results show that there are preferences for a

processing. Manchester, UK.

certain type of realization, especially when considering

decision

trees.

In

Proceedings

Schmid, H. , Fitschen, A. , Heid,U. (2004):

of

the

SMOR: A

German compound nouns.

German

Since our current work only deals with a small part of

derivation, composition and inflection. In Proceedings

variation possibilities, we intend to enlarge our

of LREC '04. Lisbon, Portugal.

computational

morphology

covering