Terminology extraction and term variation patterns: a study of French and German data Marion Weller, Helena Blancafort, Anita Gojun, Ulrich Heid Institut für maschinelle Sprachverarbeitung
Syllabs
Universität Stuttgart
Paris
{wellermn|gojunaa|heid}@ims.uni-stuttgart.de
[email protected] Abstract The terminology of many technical domains, especially new and evolving ones, is not fully fixed and shows considerable variation. The purpose of the work described in this paper is to capture term variation. For term extraction, we apply hand-crafted POS patterns on tagged corpora, and we use rules to relate morphological and syntactic variants. We discuss some French and German variation patterns, and we present first experimental results from our tools. It is not always easy to distinguish (near) synonyms from variants that have a slightly different meaning from the original term; we discuss ways of operating such a distinction. Our tools are based on POS tagging and an approximation of derivation and compounding; however, we also propose a non-symbolic, statistics-based line of development. We discuss general issues of evaluating variant detection and present a smallscale precision evaluation. Keywords: terminology, term variation, comparable corpora, pattern-based term extraction, compound nouns
1. Introduction
documents published on the Internet are often the most TTC1
recent sources of data. In such domains, terminology
and
typically has not yet been standardized, and thus
Comparable Corpora) is the extraction of terminology
numerous variants co-exist in published documents.
from comparable corpora. The tools under development
Tools which support the extraction, identification and
within the project address the issues of compiling corpus
interrelating of term variants are thus necessary to
collections, monolingual term extraction and the
capture the full range of expressions used in the
alignment
multilingual
respective domain. End users may then decide (e.g. on
equivalence candidates, as well as the management and
the basis of variant frequency and sources of variants)
the export of the resulting terminological data towards
which expression to prefer.
CAT and MT tools.
A second, more technical motivation for term variant
Since parallel corpora of specialized domains are scarce
extraction is provided by the procedures for term
and not necessarily available for a broad range of
alignment (either lexical or statistical strategies), for
languages (TTC deals with English (EN), Spanish (ES),
which data sparseness is a problem. In order to reduce
German (DE), French (FR), Latvian (LV), Russian
the complexity of term alignment, TTC intends to gather
(RU), Chinese (ZH)), comparable corpora are used
monolingual variants into sets of related terms.
instead: textual material from specialized domains is
Particularly for this application, we do not only allow
accessible for many languages, either on the Internet or
for (quasi) synonyms, but also for variants with a slight
in publications of companies.
difference in meaning as shown in 1.
In technical domains which are rapidly evolving,
1) production d'électricité ↔ électricité produite
The
objective
(Terminology
of
of
the
EU-funded
Extraction,
terms
into
project
Translation
pairs
of
Tools
(production of electricity ↔ produced electricity) 1
http://www.ttc-project.eu The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under Grant Agreement n. 248005.
Terms may be of different forms (single-word vs. multiword terms) in different languages: this is a challenge
for term alignment. For example, compound nouns play
languages, we work with lemmas rather than inflected
an important role in German terminology, but have no
forms.
equivalents of the same morpho-syntactic structure in many other languages. Grouping equivalent terms of different syntactic structures can help to deal with such cases, as illustrated in 2: 2) Energieproduktion ↔ Produktion von Energie ↔
2.2. Term candidate extraction and filtering Our main focus is on the extraction of nominal phrases such as [NN NN] or [NN PRP NN] constructions (cf. tables 2-5), but [V NN] collocations are also of interest2.
production d'électricité
For each language, we identify term candidates by using
(energy production ↔ production of energy)
hand-crafted POS patterns. In contrast to nominal
2. Methodology
phrases, which are relatively easy to capture by POS patterns, the identification of [V NN] collocations is
The steps required for term extraction and for variant
more challenging, as verbs and their object nouns do not
identification follow a simple pipeline architecture: first,
necessarily occur in adjacent positions, depending on the
a corpus collection is compiled, which then undergoes
general
linguistic
steps,
particularly to German where constituent order is rather
monolingual term candidates are extracted. As not all
flexible and allows for long distances between verbs and
extracted items are domain relevant, we apply statistical
their objects.
filtering. Since we intend to detect term variation on a
In order to reduce the extracted term candidates to a set
morpho-syntactic
requires
of domain-relevant items, we estimate their domain
model
specificity by comparing them with terms extracted
pre-processing.
morphological
level, processing
Following
this in
last
these
step
order
to
derivational relationships between word classes.
structure of the
sentence. This
applies
from general language corpora (Ahmad et al, 1992). The underlying idea of this procedure is the assumption that
2.1. Compiling a corpus and pre-processing
terms which occur in both domain-specific and general
To collect corpus data, we use the focused Web crawler
language corpora are not domain-relevant, whereas
Babouk (de Groc, 2011) which has been developed
terms occurring only or predominantly in the domain-
within the TTC project. Babouk starts with a set of seed
specific data can be considered as specialized terms. We
terms or URLs given by the user which are combined
use the quotient q of a term's relative frequency in the
into queries and submitted to a search engine. Babouk
specialized data and in the general language corpus as
scores the relevance of the retrieved web pages using a
an indicator for its domain relevance (see table 1).
weighted-lexicon-based thematic filter. Based on the term candidate Gleichstrom (direct current) Jahr (year)
content of relevant retrieved pages, the lexicon is extended and new search queries are combined. One objective of the TTC project is to rely on flat
f domain 128
f general 4
q 22362,7
2157
221.213
1,2
linguistic analysis that is available for all languages. Table 1: Domain-specific vs. general language
One strand of research thus goes towards the development of knowledge-poor strategies, such as using a pseudo part-of-speech tagger (Clark, 2003) as a
2.3. Term variation
basis for probabilistic NP-extraction (Guégan & Loupy,
In TTC we define a term variant as “an utterance which
2011). A knowledge-rich approach is term extraction
is semantically and conceptually related to an original
based on hand-crafted part-of-speech (POS) patterns,
term” (Daille, 2005). Thus, term variants are bound to
which is the method we chose for the present work.
texts (“utterance”) and require the presence of an
Pre-processing of our data collection consists of
“original term” identified e.g. by means of a morpho-
tokenizing, POS-tagging and lemmatization using
syntactic term pattern.
TreeTagger (Schmid, 1994). For efficiency reasons, with German
and French being morphologically rich
NN:noun, PRP: preposition, V: verb, VPART: participle
2
The relationship between term variant and original term
For French, we created a set of rules to model the
is supposed to mainly be one of (quasi-) synonymy or of
relationship between nouns ending in -tion and the
controlled modification (e.g. by attributive adjectives,
respective verbs:
NPs or PPs). We formalize this by explicitly classifying
production → produire (production → produce)
relationships between patterns.
évolution → évoluer (evolution → evolve)
We distinguish the following types of variants:
condition → conditionner (condition → condition)
protection → protéger (protection → protect)
graphical air flow ↔ airflow
morphological (derivation, compounding)
Energieproduktion ↔ Produktion von Energie
Similar rules can be formulated, e.g. for nouns ending in
(production of energy)
-ment or -eur, e.g. chargement (nominalized action) →
solare Energie ↔ Solarenergie (solar energy)
charger (verb), as well as convertisseur (nominalized
paradigmatic e.g. omissions
tool name) → convertir (verb). Similarly, terms
les énergies renouvelables ↔ les renouvelables
containing adjectives ending in -able, such as utilisable
(the renewable energies ↔ the renewables)
→ utiliser (cf. table 5) or relational adjectives
abbreviations, acronyms
(prototypique → prototype) are under study. A further
Windenergieanlage ↔ WEA (wind energy plant)
type of pattern that could be added are rules to handle
syntactic variants3 consommation d’énergie ↔
prefixation (e.g. anti-corrosion → corrosion).
consommation annuelle d’énergie (energy consumption ↔ yearly energy consumption)
2.5. Processing formally related items A very common form of graphic variation is
Assuming that German technical texts contain many
hyphenation, e.g. Luftwärmepumpe vs. Luft-Wärme-
domain-specific compounds, we focus in this work on compound nouns and their variant [NN PRP NN] as
pumpe (air-source heat pump). This type of variation is dealt with by the splitting programm, which uses
illustrated above (morphological variants).
hyphens as splitting points. Hyphenated and non-
For French, we choose a similar pattern [NN de NN] ↔ [NN VPART]. In our current work, we restrict this
hyphenated forms are treated as one term. To a certain extent, our variant detection tools also deal
pattern to nouns ending in -tion. The addition of French
with alternating transitional elements (Kraftwerkbetrieb
morphology tools is planned to widen the scope of these
vs. Kraftwerksbetrieb). This is modeled by hand-crafted
patterns.
rules which allow for several realizations. Additionally, there are relatively regular forms of spelling variation,
2.4. Morphological processing
e.g. the new/old orthography in German, resulting in
In order to identify morphological variants of German
e.g. ph/f variation. This can be dealt with either by rules
compounds, we need to split compounds into their
or using a method based on string-distance.
components: in the present work, we opt for a statistical compound splitter; the implementation is
based on
3. Experiments and examples of results
(Koehn & Knight, 2003).
Our experiments are based on comparable corpora
Searching for the most probable split of a given word,
crawled from the Web. While they are generally easy to
the basic idea is that the components of a compound also
obtain with a focused crawler, such corpora might be
appear as single words and consequently should occur in
inhomogeneous with respect to domain coverage or
corpus data. A word frequency list serves as training
types of sources. When working with several languages,
data, supplemented with a hand-crafted set of rules to
the degree of comparability may also vary.
model transitional elements, such as the
We use a collection of 1000 documents each for French
s in
Produktions|kosten (production costs).
and German, with a total size of 1.55 M tokens (FR) and 1.29 M tokens (DE) of the domain of wind energy.
3
This last type of variants is not necessarily synonymous with the original term.
When looking at the extracted German data, we find that
Abgabe von Wärme Beleuchtung von Straße Erzeugung von Strom Produktion von Strom Speicherung von Energie Verbrauch an Primärenergie Versorgung mit Fernwärme Nutzung von Biomasse
1 1 32 4 7 1 2 8
Wärmeabgabe Straßenbeleuchtung Stromerzeugung Stromproduktion Energiespeicherung Primärenergieverbrauch Fernwärmeversorgung Biomassenutzung
18 49 569 72 37 114 13 7
release of warmth street lighting power generation power production energy storage primary energy consumption district heating biomass utilization
Table 2: Prepositional phrases vs. compound nouns consommation d'électricité consommation d'énergie importation de pétrole production d'électricité production de chaleur installation d'éolienne installation de puissance utilisation d'énergie
electricity consumption energy consumption import of petroleum electricity production heat production wind turbine installation installation of power use of energy
28 66 9 225 26 5 1 5
électricité consommée énergie consommée pétrole importé électricité produite chaleur produite éolienne installée puissance installée énergie utilisée
consumed electricity consumed energy imported petroleum produced electricity produced heat installed wind turbine installed power used energy
15 22 1 95 21 16 69 19
Table 3: Related French terms: prepositional phrases vs. noun-participle constructions.
Nutzenergie nutzbar Energie genutzt Energie nutzbar Energieform genutzt Energieform nutzbar Energiegehalt Nutzenergie-Anteil nutzbar Energiemenge
useful energy usable energy used energy usable energy form used energy form usable energy content proportion of useful energy usable amount of energy
89 24 5 9 4 3 1
énergie utilisée énergie utile énergie utilisable forme d'énergie utile form d'énergie utilisable source d'énergie utilisable
1
used energy useful energy usable energy useful energy form form of useable energy source of usable energy
19 14 14 2 2 1
Table 5: Different combinations of the components
Table 4: Variants of the compound Nutzenergie.
energie and utile.
the realization of a term as a compound is often more frequent than the alternative structures [NN PRP NN] or [NN ARTgen NNgen], as illustrated in table 2. This
of the pattern pair4 [NN de NN] ↔ [NN VPART] in
does not only apply to common words like Strom-
one of the two patterns, the overall tendency for
erzeugung (power generation), but also to comparative-
preference is less clear than for the German examples.
ly long and more complex words like Fernwärmever-
The difference in meaning (i.e. action vs. situation) does
sorgung (lit. long-distance heat supply: district heating).
not allow for full interchangeability of related terms, and
We consider this as evidence that the respective
the use of the different forms of realization is context
compound nouns are established as terms in the domain
dependent.
or even in general language. The degree of preference
table 3 have different meanings, as is the case with
varies, up to the point of there not being an alternative
puissance installée vs. installations de puissance élevée
realization, as is the case with Windgeschwindigkeit
in example (3).
(wind speed, freq=149), for which one could imagine a
3)
table 3 are not (near) synonyms, but could rather be considered as related. While some terms seem to prefer
In contrast to the German structures, the French terms
Par contre, le coût et la complexité des installations les réservent le plus souvent à des installations de puissance élevée pour
construction like *Geschwindigkeit des Windes (speed of the wind), which does not occur in our corpus.
Some terms from the pairs contained in
4
Note that the extracted lemma of the participle is its infinitive; we show the inflected form for better readability, i.e. consommée instead of consommer.
However, due to the cost and complexity of the installations, they
same applies to the set of rules used to group variants. For example, the French pattern [NN PRP NN] is
are mostly restricted to installations of high power in order to
restricted to the prepositions de and à. While there might
benefit from the scaling effects.
be valid terms containing other prepositions, they are
bénéficier d’économies d’échelle.
In other cases, grammatical and/or stylistic constraints
excluded from being extracted. Similarly, the large
may lead authors to use one variant rather than another.
number of potential paraphrases of German compounds
For example, compounds in enumerations are rather
cannot be captured.
split in order to facilitate the combination with other
The examples in tables 4 and 5 illustrate the wide range
nouns, e.g. Meeresboden vs. Boden von Meeren in
of possible types of variation and thus the difficulty to
example (4).
capture and relate the different types of variation. In
4)
Methanhydrat bildet sich am Boden von Meeren bzw. tiefen Seen
addition to the problem of pattern coverage, another
Methane hydrate develops at the ground of the sea or deep lakes
factor is the quality of the morphological tools used to model the relationship between word classes.
In table 4, we show examples of variants in a wider sense: starting with the compound Nutzenergie (useful
4.2. Evaluation of precision
energy), we find the synonym nutzbare Energie (usable
In a small experiment, we measured the precision of the
energy) and the related form genutzte Energie (used
100 most-frequent German compound nouns and their
energy). In the entries in the lower part of the table (grey
proposed variants: 74 of the variants are valid. Most of
background), the component
the 26 invalid variants are due to bad PP-attachment, as
Energie is part of a
compound noun while still preserving the (basic)
illustrated by the following example:
meaning of the term Nutzenergie (useful energy).
5) Stromkunde (energy customer) → *Kunde mit
The French examples in table 5 correspond to the
Strom (customer with energy)
German ones (table 4), with related terms consisting of
which is part of the verbal phrase Kunden mit Strom
the basic components in the upper part of the table, and
versorgen (supply costumers with energy). This kind of
terms expanded by an additional component in the lower
error can rather be considered a problem of the
part of the table (gray background). The forms nutzbar
extraction step than of the variant detection.
and utilisable (usable) in table 4 and 5 illustrate one of
However, in the examined set of 100 items, there was
the above mentioned variation pattern for adjectives.
one term-variant pair whose derivation is technically correct, but the meaning is not related:
4. Evaluation and discussion
6) Grundwasser (ground water) → Wasser am Grund eines Sees (water on the ground of a lake)
4.1. Issues in measuring precision and recall While it is relatively easy to measure the precision of
4.3. Symbolic vs. non-symbolic approach
identified (near) synonyms (such as the compound ↔ [NN PRP NN] pairs), it is comparatively difficult to
By relying on a fixed set of rules for extraction, we
determine the precision of related terms like the ones in
In order to extract terms without a set of patterns, we
tables 4 and 5, as it is often difficult to decide on the
present a knowledge-poor approach for term extraction
degree of relatedness.
using a probabilistic NP extractor and string-level term
Even more difficult is the evaluation of recall, which
variation detection. First, we apply a probabilistic NP
largely depends on the set of term variation patterns, but
extractor trained on a small corpus annotated manually
also on the patterns used for term candidate extraction.
with NPs (300 to 600 sentences): this tool has been
In order to avoid noise, term candidate extraction is
described in Guégan & Loupy (2011) for the extraction
restricted to productive patterns; this implies that not all
of NP chunks and uses a pseudo part-of-speech tagger
term variants might be extracted and consequently, that
(Clark, 2003).
some may not be available for variant grouping. The
A further non-symbolic procedure consists in relating
clearly favour precision at the cost of recall.
extracted terms without relying on a predefined set of
inventory by exploring more variation patterns. We
variation patterns. We experimented with comparing
particularly plan to include high-quality morphological
NPs on a string level (using Levenshtein disctance ratio)
tools, e.g. SMOR (Schmid et al., 2004) for German, and
and grouping terms by similarity. The resulting term
DériF (Namer, 2009) for French. SMOR has proven to
groups also provide a basis for the automatic derivation
outperform our statistical splitter.
of term variation patterns, which can be used as an input
Another strand of research is the exploration of term
to the symbolic method.
variation across languages, e.g. relations between term variants that are similar within different language pairs.
4.4. Relatedness of term candidates
References
Using a predefined set of term variation patterns facilitates the decision whether terms are
(near)
Ahmad, K., Davies, A. , Fulford, H. , Rogers, M. (1992):
synonyms or related. As synonyms, we consider for example the type [compound noun] ↔ [NN PRP NN].
What is a Term? The semi-automatic extraction of
Structures involving relational adjectives ([ADJ NN] (DE), [NN ADJ] (FR)), can be expressed by
terms from text. In Translation Studies - an Interdiscipline. John Benjamins Publishing Company. Clark, A.
(2003):
Combining
distributional
and
prepositional phrases, e.g. production énergétique ↔
morphological information for part of speech
production d'énergie (energy production ↔ production
induction. In Proceedings of the 10th conference of
of energy).
the European chapter of the Association for
Similarly, patterns can also help to specify the degree of
Computational Linguistics. Budapest, Hungary.
relatedness: by explicitly formulating term variation
Daille, B. (2005): Variants and application-oriented
rules we can differentiate between merely related terms
terminology engineering. In Terminology, volume. 1.
(e.g. consumption vs. annual consumption) and term
Guégan, M. , de Loupy, C. (2011): Knowledge-Poor
variants where we assume quasi synonymy (cf.
Approach to Shallow Parsing: Contribution of
compound nouns in table 2).
Unsupervised Part-of-Speech Induction. RANLP 2011
A difficult task is the identification of (neoclassical)
- Recent Advances in Natural Language Processing.
synonyms: without additional information (e.g. a
de Groc, C. (2011): Babouk: Focused web crawling for
dictionary), it is impossible to relate terms like
corpus
Sonnenenergie ↔ Solarenergie (solar energy), as the
extraction. In Proceedings of the IEEE/WIC/ACM
relation between Sonne and solar is not known to the
International Conferences on Web Intelligence and
system and cannot be derived by morphological means.
Intelligent Agent Technology. Lyon, France.
compilation
and
automatic
terminology
While the terms in the example above are synonyms,
Koehn, P. , Knight, K. (2003): Empirical Methods for
there can be some slight difference in meaning between
Compound Splitting. In Proceedings of the 10th
neoclassical compounds and their native form: the term
conference of the European chapter of the Association
hydroélectricité (hydroelectricity) is more precise than
for Computational Linguistics. Budapest, Hungary.
énergie de l'eau (water energy), and not necessarily a
Namer, F. (2009): Morphologie, Lexique et Traitement
synonym.
Automatique des Langues - Le système DériF.
5. Conclusion and next steps
Hermès – Lavoisier Publishers. Schmid, H. (1994): Probabilistic part-of-speech tagging
We presented a method for terminology extraction and
using
for the identification of a certain type of term variation.
international conference on new methods in language
Preliminary results show that there are preferences for a
processing. Manchester, UK.
certain type of realization, especially when considering
decision
trees.
In
Proceedings
Schmid, H. , Fitschen, A. , Heid,U. (2004):
of
the
SMOR: A
German compound nouns.
German
Since our current work only deals with a small part of
derivation, composition and inflection. In Proceedings
variation possibilities, we intend to enlarge our
of LREC '04. Lisbon, Portugal.
computational
morphology
covering