T OR , T OR M D: Distributional Profiles of Concepts for Unsupervised Word Sense Disambiguation Saif Mohammad, Graeme Hirst, and Philip Resnik University of Toronto and University of Maryland {smm,gh}@cs.toronto.edu,
[email protected] Multilingual Chinese–English Lexical Sample Task
How to represent a concept? As a category A set of near−synonymous words: from a thesaurus CELESTIAL BODY celestial body
By its usage in text
sun
planet
} concept star ... } near−synonyms
CELESTIAL BODY
Words that co−occur: } concept
CELESTIAL BODY
light
fusion
Words having ‘celestial body’ as cross-lingual candidate sense
gravity
revolve ... } co−occurring words
Cross-lingual candidate senses of Chinese words and
Combining the two
Central Idea: D ISTRIBUTIONAL P ROFILES OF C ONCEPTS
CELESTIAL BODY
strong association space
CELEBRITY
} concepts
weak association } text star
Distributional Profile of a concept (DPC) space 0.36, light 0.27, revolve 0.14,... (planet, sun, star,...) concept co−occurring words near−synonyms with strength of association CELESTIAL BODY:
How to create these DPCs? You know someone by the company they keep. Create Word–Category Unsupervised na¨ ı ve Bayes word sense classifier Co-occurrence Matrix (WCCM) desired concept = argmax P(c j ) ∏w ∈W P(wi|c j ) categories/concepts → c ∈C c1 c2 . . . c j . . . w1 m11 m12 . . . m1 j . . . The WCCM can be used to estimate probabilities (in an unsupervised manner), that are traditionally w2 m21 m22 . . . m2 j . . . calculated using sense−annotated data. .. .. .. . . . .. .. ∑i mi j ∑i, j mi j wi mi1 mi2 . . . mi j . . . mi j .. .. .. . . . .. . . . ∑i mi j i
So effectively we have Chinese words with English senses Apply this cross−lingually
Capturing co-occurrence associations between words and concepts
← words
j
Base WCCM: mi j is the number of times word wi co-occurs with any word that has c j as a sense. Bootstrapped WCCM: mi j is the number of times word wi co-occurs with any word used in sense c j .
Apply this monolingually
English Lexical Sample Task Accuracies were markedly better than the random baseline— an increase of more than twenty percentage points.
Conclusions
Capturing co-occurrence associations between Chinese words and English concepts
Cross-lingual Distributional Profiles of Concepts
Chinese word–English category co-occurrence matrix en cen c 1 2 wch 1 m11 m12 wch 2 m21 m22 .. .. .. mi1 mi2 wch i .. .. ..
. . . cen j . . . m1 j . . . m2 j . . . .. . . . mi j . . . ..
Base WCCM: mi j is the number ... of times Chinese word wch i co. . . occurs with a word that has c j . . . as English sense. .. Bootstrapped WCCM: m is ij of times Chinese . . . the number ch co-occurs with a word w i ... word used in English sense c j .
• Placed first among unsupervised systems in the Chinese–English Task. • Only about 1 percentage point behind the best in the English Lexical Task. • Cross-lingual DPCs can help automatic machine translation. • DPCs create simple yet powerful baselines for WSD.
See how cross-lingual DPCs can be used to obtain state-of-the-art semantic distance accuracies in a resource-poor language using a knowledge source from a resource-rich one. Come to EMNLP’s Friday morning session