Crowdsourcing Synset Relations with Genus-Species-Match Dmitry Ustalov
IMM UB RAS / UrFU
Outline • Introduction • Related Work • Problem • Approach • Experiments • Results • Conclusion 2
Introduction • A thesaurus is a critical resource for successful NLP and AI applications.
• No open source thesaurus for Russian.
• Yet Another RussNet, started in 2013, is aimed at creation of such one. • Crowdsourcing is used. • The data are noisy!
http://russianword.net/en/ 3
Introduction: Synset Editor
http://russianword.net/editor
4
Related Work: Relations • Hearst (1992) proposed a set of patterns for extracting hyponomy. • Yang & Powers (2008) used syntax parsing for the similar purpose. • Giuliano et al. (2010) constructed a thesaurus leveraging on Wikipedia. • Sarasua et al. (2012) created CrowdMap for aligning knowledge resources. 5
Related Work: Workflows • Soylent, a Word plugin for text improving (Bernstein et. al., 2010). • CrowdForge, a crowdsourced MapReduce (Kittur et al., 2011). • TWSI, sense inventory induction (Biemann, 2013). • CrowdCleaner, multi-version data cleansing (Tong et al., 2014). 6
Related Work: Frameworks • Dawid & Skene (1979) proposed an EMalgorithm for quality analysis. • GLAD model for labels, answers, and difficulties (Whitehill et al., 2009). • ZenCrowd for answer aggregation and worker ranking (Demartini et al., 2013). • Karger et al. (2014) proposed an iterative algorithm for answer aggregation. 7
Problem • Let a thesurus be composed of synsets—sets of quasi-synonyms. {clothes, wear, vesture, …}
• We have are the is-a relations of the word pairs extracted from the Wiktionary. (animal, cat)
• How to establish is-a relations between the synsets? 8
Approach • Let us be pragmatic. • Do not try to build such relations for all the synsets. • Instead, stick to a particular domain and build the relations there. • Sure, a domain dictionary will be required (it is fine). 9
Approach: Genus-Species-Match Given a set of synsets, a set of word relations and a domain dictionary, annotate the synset relations.
10
Approach: Genus Given a genus-species pair 𝑔, 𝑠 ∈ 𝑅 and a synset 𝑠⃗ ∈ 𝑆( , a worker has to confirm whether 𝑠⃗ represents the genus of 𝑠.
11
Approach: Species Given a genus-species pair 𝑔, 𝑠 ∈ 𝑅 and a synset 𝑠⃗ ∈ 𝑆) , a worker has to confirm whether 𝑠⃗ represents the species of 𝑔.
12
Approach: Match Given a pair of synsets (𝑠( , 𝑠) ), a worker has to confirm that this pair represents a reasonable is-a relation, tightly coupled data processing mechanisms or not
13
Approach: Implementation
14
Experiments • The emergency management domain has been chosen. • The EMERCOM dictionary has been used.
• After the preparations:
• 383 genus-species word pairs, • 1438 synsets to the stage “Genus”, • 833 synsets to the stage “Species”. 15
Experiments: The Platform • TurboText, an online copywriting marketplace has been chosen. • It supports microtasks!
16
Experiments: The Process • TurboText allows only to write the task description and accept/reject the answers. • Thus, the requester should have own infrastructure. • Mechanical Tsar, an open source crowdsourcing engine, has been used (Ustalov, 2015). • http://mtsar.nlpub.org/
17
Experiments: Setup • As of October 1, 2015, 5 RUR = $0.08. • The tasks have been provided in the batches.
18
Experiments: TurboText
19
Experiments: The Annotation
7357
4205
1494 0 0
30
60
90
120
150
180
210
240
264
minutes Genus
Species
Match
20
Experiments: Aggregation • “Genus” and “Species” have been aggregated using majority voting. • “Match” has been aggregated by: • majority voting, • “KOS” (Karger et al., 2014), • ZenCrowd (Demartini et al., 2013).
• The annotation resulted in the 287 final judgements. 21
Results An expert has annotated the same result set providing us with a gold standard.
22
Results: Error Analysis • The workers made mistakes by incorrectly interpreting the lexical sense. • The synsets mapped to the relations have either too broad or too narrow mearnings. • The matched synsets or relations interit nonsense. Cleansing is needed. 23
Results: Feedback
24
Conclusion • Genus-Species-Match disambiguates the synsets and establishes the is-a relations between them. • Several directions for future work: • establishing holonymy/meronomy, • budget allocation algorithms, • data cleansing, • better task design?
25
Conclusion: Final Remarks • This is the first attempt to recruit paid crowd workers from an online labor marketplace for annotating a language resource made in Russia. (to the best of our knowledge)
• The results are published:
http://ustalov.imm.uran.ru/pub/gsmainl.tar.gz. 26
Thanks! Dmitry Ustalov, IMM UB RAS / UrFU. • http://ustalov.imm.uran.ru/ •
[email protected] •
[email protected] The present work is supported by the Russian Foundation for the Humanities, project № 13-04-12020, and by the Mikhail Prokhorov Foundation. 27