with Genus-Species-Match

Report 3 Downloads 186 Views
Crowdsourcing Synset Relations with Genus-Species-Match Dmitry Ustalov

IMM UB RAS / UrFU

Outline • Introduction • Related Work • Problem • Approach • Experiments • Results • Conclusion 2

Introduction • A thesaurus is a critical resource for successful NLP and AI applications.

• No open source thesaurus for Russian.

• Yet Another RussNet, started in 2013, is aimed at creation of such one. • Crowdsourcing is used. • The data are noisy!

http://russianword.net/en/ 3

Introduction: Synset Editor

http://russianword.net/editor

4

Related Work: Relations • Hearst (1992) proposed a set of patterns for extracting hyponomy. • Yang & Powers (2008) used syntax parsing for the similar purpose. • Giuliano et al. (2010) constructed a thesaurus leveraging on Wikipedia. • Sarasua et al. (2012) created CrowdMap for aligning knowledge resources. 5

Related Work: Workflows • Soylent, a Word plugin for text improving (Bernstein et. al., 2010). • CrowdForge, a crowdsourced MapReduce (Kittur et al., 2011). • TWSI, sense inventory induction (Biemann, 2013). • CrowdCleaner, multi-version data cleansing (Tong et al., 2014). 6

Related Work: Frameworks • Dawid & Skene (1979) proposed an EMalgorithm for quality analysis. • GLAD model for labels, answers, and difficulties (Whitehill et al., 2009). • ZenCrowd for answer aggregation and worker ranking (Demartini et al., 2013). • Karger et al. (2014) proposed an iterative algorithm for answer aggregation. 7

Problem • Let a thesurus be composed of synsets—sets of quasi-synonyms. {clothes, wear, vesture, …}

• We have are the is-a relations of the word pairs extracted from the Wiktionary. (animal, cat)

• How to establish is-a relations between the synsets? 8

Approach • Let us be pragmatic. • Do not try to build such relations for all the synsets. • Instead, stick to a particular domain and build the relations there. • Sure, a domain dictionary will be required (it is fine). 9

Approach: Genus-Species-Match Given a set of synsets, a set of word relations and a domain dictionary, annotate the synset relations.

10

Approach: Genus Given a genus-species pair 𝑔, 𝑠 ∈ 𝑅 and a synset 𝑠⃗ ∈ 𝑆( , a worker has to confirm whether 𝑠⃗ represents the genus of 𝑠.

11

Approach: Species Given a genus-species pair 𝑔, 𝑠 ∈ 𝑅 and a synset 𝑠⃗ ∈ 𝑆) , a worker has to confirm whether 𝑠⃗ represents the species of 𝑔.

12

Approach: Match Given a pair of synsets (𝑠( , 𝑠) ), a worker has to confirm that this pair represents a reasonable is-a relation, tightly coupled data processing mechanisms or not

13

Approach: Implementation

14

Experiments • The emergency management domain has been chosen. • The EMERCOM dictionary has been used.

• After the preparations:

• 383 genus-species word pairs, • 1438 synsets to the stage “Genus”, • 833 synsets to the stage “Species”. 15

Experiments: The Platform • TurboText, an online copywriting marketplace has been chosen. • It supports microtasks!

16

Experiments: The Process • TurboText allows only to write the task description and accept/reject the answers. • Thus, the requester should have own infrastructure. • Mechanical Tsar, an open source crowdsourcing engine, has been used (Ustalov, 2015). • http://mtsar.nlpub.org/

17

Experiments: Setup • As of October 1, 2015, 5 RUR = $0.08. • The tasks have been provided in the batches.

18

Experiments: TurboText

19

Experiments: The Annotation

7357

4205

1494 0 0

30

60

90

120

150

180

210

240

264

minutes Genus

Species

Match

20

Experiments: Aggregation • “Genus” and “Species” have been aggregated using majority voting. • “Match” has been aggregated by: • majority voting, • “KOS” (Karger et al., 2014), • ZenCrowd (Demartini et al., 2013).

• The annotation resulted in the 287 final judgements. 21

Results An expert has annotated the same result set providing us with a gold standard.

22

Results: Error Analysis • The workers made mistakes by incorrectly interpreting the lexical sense. • The synsets mapped to the relations have either too broad or too narrow mearnings. • The matched synsets or relations interit nonsense. Cleansing is needed. 23

Results: Feedback

24

Conclusion • Genus-Species-Match disambiguates the synsets and establishes the is-a relations between them. • Several directions for future work: • establishing holonymy/meronomy, • budget allocation algorithms, • data cleansing, • better task design?

25

Conclusion: Final Remarks • This is the first attempt to recruit paid crowd workers from an online labor marketplace for annotating a language resource made in Russia. (to the best of our knowledge)

• The results are published:

http://ustalov.imm.uran.ru/pub/gsmainl.tar.gz. 26

Thanks! Dmitry Ustalov, IMM UB RAS / UrFU. • http://ustalov.imm.uran.ru/ • [email protected][email protected] The present work is supported by the Russian Foundation for the Humanities, project № 13-04-12020, and by the Mikhail Prokhorov Foundation. 27