Experiments with Semantic Similarity Measures based on LDA and LSA Nobal Niraula, Rajendra Banjade, Dan Stefanescu, and Vasile Rus
[email protected] The University of Memphis
Semantic Similarity • A practical approach to language understanding • Basic idea – Compare a target text with a benchmark/expert text whose meaning is known
• Semantic similarity vs. True understanding – True understanding is intractable and poorly scalable
Current Study • How does Latent Dirichlet Allocation (LDA) compare to Latent Semantic Analysis (LSA) • LDA – Probabilistic method – Latent topics – Blei, D.M., Ng, A.Y., & Jordan, M.I. 2003. Latent dirichlet allocation, The Journal of Machine Learning Research 3, 993-1022. • LSA – Algebraic method – Latent concepts – Landauer, T.; McNamara, D. S.; Dennis, S.; and Kintsch, W. (2007). Handbook of Latent Semantic Analysis. Mahwah, NJ: Erlbaum.
Why LDA and LSA? • We used LSA for more than a decade • LDA and LSA are fully automated – All you need is a large collection of texts
Practical purpose • To promote Deep Learning … – By humans
• Understand student utterances in dialoguebased Intelligent Tutoring Systems
DeepTutor
www.deeptutor.org
DeepTutor • Promoting deep learning of science topics through quality instruction and interaction – Deep Natural Language Processing • For now, just advanced semantic similarity methods • In the long run: deep, true understanding, e.g. using semantic nets
– Funded by Institute for Education Science
• Started in 2009 • Dialogue-based • First Intelligent Tutoring System based on Learning Progressions
DeepTutor • Topic: Conceptual Physics • Target Population: – High school students (335 students; d=0.908; online unsupervised learning) – Non-science majors in college (30 students; d=0.786)
Semantic Similarity: LSA vs LDA • LSA vs. LDA (vs. ESA vs. etc.) – Fully automated – LSA: Latent concepts • concept = abstract dimension in a reduceddimensionality abstract space • Word = vector/point in the latent space – All senses of a word map to the same vector
– LDA: latent topics • Topic = distribution over words • Word contributes to many topics – A topic may encode a particular word sense
LDA: Latent Dirichlet Allocation
TOPIC 2
TOPIC 2
Blei, D.M., Ng, A.Y., & Jordan, M.I. 2003. Latent dirichlet allocation, The Journal of Machine Learning Research 3, 993-1022.
Semantic Similarity between Short Texts • From Microsoft Research Paraphrase corpus (Dolan, Quirk, & Brockett, 2004) – Text A: York had no problem with MTA’s insisting the decision to shift funds had been within its legal rights. – Text B: York had no problem with MTA’s saying the decision to shift funds was within its powers.
• ITS data: – Student Response: An object that has a zero force acting on it will have zero acceleration. – Expert Answer: If an object moves with a constant velocity, the net force on the object is zero.
W2W versus T2T Similarity Measures • W2W (word-to-word) – LSA: cosine between corresponding vectors – LDA: similarity of word contributions to all topics T in the LDA model
• T2T (text-to-text) – Direct methods • LSA – Step 1: obtain an LSA vector for the text by summing individual words’ vectors – Step 2: cosine between text vectors
• LDA – Step 1: Similarity of documents (distributions over topics) – Step 2: Similarity of topics (distributions over words) – Step 3: Take the product
– Expanding W2W measures • Greedy • Optimal (Hungarian algorithm or Kuhn-Munkres algorithm; Munkres, 1957)
Experiments • Several LSA and LDA models • W2W and T2T methods • 2 datasets – Microsoft Research Paraphrase corpus • Sentences extracted from news articles
– User Language Paraphrase corpus • Sentences from student-computer tutor interactions
LDA vs. LSA Models • LSA model – Typically derived using N=300 latent dimensions/concepts
• LDA model – N=300 topics; to compare with same number of latent concepts used in LSA – Optimized number of topics (N=100) • Maximizing topic coherence based on Point-wise Mutual Information (PMI) derived from Wikipedia
Results Accuracy/ Kappa/Fmeasure (T=300)
Accuracy/Kappa/ Fmeasure (T=100)
Accuracy/Kappa/ Fmeasure (T=300)
Accuracy/Kappa/Fmeasure (T=100)
LDA-IR
71.17/16.17/81.94
68.24/3.09/80.92
67.47/4.52/79.87
67.01/3.15/79.98
LDA-Hellinger
71.32/18.85/81.75
68.24/2.46/80.99
67.36/4.39/79.73
67.18/3.50/80.04
LDA-Manhattan
71.07/10.10/82.50
71.21/23.41/81.16
66.78/3.56/79.91
67.18/4.04/80.04
LDA-Greedy
77.32/34.40/85.75
76.85/37.89/84.94
73.04/35.01/81.31
73.10/34.27/81.32
LDA-Optimal
76.97/36.96/85.06
75.96/36.75/84.14
73.27/36.74/80.71
73.15/36.86/80.71
LSA-Greedy
77.22/33.82/85.73
Same
72.86/33.89/81.11
Same
LSA-Optimal
77.12/36.80/85.24
Same
73.04/35.95/80.80
Same
LSA
77.47/37.54/85.50
Same
73.56/34.61/81.83
same
Method
Conclusions • W2W LDA-based measures perform well and comparable to LSA • The T2T LDA-based measures have limitations for short texts
SEMILAR: A Semantic Similarity Toolkit • It implements a number semantic similarity methods – – – – – – – –
preprocessing Overlap (n-gram) WordNet-based similarity measures LDA LSA Syntax Negation Etc.
• Library available to download • GUI-based application to be released soon • DEMO at ACL-2013
www.semanticsimilarity.org
Acknowledgments • This research has been supported in part by Institute for Education Sciences under award R305A100875
Thank You!
www.deeptutor.org