Query Spelling Correction Using Multi-task Learning - Semantic Scholar

Report 1 Downloads 100 Views
WWW 2012 – Poster Presentation

April 16–20, 2012, Lyon, France

Query Spelling Correction Using Multi-task Learning Xu Sun

Anshumali Shrivastava

Ping Li

Dept. of Statistical Science Cornell University Ithaca, NY 14853

Dept. of Computer Science Cornell University Ithaca, NY 14853

Dept. of Statistical Science Cornell University Ithaca, NY 14853

[email protected]

[email protected]

ABSTRACT

a result, our method is capable of “softly” merging biased datasets for learning a unified speller. In our approach, we first generate a large amount of spelling correction candidates via a candidate generation module, and then use multitask learning for accurately re-ranking top candidates.

This paper explores the use of online multi-task learning for search query spelling correction, by effectively transferring information from different and biased training datasets for improving spelling correction across datasets. Experiments were conducted on three query spelling correction datasets, including the well-known TREC benchmark data. Our experimental results demonstrate that the proposed method considerably outperforms existing baseline systems in terms of accuracy. Importantly, the proposed method is about oneorder of magnitude faster than baseline systems in terms of training speed. In contrast to existing methods which typically require more than (e.g.,) 50 training passes, our algorithm can very closely approach the empirical optimum in around five passes.

2. METHODOLOGY The first step is to generate correction candidates. Following Ganjisaffar et al. [2], we implemented candidate generators in our system that generate about 2,000 candidates for each query. Then, we perform naive ranking of the candidates via the Web n-gram service provided by Microsoft.

2.1 Re-Ranking On top of the naive ranker, a re-ranker based on the Maximum Entropy model can significantly improve the results from the naive ranker. Assuming a feature function that maps a pair of observation sequence x and a classification label y to a feature vector f , the probability is modeled as   exp w f (y, x )  , x w P (y|x , ) =  (1)   ∀y  exp w f (y , x )

Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Query Formulation

General Terms Algorithms, Experimentation, Performance

x i , yi ) for where w is a weight vector. Given a training set, (x i = 1 . . . n, weights  are gained via maximizing the objective n w) = x w function, L(w i=1 log P (yi |x i , w ) − R(w ), where the w ) is an L2 regularizer in this study. term R(w

Keywords Query Spelling Correction, Multi-task Learning

1.

[email protected]

2.2 Features for Re-Ranking

INTRODUCTION

In addition to the features based on transformations between query-candidate pairs, we also design other features from the top candidates. In total, we use 30 feature templates for each query-candidate pair, which include: error model features (e.g., edit distance), candidate language model features, query language model features, surface features capturing differences of the query and the candidate, frequency of the query and the candidate, the rank of the candidate in the naive ranking phase, and so on.

Since query spelling correction is important and received much attention [1, 5, 4, 3], a variety of datasets were developed (from different domains) for this task. As discussed in [2], queries in one specific dataset can be very biased from another dataset, and combining such biased datasets is in general difficult, because (i) their error-patterns can be high biased; (ii) their distributions of misspelled queries can be quite different; (iii) their domains can be quite different (hence, a domain adaptation problem exists). Naturally, the question is: how to effectively integrate such highly biased datasets for improving query spelling correction? In our paper, we propose a multi-task learning framework for query spelling correction. Our method is able to adaptively integrate highly biased training datasets via automatically learning the task-relationships (data-similarities). As

2.3 Multi-Task (Online) Learning (MTL) Define Nq = {1, . . . , q}. Let T be the number of tasks. For x t,i , y t,i ) : each task t ∈ NT , there are n data examples {(x i ∈ Nn } available. D denotes the n × T matrix whose t-th column is given by the vector d t of data examples, and W denotes the f × T matrix whose t-th column is given by the vector w t . We learn W by maximizing the objective:    ||w   w t ||2 W , D) = αt,t t (i, w t ) − , Obj(W 2σt2  i∈N t∈N

Copyright is held by the author/owner(s). WWW 2012 Companion, April 16–20, 2012, Lyon, France. ACM 978-1-4503-1230-1/12/04.

t,t ∈NT

613

n

T

WWW 2012 – Poster Presentation MSN

93 92

91 90 89 88

Single Merge MTL Methods (1 Pass)

94

MSN

95

92

94 93 92 Single Merge MTL Methods (5 Passes)

90 89 Single Merge MTL Methods (5 Passes)

96

=

w (k) t

+ η t · g t,

91.5

Merged

91

94.5

0

20 40 Number of Passes

60

90

0

20 40 Number of Passes

TREC 98

92

Single Merge MTL Methods (5 Passes)

MTL

97.5

Merged

97

Single 96.5 96

(2)

0

20 40 Number of Passes

60

Figure 2: F-score curves of different methods.

The update term g t is derived by weighted sampling over different tasks. η t is a positive vector-valued step size. The optimal step size asymptotically approaches the (approximate) inverse Hessian matrix.

accuracy and speed. Since query spelling correction can have very large-scale training data in industrial applications (e.g., query spelling logs which contain millions of entries, see [4]), the proposed method’s ability to approach empirical optimum in a few passes will be of critical importance in such large-scale applications.

EXPERIMENTS

The TREC dataset is based on the TREC queries (2008 Million Query Track). This dataset contains 5,892 queries, with 5.3% of queries are judged as misspelled. We use two auxiliary datasets for multi-task learning. One is the JDB dataset [2] collected from the AOL query sets, with a total of 12,000 samples, and 2,000 of them (16.7%) are misspelled. The second one is collected from MSN queries, with about 11% of queries are misspelled. We evenly split the datasets into training and testing data. Baseline methods include the (well-known) stochastic gradient descent method with single-task setting (SGD-Single), and the SGD method with merged setting (SGD-Merge). For the multi-task learning method, its hyper-parameters are tuned in preliminary experiments using cross validation. As in the 2011 Speller Challenge, we use the expected F-score as the evaluation metric. The performance comparisons on the 1-pass and 5-pass settings are highlighted in Figure 1. In this figure, Single represents the SGD-Single baseline. Similarly, Merge represents SGD-Merge. Clearly, the proposed MTL method can achieve substantially higher F-scores than the baselines. In addition, we check the training process and the empirical convergence speed. The curves are shown in Figure 2. The figure shows the comparisons of MTL with baselines. We find the MTL method has the best empirical convergence speed. The results demonstrate that the proposed method can achieve good accuracy with fast convergence speed.

4.

Merged

94

where αt,t is the task-similarity, with αt,t = αt ,t . In our study, the similarity is automatically learned from the data. For online learning, we use the update formula as follows:

3.

95

92

90.5

Figure 1: Comparison of F-scores of different methods in 5 passes.

w (k+1) t

Single

94

91

88

Single Merge MTL Methods (1 Pass)

MTL Single

92.5

MTL

95.5

TREC F−score (%)

93 F−score (%)

F−score (%)

96

92

Single Merge MTL Methods (1 Pass)

JDB 96

MSN 93

F−score (%)

92

94

JDB 96

F−score (%)

95

TREC

F−score (%)

93 F−score (%)

96 F−score (%)

F−score (%)

JDB

April 16–20, 2012, Lyon, France

5. ACKNOWLEDGMENTS This work is partially supported by NSF (DMS-0808864, SES-1131848), ONR (YIP-N000140910911), and DARPA. Xu Sun was a Postdoctoral Associate supported by ONR and NSF. Anshumali Shrivastava is a Ph.D. student supported by ONR and NSF. The authors thank Yanen Li (Author of [3]) for helpful discussions.

6. REFERENCES [1] Brill, E., and Moore, R. C. An improved error model for noisy channel spelling correction. In ACL’00 (Hong Kong, October 2000), pp. 286–293. [2] Ganjisaffar, Y., Zilio, A., Javanmardi, S., Cetindil, I., Sikka, M., Katumalla, S. P., Khatib-Astaneh, N., Li, C., and Lopes, C. qSpell: Spelling correction of web search queries using ranking models and iterative correction. In Spelling Alteration for Web Search Workshop (July 2011). [3] Li, Y., Duan, H., and Zhai, C. Cloudspeller: Spelling correction for search queries by using a unified hidden markov model with web-scale resources. In Spelling Alteration for Web Search Workshop (July 2011). [4] Sun, X., Gao, J., Micol, D., and Quirk, C. Learning phrase-based spelling error models from clickthrough data. In ACL’10 (Uppsala, Sweden, July 2010), pp. 266–274. [5] Toutanova, K., and Moore, R. Pronunciation modeling for improved spelling correction. In ACL’02 (Philadelphia, USA, 2002), pp. 144–151.

CONCLUSION

We demonstrated that using multi-task learning can reliably integrate highly biased datasets for query spelling correction and may outperform existing baselines in terms of

614

60