Regularized query classification using search click information

Comment

Report 1 Downloads 15 Views

Pattern Recognition 41 (2008) 2283 – 2288 www.elsevier.com/locate/pr

Regularized query classiﬁcation using search click information Xiaofei He ∗ , Pradhuman Jhala Yahoo! Research Labs, 3333 Empire Avenue, Burbank, CA 91504, USA Received 26 September 2006; received in revised form 1 October 2007; accepted 9 January 2008

Abstract Hundreds of millions of users each day submit queries to the Web search engine. The user queries are typically very short which makes query understanding a challenging problem. In this paper, we propose a novel approach for query representation and classiﬁcation. By submitting the query to a web search engine, the query can be represented as a set of terms found on the web pages returned by search engine. In this way, each query can be considered as a point in high-dimensional space and standard classiﬁcation algorithms such as regression can be applied. However, traditional regression is too ﬂexible in situations with large numbers of highly correlated predictor variables. It may suffer from the overﬁtting problem. By using search click information, the semantic relationship between queries can be incorporated into the learning system as a regularizer. Speciﬁcally, from all the functions which minimize the empirical loss on the labeled queries, we select the one which best preserves the semantic relationship between queries. We present experimental evidence suggesting that the regularized regression algorithm is able to use search click information effectively for query classiﬁcation. 䉷 2008 Elsevier Ltd. All rights reserved. Keywords: Query classiﬁcation; Query representation; Web search; Regression; Regularization; User logs

1. Introduction Due to the rapid growth of the number of web-based applications, there is an increasing demand for effective and efﬁcient method for query understanding. Successfully assigning a category label to a user query would potentially help many applications such as web search, web-based advertisement, recommendation system, etc. In this paper, we consider the problem of query representation and classiﬁcation. One of the major difﬁculties of query understanding is that the user queries are often very short. A typical query has less than three terms [1]. If we use term vector to represent a query like documents and use inner product to evaluate the similarity of query pairs, most of them would have zero similarity. For example, the queries “SVM” and “Support Vector Machine” actually have the same meaning, but they do not share any common terms. Moreover, even if two queries share some

∗ Corresponding author. Tel.: +1 626 243 3793.

E-mail addresses: [email protected] (X. He), [email protected] (P. Jhala). 0031-3203/$30.00 䉷 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2008.01.010

common terms, evaluating their similarity according to these terms may not accurately reﬂect their semantic relationship. Consider two queries “Apple Juice” and “Apple Computer”. They share a common term “Apple”. However, they have totally different meanings. In traditional information retrieval, some techniques such as query expansion have been proposed to help users formulate better queries [2–5]. Query expansion involves supplementing the original query with additional terms. It aims to formulate better queries which can enhance the web search performance. However, the expanded queries generated by these techniques may still fail to provide a rich context based on which term-wise similarity measure can reﬂect intrinsic semantic relationships. In this paper, we introduce a novel approach for query representation and classiﬁcation. For each query, we submit it to a search engine and use the top returned web documents to represent this query. This is based on the assumption that the top returns are likely to be relevant to the query. These web documents are combined together to form a term vector which has rich context. Thus, each query can be represented as a point in high dimensional space. With vector representations, the query similarities can be evaluated by inner product and standard

2284

X. He, P. Jhala / Pattern Recognition 41 (2008) 2283 – 2288

machine learning algorithms can be applied to query classiﬁcation. In real world query data set, the number of categories ranges from tens to hundreds. Due to the consideration of computational complexity, we adopt regression as our classiﬁcation tool. In general, the dimensionality of the query space is very high. When the number-of-dimension to sample-size is too large, we cannot reliably ﬁnd a regression function with good generalization capability. In this case, overﬁtting may occur. In order to avoid overﬁtting, one can impose additional constraints to the learning system as a regularizor. One of the most popular approach to regularization is dimensionality reduction. Here, the intrinsic discriminative structure is acknowledged by approximating the data by its projection onto a lower-dimensional basis. The predictors are thus replaced by their basis coefﬁcients. A drawback of this approach is that one has to supply a suitable basis, which tends to involve more subjectivity. In this work, we incorporate a regularizer by using users’ search click information. For each query, we can get a list of URLs which are clicked during the search of this query. If the same URL is clicked during the search of both of two queries, they are likely to be related. Intuitively, the relatedness of two queries can be evaluated as the size of the intersection of their URL lists. Thus, we can build a graph to model the semantic relationship between queries. This graph can be then incorporated into the regression framework as a regularization term. Speciﬁcally, from all the functions which minimize the empirical loss on the labeled queries, we select the one which best preserves the semantic relationship between queries. The paper is structured as follows: in Section 2, we describe how to represent the query with rich context by using a search engine. In Section 3, the regularized query classiﬁcation algorithm is introduced. The experimental results are presented in Section 4. Finally, we provide some concluding remarks in Section 5.

Table 1 Web search results of “Apple Juice” and “Apple Computer” Apple Juice www.applejuice.org www.thenandnow.net/fanlisting/applejuice en.wikipedia.org/wiki/Apple_juice www.appleproducts.org/recipes.html importer.alibaba.com/buyeroffers/Apple_Juice.html www.applejuice.org/recipes.html www.martinellis.com www.martinellis.com/Products/50oz_Unﬁltered_AppleJuice.htm www.webtender.com/db/ingred/461 www.appleproducts.org/nutritn.html www.drinksmixer.com/desc68.html www.soymilkquick.com/applejuice.html www.ichef.com/recipe.cfm?task = display&itemid = 71667&recipeid = 71330 www.hormel.com/kitchen/glossary.asp?id = 36065&catitemid = www.amazon.com/exec/obidos/tg/detail/-/B00000K41T?v = glance Apple Computer www.apple.com store.apple.com www.apple.com/support en.wikipedia.org/wiki/Apple_Computer quicktime.apple.com asia.apple.com www.info.apple.com developer.apple.com jobs.apple.com/cgi-bin/WebObjects/Employment.woa store.apple.com/Catalog/US/Images/routingpage.html asu.info.apple.com www.apple.co.jp www.appleclub.com.hk www.apple-history.com ﬁnance.yahoo.com/q?s = AAPL

Table 2 Query classiﬁcation results

2. Query representation based on web search As we described previously, one of the major difﬁculties of query understanding is how to represent a query. In this work, we propose to represent a query by using web search results. The procedure is outlined below: • Submit query q to a search engine. Let (p1 , p2 , . . . , pn ) be the top n web pages. • For each pi , i = 1, . . . , n, extract words and compute their frequencies. Keep the top m words with highest frequencies. Each pi is thus represented as a term vector. • Normalize pi such that pi = 1 and add them together: p = n p . i=1 i • Finally, query q can be represented by the following vector: q=

p . p

Recall the two queries “Apple Juice” and “Apple Computer”. By submitting them to a search engine (search.yahoo.com), we

Ridge regression Regularized regression with search click information

Training error (%)

Testing error (%)

3.32 8.37

22.05 18.86

can get a list of relevant URLs. The top 15 returns are shown in Table 2. For each URL in Table 2, we extract words from the web pages. And ﬁnally, these two queries can be represented as the following term vectors: • Apple Juice: apple, juice, cider, oz, product, serves, recipes, pie, ingredients, fruit, frozen, punch, soyquick, tofu, lemon, orange, home, cinnamon, drink, beverage, . . . . • Apple Computer: apple, computer, authorized, mac, college, product, SVP, purchaser, school, proposal, university, support, software, repair, ipod, jobs, make, users, com, retail, .... The terms are sorted according to their frequencies in decreasing order. Due to space limitation, we only list the top 20 terms

X. He, P. Jhala / Pattern Recognition 41 (2008) 2283 – 2288

with the highest frequencies. Clearly, the expanded representations using web search results can better describe the meanings of the queries. 3. Regularized query classiﬁcation Once the queries are equipped with vector representations, classical supervised learning algorithms can be applied, including support vector machines (SVM) [6], linear regression [7], and naive bayes [7]. When dealing with web scale data, efﬁciency is a major concern. Due to this consideration, we adopt the regression framework. In this Section, we introduce our regularized query classiﬁcation algorithm by using search click information. We begin with a brief review of standard linear regression. 3.1. A brief review of linear regression We consider binary classiﬁcation problem. Let {xi , yi }, i = 1, . . . , m be a set of training samples, where yi ∈ {1, −1} is the label of xi . Linear regression aims to ﬁt a function f (x) = aT x + b such that the residual sum of square is minimized RSS(a) =

m

(f (xi ) − yi )2 .

(1)

For the sake of simplicity, we append a new element “1” to each xi . Thus, the coefﬁcient b can be absorbed into a and we have f (x) = aT x. Let X = (x1 , . . . , xm ) and y = (y1 , . . . , ym ). We have

In this subsection, we introduce a novel regularized query classiﬁcation method by using search click information. We apply Belkin and Niyogi’s manifold regularization framework [8] for learning a linear classiﬁer. In web search, most search engines have accumulated a large amount of query logs, from which we can discover the semantic relationship between queries from the users’ perspectives. Among all the queries available, let m1 be the number of queries with label and m2 be the number of queries without label. Without loss of generm1 +m2 1 ality, let {xi }m i=1 be the labeled queries and {xi }i=m1 +1 be the unlabeled queries. When a user submits a query x and then clicks the returned web page p, there is reason to suspect that x is somewhat related to p. The strength of relevance between x and p increases as more users click p while searching x. Let C(x, p) denote the number of clicks of p with respect to query x. Our basic assumption is that if two queries xi and xj are both relevant to web page p, then xi and xj are relevant to each other. Thus, the similarity between xi and xj from the user’s perspective can be naturally deﬁned as follows: Wij = C(xi , p)C(xj , p). Once we construct a weighted query graph G with edges connecting related queries to each other, consider the problem of mapping the weighted graph G to a line so that related queries stay as close together as possible. A reasonable criterion for choosing a “good” map is to minimize the following [8]: (aT xi − aT xj )2 Wij . (3) ij

RSS(a) = (y − X T a)T (y − X T a). Requiring jRSS(a)/ja = 0, we obtain: (2)

One problem of linear regression is that it is too ﬂexible. When the number of terms are larger than the number of samples, the minimizer is not unique. It is unclear which one of these minimizers has the best generalization capability. In fact, in this situation, it can be shown that the matrix XX T is singular. Thus, the above optimization problem is not well deﬁned. In order to overcome this problem, regularization methods are introduced. Tikhonov regularization is the most commonly used method of regularization. It is also known as ridge regression. It aims to ﬁnd a minimum norm minimizer: min(a xi − yi ) + a. T

3.2. Regularized regression with search click information

p

i=1

a = (XXT )−1 Xy.

2285

2

a

The optimal solution of ridge regression is given by a = (XX T + I )−1 Xy, where I is the identity matrix. Clearly, the matrix XX T + I is no longer singular. However, Tikhonov regularizer is data independent. It fails to discover the intrinsic structure in the data.

The objective function with our choice of Wij incurs a heavy penalty if related queries xi and xj are mapped far apart. Therefore, minimizing it is an attempt to ensure that if xi and xj are relevant then aT xi and aT xj are close. By incorporating Eq. (3) into the standard linear regression as a regularizer, we get the following objective function [8]: V (a) =

m1

(aT xi − yi )2

i=1

+ 1

m

(aT xi − aT xj )2 Wij + 2 a2 ,

(4)

i,j =1

where m = m1 + m2 . Following some simple algebraic steps, we have m 1 T (a xi − aT xj )2 Wij 2 i,j =1 m

=

i=1 T

aT xi Dii xiT a −

= a X(D − W )X a = aT XLXT a,

T

m i,j =1

aT xi Wij xjT a

2286

X. He, P. Jhala / Pattern Recognition 41 (2008) 2283 – 2288

where X = (x1 , . . . , xm ), and D is a diagonal matrix; its entries are column (or row, since W is symmetric) sum of W, Dii = i Wij . L = D − W is the graph Laplacian [9]. It would be important to note that graph Laplacian has been used for many areas, including dimensionality reduction [10], face recognition [11], clustering [12], ranking [13], image segmentation [14], etc. Deﬁne X1 = (x1 , . . . , xm1 ) and y = (y1 , . . . , ym1 )T where yi is the label of xi . Thus, V (a) can be reduced to the following: V (a) = (y − X1T a)T (y − X1T a) + 1 aT XLXT a + 2 aT a. Requiring that the gradient of V (a) vanish gives the following solution: a = (X1 X1T + 1 XLXT + 2 I )−1 X1 y.

(5)

4. Experimental results In this section, several experiments were performed to show the effectiveness of our proposed algorithm. We begin with a description of the data used in our experiments. 4.1. Data preparation The data for this study are derived from Yahoo! search engine logs. A typical search engine log contains information about query and click events, such as, time, query term, number of served URLs, details of served and clicked URLs, etc. We used a manually labeled taxonomy of query terms in this experiment. The taxonomy has six levels of hierarchy. These six levels of hierarchy contains 31, 175, 324, 265, 111, and 43 categories, respectively, as shown in Fig. 1. Due to the consideration of effectiveness, we selected the second level categories and removed those categories with less than 20 queries, thus leaving us with 144 categories and 94,415 queries. We collected one week of search engine logs to construct a query graph as described in Section 3. Fig. 2 shows the number of queries in each of these 144 categories. It would be important to note that our method can also deal with all the levels. However, when all the levels are used, the classiﬁcation accuracy may not be acceptable for real world web applications.

We sent all queries to Yahoo! search engine, parse top 15 returned web documents, and ﬁnd all unique terms present in these documents. The size of all unique terms found in this way is 273,238. Thus, each query can be represented as a 273,238dimensional vector. Here, the number of dimensions is equal to the total number of terms that are used to represent all the queries. For a term vector x=(x1 , x2 , . . . , x273,238 ), the element xi is set to be 1 if the ith term is used to represent this query; and it is set to be 0 otherwise. Since for each single query, only a small number of terms are used to represent this query, the term vector is highly sparse. Since the number of queries is far less than the number of dimensions, the situation of overﬁtting may occur. 4.2. Query classiﬁcation results The query data set was randomly split into training set and testing set. The training set contains 80% queries, and the testing set contains the rest 20%. The search click information was used to construct a query graph as described in Section 3. The training set is utilized to learn a linear classiﬁer. We averaged the results over 20 random splits. We applied cross validation on the training set to select the parameter 1 and 2 in our algorithm. Since query classiﬁcation is essentially a multi-class classiﬁcation problem, we use one-vs-rest strategy to train c binary classiﬁers, f1 , . . . , fc , where c is the number of categories. Thus, for a test query x, its label is predicted as follows: predicted label of x = argmax fi (x). i

The classiﬁcation accuracy is deﬁned as the ratio of the number of correct predictions and the total number of predictions. Table 1 shows the classiﬁcation results by using ridge regression and our proposed algorithm. The testing error of ridge regression is 22.05% and that of our algorithm is 18.86%. Our algorithm gains 14.5% relative improvement over ridge regression. For a detailed comparison, Figs. 3 and 4 show the training and testing accuracy for each category. 4.3. Discussion Our experiment reveals a number of interesting points:

Fig. 1. Query taxonomy used in our experiments.

1. The query classiﬁcation performance are reasonably good. This indicates that the queries can be accurately represented by the web pages returned by a search engine. 2. The training error of ridge regression is very low (3.32%), while its testing error is pretty high (22.05%). The big difference implies that ridge regression suffers from the overﬁtting problem. 3. The search click information and our proposed graph-based regularized regression method are effective in characterizing the semantic relationship between queries. The new algorithm achieves a better classiﬁcation accuracy than ridge regression. Also, the difference between training and testing error is much smaller than that of ridge regression, which

X. He, P. Jhala / Pattern Recognition 41 (2008) 2283 – 2288

2287

Fig. 2. The query distribution in 144 categories.

Fig. 3. The query classiﬁcation results without using search click information.

Fig. 4. The query classiﬁcation results by using search click information.

implies that our algorithm avoids the overﬁtting problem to some extent. 5. Conclusions We have introduced a novel method for query representation and classiﬁcation. One of the major difﬁculties of query understanding is that the queries are often very short. By using a web search engine, the query can be represented as a term vector extracted from the top returned web documents. The expanded query can better describe its meaning. For query classiﬁcation, we have described a graph-based regularized regression method which makes use of search click information. Our basic assumption is that if two queries share a large number of clicked web documents, they are probably relevant to each other. By using search click information, we construct a weighted graph

over the queries which reﬂects the semantic relationship between queries from the users’ perspectives. This graph is then incorporated into regression as a regularizer. An experiment was carried out on a large real world query data set. Our algorithm achieved 3.19% improvement over ridge regression. References [1] J.-R. Wen, J.-Y. Nie, H.-J. Zhang, Query clustering using user logs, ACM Trans. Inf. Syst. 20 (1) (2002). [2] H. Cui, J.-R. Wen, J.-Y. Nie, W.-Y. Ma, Query expansion by mining user logs, IEEE Trans. Knowl. Data Eng. 15 (4) (2003). [3] C. Carpineto, G. Romano, B. Bigi, An information-theoretic approach to automatic query expansion, ACM Trans. Inf. Syst. 19 (1) (2001). [4] B. Billerbeck, F. Scholer, H.E. Williams, J. Zobel, Query expansion using associated queries, in: Proceedings of the Ninth International Conference on Information and Knowledge Management, 2003, pp. 2–9.

2288

X. He, P. Jhala / Pattern Recognition 41 (2008) 2283 – 2288

[5] M. Theobald, R. Schenkel, G. Weikum, Efﬁcient and self-tuning incremental query expansion for top-k query processing, in: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, August 2005, pp. 242–249. [6] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer, Berlin, 1995. [7] Y. Yang, An evaluation of statistical approaches to text categorization, J. Inf. Retrieval 1 (1/2) (1999) 67–88. [8] M. Belkin, P. Niyogi, V. Sindhwani, On manifold regularization, Technical Report tr-2004-05, Computer Science Department, The University of Chicago, 2004. [9] F.R.K. Chung, Spectral Graph Theory, Regional Conference Series in Mathematics, vol. 92, American Mathematical Society, Providence, RI, 1997.

[10] M. Belkin, P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering, Advances in Neural Information Processing Systems, vol. 14, MIT Press, Cambridge, MA, 2001, pp. 585–591. [11] X. He, S. Yan, Y. Hu, P. Niyogi, H.-J. Zhang, Face recognition using Laplacian faces, IEEE Trans. Pattern Anal. Mach. Intell. 27 (3) (2005). [12] A.Y. Ng, M. Jordan, Y. Weiss, On spectral clustering: analysis and an algorithm, Advances in Neural Information Processing Systems, vol. 14, MIT Press, Cambridge, MA, 2001, pp. 849–856. [13] D. Zhou, O. Bousquet, T. Lal, J. Weston, B. Schölkopf, Learning with local and global consistency, Advances in Neural Information Processing Systems, vol. 16, 2004. [14] J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 22 (8) (2000) 888–905.

About the Author—XIAOFEI HE received the BS degree from Zhejiang University, China, and PhD degree from the University of Chicago, both in computer science, in 2000 and 2005, respectively. He is currently a research scientist at Yahoo! Research. His research interests include machine learning and pattern recognition. About the Author—PRADHUMAN JHALA received MS degree in Computer Science in 2003 from Rochester Institute of Technology. He worked as a Research Associate with Virginia Bioinformatics Institute and currently he is working with Yahoo! Search and Advertising Sciences. His research interests include Machine Learning, Information Retrieval, and Bioinformatics.

Recommend Documents

Improving Classification Accuracy Using Gene Ontology Information