Regularized query classification using search click information

Report 1 Downloads 15 Views
Pattern Recognition 41 (2008) 2283 – 2288 www.elsevier.com/locate/pr

Regularized query classification using search click information Xiaofei He ∗ , Pradhuman Jhala Yahoo! Research Labs, 3333 Empire Avenue, Burbank, CA 91504, USA Received 26 September 2006; received in revised form 1 October 2007; accepted 9 January 2008

Abstract Hundreds of millions of users each day submit queries to the Web search engine. The user queries are typically very short which makes query understanding a challenging problem. In this paper, we propose a novel approach for query representation and classification. By submitting the query to a web search engine, the query can be represented as a set of terms found on the web pages returned by search engine. In this way, each query can be considered as a point in high-dimensional space and standard classification algorithms such as regression can be applied. However, traditional regression is too flexible in situations with large numbers of highly correlated predictor variables. It may suffer from the overfitting problem. By using search click information, the semantic relationship between queries can be incorporated into the learning system as a regularizer. Specifically, from all the functions which minimize the empirical loss on the labeled queries, we select the one which best preserves the semantic relationship between queries. We present experimental evidence suggesting that the regularized regression algorithm is able to use search click information effectively for query classification. 䉷 2008 Elsevier Ltd. All rights reserved. Keywords: Query classification; Query representation; Web search; Regression; Regularization; User logs

1. Introduction Due to the rapid growth of the number of web-based applications, there is an increasing demand for effective and efficient method for query understanding. Successfully assigning a category label to a user query would potentially help many applications such as web search, web-based advertisement, recommendation system, etc. In this paper, we consider the problem of query representation and classification. One of the major difficulties of query understanding is that the user queries are often very short. A typical query has less than three terms [1]. If we use term vector to represent a query like documents and use inner product to evaluate the similarity of query pairs, most of them would have zero similarity. For example, the queries “SVM” and “Support Vector Machine” actually have the same meaning, but they do not share any common terms. Moreover, even if two queries share some

∗ Corresponding author. Tel.: +1 626 243 3793.

E-mail addresses: [email protected] (X. He), [email protected] (P. Jhala). 0031-3203/$30.00 䉷 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2008.01.010

common terms, evaluating their similarity according to these terms may not accurately reflect their semantic relationship. Consider two queries “Apple Juice” and “Apple Computer”. They share a common term “Apple”. However, they have totally different meanings. In traditional information retrieval, some techniques such as query expansion have been proposed to help users formulate better queries [2–5]. Query expansion involves supplementing the original query with additional terms. It aims to formulate better queries which can enhance the web search performance. However, the expanded queries generated by these techniques may still fail to provide a rich context based on which term-wise similarity measure can reflect intrinsic semantic relationships. In this paper, we introduce a novel approach for query representation and classification. For each query, we submit it to a search engine and use the top returned web documents to represent this query. This is based on the assumption that the top returns are likely to be relevant to the query. These web documents are combined together to form a term vector which has rich context. Thus, each query can be represented as a point in high dimensional space. With vector representations, the query similarities can be evaluated by inner product and standard

2284

X. He, P. Jhala / Pattern Recognition 41 (2008) 2283 – 2288

machine learning algorithms can be applied to query classification. In real world query data set, the number of categories ranges from tens to hundreds. Due to the consideration of computational complexity, we adopt regression as our classification tool. In general, the dimensionality of the query space is very high. When the number-of-dimension to sample-size is too large, we cannot reliably find a regression function with good generalization capability. In this case, overfitting may occur. In order to avoid overfitting, one can impose additional constraints to the learning system as a regularizor. One of the most popular approach to regularization is dimensionality reduction. Here, the intrinsic discriminative structure is acknowledged by approximating the data by its projection onto a lower-dimensional basis. The predictors are thus replaced by their basis coefficients. A drawback of this approach is that one has to supply a suitable basis, which tends to involve more subjectivity. In this work, we incorporate a regularizer by using users’ search click information. For each query, we can get a list of URLs which are clicked during the search of this query. If the same URL is clicked during the search of both of two queries, they are likely to be related. Intuitively, the relatedness of two queries can be evaluated as the size of the intersection of their URL lists. Thus, we can build a graph to model the semantic relationship between queries. This graph can be then incorporated into the regression framework as a regularization term. Specifically, from all the functions which minimize the empirical loss on the labeled queries, we select the one which best preserves the semantic relationship between queries. The paper is structured as follows: in Section 2, we describe how to represent the query with rich context by using a search engine. In Section 3, the regularized query classification algorithm is introduced. The experimental results are presented in Section 4. Finally, we provide some concluding remarks in Section 5.

Table 1 Web search results of “Apple Juice” and “Apple Computer” Apple Juice www.applejuice.org www.thenandnow.net/fanlisting/applejuice en.wikipedia.org/wiki/Apple_juice www.appleproducts.org/recipes.html importer.alibaba.com/buyeroffers/Apple_Juice.html www.applejuice.org/recipes.html www.martinellis.com www.martinellis.com/Products/50oz_Unfiltered_AppleJuice.htm www.webtender.com/db/ingred/461 www.appleproducts.org/nutritn.html www.drinksmixer.com/desc68.html www.soymilkquick.com/applejuice.html www.ichef.com/recipe.cfm?task = display&itemid = 71667&recipeid = 71330 www.hormel.com/kitchen/glossary.asp?id = 36065&catitemid = www.amazon.com/exec/obidos/tg/detail/-/B00000K41T?v = glance Apple Computer www.apple.com store.apple.com www.apple.com/support en.wikipedia.org/wiki/Apple_Computer quicktime.apple.com asia.apple.com www.info.apple.com developer.apple.com jobs.apple.com/cgi-bin/WebObjects/Employment.woa store.apple.com/Catalog/US/Images/routingpage.html asu.info.apple.com www.apple.co.jp www.appleclub.com.hk www.apple-history.com finance.yahoo.com/q?s = AAPL

Table 2 Query classification results

2. Query representation based on web search As we described previously, one of the major difficulties of query understanding is how to represent a query. In this work, we propose to represent a query by using web search results. The procedure is outlined below: • Submit query q to a search engine. Let (p1 , p2 , . . . , pn ) be the top n web pages. • For each pi , i = 1, . . . , n, extract words and compute their frequencies. Keep the top m words with highest frequencies. Each pi is thus represented as a term vector. • Normalize pi such that pi = 1 and add them together: p = n p . i=1 i • Finally, query q can be represented by the following vector: q=

p . p

Recall the two queries “Apple Juice” and “Apple Computer”. By submitting them to a search engine (search.yahoo.com), we

Ridge regression Regularized regression with search click information

Training error (%)

Testing error (%)

3.32 8.37

22.05 18.86

can get a list of relevant URLs. The top 15 returns are shown in Table 2. For each URL in Table 2, we extract words from the web pages. And finally, these two queries can be represented as the following term vectors: • Apple Juice: apple, juice, cider, oz, product, serves, recipes, pie, ingredients, fruit, frozen, punch, soyquick, tofu, lemon, orange, home, cinnamon, drink, beverage, . . . . • Apple Computer: apple, computer, authorized, mac, college, product, SVP, purchaser, school, proposal, university, support, software, repair, ipod, jobs, make, users, com, retail, .... The terms are sorted according to their frequencies in decreasing order. Due to space limitation, we only list the top 20 terms

X. He, P. Jhala / Pattern Recognition 41 (2008) 2283 – 2288

with the highest frequencies. Clearly, the expanded representations using web search results can better describe the meanings of the queries. 3. Regularized query classification Once the queries are equipped with vector representations, classical supervised learning algorithms can be applied, including support vector machines (SVM) [6], linear regression [7], and naive bayes [7]. When dealing with web scale data, efficiency is a major concern. Due to this consideration, we adopt the regression framework. In this Section, we introduce our regularized query classification algorithm by using search click information. We begin with a brief review of standard linear regression. 3.1. A brief review of linear regression We consider binary classification problem. Let {xi , yi }, i = 1, . . . , m be a set of training samples, where yi ∈ {1, −1} is the label of xi . Linear regression aims to fit a function f (x) = aT x + b such that the residual sum of square is minimized RSS(a) =

m 

(f (xi ) − yi )2 .

(1)

For the sake of simplicity, we append a new element “1” to each xi . Thus, the coefficient b can be absorbed into a and we have f (x) = aT x. Let X = (x1 , . . . , xm ) and y = (y1 , . . . , ym ). We have

In this subsection, we introduce a novel regularized query classification method by using search click information. We apply Belkin and Niyogi’s manifold regularization framework [8] for learning a linear classifier. In web search, most search engines have accumulated a large amount of query logs, from which we can discover the semantic relationship between queries from the users’ perspectives. Among all the queries available, let m1 be the number of queries with label and m2 be the number of queries without label. Without loss of generm1 +m2 1 ality, let {xi }m i=1 be the labeled queries and {xi }i=m1 +1 be the unlabeled queries. When a user submits a query x and then clicks the returned web page p, there is reason to suspect that x is somewhat related to p. The strength of relevance between x and p increases as more users click p while searching x. Let C(x, p) denote the number of clicks of p with respect to query x. Our basic assumption is that if two queries xi and xj are both relevant to web page p, then xi and xj are relevant to each other. Thus, the similarity between xi and xj from the user’s perspective can be naturally defined as follows:  Wij = C(xi , p)C(xj , p). Once we construct a weighted query graph G with edges connecting related queries to each other, consider the problem of mapping the weighted graph G to a line so that related queries stay as close together as possible. A reasonable criterion for choosing a “good” map is to minimize the following [8]:  (aT xi − aT xj )2 Wij . (3) ij

RSS(a) = (y − X T a)T (y − X T a). Requiring jRSS(a)/ja = 0, we obtain: (2)

One problem of linear regression is that it is too flexible. When the number of terms are larger than the number of samples, the minimizer is not unique. It is unclear which one of these minimizers has the best generalization capability. In fact, in this situation, it can be shown that the matrix XX T is singular. Thus, the above optimization problem is not well defined. In order to overcome this problem, regularization methods are introduced. Tikhonov regularization is the most commonly used method of regularization. It is also known as ridge regression. It aims to find a minimum norm minimizer: min(a xi − yi ) + a. T

3.2. Regularized regression with search click information

p

i=1

a = (XXT )−1 Xy.

2285

2

a

The optimal solution of ridge regression is given by a = (XX T + I )−1 Xy, where I is the identity matrix. Clearly, the matrix XX T + I is no longer singular. However, Tikhonov regularizer is data independent. It fails to discover the intrinsic structure in the data.

The objective function with our choice of Wij incurs a heavy penalty if related queries xi and xj are mapped far apart. Therefore, minimizing it is an attempt to ensure that if xi and xj are relevant then aT xi and aT xj are close. By incorporating Eq. (3) into the standard linear regression as a regularizer, we get the following objective function [8]: V (a) =

m1 

(aT xi − yi )2

i=1

+ 1

m 

(aT xi − aT xj )2 Wij + 2 a2 ,

(4)

i,j =1

where m = m1 + m2 . Following some simple algebraic steps, we have m 1  T (a xi − aT xj )2 Wij 2 i,j =1 m 

=

i=1 T

aT xi Dii xiT a −

= a X(D − W )X a = aT XLXT a,

T

m  i,j =1

aT xi Wij xjT a

2286

X. He, P. Jhala / Pattern Recognition 41 (2008) 2283 – 2288

where X = (x1 , . . . , xm ), and D is a diagonal matrix; its entries are  column (or row, since W is symmetric) sum of W, Dii = i Wij . L = D − W is the graph Laplacian [9]. It would be important to note that graph Laplacian has been used for many areas, including dimensionality reduction [10], face recognition [11], clustering [12], ranking [13], image segmentation [14], etc. Define X1 = (x1 , . . . , xm1 ) and y = (y1 , . . . , ym1 )T where yi is the label of xi . Thus, V (a) can be reduced to the following: V (a) = (y − X1T a)T (y − X1T a) + 1 aT XLXT a + 2 aT a. Requiring that the gradient of V (a) vanish gives the following solution: a = (X1 X1T + 1 XLXT + 2 I )−1 X1 y.

(5)

4. Experimental results In this section, several experiments were performed to show the effectiveness of our proposed algorithm. We begin with a description of the data used in our experiments. 4.1. Data preparation The data for this study are derived from Yahoo! search engine logs. A typical search engine log contains information about query and click events, such as, time, query term, number of served URLs, details of served and clicked URLs, etc. We used a manually labeled taxonomy of query terms in this experiment. The taxonomy has six levels of hierarchy. These six levels of hierarchy contains 31, 175, 324, 265, 111, and 43 categories, respectively, as shown in Fig. 1. Due to the consideration of effectiveness, we selected the second level categories and removed those categories with less than 20 queries, thus leaving us with 144 categories and 94,415 queries. We collected one week of search engine logs to construct a query graph as described in Section 3. Fig. 2 shows the number of queries in each of these 144 categories. It would be important to note that our method can also deal with all the levels. However, when all the levels are used, the classification accuracy may not be acceptable for real world web applications.

We sent all queries to Yahoo! search engine, parse top 15 returned web documents, and find all unique terms present in these documents. The size of all unique terms found in this way is 273,238. Thus, each query can be represented as a 273,238dimensional vector. Here, the number of dimensions is equal to the total number of terms that are used to represent all the queries. For a term vector x=(x1 , x2 , . . . , x273,238 ), the element xi is set to be 1 if the ith term is used to represent this query; and it is set to be 0 otherwise. Since for each single query, only a small number of terms are used to represent this query, the term vector is highly sparse. Since the number of queries is far less than the number of dimensions, the situation of overfitting may occur. 4.2. Query classification results The query data set was randomly split into training set and testing set. The training set contains 80% queries, and the testing set contains the rest 20%. The search click information was used to construct a query graph as described in Section 3. The training set is utilized to learn a linear classifier. We averaged the results over 20 random splits. We applied cross validation on the training set to select the parameter 1 and 2 in our algorithm. Since query classification is essentially a multi-class classification problem, we use one-vs-rest strategy to train c binary classifiers, f1 , . . . , fc , where c is the number of categories. Thus, for a test query x, its label is predicted as follows: predicted label of x = argmax fi (x). i

The classification accuracy is defined as the ratio of the number of correct predictions and the total number of predictions. Table 1 shows the classification results by using ridge regression and our proposed algorithm. The testing error of ridge regression is 22.05% and that of our algorithm is 18.86%. Our algorithm gains 14.5% relative improvement over ridge regression. For a detailed comparison, Figs. 3 and 4 show the training and testing accuracy for each category. 4.3. Discussion Our experiment reveals a number of interesting points:

Fig. 1. Query taxonomy used in our experiments.

1. The query classification performance are reasonably good. This indicates that the queries can be accurately represented by the web pages returned by a search engine. 2. The training error of ridge regression is very low (3.32%), while its testing error is pretty high (22.05%). The big difference implies that ridge regression suffers from the overfitting problem. 3. The search click information and our proposed graph-based regularized regression method are effective in characterizing the semantic relationship between queries. The new algorithm achieves a better classification accuracy than ridge regression. Also, the difference between training and testing error is much smaller than that of ridge regression, which

X. He, P. Jhala / Pattern Recognition 41 (2008) 2283 – 2288

2287

Fig. 2. The query distribution in 144 categories.

Fig. 3. The query classification results without using search click information.

Fig. 4. The query classification results by using search click information.

implies that our algorithm avoids the overfitting problem to some extent. 5. Conclusions We have introduced a novel method for query representation and classification. One of the major difficulties of query understanding is that the queries are often very short. By using a web search engine, the query can be represented as a term vector extracted from the top returned web documents. The expanded query can better describe its meaning. For query classification, we have described a graph-based regularized regression method which makes use of search click information. Our basic assumption is that if two queries share a large number of clicked web documents, they are probably relevant to each other. By using search click information, we construct a weighted graph

over the queries which reflects the semantic relationship between queries from the users’ perspectives. This graph is then incorporated into regression as a regularizer. An experiment was carried out on a large real world query data set. Our algorithm achieved 3.19% improvement over ridge regression. References [1] J.-R. Wen, J.-Y. Nie, H.-J. Zhang, Query clustering using user logs, ACM Trans. Inf. Syst. 20 (1) (2002). [2] H. Cui, J.-R. Wen, J.-Y. Nie, W.-Y. Ma, Query expansion by mining user logs, IEEE Trans. Knowl. Data Eng. 15 (4) (2003). [3] C. Carpineto, G. Romano, B. Bigi, An information-theoretic approach to automatic query expansion, ACM Trans. Inf. Syst. 19 (1) (2001). [4] B. Billerbeck, F. Scholer, H.E. Williams, J. Zobel, Query expansion using associated queries, in: Proceedings of the Ninth International Conference on Information and Knowledge Management, 2003, pp. 2–9.

2288

X. He, P. Jhala / Pattern Recognition 41 (2008) 2283 – 2288

[5] M. Theobald, R. Schenkel, G. Weikum, Efficient and self-tuning incremental query expansion for top-k query processing, in: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, August 2005, pp. 242–249. [6] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer, Berlin, 1995. [7] Y. Yang, An evaluation of statistical approaches to text categorization, J. Inf. Retrieval 1 (1/2) (1999) 67–88. [8] M. Belkin, P. Niyogi, V. Sindhwani, On manifold regularization, Technical Report tr-2004-05, Computer Science Department, The University of Chicago, 2004. [9] F.R.K. Chung, Spectral Graph Theory, Regional Conference Series in Mathematics, vol. 92, American Mathematical Society, Providence, RI, 1997.

[10] M. Belkin, P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering, Advances in Neural Information Processing Systems, vol. 14, MIT Press, Cambridge, MA, 2001, pp. 585–591. [11] X. He, S. Yan, Y. Hu, P. Niyogi, H.-J. Zhang, Face recognition using Laplacian faces, IEEE Trans. Pattern Anal. Mach. Intell. 27 (3) (2005). [12] A.Y. Ng, M. Jordan, Y. Weiss, On spectral clustering: analysis and an algorithm, Advances in Neural Information Processing Systems, vol. 14, MIT Press, Cambridge, MA, 2001, pp. 849–856. [13] D. Zhou, O. Bousquet, T. Lal, J. Weston, B. Schölkopf, Learning with local and global consistency, Advances in Neural Information Processing Systems, vol. 16, 2004. [14] J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 22 (8) (2000) 888–905.

About the Author—XIAOFEI HE received the BS degree from Zhejiang University, China, and PhD degree from the University of Chicago, both in computer science, in 2000 and 2005, respectively. He is currently a research scientist at Yahoo! Research. His research interests include machine learning and pattern recognition. About the Author—PRADHUMAN JHALA received MS degree in Computer Science in 2003 from Rochester Institute of Technology. He worked as a Research Associate with Virginia Bioinformatics Institute and currently he is working with Yahoo! Search and Advertising Sciences. His research interests include Machine Learning, Information Retrieval, and Bioinformatics.