ADVERTISING BASED ON USERS’ PHOTOS Xin-Jing Wang†, Mo Yu‡, Lei Zhang†, Wei-Ying Ma† †
‡
Microsoft Research Asia Harbin Institute of Technology
ABSTRACT In this paper, we tackle the problem of learning a user’s interest from his photo collections and suggesting relevant ads. We address two key challenges in this work: 1) understanding a user’s photos to detect his interest, and 2) bridging the lexical and semantic gap between the vocabulary of ads and that of general users’ photos. We solve the first problem by employing a data-driven image annotation approach to annotate each photo and modeling a group of photos, and tackle the second problem by learning and matching the topics of users’ photos and ads. The experiments based on real Flickr data showed the effectiveness of the approach. Index Terms— image advertising, user interest modeling 1. INTRODUCTION Though image has been one of the most pervasive media formats on the Web, image-based online advertising is still an uncovered gold mine. We believe that user-generated photos are invaluable resources for understanding users and providing user-targeted ads because intuitively, though users are generally reluctant to expose their interests publicly, the content they recorded with a camera may reveal their interest. Motivated by this, we propose a photo topic mining approach to identify users’ interests via their uploaded photos and provide image content-based advertising. Intuitively, we can suggest ads by directly matching the textual descriptions of ads with user-generated tags, and show the most relevant ads to users. However, a big barrier is that many Web photos do not have tags. Moreover, even if users have tagged their photos, it is still challenging to detect his real interest. To solve these problems, we firstly adopt a data-driven image annotation approach to automatically tagging the content of a photo, and then model the interest by learning the “topic” from a set of photos, typically the photos shown in one page. We emphasize a “set of photos” because a user’s interest is more reasonably discovered from a set of his photos than from one single photo. For example, given a photo of a red tomato, it is unclear whether the user is interested in tomatoes, or fruits, or red objects. However, if several photos on red tomatoes and green tomatoes are given, we are more confident to conclude that the
Online Photo Collection
Suggested Ads
...
images
users’ tags Ranking Model
Image Annotation [ a1, a2, …, an] [ a1, a2, …, an] ... [ a1, a2, …, an]
[ t1, t2, …, tn] [ t1, t2, …, tn] ... [ t1, t2, …, tn]
Text Processing & Feature Combination
[ w1, w2, …, wn] [ w1, w2, …, wn] ... [ w1, w2, …, wn] Ads Textual Features
Figure 1. System overview. Image annotations are combined with users’ tags to retrieve ads images. The suggested ads of pet tags and foods for dog images are generated from the real system developed in this study. user’s interest is “tomato”. On the other hand, an additional advantage of learning photos’ “topics” is that it helps solve the vocabulary impedance problem [2] between ads and general images. Note that we focus only on relevance while do not discuss the bidding problem in this research. Relevance is very important because current online advertising revenue is mainly based on user behaviors (Pay-Per-Click or Pay-PerAction). Thus, the key to attract a user’s click is to suggest ads which are relevant to either the user’s information need or interest [2][4]. 2. THE APPROACH The entire process is illustrated in Figure 1. Firstly, we use the image annotation approach to automatically annotate each image in a query image group. Secondly, we combine the generated annotations with user-submitted tags (we call it u-tags hereafter) if they are available. Thirdly, a ranking model is applied to rank ads by their relevance to the user’s interest learnt from image textual features, and the topranked ads are returned as the suggestions. 2.1. Image Content Understanding We adopt the data-driven approach proposed by Wang et al. [3] to annotate an image. Its basic idea is: If a large-scale
image database is available so that for each new image we can found a duplicate in the database, then to annotate a new image, what we need to do is just to copy the duplicate’s textual descriptions. While in reality, since it is infeasible to obtain such a database, we instead find for a new image a set of visually and semantically similar images, or nearduplicates, and infer the annotations from their descriptions. Motivated by this, we annotate an image in three steps: 1) retrieving its similar images; 2) mining salient terms from the surrounding texts of the retrieval results; and 3) rejecting noisy terms. The database is made up of 2.4 million highquality web images with meaningful surrounding texts. The input of the approach can either be an image or an image with a few terms describing some of its content. This property perfectly supports the case of web images since not all the web images have surrounding texts. We call annotations resulted from the former type of query as c-tag and the latter case ck-tag, and compare the quality of these two types of tags in our evaluation (see Section 3.) Combining annotations with u-tags (if available), we represent an image in the vector space model, with stopwords removed and terms stemmed. Each term is weighted by its TF*IDF score. 2.2. User Interest Modeling We take into account the photos displayed in one page to learn group-wise topics. The reason is threefold. Firstly, this coincides with the browsing behavior supported by current photo sharing websites, e.g. Flickr.com, which display users’ photo collections in successive pages. Secondly, as analyzed above, it is more reasonable to learn interests from a group of photos than from a single photo. Thirdly, we can display different ads for each page of photos, which enlarges the chance of a user’s clicking on the suggested ads. We propose two methods to model a user’s interest from a page of his photos, namely the term distribution model and the topic distribution model. 2.2.1. Term Distribution Model This is a naïve model which simply computes the histogram of term frequency to represent a user’s interest. This model serves as the baseline. 2.2.2. Topic Distribution Model There are at least two key problems that will greatly degrade the effectiveness of the term distribution model. The first is the noisy tag problem. It is known that user-generated tags are usually noisy, while image annotation is still a challenging research topic which can neither provide 100 percent accurate tags. Since it is not able to differentiate noisy tags from accurate tags, the term distribution model is inevitably biased. The second problem is known as “vocabulary impedance” [1][2]. It is known that there is not only a lexical gap but also a semantic gap between the vocabulary of ads and the vocabulary of general images. As an example of the
ODP Tree
Topic Distributions
1
User Photo [u1, u2, …, un]
2 4
Ads [w1, w2, …, wn]
3
[0, 0, 0.3, 0.6, 0.1, 0, 0]
5 6
7
[0, 0, 0, 0.8, 0, 0.2, 0]
Figure 2. Represent images by ODP-based topic distribution features. The dotted lines suggest the relevant ODP topics, and the thickness suggests the relevance score. lexical gap, a user tags his dog photo with his dog’s name, while such a term is out of the ads’ vocabulary. The semantic gap between ads and photos, however, is more complex. For example, though a user submitted only car photos and car-related tags, it is still reasonable to show him advertisements on insurance companies. Obviously, the term distribution model is not able to bridge the two types of gaps and hence degrades the relevance of ads matched to the learnt interest. On the other hand, only the lexical gap problem was tackled in previous work [1][2] while the semantic mismatch problem is never visited. In this study, we propose a topic distribution method to model a user’s interest with the attempt to bridge both these gaps. The basic idea is to map a term distribution to a concept distribution and represent the interest by a ranked list of predefined concepts. We leverage the ODP taxonomy (Open Directory Project, http://dmoz.org/) for this task. ODP is the largest human-edited directory of the Web, which is constructed by expert editors. In the ODP tree, each concept is a tree-node which represents a category of webpages manually attached by human experts. Therefore, we learn a term distribution from the associated webpages as the feature vector to represent each concept. Then given a photo, we score the ODP concepts by their cosine similarities to its term distribution respectively, and the concepts whose scores are above a certain threshold are used to represent each image. In particular, let 𝐼𝑖 be an image, and 𝜃𝑗 be the 𝑗th concept. 𝐼𝑖 is represented as the feature vector whose nonzero elements are the normalized similarity scores of the selected concepts, i.e. (1) 𝑇 𝐼𝑖 = [𝑤𝑗 𝜃𝑗 |0 ≤ 𝑤𝑗 𝜃𝑗 ≤ 1] where 𝑤𝑗 𝜃𝑗 is the normalized score of 𝜃𝑗 . Such a feature vector represents the “topic” of a certain photo. We perform this process for each photo under consideration and then aggregate the topics; the resulted topic distribution is assumed to represent the user’s interest. The advantage of representing an image as a concept distribution rather than categorizing it to a certain concept as some previous work did is twofold. Firstly, the soft membership effectively handles textual ambiguities, e.g. “apple” can either represent a fruit or a computer brand. Secondly, it solves the semantic mismatch problem to a certain extend.
2.3. Ads Ranking Corresponding to the user interest models, we propose two ranking method to identify relevant ads. Moreover, a mixture model is presented based on these two models. 2.3.1. Direct Match Model (COS) The direct match model (COS) is widely adopted [2][4]. In this model, an ad is also represented in term distribution and COS simply measures the cosine similarity between the term distributions of photos and ads. This is used as a baseline method. 2.3.2. Topic Bridging Model (TB) We adopt the same approach described in Section 2.2.2 to learn the topic distribution of an ad, and then measure the cosine similarity to score an ad. Then the ads are ranked in a descending order and the top ranked ones are returned to be displayed. The entire process is shown in Figure 2. Fundamentally, this model introduces a supervocabulary, which is assumed to be a superset of both the vocabulary of general images and that of ads. Moreover, this vocabulary is structuralized to define a semantic topic space and an image is represented as a data point in this space. 2.3.3. Mixture Model The TB model may be inferior to the COS model if the textual descriptions of photos and ads already suggest a good match. In this case, mapping the terms to an intermediate taxonomy may bring additional noise. To address this issue, we propose a mixture model which performs a convex combination of the above two models. In particular, we measure the relevance of an ad 𝑎𝑑𝑖 to the user’s interest in Eq.(2): 𝑆𝑐𝑜𝑟𝑒_𝑚𝑖𝑥 𝑎𝑑𝑖 = 𝛼 ∗ 𝑡𝑏 𝑎𝑑𝑖 + 1 − 𝛼 ∗ cos 𝑎𝑑𝑖 (2) where 𝑡𝑏(⋅) and cos (⋅) are the relevance scores output by TB and COS respectively. 𝛼 is an empirically determined parameter which is the weight of confidence in 𝑡𝑏(⋅). When 𝛼 is set to zero, the model shrinks to COS, and when 𝛼 is set to one, it is exactly TB. 3. EVALUTATION A series of experiments were performed to evaluate the performance of the proposed approach, the effect of autoannotated tags, and the impact of the coefficient 𝛼 in the mixture model. The evaluation is based on a photo collection of 25 Flickr users who submitted about 5,000 photos which are shown in about 420 pages. 6,500 Amazon product categories are used as ads campaigns which contain about 20 million products. Given a
0.7 Mixture
Average Precsion
For example, a photo is labeled as “apple” and one of its suggested ads is “apple pie” since both of them have high weights on the concept “fruits”.
TB
0.6
COS
0.5
0.4
0.3 1
3
5
7
top N
10
15
20
Figure 3. AP performance of the three ranking models on utag+ck-tag features vs. the top N ads. The mixture model performs the best. page of Flickr photos, we rank each category and then randomly select a few products from the top ranked categories as the output. The intuition is that the products in the same category have very close descriptions. If we rank specific products, most probably the suggested ads will be dominated by a single category, which will result in poor user satisfactions. Three volunteers were asked to mark each suggested ad as irrelevant, relevant or perfect. “Irrelevant” means an ad is a false-alarm. “Relevant” means an ad is somewhat relevant, and “perfect” means strong relevance. Two metrics were used for the evaluation. One is Average Precision (AP) which is the mean fraction of correct answers (i.e. relevant or perfect ads) in the top N results. Contrarily, to differentiate “perfect” suggestions from relevant ones for better understanding of the performance, the Weighted Average Precision (WAP) metric is adopted which is defined as Eq.(3): (3) 𝑊𝐴𝑃 = (𝑝 + 0.5𝑟)/(𝑝 + 𝑟 + 𝑤) Figure 3 illustrate the average precision of the three ranking models vs. the topN output on the 420 groups of photos. Textual features used are the combinations of u-tags with ck-tags (i.e. “u-tag+ck-tag”). The mixture model was tuned to its best performance. From this figure, it can be seen that the mixture model performs the best, while the TB model generally outperforms the COS model except for the top one result. This is probably because the vocabulary mismatch problem is not very severe in the top one result. This also suggests that the image annotation system is effective so that the ck-tags had caught the semantics of the query images. Figure 4 illustrates the effect of different types of textual features of the general images, namely c-tag, ck-tag, utag plus c-tag, and u-tag plus ck-tag. Note that we ignore the case of “u-tag only” since some photos do not have any utags. In this case, no ads will be suggested and the approach is meaningless. Figure 4 evaluates the textual features, which suggests that: 1) u-tag can help image understanding and hence improve the performance. 2) ck-tag is more effective than the c-tag. Recall that a ck-tag is generated by leveraging the
0.7
0.7
0.6 0.5
Average Precision
0.6
(a)
AP
0.4
0.3 0.2 0.1 0
1 c-tag
3
5
7
top N
u-tag+c-tag
10
ck-tag
15
20
u-tag+ck-tag
0.5 0.4
0.3 0
0.1
0.2
top 1
0.5
0.4
0.5
Alpha
top 3
0.6
0.7
0.8
0.9
1
top 5
Figure 5. The effect of 𝛼 on the mixture model vs. the top N results. The “u-tags+ck-tags” features were used. It shows that the mixture model is the most effective, and BT generally works better than COS.
0.4
WAP
0.3
(b)
0.3
0.2 0.1
Users’ Photos
0 1 c-tag
3
5
u-tag+c-tag
7
top N
10
15
ck-tag
Suggested Ads
20
u-tag+ck-tag
Figure 4. The effect of different types of tags on the TB model. The effectiveness of u-tag+ck-tag is comparable to u-tag+c-tag with the AP metric, but is significantly superior with the WAP metric. This means that it produces more “perfect” ads than the other methods. available u-tags in the annotation process, which helps bridge the semantic gap. This conclusion is consistent with the discoveries in [3]. 3) The u-tag+ck-tag features produce more “perfect” ads than the u-tag+c-tag features and other features since it is significantly superior to the other features in terms of the WAP metric while they are comparable in terms of the AP metric. Figure 5 shows the effect of the coefficient 𝛼 on the mixture model at top N results. WAP and the u-tag+ck-tag features are used for this evaluation. The model obtains its best performance when 𝛼 is between zero and one, which means that the mixture model performs the best. Moreover, when 𝛼 is larger than 0.5, the performance drops slowly as 𝛼 increases, which suggests that TB is superior to COS. Figure 6 shows a few examples of the system output. Three groups of users’ photos are given as well as their utags, ck-tags and c-tags. Note that although the photos shown in the third row have no u-tags available, our system still generated meaningful ads. 4. CONCLUSION The image advertising market turns to be non-negligible as photo sharing becomes popular. In this paper, we have investigated the problem of how to recognize a user’s interest from his shared photos and advertise based on it. Two key challenges are addressed: 1) how to understand a user’s interest from his photo collections, and 2) how to couple the vocabulary impedance between general photos and ads. We adopted a search-based annotation method to understand each photo and proposed a topic modeling method to learn a user’s interest, which solves the first problem. We address
u-tag: woman, white, sweet, sunshine, summer, stunning ck-tag: women, red, dress, festival, street, sweet girl, smile c-tag: falls, flowers, garden, trees, red, river, house
u-tag: illinois, girls, high, school, basketball, bourbonnais ck-tag: chicago, school girls, basketball game c-tag: flower, orange, rose, car, city
c-tag: trees, green, stream, mountain, garden, waterfall
Figure 6. Examples of users’ photos and suggested ads. We suggest makeup products for girls photos, sports equipments for game photos, and planting products for outdoor images. The outdoor images have no u-tags and our method still works. the second problem by leveraging an intermediate taxonomy and mapping both the textual descriptions of photos and ads onto it. Each ad is then ranked by their topic similarity to the user’s interest. Furthermore, a mixture model which combines learnt topics and raw texts is shown to achieve the best performance. Evaluations on real Flickr data demonstrate the effectiveness of the proposed method. 5. REFERENCES [1]A. Broder, M. Fontoura, et al, “A Semantic Approach to Contextual Advertising,” SIGIR, pp. 559-566, 2007. [2]B. Ribeiro-Neto, M. Cristo, et al, “Impedance Coupling in Content-targeted Advertising,” SIGIR, pp. 496-503, 2005. [3]X.-J. Wang, L. Zhang, et al, “Annotating Images by Mining Image Search Results,” T-PAMI, vol. 30(11), pp. 1919-1932, 2008. [4] X.-S. Hua, T. Mei, S.P. Li, “When Multimedia Advertising Meets the New Internet Era,” Intl Workshop on Multimedia Signal Processing, pp. 1-5, 2008.