Large Scale Ad Latency Analysis Mihajlo Grbovic Yahoo! Labs Sunnyvale, CA, USA
[email protected] Jon Malkin Yahoo! Labs Sunnyvale, CA, USA
[email protected] Abstract—Late web display advertisements are problematic for both the user experience and the monetary machinery powering the display advertising industry. If a web page is delivered to a user but the ad fails to load in time, the publisher cannot charge the advertiser for that impression. Detecting whether a specific ad will render in time could give the publisher a choice to show that ad or another one. Further, discovering the root causes of latency, possibly over time as new violators emerge, would allow the publisher to address the actionable issues. We propose a system that predicts, at serve time, which ads are likely to have high latency. Once identified we can either ignore those ads, even if they win the auction, or apply a penalty to those ads. In addition, our system collects the daily impression logs, consisting of different types of observations measured at serve time and the associated latency in milliseconds, and analyzes the data to identify the features associated with late ads and likely to be causing the delay. Keywords-ad latency; classification; big data; mapreduce;
I. I NTRODUCTION As computers and network speeds have increased, users now expect web pages to load seemingly instantaneously. Along with wanting content to arrive quickly, websites want to ensure ads are also delivered promptly since sites cannot charge advertisers for ads that never show. Improving ad load times will ultimately improve both the user experience and the revenue for sites. The online advertising world consists of advertisers, organizations wishing to advertise a product, service or event; publishers, who run the websites and want to supplement their income; and ad exchanges, to connect the participants together and create a marketplace. Studies of user responses to page load times indicate that users prefer faster loading pages [1]. For users who are likely to engage with an advertisement, having an ad display sooner means the user has a chance to see the ad before becoming immersed in the page content. Even for a user who will not engage with an ad, having the completed page finish rendering quickly will avoid a potentially distracting page change as a white space placeholder is suddenly filled after starting to consume the page’s content. Financially, the impact of slow ads is clear. If an ad takes too long to load, the user may navigate away from the page before seeing the advertisement. This results in a
Hirakendu Das Yahoo! Labs Sunnyvale, CA, USA
[email protected] missed opportunity for the advertiser and lost revenue for the publisher. In this work, we consider two ways to use machine learning to analyze latency data from Yahoo’s ad serving system to reduce the impact of ad latency. Only a small portion of traffic has been instrumented to record ad latency data but Yahoo’s large scale means that even with a small sampling rate we still have tens of millions of records per day. For the first method, we perform offline analysis on recent delivery data to identify the largest factors contributing to slow ad rendering times. Second, we propose a run-time system to predict when ads are likely to be late. Using such a system, we could choose to bypass ads with a high probability of being late, even when selling the impression slot via auction. II. P ROBLEM D ESCRIPTION In the ad latency setup, the input space is defined by a feature vector x ∈ X , and the output is defined as a latency value in milliseconds y ∈ Y. The problem can be formulated as a binary classification problem with labels c ∈ {0, 1} by thresholding. An ad is defined as being late, where c = 1, if the time between an ad request and the ad rendering is more than τ milliseconds. Let D = {(xn , cn ) : n = 1, ..., N } denote our sample data, where N is our total number of records, xn is a Kdimensional feature vector of observations measured at ad serving time and cn is the latency indicator. Our goal is both to learn a predictive model that maps x to a latency indicator f (x) = c, and to perform an analysis to detect the top violators in terms of measured features. These two goals can be achieved simultaneously if a visualizable and interpretable model with good prediction power [2] is used, such as logistic regression [3] or decision trees [4]. An observation x consists of user features, advertiser features, publisher features and temporal features. Table I shows specific features and their cardinalities. Some sensitive information is omitted (marked as X). The user features include features extracted from the browser’s user agent: the user’s device type, operating system, browser, and whether or not the user is likely a
Table I: Ad latency feature categories and cardinalities type
User
Advertiser
Publisher Time
feature device browser CS state ISP account ad size (dim) property ad position hostname serve type time
card. 15 100 10 ∼261 ∼3774 X 33 X 28 X X 24
feature OS robot country city distance to DC creative ad size (in KB) page on property DC ad network
card. 22 2 ∼215 ∼5583 numeric X numeric X X X
day of week
7
robot. Based on a user’s IP address, we also determine the geographic location (country, city, and state), an approximate physical distance from the data center (DC) that handles the ad request, the user’s internet connection speed (CS), and the internet service provider (ISP) used by the user. Advertiser features include the advertiser’s identity; the actual advertisement image, typically referred to as the creative; and the size of the advertisement, both the 2-D dimensions of a creative and byte size. A single advertiser may have multiple creatives, so some of these features are hierarchical. The publisher specific features include the property on which the ad is shown (in this case, limited to Yahoo-owned sites, e.g. Mail or Homepage), the specific web page on the property (e.g. compose new email), the position on the page at which the ad is shown (e.g. top or right), whether the ad is hosted by a 3rd party ad network, and the hostname and the data center used to process the ad request. We use a one-hot encoding for our categorical features, meaning each feature with cardinality k is mapped to k mutually exclusive binary features. The exact number of features varies day to day, but the feature dimensionality is typically larger than 20, 000. We use logistic regression, where the feature vector x is parameterized using a weight vector w, where large positive weights are associated with a higher likelihood of latency and large negative weights are associated with a lower likelihood of latency. Using such a model, we can provide feedback to publishers, advertisers and even the exchange about the sources of latency. Examples of the desired analysis could range from: “ads are late in geographical region X,” to “ads are late on a particular device Y and browser Z,” or “serving ads from data center W is problematic”. III. M ETHODOLOGY A. Logistic Regression The assumptions involved in logistic regression are similar to those involved with any linear model, namely the existence of a linear relationship between the inputs and
the output. In the case of logistic regression, we assume that the posterior class probabilities can be estimated as a linear function of the inputs x, passed through a sigmoidal function, K X
P (c = 1|x) = g(w0 +
wk · xk ) =
k=1
1
1+
PK , ew0 + k=1 wk ·x
(1) and P (c = 0|x) = 1 − P (c = 1|x). Given a dataset D of observations, we can estimate the parameters w via maximum likelihood (ML). The likelihood function of the data D is given by p(D|w) =
N Y
P (cn |xn , w) =
n=1
N Y
µcnn (1 − µn )1−cn , (2)
n=1
where we denote µn = P (cn = 1|xn ). Note that the term µcnn (1−µn )1−cn reduces to the posterior class probability of class 0 when cn = 0, and to the posterior class probability of class 1 when cn = 1. In order to find the ML estimate of the parameters, we form the log-likelihood, l = log p(D|w) =
N X
(cn log µn + (1 − cn ) log (1 − µn )),
n=1
(3) The ML estimate requires us to solve ∇w l = 0, which is a non-linear system of K +1 equations in K +1 unknowns, so we do not expect a closed form solution. Hence, the usual approach is to apply the gradient descent (GD) algorithm to get the parameter estimates. It should be noted that the parameter estimates can also be calculated to minimize the mean squared error (MSE) betweenP the ground truth c and the prediction f (x, w) = K g(w0 + k=1 wk · xk ), MSE =
N 1 X (cn − f (xn , w))2 , N n=1
(4)
which can be solved using gradient descent or similar optimization algorithms. To avoid overfitting, one can use regularization, such as `2 or `1 . Applying `2 regularization, adding λ2 kwk2 to the loss function, forces the values of the parameter vector to have low values, thereby ensuring that no one dimension can have an overwhelming influence in classification. Using `1 regularization, adding λ1 kwk1 to the loss function, induces sparsity in the parameter vector, reducing the feature space to a subset of features that are the most predictive. This reduces the complexity of prediction which results in a faster prediction speed. Hyperparameters λ2 > 0 and λ1 > 0 can be selected based on the problem requirements and require tuning. In case of a class imbalance, where examples of one class in a data set vastly outnumber examples of the other class,
one can add weights per example mn to the loss function, based on the proportion of the two classes in the data. The final problem that we tackle, including sample weights and regularization is of the form min
w∈RK
N 1 X mn (cn − f (xn , w))2 + λR(w), N n=1
(5)
where R is the regularizer, e.g. `1 and/or `2 . Given logistic regression predictions yˆn = f (xn , w) ∈ [0, 1], the classification predictions are made by thresholding, as cˆn = sign(ˆ yn − θ), where threshold θ is usually set to 0.5, f (xn , w) ≥ 0.5 ⇒ cˆn = 1. However, θ can be chosen anywhere between 0 and 1 to ensure a desired False Positive Rate (FPR) or True Positive Rate (TPR). There are many scalable implementations of different optimization techniques for solving problem (5) with and without regularization, such as stochastic and (mini-)batch gradient descent, sparse gradient descent, Newton, quasiNewton and truncated Newton methods, trust region, and nonlinear conjugate gradient descent. An overview of different optimization techniques for logistic regression is given in [5]. B. Performance Metrics To evaluate the performance of our ad latency model, we used a test set of labeled ad serving observations, disjoint from the training set, consisting of both late and non-late ads. The true positive rate, TPR, is defined as n1 TPR = · 100, (6) N1 where n1 is the number of correctly classified late ad examples and N1 is the total number of late ad examples in the test data. Similarly, the false positive rate, FPR, is given as n0 FPR = · 100, (7) N0 where n0 is the number of misclassified non-late ad examples and N0 is the total number of non-late ad data examples in the test data For different values of threshold θ the logistic regression model can achieve different TPR and FPR performance. By sliding the threshold from 0 to 1, both the TPR and the FPR decrease, and by plotting these values on a 2-D plane we form a Receiver Operating Characteristic (ROC) curve [6]. Usually the threshold is chosen depending on application requirements. During online ad latency prediction, the publisher typically prefers a very low FPR, ensuring that very few ads are wrongly classified as late and thus minimally hurting revenue. At the same time, the publisher prefers a high TPR. Since these two goals are conflicting, the best strategy is to maximize the area under the ROC curve (AUC).
IV. E XPERIMENTAL R ESULTS In this section we present the results of a study of runtime prediction of ad latency and the sample output of our pipeline for offline ad latency analysis. We used the Vowpal Wabbit package1 [7] for logistic regression. This online-learning library is capable of running in a distributed manner on several machines in parallel, as well as on a single machine. We demonstrate the trade-offs in time and performance of training on a single machine vs. distributed training on a Hadoop cluster. To address label imbalance, as overwhelming majority of our ads are not late, we use weighted learning (5), where weights are set based on the proportion of the classes. A. MapReduce and AllReduce The MapReduce [8] abstraction, implemented as an open source platform Hadoop2 , has become popular for distributed processing of big data. With the growth of data, performing regular operations and extracting statistics from data becomes impractical with a single machine. The MapReduce framework helps to bridge the gap, whether the problem is storing all the data points or in memory when trying to perform operations on the entirety of data. There are three different ways of utilizing MapReduce for machine learning: 1) reading the data in multiple mappers and learning a model on a single reducer in an online learning manner, without storing the points that are being streamed; 2) learning several local models on parts of the dataset in multiple mappers and combining the local models into a global model on a single reducer; and 3) learning models on multiple mappers, where mappers can can communicate to exchange local gradients. In the first option, learning a model takes as long as learning on a single machine. The only benefit is in data storage, specifically the fact that you never need to store the dataset on a single computer’s disk. The second option is a typically used for batch machine learning algorithms. It requires running several MapReduce jobs, where the mappers compute gradients and the reducers sum them. Thus one MapReduce job is analogous to one batch gradient descent update. This method may be ineffective because each iteration has large overheads (e.g. job scheduling, data transfer, data parsing), and at least a dozen iterations often need to be conducted to ensure convergence. The third option, is known as AllReduce [9]. Several researchers in the field have argued that AllReduce is better suited for machine learning algorithms [10]. For solving problems of the form (5), AllReduce can accumulate local gradients for a gradient based algorithm like gradient descent or LBFGS [11]. A typical implementation is done by imposing a tree structure on the communicating mappers, where 1 https://github.com/JohnLangford/vowpal 2 http://hadoop.apache.org/
wabbit
100
90
90
80 70 60 50 40 30 20
1 sec (AUC=0.8894) 2 sec (AUC=0.8465) 3 sec (AUC=0.8350)
10 0 0
10
20 30 40 50 FALSE POSITIVE RATE (%)
60
(a) Late ad definition consequences
TRUE POSITIVE RATE (%)
TRUE POSITIVE RATE (%)
100
80 70 60 50 40 30 20
1 sec (10,272 features) 1 sec (853 features) 1 sec (72 features)
10 0 0
10 20 30 FALSE POSITIVE RATE (%)
40
(b) L1 regularization consequences
Figure 1: Comparison of ad-latency algorithms Table II: Performance in terms of AUC #mappers 1 100 500 1000
avg map time 7hr 32min 1hr 07min 12min 6min
total time 7hr 32min 1hr 10min 14min 8min
AUC .8894 .8893 .8891 .8889
local gradients are summed up the tree, and then broadcast down to all mappers. In this approach, all computation is done in the mappers; reducers are never used even though an implicit reduce is performed. VW contains an implementation of AllReduce compatible with Hadoop. For details see [9]. B. Runtime ad latency classification results In runtime ad latency classification, the goal is to train a model that can predict in real-time whether or not an ad will be late. If a late ad is detected before being served and an alternate ad predicted not to be late is available, the alternate ad is shown. The model is trained every day on the latest 7 days of data, and updated hourly as new data becomes available. VW allows for storing the model in a form that allows for continuation of training. We observed that updating the model during the day helps in discovering new emerging trends. To reduce the complexity of the trained model and ensure fast real-time prediction, we use `1 regularization. The initial daily model is trained using VW’s AllReduce distributed learning framework. The resulting model is saved on a single machine and the hourly training updates are made on the single machine, since the hourly data are not large enough to benefit from distributed learning.
For purposes of experimental evaluation of our ad latency model in this paper, we used a training set of 1.3 billion examples for training, and a test set of 200 million examples, disjoint from the training set, for evaluating the performance During runtime ad latency prediction, the publisher typically prefers a very low FPR, ensuring that very few ads are wrongly classified as late and thus minimally hurting revenue. At the same time, the publisher prefers a high TPR. Since these two goals are conflicting, the best strategy is to maximize the area under the ROC curve (AUC). Table II compares the time and performance results of single machine learning vs. learning on different number of machines in a distributed manner. To ensure that Hadoop schedules M mappers, we compressed the data into M files. From Table II we can conclude that the running time drastically improves when we make use of the distributed learning implementation. Also, by comparing different number of mappers, we can conclude that the performance only slightly decreases with the larger number of mappers, but with a very large decrease in training time. Additional results are presented in Figure 1. Figure 1a shows the ROC curves of the models trained on different definitions of late (1, 2 and 3 seconds). Figure 1b shows the ROC curves of models trained based on a late definition of 1 second, but with different values of the `1 regularization hyperparameter. Larger `1 values result in a sparser solution. Several observations can be made. First, it is easier to distinguish between ads that are more or less than 1 second late, than to distinguish between ads that are more or less than 2 or 3 seconds late. Second, as we increase the λ1 hyperparameter, the performance degrades slightly but we get a much sparser solution, resulting faster runtime prediction. Finally, we observe that for a very low FPR of
Table III: Aggregation of results per feature type 5
user device tablet android xbox tablet other ps3 iphone mob. android mob. other desktop user os mac ppc winphone7.5 ps3gameos win2k winxp winphone8 os2 unix linux win7 win2k3 win8 mac osx firefox17 chrome27
1.147 0.970 0.446 0.438 0.433 0.206 -0.349 -0.675 1.173 0.854 0.362 0.254 0.137 0.005 -0.030 -0.062 -0.174 -0.247 -0.313 -0.435 -0.180 -0.183
user browser safari65 safari75 mob. safari netfront ie6 chrome4 chrome11 realplayer firefox12 ie8 firefox3 safari5 chrome2 ie7 firefox23 firefox11 chrome28 chrome26 safari85 ie10 firefox21
ISP city space ID country creative ID state browser os CS device hour ad size ad loc colo
4
1.132 1.065 0.740 0.371 0.352 0.318 0.312 0.309 0.150 0.145 0.144 0.137 0.132 0.113 0.102 0.090 -0.007 -0.046 -0.080 -0.117 -0.130
2 − 3%, we can detect and potentially eliminate 40% of late ads. C. Offline ad-latency analysis results For the offline ad latency analysis pipeline, we trained one logistic regression model for each ad network. This choice was made to better address any discovered issues. The models for all R ad networks are trained in a single MapReduce job, where R models are trained on R reducers. The mappers read close to a billion data points observed in the latest 7 days. Ad network features are excluded from modeling because they are redundant given the partitioning. Moderate values of λ1 were used for `1 regularization. We observed that this discards features with little support, for instance operating systems, internet service providers, or browsers that very few people use, which is useful in interpreting the results. Ultimately, advertisers will rarely optimize their ads for devices, or browsers that are not commonly used. Once the models are trained for a specific day, they can be organized by ad network in the form of high level features (e.g. device, or browser), with an option to drill down to observe problematic and non-problematic categories, corresponding to features with large positive and large negative weights, respectively. In addition, geographic (geo) features could be displayed as a heat map. Table III shows a resulting visualization for some of the feature categories for an ad network. Most categories were omitted due to data sensitivity. The analysis of type shown in Table III is very useful for all parties. Publisher can address the problematic web pages, advertisers can revisit the design of certain creatives and
3
feature weight
ad size 425x600 0.127 300x250 -0.086 630x120 -0.149 300x100 -0.727 hour PST 08h 0.131 07h 0.046 09h 0.024 22h -0.002 20h -0.014 21h -0.028 user connection oc3 1.226 satellite 0.888 mobile 0.826 dialup 0.227 wireless 0.167 broadband -0.156 cable -0.433 t1 -0.518
2 1 0 −1 −2 −3 −4
Figure 2: Feature types sorted by logistic regression weights
optimize them as needed, and the ad exchange can examine issues with specific servers, data centers or even ISPs. If we sort the features by weights, as in Figure 2, we can further observe that ISP and geo features cover most of the top and bottom positions. In the 100 features with highest weights, 75 are ISPs, 13 are geo, 6 are space IDs and 1 is a creative. Of the 100 features with lowest weights, 62 are ISPs, 21 are geo, 12 are space IDs, and 7. Interestingly, the top negative position is held by a specific ad location on the page. Overall, the offline ad-latency analysis pipeline brings value to different players in the computational advertising business. The hope is that over time, emerging ad serving and rendering problems can be detected early and resolved to some extent using the proposed method. V. C ONCLUSION In this paper we show the results of a large scale ad latency analysis. We described the methodology and high level implementation behind a system that collects and analyses ad impression data. Using user, advertiser and publisher features, we trained models to predict ad latency values for runtime ad latency detection as well as for offline analysis to help identify the features that are associated with late ads. In future work, we will concentrate on regression, modeling the interactions between the features by creating cross-features and develop an alternate modeling pipeline using decision trees. We also plan to explore the idea of a Pigovian tax [12], treating latency as a negative externality and penalizing an advertiser’s bid but allowing the ad to show for a sufficiently high cost. R EFERENCES [1] J. Ramsay, A. Barbesi, and J. Preece, “A psychological investigation of long retrieval times on the world wide web,” Interacting with computers, vol. 10, no. 1, pp. 77–86, 1998.
[2] H. Kim, W.-Y. Loh, Y.-S. Shih, and P. Chaudhuri, “Visualizable and interpretable regression models with good prediction power,” IIE Transactions, vol. 39, no. 6, pp. 565–579, 2007.
[8] J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
[3] D. W. Hosmer and S. Lemeshow, Applied logistic regression. Wiley-Interscience, 2004, vol. 354.
[9] A. Agarwal, O. Chapelle, M. Dud´ık, and J. Langford, “A reliable effective terascale linear learning system,” arXiv preprint arXiv:1110.4198, 2011.
[4] L. Breiman, Classification and regression trees. CRC press, 1993. [5] C.-J. Lin, R. C. Weng, and S. S. Keerthi, “Trust region Newton method for logistic regression,” The Journal of Machine Learning Research, vol. 9, pp. 627–650, 2008.
[10] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein, “GraphLab: A new framework for parallel machine learning,” The 26th Conference on Uncertainty in Articial Intelligence, 2010.
[6] T. Fawcett, “ROC graphs: Notes and practical considerations for researchers,” Machine Learning, vol. 31, pp. 1–38, 2004.
[11] J. Nocedal, “Updating quasi-Newton matrices with limited storage,” Mathematics of computation, vol. 35, no. 151, pp. 773–782, 1980.
[7] J. Langford, L. Li, and T. Zhang, “Sparse online learning via truncated gradient,” The Journal of Machine Learning Research, vol. 10, pp. 777–801, 2009.
[12] A. C. Pigou, Wealth and welfare. Macmillan and co., limited, 1912.