Geolocation with Subsampled Microblog Social Media Miriam Cha Harvard University
∗
Youngjune Gwon Harvard University
ABSTRACT We propose a data-driven geolocation method on microblog text. Key idea underlying our approach is sparse coding, an unsupervised learning algorithm. Unlike conventional positioning algorithms, we geolocate a user by identifying features extracted from her social media text. We also present an enhancement robust to erasure of words in the text and report our experimental results with uniformly or randomly subsampled microblog text. Our solution features a novel two-step procedure consisting of upconversion and iterative refinement by joint sparse coding. As a result, we can reduce the amount of input data required by geolocation while preserving good prediction accuracy. In the light of information preservation and privacy, we remark potential applications of these results.
Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Clustering; I.2.6 [Artificial Intelligence]: Learning—Unsupervised feature learning
Keywords Geolocation; joint sparse coding; text subsampling; Twitter
1.
INTRODUCTION
Traditionally, geolocation involves the detection and related computational processing of beacon signals used in a positioning system such as GPS. We consider a data-driven framework for geolocation that leverages the increased availability of geotagged social media data. In particular, we aim to develop an algorithm for estimating geocoordinates of a social network user by learning features from the user’s social media text. We also extend the algorithm to take lossy, subsampled text data as input to geolocation prediction while preserving accurate geolocation estimates. Geolocation information serves valuable context for social media. Recent research [1, 2] has experimented with latent variable models trained on the Twitter microblog text ∗The author was a Ph.D. student at Harvard while the work was done. He now works for MIT Lincoln Laboratory in Lexington, MA.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected]. MM’15, October 26–30, 2015, Brisbane, Australia. c 2015 ACM. ISBN 978-1-4503-3459-4/15/10 ...$15.00.
DOI: http://dx.doi.org/10.1145/2733373.2806357.
H. T. Kung Harvard University
(“tweets”) for geolocation. To achieve decent geolocation accuracy, a geographic topic model must be trained with a sufficient amount of labeled training examples. The fact that only 2.2% of tweets are geotagged [3] indicates the difficulty of supervised learning. In this paper, we describe a model-free, geolocation method driven by a relatively small labeled dataset. Our key component is sparse coding, an unsupervised algorithm that learns a feature mapping for the raw text input. It is advantageous for prediction algorithms to operate on sparse (feature) representations of the raw data. We exploit the mapping to build a lookup table of reference geocoordinates and search it using the sparse feature vector computed on the text input. We apply k-Nearest Neighbor (k-NN) similarity matching in the feature domain for geolocation. This paper also studies the preservation of geographic information after discarding words in the text. While the cause of such discarding could be unknown or randomly introduced, a purposeful usage would be to enhance privacy of a user, especially if the computing for geolocation is done remotely (e.g., in the cloud). We demonstrate robustness of our approach by experimenting with uniformly and randomly subsampled tweet text. While subsampling is a simple method of dimensionality reduction, we face the difficulty unique for text data where information erasure is highly irregular depending on which word gets discarded. To address this challenge, we propose a novel two-step procedure consisting of upconversion and refinement inspired by joint sparse coding for image super-resolution [4]. We will show that the refinement step can be iteratively repeated to produce better geolocation estimates. Geolocation based on comprehending social media text is an ongoing, hard research problem [1, 2, 5]. Prediction with heavily subsampled text (e.g., 50% of words in the text) is even more challenging. Our work here assesses the feasibility of a learning approach robust to subsampling. The results of this paper may be useful to develop secure applications for devices with limited computing power. Rest of this paper is organized as follows. In Section 2, we describe our data-driven framework for text-based geolocation. Section 3 will present our approach based on subsampled text data. In Section 4, we discuss the results from an experimental evaluation on the CMU GeoText dataset [6], and Section 5 concludes the paper.
2.
TEXT-BASED GEOLOCATION VIA SPARSE CODING
To alleviate the scarcity of labeled training examples in practice, we focus on unsupervised feature learning based on sparse coding and dictionary training.
2.1
Sparse coding for text data
2.3
We use sparse coding as the basic means to extract features from text. A text document, however, cannot be directly applied to sparse coding. Instead, we convert text to a numeric form in a procedure called “embedding.” Let vocab denote a collection of unique words appearing in documents with size V = |vocab|. In binary-bag-of-words (BW) embedding scheme, a text document containing W words is represented as a bit vector wBW ∈ {0, 1}V . The ith element in wBW is 1 if the word vocab[i] has appeared in the text. We also use word-sequence (WS) embedding wW S ∈ {1, . . . , V }W , where wi , the ith element in wW S , represents vocab[wi ]. Sparse coding takes in an unit input vector called patch drawn from data. We denote patch x ∈ RN , a consecutive subvector taken from wBW or wW S for an input to sparse coding. Given an input x ∈ RN , sparse coding solves for a representation y ∈ RK in the following optimization problem: min kx − Dyk22 + λψ(y) D,y
2.2
Our baseline geolocation method consists of the following steps in the training phase. 1. (Text embedding) perform binary or word-sequence embedding of text data using vocab; 2. (Unsupervised learning) feed patches drawn from unlabeled embedded text vectors to sparse coding and learn basis vectors for dictionary D; 3. (Feature extraction) using the dictionary D learned during unsupervised learning, for given labeled training patches {(x(1) , l(1) ), (x(2) , l(2) ), . . . }, perform sparse coding on {x(1) , x(2) , . . . } and obtain sparse codes {y(1) , y(2) , . . . } with their associated labels; 4. (Feature pooling) aggregate features by max pooling over a group of M sparse codes extracted from the same document and obtain pooled sparse code z such that the jth element in z, zj = max(y1,j , y2,j , . . . , yM,j ) where yi,j is the jth element from yi , the ith sparse code in the pooling group;
(1)
where input x is represented as a sparse linear combination of basis vectors in an overcomplete dictionary D ∈ RN ×K (K > N ). The solution y is the feature representation for x. The system x = Dy is underdetermined (i.e., more unknowns than equations) and needs an extra constraint for assuring unique solution. As in the second term of Eq. (1), sparse coding regularizes on the `0 - or `1 -norm of y for ψ(.) with λ > 0. Because the `0 -norm of a vector is the number of nonzero elements, it can precisely serve the regularization purpose. Finding the sparsest `0 -minimum solution in general, however, is known to be NP-hard. The `1 -minimization with LASSO [7] or Basis Pursuit [8] is often preferred. Recently, it is known that the `0 -based greedy algorithms such as Orthogonal Matching Pursuit (OMP) [9] can run fast. Dictionary learning for sparse coding is done by an unsupervised, data-driven process incorporating two separate optimizations. It first computes sparse code for each training example using the current dictionary. Then, the reconstruction error from the computed sparse codes is used to update each basis vector in the dictionary. We use K-SVD algorithm [10] for dictionary learning.
Preprocessing patches
We can enhance the quality of unsupervised learning by preprocessing patches. For example, binary-bag-of-words embedding produces highly sparse bit vectors that are difficult for sparse coding to learn meaningful features. We can preprocess the patches of embedded text by removing the mean value and scaling with standard deviation. Another technique called whitening makes input data less redundant such that dictionary learning can be more effective. Combining these into one integrated procedure, we preprocess a batch of n input patches taken from embedded text vectors by whitening. P (i) 1. Remove mean x(i) := x(i) − n1 n i=1 x ; Pn (i) (i)> 1 2. Compute covariance matrix C = n−1 ; i=1 x x 3. Do eigendecomposition [U,Λ] = eig(C); 4. Compute xwhite = (Λ+I)−1/2 U> x, where is a small positive value for regularization.
Baseline geolocation method
5. (Tabularization of reference geocoordinates) build a lookup table of geocoordinates associated with pooled sparse codes z from labeled data patches. Note that each label contains geocoordinates in the form l(i) = {lat, lon}. The baseline method works in the following manner for the testing phase. When text data (tweets) of unknown geocoordinates arrive, we perform preprocessing, feature extraction via sparse coding, and max pooling. Using the max-pooled sparse code of the tweets, we find the k pooled sparse codes from the lookup table that are closest in cosine similarity. We take the average geocoordinates of the k-NNs.
2.4
Grid-based voting scheme for k-NN We accompany a simple voting scheme for k-NN. We lay out a grid over the all k-NN geocoordinates. Each k-NN casts a vote to its corresponding grid. We identify the grid that receives the most votes and take the average of geocoordinates in the selected grid as the final geolocation estimate. 3.
GEOLOCATION FROM SUBSAMPLED TEXT
Using a linear subsampling matrix (i.e., uniform or random), we consider two blind subsampling strategies. First, we subsample after binary-bag-of-words embedding. Secondly, we can subsample the raw text before embedding. Blind text subsampling concerns a tradeoff between amount of input data required for geolocation and prediction accuracy. This section proposes an enhancement robust to the effect of blind text subsampling on degradation of geolocation accuracy.
3.1
Subsampling binary-bag-of-words
Given wBW , we perform either uniform or random subsampling. For uniform subsampling, every αth component of wBW is discarded for some integer α. Similarly, random subsampling discards the equivalent number of components randomly from wBW . As a result, the subsampled embedded vector has a smaller dimension than the original binary bagof-words vector. Subsampling wBW has an identical effect as embedding the raw text based on subsampled vocab.
Joint&Sparse&Coding&(training)&
pled xs . We first solve for v via sparse coding ofDEq. (4). x 4.3 Results D D w xf jointJoint&sparse&coding& u Dd v r q Joint&sparse&coding& f ˆ f = D,&u v. Using Du , we obtain an upconversion estimate x D D,& w (for&refinement)& xs coding(for&upconversion)& Dr Drq wq dv pled xs . We first solve for v via joint sparse of Eq. (4). Du D4.3 Results ˆ ˆ f in another joint opti- xf We can refine the upconversion x ˆ f = Du v. Using Du , we obtain an upconversion estimate x 5. CONCLUSION 3%4, mization xWe !"#$%&'()*+, -.//)*+, 01)2.*)*+, s ˆ f inGeoloca5on&(online)& can refine the upconversion x another joint opti24..2,2.52, pled xs . We first solve for v via joint sparse coding of Eq. (4). 4.3 Results 5. CONCLUSION 2 2 mization min ⌅xf Dr w⌅ D w⌅ (5) ˆ1 .f Sparse& Using Du , we obtainDan upconversion estimate =qD 2 + ⌅ˆ 2 + ⇥⌅w⌅x Dr Dq w 6.Feature&REFERENCES u v. Sparse& Geo=& xs xf xˆ fCoding& Du Dd v Coding& k=NN& r ,Dq ,w Pooling& coordinates& ˆ f in We can refine the joint1 . optimin text ⌅xf upconversion Dr w⌅22 + ⌅ˆ xfx Dqanother w⌅22 + ⇥⌅w⌅ (5) Figure 1: Patches {xf , xs } from full and subsampled [1] J. Eisenstein, B. O’Connor, N.A. Smith, and E.P 6. REFERENCES 5. CONCLUSION Dr ,Dq ,w mization Using the joint sparse code w and Dr , we obtain the reXing. A Latent Variable for Geographic L [1]DJ.DEisenstein, B. O’Connor, N.A. Smith, Model and E.P. Dcan v u Dditeratively r qw ˜and upconversion Weobtain the f = D Using the fined joint sparse Drrw. ,2we the re- repeat 2code wx In EMNLP, 2010. Lexical Xing. A LatentVariation. Variable Model for Geographic min ⌅x D w⌅ + ⌅ˆ x D w⌅ + ⇥⌅w⌅ . (5) r q 1 f f 2 2 6. Variation. REFERENCES refinement step. For geolocation ˜f = fined upconversion x Dr w. Weenhanced can iteratively repeat with the subsampled Dr ,D q ,w [2] EMNLP, L. Hong,2010. A. Ahmed, S. Gurumurthy, A.J. Smola In the refined featurewith vector w. Figure 2: Full refinement text, step. we Foruse enhanced geolocation subsampled [1] Eisenstein, B. O’Connor, N.A. Discovering Smith, and E.P. K. Tsioutsiouliklis. Geographical To [2] J. L.system Hong, A.pipelines Ahmed, S. Gurumurthy, A.J. Smola, and Using thewejoint sparse code w and Dr , w. we obtain the retext, use the refined feature vector Xing. A Latent Variable Model for In Geographic Lexical the Twitter Stream. WWW, 2012. K. Tsioutsiouliklis. Discovering Geographical Topics in Unlike wBW , wW S converts text to a numeric form while ˜ f = Dr w. We can iteratively repeat the fined upconversion x Variation. In EMNLP, 2010. the Twitter Stream. In WWW, 2012. [3] J. Yang, J. Wright, T.S. Huang, and Ma Y. Ima 4. W EXPERIMENTS refinement step. ForS enhanced geolocation with subsampled retaining the ordering of words. Thus subsampling w is [2] Hong, J.A.Wright, Ahmed, S. Gurumurthy, A.J. Smola, and [3] L. J. Yang, T.S. Huang, and Y. Image Super-Resolution Via Ma Sparse Representation. IE text,4.we EXPERIMENTS use the We refined featuregeolocation vector w. accuracies compare of previously menWhen subsampled tweets from unknown geocoordinates K. Tsioutsiouliklis. Geographical Topics in Super-Resolution ViaDiscovering Sparse Representation. IEEE subsampling raw text vector. Uniform subsampling discards Trans. on Image Processing, 19(11):2861–2873, 2 We compare geolocation accuracies of previously mentioned methods onarrive, both uniformly and randomlypipeline subsamthe geolocation consists of upconversion and the Twitter Stream. In WWW, 2012. Trans. on Image Processing, 19(11):2861–2873, 2010. [4] R. Tibshirani. Regression Shrinkage and Selectio every αth words, and random subsampling discards the tioned methods on both and accuracy randomly issubsampledsame text data.uniformly Geolocation evaluated using the [4] R. Tibshirani. Regression Shrinkage and Selection via [3] J. Yang, J. Wright, T.S. Huang, and Ma Y. Image Society, S the Lasso. Journal of Royal Statistical refinement. We can repeat the refinement step. 4. pled EXPERIMENTS texttexts data. is evaluated using the iteratively CMUGeolocation GeoText accuracy dataset [8]. number of words randomly. The subsampled raw are the Lasso. Journal of Royal Statistical Society, Series Super-Resolution Via Sparse Representation. B, 58:267–288, 1994. CMU GeoText dataset [8]. We compare geolocation accuracies of previously menFor enhanced geolocation with subsampled text, we use the IEEE B, 58:267–288, 1994. embedded based on the binary-bag-of-words scheme, resultTrans. on Image Processing, 19(11):2861–2873, 2010. [5] S.S. Chen, D.L. Donoho, and M.A. Saunders. At tioned methods and randomly 4.1on both Datauniformly refined featuresubsamvector w. Figure 2 illustrates ourand full pipelines. [5] R. S.S.Tibshirani. Chen, D.L. Donoho, M.A. Saunders. Atomic [4] Regression Shrinkage and Selection via Rev., 20 ing the final subsampled vector of dimension V . data. DataGeolocation accuracy Decomposition by Basis Pursuit. SIAM pled4.1 text is evaluated using the Decomposition by Basis Pursuit. SIAM Rev., 2001. GeoText is a geo-tagged microblog corpus comprising the Lasso. [6] Journal of Royal Society, Series J.A. Tropp andStatistical A.C. Gilbert. Signal Recovery Fr isdataset a geo-tagged microblog corpus comprising CMU GeoText GeoText [8]. by 9,475 377,616 tweets users from 48 contiguous US[6] states J.A. Tropp and A.C. Gilbert. Signal Recovery From B, 58:267–288,Random 1994. Measurements Via Orthogonal Matchin 377,616 tweets by 9,475 users from 48 contiguous US states Random Measurements Via Orthogonal Matching and Washington D.C. Each document in the dataset [5] is conS.S. Chen, D.L. Donoho, andTrans. M.A. Saunders. AtomicTheory, 20 Washington D.C. Each document in the dataset is conPursuit. IEEE on Information 4.1 andData Pursuit. IEEE Trans. on Pursuit. Information Theory, 2007. of from all tweets from single location user whose location Decomposition byAharon, Basis SIAM Rev., 2001. catenation catenation of all tweets a single userasection, whose [7]M.M. M.Bruckstein. Elad, andapA. Bruckstein. K-SVD Inasthis we empirically evaluate the proposed GeoText isinformation a geo-tagged microblog corpus comprising [7] M. Aharon, Elad, and A. K-SVD: An is provided GPS-assigned latitude and longiinformation is provided as GPS-assigned latitude and longi[6] J.A. Tropp and A.C. Gilbert. Signal Recovery From Dictionar Algorithm for Designing Overcomplete tweets by 9,475 users from 48 contiguous US states The effect of subsampling raw text is 377,616 more devastating Algorithm for Designing Overcomplete Dictionaries for proaches using the CMU geo-tagged microblog corpus [6]. tude values. The document is a sequence of integer numbers tude values. The document is a sequence of integer numbers Random Measurements Via Orthogonal Matching Sparse Representation. Trans. on Sig. Pro Sparse Representation. IEEE Trans. onIEEE Sig. Proc., and ranging Washington D.C. document ineach the dataset isrepresents conranging 1Each to 5,216, where number the posithan subsampling binary-bag-of-words vectors. In 1this secto 5,216, where eachWe number represents the posi- geolocation train our baseline method using the full data Pursuit. IEEE54(11), Trans. on Information Theory, 2007. 2006. 54(11), 2006. catenation of all tweets from a single user whose location tion in vocab. tion in vocab. [7] M. Aharon,with M.GeoText. Elad, and CMU A. Microblog Bruckstein. K-SVD: An Corpus. tion, we propose a variation of joint sparse coding that canas GPS-assigned samples. We and alsolongitrain our uniformly and ran[8] Geo-tagged Microblog [8]method GeoText. CMU Geo-tagged Corpus. information is provided latitude for http://www.ark.cs.cmu.edu/GeoText/, Designing Overcomplete Dictionaries for 2010. http://www.ark.cs.cmu.edu/GeoText/, 2010. recover the original text and refine the recovery from the tude4.2 values. The is a sequence integer numbersraw textAlgorithm domly ofsubsampled and binary-bag-of-words (em4.2document Methodology Methodology Sparse Representation. IEEE Trans.B.P. on Sig. Proc., [9] S. Roller, M. Speriosu, S. Rallapalli, B.P. Wing, [9] S. Roller, M. Speriosu, S. Rallapalli, Wing, and ranging 1 to 5,216, where each number represents the posisubsampled text for better geolocation performance. Dataset processing. In our experiment, we cut the bedded) text vectors to dataset analyze the effect of subsampling Dataset processing. In our experiment, we cut the dataset 54(11), 2006. J. Baldridge. Supervised Text-based Geolocation J. Baldridge. Supervised Text-based Geolocation in vocab. into five from folds that user_id %accuracy 5, following%Eisenintosuch five foldsfold such=on thatthe fold = user_id 5, following[8] EisenLet us denote the pair {xf , xs } patchestion drawn the degradation. We will discuss the improved GeoText. CMU Geo-tagged Microblog Corpus. Language Models on an Adaptive Grid. Using Language Models on an In Adaptive Grid. In stein et al. stein [1]. Folds used1–4 for are training, andtraining, fold 5 forand fold 5 Using et al.1–4 [1].are Folds used for for http://www.ark.cs.cmu.edu/GeoText/, 2010. EMNLP, 2012.sparse binary-bag-of-words vectors embedded on4.2 full andWe subsamEMNLP,coding 2012. geolocation performance byeach our joint method Methodology testing. have embedded entire text from each testing. We have the embedded thedata entire text data from [9] S. M. Speriosu, S.the Rallapalli, B.P. Wing, and Sky and [10] R. Roller, W. Sinnott. Virtues of Haversine. and R. W. Sinnott. Virtues ofSky the Haversine. user toprocessing. binary word-sequence vectors. From these vecpled text, as seen in Figure 1. The sparseDataset coding problems In our and experiment, we cut the dataset for upconversion and on[10] subsampled text data. userand to binary word-sequence vectors. Fromrefinement these vecTelescope, 1984. Baldridge. Supervised Text-based Geolocation Telescope, 1984. N we uniformly or randomly the input Eisentextthe by input textJ. intotors, five folds such that fold = user_id % 5, following we uniformly orsubsample randomly subsample by for xf ∈ RN and its subsampled counterpart x are s ∈ R tors, Language Models on an Adaptive Grid. In range of aFolds = 2,1–4 3,of4,are 6, = and We sparse patches of patchesUsing steina et al. [1]. used training, and 5 for range 2, 9. 3,for 4, 6, and 9.code We fold sparse code of 2012. a configurable size N takenthe from subsampled input. We input. EMNLP, We have embedded entire text data each a configurable size N the taken from thefrom subsampled We min kxf − Df yf k22 + λf kyftesting. k (2) 1 use [10]text R. W.dataset Sinnott. Virtues of the Haversine. Sky and ofword-sequence N =sizes 64 for user to patch binarysizes andpatch vectors. these GeoText isextraction. avecTwitter comprising 377,616 use of feature N= 64extraction. for From feature Df ,yf In unsupervised learning, we precondition patches Telescope, 1984. tors,Training. we uniformly or randomly subsample the input by Training. In unsupervised learning, wetext precondition patches tweets by 9,475 users from 48 contiguous US states and with of PCA = whitening coding.code The parameters a range 2, 3,PCA 4, 6,prior andto9.sparse Weprior sparse with whitening to sparsepatches coding. of The parameters and for sparse coding are experimentally determined. We have D.C. Each document in the dataset is concatea configurable sizesparse N taken fromWashington theexperimentally subsampled input. We for coding are determined. We have used a dictionary size K ⇤ 10N , sparsity 0.1N ⇥ T ⇥ 0.4N 2 nation of the entire tweets by a single user collected over use patch sizes of N = 64 for feature extraction. used a(3) dictionary sizeinKsparse ⇤ 10N , sparsity min kxs − Ds ys k2 + λs kys k1 (T is number of nonzero elements code y). For0.1N max ⇥ T ⇥ 0.4N Training. In unsupervised learning, weelements precondition patches Ds ,ys is pooling numberfactors of nonzero in sparse code y). For max one All documents include the user location informapooling, we(Tuse M inweek. 10s. with PCA whitening prior to we sparse coding. The pooling, we use pooling factors M parameters in multiclass 10s. In supervised learning, have trained linear tion provided as GPS-assigned latitude and longitude values. where Df and Ds are dictionaries from fullforand subsampled sparse coding are experimentally determined. We have In supervised have sparse trainedcodes linear multiclass SVM and softmax classifiers,learning, using maxwepooled apersequence of integer numbers ranging 1 to dictionary K ⇤ 10N sparsity 0.1N T ⇥ispooled 0.4N SVM and softmax classifiers, using max sparse codes In size our results, we,The reportdocument the best⇥average text, respectively. Image super-resolutionused [4]asa features. takes advan(T isformance number of of nonzero elements in sparsewhere code y).the For max asthe features. Inhave our5,216, results, we report the best average represents pereach number the position in vocab. two. We exhaustively built lookup tage of the shared sparse code between the high- use and lowpooling, pooling factors M in formance of the two. We have table we of reference geocoordinates for10s. k-NN.exhaustively This results built up the lookup resolution pair of patches for the same image such that one Intosupervised learning, we have trained linear multiclass table(depending of referenceongeocoordinates for k-NN. This results up 7,500 entries how much of labeled dataset SVM and softmax classifiers, using maxInpooled sparse codes islow-resolution allowed for7,500 supervised reducing prediction to entrieslearning). (depending on how much of labeled dataset can recover a high-resolution image from the as features. our results, report best average pererrors, weInis find k-NN works well For forthe 20 ⇥ k ⇥ 30. We use allowed forwe supervised learning). In reducing prediction all our experiments, we have cut the dataset into five version. Similarly, if the sparse code for formance thehigher fullofk and sub(200 ⇥ kWe ⇥ 250) voting-based selection the two. havewith exhaustively builtgrid the20 lookup errors, we find k-NN works well for ⇥ k ⇥= 30. We use % 5, following Eisenstein et folds such that fold user_id sampled is shared (i.e., yf = ys ), we can attempt togeocoordinates reWe have suggested by Rollerupetgrid selection tablescheme. of reference for250) k-NN. This results higher kvaried (200 grid ⇥ k sizes ⇥ with voting-based al. [1].grid Folds are used for et training, and fold 5 for testing. al. [9]entries and scheme. decided on for most purposes. 7,500 (depending on how much of labeled dataset We5 have varied sizes 1–4 suggested by Roller cover the full by using the subsampled. Weto formulate a new Metrics. geolocation, we We use mean and median dis-the entire text data from each user to is allowed forFor supervised learning). reducing prediction al. [9] and decided on the 5Inhave for most purposes. embedded joint optimization that forces the sharing of the sparse code tance between the geolocation, predicted geoerrors, weerrors findMetrics. k-NN works well for 20and ⇥ kuse ⇥ the 30. mean We use For weground-truth andand median disword-sequence vectors and applied coordinates Webinary-bag-of-words note it grid is required to v = yf = ys between the full and subsampled pair of patches higher k (200 ⇥ink kilometers. ⇥errors 250) between with voting-based selection tance thethat predicted and ground-truth geoapproximate thevaried great-circle distance between any two et lo-subsampling by α = 2, 3, 4, and 6. uniform ornote random scheme. We have grid sizes suggested by Roller coordinates in kilometers. We that it is required to 2 cations on the surface of earth. We use the Haversine for[9]λand decided on 5 forthe most purposes. min kxf − Du vk22 + kxs − Dd vkal. (4) We precondition patches approximate great-circle distance between any twowith lo- PCA whitening before sparse v kvk 1 2 + mula [10]. For classification tasks, we adopt multiclass clasDu ,Dd ,v Metrics. For cations geolocation, we surface usecoding. the of mean andWe median dis-Haversine on the earth. use the forAfter numerous experiments, we have determined sification accuracy as the evaluation metric. tance errors between theFor predicted and ground-truth geo-multiclass clasmula [10]. classification tasks, we adopt to use patch size Nto = 64. We have used OMP, a greedy-`0 where Du is the upconversion dictionary, and Dd for down-accuracy coordinates insification kilometers. We note that it is required as the evaluation metric. sparse encoder, and approximate the great-circle distance between any two lo- K-SVD for dictionary learning. Other conversion. cations on the surface of earth. sparse We use the Haversine forcoding parameters are also determined experimenIn the unsupervised learning stage, we first solve for Du , mula [10]. For classification tasks, we adopt multiclass clastally. We have used a dictionary size K ≈ 10N , sparsity Dd , and v via joint sparse coding of Eq. (4). accuracy Using asthe sification the evaluation metric. 3%4, 24..2,2.52,
-.//)*+,
01)2.*)*+,
xf
3.2
Subsampling raw text
3.3
Joint sparse coding for upconversion and refinement
ˆ f = Du v. learned Du , we obtain an upconversion estimate x ˆ f in another joint optimization We can refine x min
Dr ,Dq ,w
kxf − Dr wk22 + kˆ xf − Dq wk22 + λw kwk1 .
(5)
Here, we explicitly look for Dr and Dq that make the reˆ f possible. In the supervised learning stage, finement of x we perform sparse coding with (labeled) subsampled data patches using the learned Dd and yield joint sparse codes v. The joint sparse codes are applied to feature pooling and tabularization of reference geocoordinates in steps 4 and 5 of baseline geolocation method.
4.
EVALUATION
4.1
Data
4.2
Experimental methodology
level S for 0.1N ≤ S ≤ 0.4N (S is number of nonzero elements in y). We use max pooling factors M in 10s. In supervised learning, we have built the table of reference geocoordinates for k-NN, using the max-pooled sparse code as the feature for lookup. This results in the lookup table of more than 7,500 entries. We have found good geolocation accuracy with 10 ≤ k ≤ 40. For the metric of performance evaluation, we use the median distance error between the predicted and ground-truth geocoordinates measured in kilometers. We note that related previous research has regarded median distance error more important than the mean. It is required to approximate the
Median location error (km)
750
Median location error (km)
Median location error vs. sampling factor
800
k−NN, k=10 k−NN, k=20 k−NN, k=40 k−NN+Voting, k=10 k−NN+Voting, k=20 k−NN+Voting, k=40
700
650
600
900 850 800
Median location error vs. sampling factor subsample subsample+upconvert subsample+upconvert+refine subsample+upconvert+mult.refine
750 700 650 600 550 0
0.1667
0.25
0.3333
0.5
Sampling Factor (1/α) 550
500 0
Figure 5: Median geolocation errors of uniform subsampling and enhancements against various sampling factors 0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Sampling Factor
Figure 3: Median geolocation errors of subsampling binarybag-of-words against various sampling factors
900
Median location error vs. refinements (α = 2) full uniform (enhanced) random (enhanced) uniform (baseline) random (baseline)
Median location error (km)
850
800
5.
750
700
650
600
550 0
1
2
3
4
5
# of refinement iterations
Figure 4: Median geolocation errors of uniform and random subsampling the raw text with α = 2.
great-circle distance between any two locations on earth because of its round surface. We use the Haversine formula [11].
4.3
subsampling with enhancements by both upconversion and single or multiple (5) iterations of refinement. As uniform and random subsampling results are comparable, we only report uniform subsampling results. Compared to subsampling only, upconversion and multiple iterations of refinement after subsampling is more robust to increased sampling factor.
Results and discussion
In Figure 3, we present the effect of grid-based voting scheme on the geolocation errors using uniformly subsampled binary-bag-of-words text. Notice that the grid-based voting scheme is robust to subsampling binary-bag-of-words text. However, the effect of subsampling raw text is stronger than subsampling binary-bag-of-words vectors, and we have experienced that the grid-based voting scheme is not as effective when applied to subsampled raw text. Therefore we use our proposed upconversion and iterative refinement scheme by joint sparse coding that is robust to subsampling raw text. We present the median geolocation errors of uniform and random 2x subsampling (i.e., α = 2) on the raw text in Figure 4. We gradually increase the number of refinement steps to observe changes in the geolocation error. Applying our baseline geolocation method on full text gives 568 km error. As expected, discarding 50% of words significantly increases geolocation errors, 794 km for uniform subsampling and 757 km for random subsampling. Remarkably, multiple iterations of refinement step can help mitigate the geolocation error caused by heavy subsampling. After five iterations of refinement step, geolocation error for uniform subsampling decreases to 652 km and random subsampling to 636 km. Notice that these geolocation errors are only 84 km and 68 km higher than using the full text. Figure 5 depicts the median geolocation errors against various uniform subsampling factors (in 1/α) for subsampling only, subsampling with enhancement by upconversion, and
CONCLUSION
We have presented a geolocation method based on sparse coding of microblog text data. We achieve the median geolocation errors of 568 km for full and 636 km even under 2x subsampling on the GeoText dataset. The proposed upconversion and iterative refinement scheme by joint sparse coding proves to be successful in drawing out the correlation between the full and subsampled text pair. The geolocation accuracy degradation is only by 12%, even if we have halved words in the text. The main contributions of this paper are the refinement scheme and its application to geolocation for subsampled microblog text. For future work, we plan to improve feature learning schemes by introducing hierarchy and evaluate our methods using both US and worldwide data.
Acknowledgments This work is supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE1144152, the Naval Supply Systems Command award under the Naval Postgraduate School Agreement No. N00244-15-0050, and gifts from the Intel Corporation.
6.
REFERENCES
[1] J. Eisenstein, B. O’Connor, N.A. Smith, and E.P. Xing. A Latent Variable Model for Geographic Lexical Variation. In EMNLP, 2010. [2] L. Hong, A. Ahmed, S. Gurumurthy, A.J. Smola, and K. Tsioutsiouliklis. Discovering Geographical Topics in the Twitter Stream. In WWW, 2012. [3] C. Weidemann. Social Media Location Intelligence: The Next Privacy Battle—An ArcGIS andd-in and Analysis of Geospatial Data Collected from Twitter.com. Journal of Geoinfo., 2013. [4] J. Yang, J. Wright, T.S. Huang, and Y. Ma. Image Super-Resolution via Sparse Representation. IEEE Trans. on Image Processing, 19(11):2861–2873, 2010. [5] M. Cha, Y. Gwon, and H. T. Kung. Twitter Geolocation and Regional Classification via Sparse Coding. In ICWSM, 2015. [6] GeoText. CMU Geo-tagged Microblog Corpus. http://www.ark.cs.cmu.edu/GeoText/, 2010. [7] R. Tibshirani. Regression Shrinkage and Selection via the Lasso. Journal of Royal Statistical Society, Series B, 1994. [8] S.S. Chen, D.L. Donoho, and M.A. Saunders. Atomic Decomposition by Basis Pursuit. SIAM Rev., 2001. [9] J.A. Tropp and A.C. Gilbert. Signal Recovery From Random Measurements via Orthogonal Matching Pursuit. IEEE Trans. on Information Theory, 2007. [10] M. Aharon, M. Elad, and A. Bruckstein. K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation. IEEE Trans. on Sig. Proc., 54(11), 2006. [11] R.W. Sinnott. Virtues of Haversine. Sky and Telescope, 1984.