Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg University, Germany
November 8, 2012
Denitions
Data Charac and Norm
Latent Geographic Feature Extraction
Social Media is a huge and increasing source of
Experiments
Conclusions
unstructured
and uncertain geographic information
Eort to make data usable:
(Structured) Information Extraction Place/event extraction from Flickr [Rattenburry SIGIR'07] Event trajectory extraction from Twitter [Sakaki WWW'10] Spatial Analysis Spatio-temporal forecasting using Flickr [Jin MM'12] Study ecological phenomena [Zhang WWW'12]
Nov 8
2 / 33
Denitions
Data Charac and Norm
Latent Geographic Feature Extraction
Experiments
Conclusions
General Motivation:
Extract spatial variables from unstructured and noisy geographic information sources Flickr
Twitter
Wikipedia |Φ(l)|
l2
Φ(l) l1
l2 l1
This work:
Framework for unsupervised extraction of informative spatial variables (dimensions of geographic semantics) from Social Media
Nov 8
3 / 33
Denitions
Data Charac and Norm
Latent Geographic Feature Extraction
Experiments
Conclusions
Outline
Nov 8
1
Denitions and Problem Statement
2
Data Characteristics and Normalization
3
Latent Geographic Feature Extraction
4
Experiments
5
Conclusions
4 / 33
Denitions
Data Charac and Norm
Latent Geographic Feature Extraction
Experiments
Conclusions
Outline 1
2
3
4
5 Nov 8
Denitions and Problem Statement Geographic Feature Signal Estimation Problem Statement Data Characteristics and Normalization Distribution Characteristics Normalization Geographic Feature Types Latent Geographic Feature Extraction Dimensionality Reduction Framework Experiments Technique Comparison Normalization Inuence Exploration Task Conclusions 5 / 33
Denitions
Data Charac and Norm
Latent Geographic Feature Extraction
Experiments
Conclusions
Terminology
f A dimension representing some semantics of a location (e.g., temperature, population, number of restaurants) Sampled (measured) at any location l in geographic space W (→ spatial variable) Geographic Feature Sensor φ and Signal φ(l ) of f : Geographic Feature
φ:
Nov 8
W → R+ l 7→ φ(l )
6 / 33
Denitions
Data Charac and Norm
Latent Geographic Feature Extraction
Experiments
Conclusions
Terminology Set of geographic features f , . . . , fp denes a Multivariate Geographic Feature Sensor: 1
Φ := (φ1 , . . . , φp )T
Spatial sampling scheme (measurements) L = (l , . . . , ln ) denes a Location Sampling Matrix: 1
Xn×p = (Φ(l1 ), . . . , Φ(ln ))T =
Nov 8
φ1 (l1 ) . . .
φp (l1 )
φ1 (ln ) . . .
φp (ln )
.. .
...
.. .
7 / 33
Denitions
Data Charac and Norm
Latent Geographic Feature Extraction
Experiments
Conclusions
Terminology A Social
Media Collection
di X
D consists of documents:
= (X , u , l , t )
: Bag of document features
(terms, tags, image features,...) u : User l : Location t : Timestamp
Features with geographic meaning aggregate in subsets of geographic space → high signal Assumption:
Nov 8
8 / 33
Denitions
Data Charac and Norm
Latent Geographic Feature Extraction
Experiments
Conclusions
Signal Estimation
Every document feature f , . . . , fp is a possibly meaningful/meaningless geographic feature Intuition of geographic feature signal φi (l ): 1
Number of users using feature fi around location l
∈ W1
Estimation of φi by Non-parametric 2D-histogram estimator on regular grid C of bandwidth w Small w → Capture small scale variation/phenomena Large w → Capture large scale variation/phenomena
1
Nov 8
motivated in next section
9 / 33
Denitions
Data Charac and Norm
Latent Geographic Feature Extraction
Experiments
Conclusions
Problem Statement Problem:
Given high-dimensional geographic feature signal Φ from a Social Media collection (all terms/tags) → Features might be meaningless, redundant, noisy Goal: Unsupervised extraction of small number of informative geographic features Applications:
Prepare data for learning tasks that cannot handle high-dimensional data Discover hidden spatial variables in the data
Nov 8
10 / 33
Denitions
Data Charac and Norm
Latent Geographic Feature Extraction
Experiments
Conclusions
Outline 1
2
3
4
5 Nov 8
Denitions and Problem Statement Geographic Feature Signal Estimation Problem Statement Data Characteristics and Normalization Distribution Characteristics Normalization Geographic Feature Types Latent Geographic Feature Extraction Dimensionality Reduction Framework Experiments Technique Comparison Normalization Inuence Exploration Task Conclusions 11 / 33
Denitions
Data Charac and Norm
Latent Geographic Feature Extraction
Experiments
Conclusions
Dataset Two Flickr datasets covering US and LA Document features: Tags (pre-ltered by minimum user frequency) Spatial resolution: US (1.0 degree), LA (0.01 degree)
Nov 8
12 / 33
Denitions
Data Charac and Norm
Latent Geographic Feature Extraction
Experiments
Conclusions
Spatial Distribution Characteristics
Figure: F (l ) D (l ) U (l ) Fd (l ) Exponential characteristics of spatial feature distribution Users ∼ distinct features / documents ∼ features : Num of features,
: Num of documents,
: Num of users,
: Num of distinct features.
Nov 8
13 / 33
Denitions
Data Charac and Norm
Latent Geographic Feature Extraction
Experiments
Conclusions
Spatial Feature Distribution: 'beach'
Figure: F (l , f ) f = beach
: Number of feature
f
= beach,
U (l , f )
: Number of users using
Some users contribute large number of documents Estimate signal on basis of users is less biased (more robust) Nov 8
14 / 33
Denitions
Data Charac and Norm
Latent Geographic Feature Extraction
Experiments
Conclusions
Normalization
Exponential distribution characteristics → Few locations dominate the signals' spatial distribution Normalization transforms the signal into a more natural domain Logging: φ0i (l ) := log φi (l ) + 1 Binarization:
Nov 8
φ0i (l ) := 1{φi (l ) > 0}
15 / 33
Denitions
Data Charac and Norm
Latent Geographic Feature Extraction
Experiments
Conclusions
Geographic Feature Types Geographic Feature Types: Classes of geographic features with similar geographic semantics [Sengstock ACMGIS'11] Global: Same intensity as baseline distribution (number of users) → Not interesting to discriminate between locations Regional: Widely spread in geographic space but dierent from baseline → Interesting to discriminate between large
subsets in geographic space
Landmark: Occurring only in small subsets of geographic space → Interesting to discriminate between single small
subset and the rest
Depends on area of interest W and spatial resolution w .
Nov 8
16 / 33
Denitions
Data Charac and Norm
Latent Geographic Feature Extraction
Experiments
Conclusions
Geographic Feature Types over locations of spatial signal Xi as geographic feature type statistic for fi : large entropy → Signal widely spread / smoothly distributed small entropy → Signal peaky / occurs in small areas
Entropy
Figure: Ordered entropies H[X ] for tag features of US Flickr dataset i
Nov 8
17 / 33
Denitions
Data Charac and Norm
Latent Geographic Feature Extraction
Experiments
Conclusions
Outline 1
2
3
4
5 Nov 8
Denitions and Problem Statement Geographic Feature Signal Estimation Problem Statement Data Characteristics and Normalization Distribution Characteristics Normalization Geographic Feature Types Latent Geographic Feature Extraction Dimensionality Reduction Framework Experiments Technique Comparison Normalization Inuence Exploration Task Conclusions 18 / 33
Denitions
Data Charac and Norm
Latent Geographic Feature Extraction
Experiments
Conclusions
Dimensionality Reduction Describe high-dimensional data by k