Latent Geographic Feature Extraction from Social Media

Report 3 Downloads 101 Views
Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg University, Germany

November 8, 2012

Denitions

Data Charac and Norm

Latent Geographic Feature Extraction

Social Media is a huge and increasing source of

Experiments

Conclusions

unstructured

and uncertain geographic information

Eort to make data usable:

(Structured) Information Extraction Place/event extraction from Flickr [Rattenburry SIGIR'07] Event trajectory extraction from Twitter [Sakaki WWW'10] Spatial Analysis Spatio-temporal forecasting using Flickr [Jin MM'12] Study ecological phenomena [Zhang WWW'12]

Nov 8

2 / 33

Denitions

Data Charac and Norm

Latent Geographic Feature Extraction

Experiments

Conclusions

General Motivation:

Extract spatial variables from unstructured and noisy geographic information sources Flickr

Twitter

Wikipedia |Φ(l)|

l2

Φ(l) l1

l2 l1

This work:

Framework for unsupervised extraction of informative spatial variables (dimensions of geographic semantics) from Social Media

Nov 8

3 / 33

Denitions

Data Charac and Norm

Latent Geographic Feature Extraction

Experiments

Conclusions

Outline

Nov 8

1

Denitions and Problem Statement

2

Data Characteristics and Normalization

3

Latent Geographic Feature Extraction

4

Experiments

5

Conclusions

4 / 33

Denitions

Data Charac and Norm

Latent Geographic Feature Extraction

Experiments

Conclusions

Outline 1

2

3

4

5 Nov 8

Denitions and Problem Statement Geographic Feature Signal Estimation Problem Statement Data Characteristics and Normalization Distribution Characteristics Normalization Geographic Feature Types Latent Geographic Feature Extraction Dimensionality Reduction Framework Experiments Technique Comparison Normalization Inuence Exploration Task Conclusions 5 / 33

Denitions

Data Charac and Norm

Latent Geographic Feature Extraction

Experiments

Conclusions

Terminology

f A dimension representing some semantics of a location (e.g., temperature, population, number of restaurants) Sampled (measured) at any location l in geographic space W (→ spatial variable) Geographic Feature Sensor φ and Signal φ(l ) of f : Geographic Feature

φ:

Nov 8

W → R+ l 7→ φ(l )

6 / 33

Denitions

Data Charac and Norm

Latent Geographic Feature Extraction

Experiments

Conclusions

Terminology Set of geographic features f , . . . , fp denes a Multivariate Geographic Feature Sensor: 1

Φ := (φ1 , . . . , φp )T

Spatial sampling scheme (measurements) L = (l , . . . , ln ) denes a Location Sampling Matrix: 1

 Xn×p = (Φ(l1 ), . . . , Φ(ln ))T = 



Nov 8

φ1 (l1 ) . . .

φp (l1 )

φ1 (ln ) . . .

φp (ln )

.. .

...



..  . 

7 / 33

Denitions

Data Charac and Norm

Latent Geographic Feature Extraction

Experiments

Conclusions

Terminology A Social

Media Collection

di X

D consists of documents:

= (X , u , l , t )

: Bag of document features

(terms, tags, image features,...) u : User l : Location t : Timestamp

Features with geographic meaning aggregate in subsets of geographic space → high signal Assumption:

Nov 8

8 / 33

Denitions

Data Charac and Norm

Latent Geographic Feature Extraction

Experiments

Conclusions

Signal Estimation

Every document feature f , . . . , fp is a possibly meaningful/meaningless geographic feature Intuition of geographic feature signal φi (l ): 1

Number of users using feature fi around location l

∈ W1

Estimation of φi by Non-parametric 2D-histogram estimator on regular grid C of bandwidth w Small w → Capture small scale variation/phenomena Large w → Capture large scale variation/phenomena

1

Nov 8

motivated in next section

9 / 33

Denitions

Data Charac and Norm

Latent Geographic Feature Extraction

Experiments

Conclusions

Problem Statement Problem:

Given high-dimensional geographic feature signal Φ from a Social Media collection (all terms/tags) → Features might be meaningless, redundant, noisy Goal: Unsupervised extraction of small number of informative geographic features Applications:

Prepare data for learning tasks that cannot handle high-dimensional data Discover hidden spatial variables in the data

Nov 8

10 / 33

Denitions

Data Charac and Norm

Latent Geographic Feature Extraction

Experiments

Conclusions

Outline 1

2

3

4

5 Nov 8

Denitions and Problem Statement Geographic Feature Signal Estimation Problem Statement Data Characteristics and Normalization Distribution Characteristics Normalization Geographic Feature Types Latent Geographic Feature Extraction Dimensionality Reduction Framework Experiments Technique Comparison Normalization Inuence Exploration Task Conclusions 11 / 33

Denitions

Data Charac and Norm

Latent Geographic Feature Extraction

Experiments

Conclusions

Dataset Two Flickr datasets covering US and LA Document features: Tags (pre-ltered by minimum user frequency) Spatial resolution: US (1.0 degree), LA (0.01 degree)

Nov 8

12 / 33

Denitions

Data Charac and Norm

Latent Geographic Feature Extraction

Experiments

Conclusions

Spatial Distribution Characteristics

Figure: F (l ) D (l ) U (l ) Fd (l ) Exponential characteristics of spatial feature distribution Users ∼ distinct features / documents ∼ features : Num of features,

: Num of documents,

: Num of users,

: Num of distinct features.

Nov 8

13 / 33

Denitions

Data Charac and Norm

Latent Geographic Feature Extraction

Experiments

Conclusions

Spatial Feature Distribution: 'beach'

Figure: F (l , f ) f = beach

: Number of feature

f

= beach,

U (l , f )

: Number of users using

Some users contribute large number of documents Estimate signal on basis of users is less biased (more robust) Nov 8

14 / 33

Denitions

Data Charac and Norm

Latent Geographic Feature Extraction

Experiments

Conclusions

Normalization

Exponential distribution characteristics → Few locations dominate the signals' spatial distribution Normalization transforms the signal into a more natural domain Logging: φ0i (l ) := log φi (l ) + 1 Binarization:

Nov 8

φ0i (l ) := 1{φi (l ) > 0}

15 / 33

Denitions

Data Charac and Norm

Latent Geographic Feature Extraction

Experiments

Conclusions

Geographic Feature Types Geographic Feature Types: Classes of geographic features with similar geographic semantics [Sengstock ACMGIS'11] Global: Same intensity as baseline distribution (number of users) → Not interesting to discriminate between locations Regional: Widely spread in geographic space but dierent from baseline → Interesting to discriminate between large

subsets in geographic space

Landmark: Occurring only in small subsets of geographic space → Interesting to discriminate between single small

subset and the rest

Depends on area of interest W and spatial resolution w .

Nov 8

16 / 33

Denitions

Data Charac and Norm

Latent Geographic Feature Extraction

Experiments

Conclusions

Geographic Feature Types over locations of spatial signal Xi as geographic feature type statistic for fi : large entropy → Signal widely spread / smoothly distributed small entropy → Signal peaky / occurs in small areas

Entropy

Figure: Ordered entropies H[X ] for tag features of US Flickr dataset i

Nov 8

17 / 33

Denitions

Data Charac and Norm

Latent Geographic Feature Extraction

Experiments

Conclusions

Outline 1

2

3

4

5 Nov 8

Denitions and Problem Statement Geographic Feature Signal Estimation Problem Statement Data Characteristics and Normalization Distribution Characteristics Normalization Geographic Feature Types Latent Geographic Feature Extraction Dimensionality Reduction Framework Experiments Technique Comparison Normalization Inuence Exploration Task Conclusions 18 / 33

Denitions

Data Charac and Norm

Latent Geographic Feature Extraction

Experiments

Conclusions

Dimensionality Reduction Describe high-dimensional data by k