Towards Social User Profiling: Unified and Discriminative Influence Model for Inferring Home Locations Rui Li, Shengjie Wang, Hongbo Deng, Rui Wang, Kevin Chen-chuan Chang University of Illinois at Urbana and Champaign
User profiling infers users’ essential attributes and is important for many services.
Search Engines
Personalized Search
User
Advertisers Richard Job: Student Location: Champaign and many others.
Targeted Advertisement
This paper aims to profile Twitter users’ home locations from both Tweets and Following Network Output Profiling a User’s Home Location Location: Champaign
Input
A user’s home location is defined as the place most his activities happen. It is different from a real-time geo position (e.g., Starbucks at green street)
In Context of Twitter Network User Centric Data (Tweets)
Social Network Data (Following network)
Lady Gaga
TechChruch
Richard
Cindy
Rob
Jessie
The problem is difficult due to scarce signal challenge
Tweets
Following Network
Lady Gaga New York
TechChruch Unknown Only 6% messages contains location related terms!
Richard
Rob Unknown
Cindy
Jessie Champaign
San Francisco
Only 16% users have locations on their profiles!
The problem is difficult due to noisy signal challenge
Tweets
Following Network
Lady Gaga New York
TechChruch Unknown A user tweets about locations different from his home location.
Richard
Rob Unknown
Cindy
Jessie Champaign
San Francisco
User follows friends who live different locations from his home location.
We propose a unified and discriminative probabilistic framework.
Scarce Signal Challenge
Noisy Signal Challenge
Unify two types of resources as a twitter graph
Model the likelihood of an edge between two nodes via a discriminative Influence model
Profile locations via maximizing the likelihood of observing the graph.
We unify two types of resources as a Directed Heterogeneous Graph We unify two types of resources as Head Node nodes on a heterogeneous graph New York We model it as a directed graph. ? u2 Tail Node We associate locations to the nodes. Beijing U6 v1 ? We aim to infer the locations of u1 unlabeled nodes with locations of Champaign Champaign ? labeled nodes. v2 u3
Unlabeled Node
u5
San Francisco u4 labeled Node
We observe two key characteristics for the probability of an edge between two nodes How likely a tail node nj at L(nj) builds an edge e a head node ni at L(ni) Spread of Word "Champaign"
500 450 400 350
count
300 250 200 150 100 50 0 110 45
100 40
90 35
80 70 longitude
30
latitude
Observation 1 The probability decreases as their distance increases
Observation 2 At the same distance, different head (Chicago, Champaign) nodes have different probabilities to attract tail nodes.
We propose a discriminative influence model to capture the two key characteristics Conceptual level Discriminative Influence Model θni Influence probabilities decrease from the center. Different nodes have different influence scope. Mathematical Level Gaussian Model 1 P(e n j , n i | θ n , L(n i )) e 2π n 2
i
i
(x u i x uj ) 2 (y u i y uj ) 2 2π n i 2
A local profiling algorithm profiles the location of a user via the edges from and to his labeled neighbors.
simple but efficient closed-from solution.
New York Beijing
v1
?
Average Distance of a User’ s Followers
Influence Scope
u2
Champaign
u1
Champaign
v2 User Location Weighted Average of Different Resources
u5 San Francisco u4
A global algorithm profiles all the users’ locations together via all the edges in the graph.
The local algorithm only uses limited information.
Our global algorithm aims to use all information. complex but accurate iterative algorithm.
Beijing v1
New York u2
Champaign ? v2 u3
? u1
? U6 Champaign u5
San Francisco u4
We incorporate additional knowledge as constraints for maximizing the likelihood function.
Additional Knowledge: e.g., users only live in cities or towns
Constraint Optimization: we maximize the likelihood in each method under constraints.
We compare our method with the-state-of-arts methods on a large Twitter corpus. Data Set: We crawled a subset of Twitter. We used the users having locations on profiles. There are 139K users, 50 million tweets and 2 million following relationships. Methods: User-based Location Profiling Content-based Location Profiling
Our algorithms are better than the baseline methods as we model edges discriminatively.
Our algorithms can take advantages of modeling two different types of resources
The global profiling algorithm can further improve the local profiling algorithm.
Conclusion and Future work We explore both social network and user-centric data for profiling users locations in a unified approach. We introduce a discriminative influence model. We develop two effective profiling methods and extend the methods via modeling constraints. The framework could be further extended to profiling other attributes.