Geometric & Network based Summarization Atanu Roy† & Akash Agrawal\ †, \ Department of Computer Science & Engineering, University of Minnesota October 14, 2011
Abstract Given a spatial network, a spatial neighborhood, a collection of activities, and a number of clusters to identify, our proposed problem is to advise our users whether to use a networkbased summarization or a geometry-based one. Network summarization is an important societal problem and has been used by athorities to curb crime [16, 7, 13] and respond to natural disasters [18, 5, 1]. Oliver et al. [18] proved, in specific cases, network based summarization is a better approach than geometry based ones whereas the existence of geometry based approaches for the last 2 decades [10] proves their acceptance with the environmental criminologists and the authorities [7]. Thus defining a classifier that gives the authorities an insight regarding the specific summarizations to follow in a specific network becomes a problem of high societal importance. Previous works have either focussed on geometry or network based approaches and none of the referenced works have been able to combine the potential of both the approaches and provide a decision metric. Network summarization is computationally challenging because of the exponential number of k subsets of all possible paths in a spatial network nk .
1
Problem
1.1
Definition
Eck et. al. in [7] has proposed an approach to use K-means [15] clustering algorithm to cluster crime hot spots. Oliver et. al. in [18] has argued that in places with low network connectivity KMR is a better alternative for law enforcement agencies than [13]. In this research we are interested in finding out the following questions: 1. At what crime hotspot density Oliver et. al.’s [18] KMR approach is a better approach than Eck et. al.’s geometry based approaches or vice versa? 2. To generate a measure/classifier which will help us in the above mentioned classification. In a nutshell, given a spatial network, and the number of clusters (be it geometric or networkbased), the algorithm has to decide dynamically, the areas which would be served better by the implementation of a geometry based summarization and the areas that would be better served by a network-based summarization. In crime analysis, an activity can be the location of a crime (e.g. assault). In disaster response-related applications, an activity might be the location of a request 1
Figure 1: Example of our problem for relief supplies. Figure 1 is an example of a typical dataset we intend to use for this project. The output contains the densely connected region being marked using a geometry based cluster whereas the sparse connected region being marked using a network based summarization. In a subsequent section, we will provide a sample solution to the dataset used in figure 1.
1.2
Importance
Crime analysis using crime hotspots [13] and geometry-based summarization is a well-studied problem. The existence [10] and continual research on the problem [16, 13, 12, 7] for the last few decades proves [7, 13] its acceptance and popularity amongst the authorities and the criminologists alike. Oliver et al. suggested absence of a well-connected underlying spatial netwrok is a limitation of a geometry-based summarization. They [18] proposed a solution for such scenarios using spatial network activity summarization. We hypothesize that the combination of both the approaches will lead to a boost in the solution quality of clustering maximum activities. Thus the authorities can use the solution to monitor hot spots in any given spatial neighborhood may it be patrolling crime hotspots or reaching to a bigger target while providing aid.
1.3
Challenges
Spatial network activity summarization is NP-complete 1 . In this problem we propose to find our k paths out of a possible n paths. This is computationally challenging since we have a search space 1
NP-completeness of SNAS is proven in an unpublished article submitted for a peer reviewed journal. Thus we are unable to cite its source.
Figure 2: Classification of Related Works n k
of
2
. Thus we have an exponential possibilities of number of choices of k paths.
Approaches
2.1
Description
Geometry based summarization has received a lot of research attention both from theoretical and application domains. But the recent few years has shown a considerable surge in the area of network based summarization research [17, 6, 4, 18].An intuitive and rudimentary approach is to combine both the algorithms. First we propose to apply Oliver et al.’s [18] KMR algorithm over the whole spatial network. Next we will take the output and calculate the density of the paths reported by KMR. We will define density in the subsequent paragraph. If the density of the paths is above a certain threshold (θ), we will use crimestat [13] to identify our geometry based clusters. In this approach, we are able to identify both the k networks and the densely active ellipses. In the above-mentioned approach we define density as the number of activities per unit area.
2.2
Novelty
Summarization of activities using some kind of grouping is a well studied problem [6, 18, 7, 8, 17, 19, 14, 4, 21]. Previous approaches can be divided into two broad categories namely geometry based and network based [18]. Figure 2 shows a classification tree of the related works in this area. Geometry based approaches like CrimeStat [12, 7] clusters data based on the geometric location of the points. They exploit a regular geometrical shape like a circle or an ellipse to group high activity areas in domains like crime analysis. CrimeStat uses the K-means [15] and hierarchical nearest neighbour clustering [11] to cluster these activities based on the Euclidean distances between the activities. [7] is a comprehensive report presented to the US National Department of Justice where the authors perform experimental evaluation on London Metropolitan Police Forces Crime Report Information System for Hackney Borough Police for the period June 1999 through August 1999. This approach uses an Euclidean distance, and does not account for the underlying spatial
network. The other end of the spectrum is based on network based approaches. In these approaches, instead of grouping points based on their distribution in the planar space, the algorithms [18, 6, 4, 20, 17] exploit the regularities in the underlying spatial network to group activities. These approaches can take into account graph properties like edge-connectivity and directionality while clustering hot spots. Approaches like [4, 20, 17] finds out a single path having the objective of maximizing the activities on the output path. On the other hand approaches like [6, 18] has the capability of spitting out k paths where k is an user-defined value. Celik et al. [6] in mean streets uses a Poisson distribution for modelling criminal hot spots. Oliver et al. in [18] uses KMR which allocates activities to the nearest summary path and computes the clusters of hot spots along the given paths with an objective to output k-paths having the maximum activity. In conclusion, both the approaches suffer from their set of deficiencies. The geometry based approaches fails to capture the underlying spatial networks and thus forms clusters which can be exploited only in a highly connected network. The density of the clusters produced by the network based approach will not be as good as the geometry based approaches, but they can exploit the underlying spatial network. Gutteridge et al. in 2003 [9], showed that spatial clustering can not only be utilized to group hot spots on a geographic maps, but it can also be used in conjunction with neural networks to to predict the location of active sites in enzymes.
2.3
Better
We have already discussed in the previous sections the importance of having an algorithm which can ”correctly” summarize spatial neighborhood depending on the density of the region. Since geometry based and network based approaches are crafted to suit a particular type of network density [18] (netwrok based for low density and geometry based high density), we hypothesize that the combination of both the approaches will increase the quality of solution. For example in figure 1, ideally we would like to apply a geometry based approach for the top half and a network based approach for the bottom half. Neither [7] or [18] support heterogeneous models on a single map due to their limitation of using a single type of summarization. On the other hand, our approach’s unique ability to output both types of summarization will lead to a solution where the top half will be reported as a ellipse and bottom half as a network. A possible solution is shown in figure 3, where the geometry based cluster is shown using an ellipse with a broken line whereas the network based summarization is shown by highlighting the x possible route where x ≤ k .
3
Tasks
The single most important task apart from finding an efficient heuristic to solve the problem is to find an appropriate dataset that could be used for experimentation purposes. We are aware of the existence of datasets like the real crime dataset from Nebraska, OM and Houston, TX [3, 2]. We can also use the Crisis Map of Haiti which is a geospatial and temporal dataset [1] describing the events after the devastation earthquake of 2010 rocked the island country. The dataset is a snapshot of victims affected by the earthquake and are in need of dire help. We have planned our schedule and it is demonstrated in table 1.
Figure 3: A possible solution
Dates & Schedule 10/10 - 10/17 10/17 - 11/01 11/01 - 11/15 11/15 - 11/22 11/22 - 11/29 11/29 - End
Table 1: Tasks and Schedule Tasks Formal problem specification Experimentation with novel approaches & preparation of the first draft Working with real world datasets & Finalizing the scope of the semester First draft of transparencies Incorporation of peer reviews and submit for final feedback Finalizing the report and working on the final transparencies
4
Deliverables
At the end of the project we propose to deliver an algorithm which can intuitively decide on the type of summarization to be used in a spatial network. We will submit a report comprising of our research findings at the end of the current semester.
References [1] Crisis map of haiti, http://haiti.ushahidi.com/. [2] Houston crime maps: houstoncrimemaps.com/.
A
free
browsable
database
of
houston
crimes,
http://
[3] L. c. p. department, ”lincoln city crime records”, http://www.lincoln.ne.gov/city/ police/, 2008. [4] K. Buchin, S. Cabello, J. Gudmundsson, M. L¨offler, J. Luo, G. Rote, R. I. Silveira, B. Speckmann, and T. Wolle, Detecting hotspots in geographic networks, Advances in GIScience (2009), 217–231. [5] W.N. Carter, Disaster management: A disaster manager’s handbook, (1991). [6] M. Celik, S. Shekhar, B. George, J.P. Rogers, and J.A. Shine, Discovering and quantifying mean streets: A summary of results, (2007). [7] J.E. Eck, S. Chainey, J.G. Cameron, M. Leitner, and R.E. Wilson, Mapping crime: Understanding hot spots, (2005). [8] M. Ester, H.P. Kriegel, J. Sander, and X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, Proceedings of the 2nd International Conference on Knowledge Discovery and Data mining, vol. 1996, Portland: AAAI Press, 1996, pp. 226–231. [9] A. Gutteridge, G.J. Bartlett, and J.M. Thornton, Using a neural network and spatial clustering to predict the location of active sites in enzymes, Journal of molecular biology 330 (2003), no. 4, 719–734. [10] A.K. Jain, M.N. Murty, and P.J. Flynn, Data clustering: a review, ACM computing surveys (CSUR) 31 (1999), no. 3, 264–323. [11] G. Karypis, E.H. Han, and V. Kumar, Chameleon: Hierarchical clustering using dynamic modeling, Computer 32 (1999), no. 8, 68–75. [12] N. Levine, Crimestat: A spatial statistics program for the analysis of crime incident locations (v 2.0), Ned Levine & Associates, Houston, TX, and the National Institute of Justice, Washington, DC (2002). [13]
, Crimestat version 3.3 update notes: Part i: Fixes getis-ord g bayesian journey-tocrime, Ned Levine & Associates, Houston, TX, and the National Institute of Justice, Washington, DC (2010).
[14] Xiaolei Li, Jiawei Han, Jae-Gil Lee, and Hector Gonzalez, Traffic density-based discovery of hot routes in road networks, SSTD, 2007, pp. 441–459.
[15] J. MacQueen et al., Some methods for classification and analysis of multivariate observations, Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1, California, USA, 1967, p. 14. [16] P. Mohan, S. Shekhar, J.A. Shine, and J.P. Rogers, Cascading spatio-temporal pattern discovery, IEEE Transactions on Knowledge and Data Engineering (2011). [17] A. Okabe, K.I. Okunuki, and S. Shiode, The sanet toolbox: New methods for network spatial analysis, Transactions in GIS 10 (2006), no. 4, 535–550. [18] Dev Oliver, Abdussalam Bannur, James M. Kang, Shashi Shekhar, and Renee Bousselaire, A k-main routes approach to spatial network activity summarization: A summary of results, ICDM Workshops, 2010, pp. 265–272. [19] S.A. Roach and SA Roach, The theory of random clumping, Methuen London, 1968. [20] S. Shiode and A. Okabe, Network variable clumping method for analyzing point patterns on a network, Unpublished paper presented at the Annual Meeting of the Associations of American Geographers, Philadelphia, Pennsylvania, 2004. [21] K. Sugihara, A. Okabe, and T. Satoh, Computational method for the point cluster analysis on networks, GeoInformatica 15 (2011), no. 1, 167–189.