Tropical Cyclone Event Sequence Similarity Search via Dimensionality Reduction and Metric Learning Shen-Shyang Ho
Wenqing Tang
W. Timothy Liu
Jet Propulsion Laboratory California Institute of Technology 4800 Oak Grove Dr., 300-323 Pasadena CA 91109
Jet Propulsion Laboratory California Institute of Technology 4800 Oak Grove Dr., 300-323 Pasadena CA 91109
Jet Propulsion Laboratory California Institute of Technology 4800 Oak Grove Dr., 300-323 Pasadena CA 91109
[email protected] [email protected] ABSTRACT
General Terms
The Earth Observing System Data and Information System (EOSDIS) is a comprehensive data and information system which archives, manages, and distributes Earth science data from the EOS spacecrafts. One non-existent capability in the EOSDIS is the retrieval of satellite sensor data based on weather events (such as tropical cyclones) similarity query output. In this paper, we propose a framework to solve the similarity search problem given user-defined instance-level constraints for tropical cyclone events, represented by arbitrary length multidimensional spatiotemporal data sequences. A critical component for such a problem is the similarity/metric function to compare the data sequences. We describe a novel Longest Common Subsequence (LCSS) parameter learning approach driven by nonlinear dimensionality reduction and distance metric learning. Intuitively, arbitrary length multidimensional data sequences are projected into a fixed dimensional manifold for LCSS parameter learning. Similarity search is achieved through consensus among the (similar) instance-level constraints based on ranking orders computed using the LCSS-based similarity measure. Experimental results using a combination of synthetic and real tropical cyclone event data sequences are presented to demonstrate the feasibility of our parameter learning approach and its robustness to variability in the instance constraints. We, then, use a similarity query example on real tropical cyclone event data sequences from 2000 to 2008 to discuss (i) a problem of scientific interest, and (ii) challenges and issues related to the weather event similarity search problem.
Algorithm, Design
[email protected] Keywords Metric learning, parameter learning, spatio-temporal data mining, mining multi-dimensional data sequences, similarity search, ensemble method, dimensionality reduction, embedding method, atmospheric events, tropical cyclones.
1. INTRODUCTION
Categories and Subject Descriptors I.2.6 [Computing Methodologies]: Artificial Intelligence— Learning: Parameter Learning; I.5.4 [Computing Methodologies]: Pattern Recognition—Applications; J.2 [Computer Applications]: Physical Sciences and Engineering—Earth and atmospheric sciences Copyright 2010 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the U.S. Government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. KDD’10, July 25–28, 2010, Washington, DC, USA. Copyright 2010 ACM 978-1-4503-0055-1/10/07 ...$10.00.
The Earth Observing System Data and Information System (EOSDIS)1 is a comprehensive data and information system which archives, manages, and distributes Earth science data from the EOS spacecrafts (a.k.a. satellite sensors) [10]. A challenge of EOSDIS is how to “help users find the data that they need and how to get it to them” [2]. One non-existent capability in the EOSDIS is the accurate and efficient ad-hoc query and retrieval of Earth science satellite sensor data for dynamic atmospheric events such as tropical cyclones based on ad-hoc user-defined criteria and event instances. Recently, we introduced a fast and effective data retrieval framework driven by a spatio-temporal partitioning scheme on the moving satellite trajectories for satellite data retrieval [12]. The main input for the data retrieval framework is the set of [similar] atmospheric event instances. In this paper, we propose a framework to solve the similarity search problem given user-defined instance-level constraints for tropical cyclone events, represented by arbitrary length multidimensional spatiotemporal data sequences. A tropical cyclone event is a “non-frontal synoptic scale lowpressure system over tropical or sub-tropical waters with organized convection and definite cyclonic surface wind circulation”2 . The frequently used term hurricane describes a high-intensity tropical cyclone with sustained surface wind intensity equal or more than 119km/h. Some examples of similarity of interest to scientists and meteorologists are track similarity (Hurricane Audrey (1957) and Hurricane Rita (2005) 3 ), strength similarity (Hurricane Katrina (2005) and Hurricane Camille (1969) [13]) and hurricane origin (Cape Verde hurricanes: Hurricane Isabel (2003), Hurricane 1
http://esdis.eosdis.nasa.gov http://www.aoml.noaa.gov/hrd/tcfaq/A1.html 3 National Weather Service Forecast Office. http://www. srh.noaa.gov/lch/rita/rita_audrey.php 2
135
Floyd (1999), and Hurricane Hugo (1989)4 ). These similarity searches can be transformed into queries shown below. Q1. Track: List all tropical cyclones that passed through regions X1 , X2 , · · · , Xn (specify a set of minimum and maximum latitudes and longitudes). Q2. Strength: List all tropical cyclones that have intensity at least Y km/h. Q3. Origin: List all tropical cyclones that evolve from region Z. The queries can be implemented using standard sequential query language (Q2) or spatial query language (Q1 and Q3) [11]. One similarity query that is impossible using current database query language is the use of condition containing instance-level constraints [22] shown below (Q4). Q4. Instances: List all tropical cyclones that are similar to tropical cyclones s1 , s2 , . . . , sk and dissimilar to tropical cyclones d1 , d2 , . . . , dl . This query type implementation requires solutions beyond set-theoretic (Q2) and geometric (Q1 and Q3) algorithm. This query enables users to perform (i) event data sequence clustering or categorizing based on knowledge of limited number of events, and (ii) similar events identification for data retrieval and analysis. For example, a scientist provides the query system with a small set of tropical cyclones that have similar trajectories and wind intensity time series, but travelling at different speed. The system returns tropical cyclone events from the past 20 years that exhibit similar characteristics and related satellite data for the scientist to further analyze and compare the events. The instance-level constraints query type can be extended to other arbitrary length multidimensional event (or object) data sequences with diverse feature attributes in earth and ocean sciences. The similarity query and search problem can be solved in a metric learning framework [24] where the instance-level constraints are split into two sets: dissimilar and similar. The main challenges for this similarity (parameter) learning problem are (i) the arbitrary length and multidimensional nature of the spatiotemporal data sequences, and (ii) the non-metric similarity measure. In this paper, the main technical contribution is a similarity (parameter) learning framework for general (arbitrary length and multidimensional) data sequences which integrates both supervised metric learning and dimensionality reduction (which is some form of unsupervised metric learning) to learn the similarity measure based on longest common subsequences (LCSS) [21]. The main purpose for the dimensionality reduction is to identify a fixed dimension manifold, with nonlinear geometric structure derived from the arbitrary length multi-dimensional data sequences, to perform similarity learning. In Figure 1, one observes the projection of the data sequences into the fixed 2-D metric space induced by the LCSS-based similarity measure. Similarity learning is performed in the 2-D space to learn the best LCSS parameters satisfying the similar and dissimilar data sequence constraints. The identification of similar data sequences from
Figure 1: Projecting arbitrary length multidimensional sequences into a fixed 2-D space for similarity learning. Dotted lines (below) and corresponding circles in square boxes in 2-D space (above) represent dissimilar cyclone trajectories; solid lines (below) and corresponding circles in 2-D space (above) represent user-defined similar cyclone trajectories. Feature values (e.g., wind intensity and pressure) are not shown in the figure.
a pool of unlabeled data sequences is based on a majority vote among the user-defined similar data sequences. The vote from each user-defined similar data sequence is based on the order ranking of the unlabeled data sequences from the user-defined similar data sequence. In particular, if an unlabeled data sequence is ranked as among the C most similar data sequences to a user-defined similar data sequence, it will receive a vote from that data sequence. The outline of this paper is as follows. In Section 2, we briefly review similarity measures for data sequences and distance metric learning. In Section 3, the tropical cyclone data sequences used in the paper is described. In Section 4, the concept of longest common subsequence (LCSS) based similarity measure is introduced and generalized. In Section 5, we describe and discuss our proposed similarity (parameter) learning and search approach in detail. In Section 6, experimental results using a combination of synthetic and real tropical cyclone event data sequences are presented to demonstrate the feasibility of our parameter learning approach and its robustness to variability in the instance-level constraints. We, then, use a similarity query example on real tropical cyclone event data sequences from 2000 to 2008 to discuss (i) a problem of scientific interest, and (ii) challenges and issues related to our proposed approach.
4 National Climatic Data Center (NCDC). Climate of 2003: Comparison of Hurricane Floyd, Hugo, and Isabel. http://www.ncdc.noaa.gov/oa/climate/research/ 2003/fl-hu-is-comp.html
136
2.
RELATED WORK
Many similarity measures for data sequences have been proposed (see [9] and references herein). The two main categories are the Lp -norm based similarity measures and the elastic similarity measures. The former similarity measures are metric, but they assume fixed length data sequences and do not support local time shifting; The latter can be used to compare arbitrary length data sequences and support local time shifting but they are not metrics. The common Lp -norm based similarity measures use L1 , L2 , or L∞ norm. The classical elastic measure that is first used to overcome the weakness of Lp norms is the Dynamic Time Warping (DTW) [3]. The Longest Common Subsequence (LCSS) based similarity measure was proposed to handle two- and three-dimensional arbitrary length data sequences. The LCSS-based similarity measure is robust to noise and give more weight to the similar portion of the sequences [21]. The Edit Sequence on Real Sequence (EDR) is robust to noise, shifts and scaling of data [7]. Both Time Warp Edit Distance (TWEP) [17] and ERP [6] (which combines L1 -norm and the edit distance) support local time shifting and they are metrics. However, both ERP and TWEP are derived on 1-D time series and only shown empirically on 1-D time series. Distance metric learning [24, 23, 16, 1] aims to learn a distance metric for the input data space from a collection of similar/dissimilar points that preserves the distance relation among the training data. However, not all similarity functions satisfy the metric properties. Moreover, it has been shown empirically that non-metric similarity functions have better performance than the metric similarity functions for problem such as similarity search problem for time series or data sequences [9]. Recently, a hashing approach has been proposed that takes into accord the learned Mahalanobis distance metric for scalable similarity search on image and systems data sets [15]. To the best of the authors’ knowledge, metric learning has never been applied to similarity search for weather events data sequences consisting of both trajectory and feature attributes. The most related work is by Yu and Gertz [25] who proposed learning the DTW distance by direct application of Xing et al.’s approach on the Mahalanobis distance [24] with the two trajectories interpolated to the same length. Even though our proposed framework is motivated by [24], it is fundamental different from [25]. The similarity (parameter) learning problem is solved in a fixed dimensional metric space which preserves the nonlinear geometric structure in the data sequence input space using a LCSS-based similarity measure.
3.
Figure 2: Histogram for the data sequence length of tropical cyclones occurring from 2000 to 2008.
Figure 3: Relationship between any two attributes in the data sequences for tropical cyclones occurring in the North Atlantic Ocean from 2000 to 2008.
Center website5 for both the North Atlantic Ocean and the Eastern North Pacific Ocean from 1851 to 2008. For this paper, 116 tropical cyclones occurring in the Atlantic Ocean from 2000 to 2008 and synthetic data sequences generated based on the 116 tropical cyclones are used in our experiments. The tropical cyclone data sequences have arbitrary length. From Figure 2, one sees that most of the data sequences consist of between ten to sixty data vectors for tropical cyclone occurring in North Atlantic Ocean between 2000 and 2008. The top left graph in Figure 3 shows the trajectories for the tropical cyclones. From the bottom right graph, one observes that there is an anti-correlation between the minimum central pressure and the maximum sustained wind intensity. Hence, we need only use one of the two feature attributes in our similarity learning and search problem. In this paper, we use the maximum sustained wind intensity. In the next section, we describe the Longest Common Subsequences (LCSS) similarity measure which supports local time shifting to handle arbitrary length data sequences.
DATA DESCRIPTION
A tropical cyclone event data sequence consists of both the trajectory and the feature attributes. A trajectory is the path a moving object follows through space and time. It is described by (i) spatial attributes (latitude and longitude), and (ii) temporal attributes (year, day, time). The feature attributes are (i) the maximum sustained wind intensity (knots) and (ii) the minimum central pressure (mb). Two consecutive data vectors in a data sequence are six hours apart. One can retrieve tropical cyclone and some subtropical cyclone event data sequences from the NOAA Coastal Services
4. SIMILARITY MEASURE BASED ON LONGEST COMMON SUBSEQUENCE It has been empirically shown that no one similarity mea5
137
http://csc-s-maps-q.csc.noaa.gov/hurricanes/
5.1 LCSS-based Similarity Parameter Learning
sure outperforms the others for time series [9]. Moreover, the proper parameter values are critical for the effectiveness of the similarity measure. In this paper, we use the Longest Common Subsequence (LCSS), an edit distance based elastic similarity measure, for similarity measure learning. In this paper, we extend the LCSS-based similarity measure from multidimensional (at most three) trajectory [21] to general multidimensional data sequences. Consider two arbitrary length multidimensional spatiotemporal data sequences A = B
=
The most intuitive distance metric learning approach to use for our problem is the one proposed by Xing et al. [24]. In their approach, the metric learning problem is posed as a convex optimization problem with the constraints given by the “must-link” data pairs and “cannot-link” data pairs. The objective is to find the matrix representing the Mahalanobis metric that allows the similar data points close to one another and the dissimilar data points far away from the similar data points. The main differences in the problem setting are (i) the non-metric LCSS-based similarity S1 used in our problem, and (ii) arbitrary length multidimensional data sequence pair constraints. Moreover, the optimization step has to be modified for the integer time parameter δ in LCSS. Xing et al.’s metric learning framework is extended to learning parameters of non-metric similarity for generic (arbitrary length and multidimensional) data sequences. A dimensionality reduction component is integrated into the Xing et al.’s framework to facilitate similarity learning in a fixed low dimensional metric space induced by the nonmetric S1. Let fS1 be a transformation from the data sequence input space to a fixed low dimensional space M induced by S1. Let S and D be the set of “must-link” pairs and the set of “cannot-link” pairs, respectively. To perform the similarity learning, we use a variant of the objective function introduced in [24] as such: P 2 min (xi ,xj )∈S ||fS1 (xi ) − fS1 (xj )|| E,δ P such that (xi ,xj )∈D ||fS1 (xi ) − fS1 (xj )| ≥ 1
(ta,1 , a1,1 , . . . , am,1 ), . . . , (ta,n , a1,n , am,n ) (tb,1 , b1,1 , . . . , bm,1 ), . . . , (tb,l , b1,l , . . . , bm,l )
with m attributes and of length n and l, respectively. Define the similarity function S1 between A and B, given δ and E = (1 , 2 , . . . , m ), by S1(A, B, δ, E) =
LCSSδ,E (A, B) min(|A|, |B|)
(1)
with the generalized LCSS defined by 8 > > > > > > > > < LCSSδ,E (A, B) =
0 : 1 + LCSSδ,E (Head(A), : Head(B))
> > > > > > max(LCSSδ,E (Head(A), B), : > > : LCSSδ,E (A, Head(B)))
if |A| = 0 or |B| = 0 ck > 0, ∀ck , |ti − tj | < δ is satisfied otherwise
such that Head(A) is the sequence (ta,1 , a1,1 , . . . , am,1 ), . . . , (ta,n−1 , a1,n−1 , am,n−1 ) for any data sequence A of length n and 1 0 1 0 1 − |a1,ti − b1,tj | c1 C B C B .. C = @ ... A = @ A . cm m − |am,ti − bm,tj |
and P > 0
for some predefined δ and E. To have good performance, the parameters δ and E have to be tuned according to the specific application. One concludes from Example 1 that the LCSS-based similarity measure S1 is sensitive to the parameters δ and E.
where P = (1 , 2 , . . . , m , δ) ∈ (R+ )m × Z + . Solve the unconstrained minimization problem: X ||fS1 (xi ) − fS1 (xj )||2 g(S, D, P ) =
Example 1. Given A = (01 , 02 ), (0.5, 11 ), (1, 3), (1.5, 12 ) and B = (0, 13 ), (1, 21 ), (22 , 14 ) where d = 1 and m = 0.
(xi ,xj )∈S
0
− log @
1. δ = 1 and E = (1). LCSSδ,E (A, B) = 3; S1(A, B, δ, E) = 0.75. Correspondence: 02 → 13 , 11 → 13 , 3 → 21 , 12 → 14 .
1 ||fS1 (xi ) − fS1 (xj )||A
(xi ,xj )∈D
−
2. δ = 0.25 and E = (1). LCSSδ,E (A, B) = 2; S1(A, B, δ, E) = 0.67. Correspondence: 02 → 13 , 3 → 21 .
m X
log j − log δ
(2)
j=1
where || · || is some metric norm (e.g. l2 norm) in M . The additional terms are used to control the magnitude of parameter vector P . The coordinate descent method [5] is used for the minimization step to avoid gradient computation. g(·) is minimized along one coordinate direction at each iteration. In our implementation, the coordinate is selected based on
3. δ = 1 and E = (0.5). LCSSδ,E (A, B) = 2; S1(A, B, δ, E) = 0.67. Correspondence: 11 → 13 , 12 → 14 . 4. δ = 0 and E = (0). LCSSδ,E (A, B) = 0; S1(A, B, δ, E) = 0.
5.
X
arg min g(S, D, Pi ) Pi ∈I
METHODOLOGY
In Section 5.1, we describe our proposed approach for learning LCSS-based similarity parameters for arbitrary length multidimensional data sequences. In Section 5.2, we describe a voting method to select the most similar data sequences from a pool of unlabeled data sequences based on the user-defined similar data sequences and the learned LCSSbased similarity measure.
138
(3)
, · · · , km , δ)|k+1 = k + such that I = {Pi = (k1 , · · · , k+1 i i hi ei }, ei is the ith unit vector, hi is a fixed small value. We fix δ and allow search in i space first. When the global minimum (at fixed δ) is achieved, we perform minimization with stepsize 1 on δ. Initialization of P starts near the zero vector. One notes that as i and δ increase, the similarity value between two data sequences increases.
The nonlinear dimensionality reduction is used to transform (fS1 ) the original data sequences based on S1 to representations in a fixed low dimensional space. In this paper, we use Isometric Feature Mapping (ISOMAP), an extension of linear embedding approaches (e.g., multidimensional scaling, MDS [8]), which learns the global nonlinear geometric structure of the input data [20]. One capability of ISOMAP is its ability to discover nonlinear degrees of freedom that underlie complex natural observations such as the trajectory and feature attributes of tropical cyclones. On the other hand, one notes that other type of linear or nonlinear dimensionality reduction approaches can also be used to replace ISOMAP for the transformation FS1 . The ISOMAP algorithm consists of two main steps:
Input: S , similarity set; D , dissimilar set; K. Output: Parameter vector, P . 1: Initialize P := [0.1, . . . , 0.1, 1]; 2: Construct the “must-link” pair set, S and “cannot-link” pair set, D; 3: Compute the K−nearest neighbor graph using S1 defined by P for data sequences in S and D ; 4: Compute the shortest path distance between all data sequences using Dijkstra’s algorithm and S1 defined by P ; 5: Apply MDS to construct a fixed low dimensional manifold M ; 6: Compute the Euclidean distances for data sequence pairs in S and in D, separately in M ; 7: Compute objective function (2); 8: Update P ; 9: Repeat Step 3 to 8 until |gi+1 − gi | < γ;
1. Estimate the geodesic distance between points in a low-dimensional manifold M with respect to a graph (e.g., k−nearest neighbor graph) constructed for the data points.
Algorithm 1: LCSS-based similarity parameter learning
2. Apply MDS to find points in M such that the distance between any two points in M matches distance computed in Step 1.
objective function values, gi+1 and gi . Algorithm 1 halts when the criterion value is less than γ. The computational complexity of the LCSS-based similarity parameter learning algorithm is analyzed by breaking it down into 3 components: dissimilarity matrix construction by computing S1 values for all sequence pairs (Step 3), ISOMAP (Step 4 and 5), and the coordinate descent method (Step 6 to 8). Based on Lemma 1 in [21], the (dis)similarity matrix construction can be computed in O(s2 δl) where s is the number of data sequences and l = 2 max(l1 , · · · , ls ), li , i = 1, · · · , s are the sequence lengths. For the ISOMAP algorithm, the computational complexity is O(s3 ). The convergence rate of coordinate descent method is similar to steepest descent method. Even though it can be slow, it is still effective for practical purposes [5]. During each iteration, one needs to construct the dissimilarity matrix only once and run the ISOMAP algorithm. From a practical implementation perspective, dissimilarity matrix construction step (Step 3) is the most expensive steps as s < l and δ ≥ 1. s < l as the number of user-defined data sequences for the similarity query is assumed to be limited.
ISOMAP is able to discover the d−dimensional manifold embedded in a high dimensional input space. In our problem, we assume that there exists some intrinsic low dimensional (Euclidean) structure embedded in the high dimensional “data sequence space”. Bernstein et al. [4] established that the estimated geodesic distance between points in the original space converges to the true geodesic distance between points in M . If we assume that the data sequences are sampled from a convex Euclidean domain, then the geodesic distance between points in M are equal to the Euclidean distance in that domain. With the two conditions satisfied, the MDS step in ISOMAP will asymptotically recover the embedded (Euclidean) data structure [4]. Hence, by adding this dimensionality reduction step into similarity learning, we ensure that similarity learning is performed in a fixed dimensional metric space which preserves the nonlinear geometric structure of the “data sequence space”. This nice property ensures that the LCSS measure learned is robust to variation in user-defined similar data sequences. In other words, the learned similarity measure can still distinguish similar and dissimilar data sequences accurately as the feature and spatio-temporal variations increase for the userdefined similar data sequences. Algorithm 1 provides a high level description of the LCSSbased similarity parameter learning procedure. In Step 2, we construct the “must-link” pairs by pairing up all the data sequences in the set S of similar data sequences. Hence, ˙ |S| = |S |(|S2 |−1) . To construct the “cannot-link” pairs, the data sequences in the set D of dissimilar data sequences are all paired up first. Then. each data sequence in S are paired with all the data sequences in D . Hence, |D| = ˙ |−1) |D |(|D + |S | · |D |. Lines 3 to 5 are the steps for the 2 ISOMAP algorithm. Step 3 computes the K−nearest neighbor graph using S1 defined by P for the data sequences in S and D . Step 4 computes the geodesic distance between all data sequences using Dijkstra’s algorithm and S1 defined by P . Step 5 constructs the low dimensional manifold M using MDS. Lines 7 computes the objective function 2. Line 8 updates the P parameter. Line 9 is the stopping criterion based on the absolute difference between two consecutive
5.2 Data Sequence Similarity Search To select the most similar data sequences from a set U of unlabeled data sequences, we use a voting scheme to decide among the data sequences in the similar set S using the LCSS-based similarity and the learned parameter vector P . The voting scheme is a combination of ranking the unlabeled data sequences and a majority vote decision based on the ranking. For each data sequence s ∈ S , the unlabeled data sequences in U are ordered based on their Euclidean distances from s. If an unlabeled data sequence u ∈ U is ranked as among the C most similar (or closest) data sequences to s, it will receive a vote from s. If u received more than |S2 | votes, it is considered similar to the data sequences in S . Algorithm 2 shows our voting approach for similar data sequences selection using the learned LCSS-based similarity measure S1 from Algorithm 1. Lines 1 to 3 are the steps for the ISOMAP algorithm using the learned LCSS-based similarity measure S1. Line 4 computes the Euclidean distances
139
1. Randomly pick one data sequence sq from the real tropical cyclone event sequence set6 .
Input: S , similarity set; U , the set of unlabeled data sequences; P , learned parameter vector; C, user-defined ranking cut-off; K. Output: O, the set of similar data sequences. 1: Compute the K−nearest neighbor graph using S1 defined by P for data sequences in S and D ; 2: Compute the shortest path distance between all data sequences using Dijkstra’s algorithm and S1 defined by P ; 3: Apply MDS to construct a fixed low dimensional manifold M ; |U | 4: Compute Euclidean distance vector Ds = {dsu }u=1
2. Specify a threshold γa for each trajectory and feature attribute a so that sq is bounded by a volume tube with radius γa in each attribute dimension and sq is the tube center. 3. Specify a translational threshold η so that a generated data sequence can shift at most η. 4. Generate each of the 10 new similar data sequences as follows.
dsu = ||s − u||2
(a) Randomly assigned an integer length lnew to a new data sequence so that lnew is between l − t and l + t, where l is the length of sq and t is a fixed integer.
for each s ∈ S , ∀u ∈ U in M ; ¯ s = sort(Ds ) = {du¯s1 , . . . , du¯ , . . . , du¯ } such 5: D sC s|U | that du¯s1 < du¯s2 < · · · < du¯sC < · · · < du¯s|U | , for each s ∈ S; 6: R = {¯ usi |¯ usi ∈ U , du¯si ≤ du¯sC , s ∈ S }; 7: Nu = #{v ∈ R : v = u} for all u ∈ U ; 8: O = {u|u ∈ U , Nu > |S2 | };
(b) Randomly generate lnew points such that they fell in the volume tube described in Step 2. (c) Randomly shift the generated points satisfying the constraint in Step 3. Then, we randomly pick 30 data sequences from the tropical cyclone event sequence set, excluding sq , and include them into the dissimilar set D . In our experiment, since |S | = 10 and |D | = 30, |S| = 10·9 = 45 and |D| = 30·29 + 10 · 30 = 2 2 2040, respectively, according to Section 5.1. Moreover, γa is varied from 1 to 5 for all a, η = 1, and t = l. For each γa value, we perform 20 trials. Using the above procedure for similar set generation, we generate another 100 positive testing examples. 85 data sequences, excluding the 30 in D and sq , from the real tropical cyclone event sequence set are used as negative testing examples. Hence, we have |U | = 185. K is set to 10 for the ISOMAP algorithm, the manifold dimensions for the similar and dissimilar data sequences are set to 10 and 30, respectively. This is to ensure that we can capture sufficient nonlinear geometric information without the manifold dimensionality higher than the number of data sequences. Accuracy of a similarity measure is computed as follows. First, similarity values between sq and all the data sequences in U are computed. Then, the similarity values are sorted. The positive testing examples should be among the 100 closest to sq . Hence,
Algorithm 2: Data sequence voting-based similarity search.
from all the unlabeled data sequences in U to each data sequence in the similar set S in the fixed manifold M . Line 5 sorts and ranks the unlabeled data sequences for each data sequences in S using the computed Euclidean distances in Step 4. Line 6 gathers the C most similar unlabeled data sequences for each data sequence in S into a single set. Line 7 counts the number of times an unlabeled data sequence is among the top C unlabeled data sequences closest to each data sequence in S . Line 8 is a voting scheme which selects a data sequence if it is ranked among the top C data sequences for more than half the data sequences in S . Other variants of the voting scheme can also be used for the similarity search. In particular, vote counts can be based on the C farthest from the data sequences in the dissimilar set or vote counts based on ranking data sequences from both similar and dissimilar sets.
6.
EXPERIMENTAL RESULTS
In Section 6.1, we show the feasibility of our LCSS-based similarity parameter learning approach (Algorithm 1) and its robustness to variability in the instance-level constraints. In Section 6.2, we present a scenario when a user provides a similar set S and a dissimilar set D , and uses our proposed approach to search for similar data sequences.
Accuracy
=
True Positive
=
#{p|p ∈ P e, Sp ≤ S|P e| } + #{n|n ∈ N e, Sn > S|P e| } |P e| + |N e| #{p|p ∈ P e, Sp ≤ S|P e| } |P e|
where P e and N e are the sets of positive and negative testing examples, respectively; Sp and Sn are the similarity values for a positive example p and a negative example n, respectively, and S|P e| is the similarity value of the sorted value at position |P e| (assuming sorting in increasing order). Since accuracy is based on the ranking for all the data sequences, true positive (or true negative) is a sufficient performance indicator if we know both |P e| and |N e|. Figure 4 shows the experimental result when no parameter learning is performed and a fixed P = (1, 1, 5, 1) is used instead. As γa increases, the number of true positive decreases. In
6.1 LCSS-based Similarity Parameter Learning In the first experiment, the thresholds for both the trajectory and feature attribute values of the data sequences in the similar set S are increased from 1 to 5 to test the robustness of our approach. In this experiment, synthetic data sequences are generated based on real tropical cyclone sequences. For each experimental trial, one similar set S of synthetic tropical cyclone sequences is generated as follows.
6
As described earlier in Section 3, 116 tropical cyclones occurring in the North Atlantic Ocean from 2000 to 2008 are used in our experiments.
140
Figure 4: No similarity learning.
Figure 7: The five data sequences [solid lines] in the similar set S and the ten data sequences [dashed lines] in the dissimilar set D (trajectories only).
other word, as the data sequences in S become more diverse, the LCSS-based similarity measure using a fixed P is less likely to measure similarity accurately for the data sequences that are generated from the same distribution as those in S . Moreover, one observes that the true positive variance among the 20 trials increases as γa increases. Next, we compare the LCSS-based similarity parameter learning approach with (i) the direct application of Xing et al.’s metric learning approach to learn the LCSS parameter, and (ii) when no learning approach is used. Figure 5 and 6 show the performance comparison for the three approaches based on (i) S1 measures in the data sequence input space, and (ii) Euclidean distance in the fixed low dimensional space, respectively. One unexpected obser-
al.’s approach to learn the LCSS-based similarity measure for the data sequences. The performance is even worse than using a good fixed P for S1 measure. From Figure 6, the accuracy for Xing et al.’s approach improved when sorting and ranking are done using Euclidean distance in the fixed low dimensional space derived using ISOMAP based on the learned S1 similarity measure. From both figures, one observes that our approach has favorable performance and its performance is consistent as variability increases. Moreover, from Figure 6, one also observes that the performances for the other two approaches become more consistent in a fixed low dimensional space as variability increases.
6.2 Similarity Query and Search We present a scenario when a user provides a similar set S and a dissimilar set D , and uses our proposed approach to search for similar data sequences. This corresponds to the query List all tropical cyclones that are similar to tropical cyclones in S and dissimilar to the tropical cyclones in D . First, we include 5 tropical cyclone event data sequences from the real tropical cyclone event data sequence set into S and include another 10 data sequences into D . Figure 7 shows the sets S and D used for LCSS-based similarity parameter learning (Algorithm 1). We then include the other 101 tropical cyclone events into U . For both Algorithm 1 and 2, K = 15. For Algorithm 2, C = 5. Figure 8 shows the 5 most similar data sequences for each of the five data sequences in S based on Step 6 in Algorithm 2. Figure 9 shows the two trajectories [solid lines] and wind intensity time series of the output from Algorithm 2 together with the trajectories of the 5 similar data sequences. Figure 10 shows some seemingly similar data sequences to the data sequences in S and their corresponding Nu values (see Step 7, Algorithm 2). One notes that Nu needs to be greater than 2 to be output by Algorithm 2. One observes from Figure 9 and 10 that the initial subsequences of those data sequences having higher Nu (greater or equal 2) tends to be very similar to those initial subsequences of data sequences in S . Also, from Figure 8, one observes that a data sequence may be similar to some data sequences that “look” different (e.g., the third and the fourth sequences). This affects the Nu value which is used in making selection decision in Algorithm 2. Moreover, the user-defined ranking cut-off, C
Figure 5: Accuracy using S1 measure in the data sequence input space.
Figure 6: Accuracy using Euclidean distance in a low dimensional space. vation from Figure 5 is the poor performance of Xing et
141
Figure 8: The five data sequences (trajectories [top, solid lines] and intensities [bottom]) in the similar set S and their corresponding five most similar unlabeled data sequences (trajectories [top, dashed lines]).
also affects the number of data sequences selected. In other words, with higher C and a fixed |S’|, there will be more selected data sequences (see Figure 11).
Figure 11: Number of output data sequences vs user-defined ranking cut-off, C when |S | = 5.
Science Problem The target users for the query described in Section 6.2 are scientists studying weather events. The objective of such a query is to support scientists in narrowing down their searches for weather events of specific characteristics described by data sequences as instance-level constraints. This capability is currently not available in standard database query and search functionalities. Such query can sometimes pose new interesting scientific problem. For example, our similarity search approach identifies two tropical cyclones similar to data sequences in S . However, there are some similarities and some differences between the two tropical cyclones. From Figure 9, one observes that their trajectories and wind intensity time series are similar initially during the first few time instances. However, one of the two (Hurricane Helen, September 12 to 27, 2006) became a Category 3 hurricane while the other (Tropical Storm Josephine, September 2 to 9, 2008) weakened after three days. Why did one intensify and the other one die out in a few days? This example demonstrates the tropical cyclone intensification problem which one attempts to discover the factors and conditions a tropical cyclone intensifies. Using each of the two tropical cyclones separately, one can further perform similarity search to identify two groups of similar tropical cyclones, retrieve the corresponding satellite data, and analyze them. By integrating query type described in Section 6.2 into a
Figure 9: Top: The five data sequences (dashed lines) in the similar set S and the two output trajectories [solid lines] using Algorithm 2. Bottom: Wind intensity time series from the similar set S and the two wind intensity time series from output data sequences.
7.
DISCUSSION
Using the similarity query and search example in Section 6.2, we discuss (i) a problem of scientific interest, and (ii) challenges and issues related to our proposed framework and approach.
142
Figure 10: Some “similar” data sequences (solid lines) not selected by Algorithm 2. Dashed lines are data sequences in S .
2. Applying our approach to other data sequence similarity measures.
satellite data retrieval system [12], one can retrieve satellite data based on the output from the query for analysis. Moreover, this framework can be easily extended to other event (or object) data sequences in earth and ocean sciences.
We tested our approach on the LCSS-based similarity measure S1. In Section 2, we briefly discuss a few other metric and non-metric similarity measures. One challenge is to extend some of the similarity measures to be applicable to multidimensional spatiotemporal data sequences. These will then be useful for our application domain and for performance comparison purposes.
Challenges and Issues Some technical challenges and (open) issues are briefly discussed here. 1. Which dimensionality reduction approach is the best? While the emphasis of the paper is a parameter learning approach by dimensionality reduction and distance metric learning, we did not answer the question on which dimensionality reduction approach is best suited for our proposed parameter learning framework. We use nonlinear ISOMAP to derive a manifold using neighborhood graph to retain geometric structure in the data sequence input space. Linear dimensionality reduction approach such as Principal Component Analysis (PCA) [14] can be used to derive low dimensional space that maximizes the variance in the data sequences. Other nonlinear dimensionality reduction methods such as Locally-Linear Embedding (LLE) [18] and Kernelized PCA [19], can also be used. Each method comes with its own strengths and weaknesses. For example, LLE is computationally more efficient than ISOMAP, but it has difficulty in handling non-uniform sampling data. This may effect the performance of our approach if the variation in the user-defined similar data sequences increases.
3. Better approach for similarity search. One weakness of Algorithm 2 is that the output is dependent on the user-defined ranking cut-off, C. If C increases with fixed |S |, the number of selected data sequences will increase (see Figure 11). There is no rules to decide what C to choose. 4. How can local/partial similarity search be achieved? From Figure 9, one observes that the two output data sequences from Algorithm 2 are similar during the initial time instances. One useful capability is to search for such partial (or local) similarity when the data sequences are similar only at some subsequences of the data sequences in S . One possible solution is to partition data sequences in S into shorter subsequences and then categorize the subsequences. After that, parameter learning is performed on each group category. One then searches for similar data sequences in each group category.
143
8.
CONCLUSIONS
[8] T. F. Cox and M. A. Cox. Multidimensional Scaling. Chapman and Hall/CRC, 2nd edition, 2001. [9] H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh. Querying and mining of time series data: Experimental comparison of representations and distance measures. Proc. VLDB Endowment, 1(2), pages 1542–1552, 2008. [10] M. Esfandiari, H. Ramapriyan, J. Behnke, and E. Sofinowski. Earth observing system (EOS) data and information system (EOSDIS) - evolution update and future. IEEE Inter. Geoscience and Remote Sensing Symposium, pages 4005–4008, 2007. [11] R. H. Guting and M. Schneider. Moving Objects Databases. Morgan Kaufmann, 2005. [12] S.-S. Ho, W. Tang, T. W. Liu, and M. Schneider. A framework for moving sensor data query and retrieval of dynamic atmospheric events. Proc. 22nd International Conference on Scientific and Statistical Database Management, pages 96–113, 2010. [13] J. Hobgood. A comparison of hurricanes katrina (2005) and camille (1969). 27th Conference on Hurricanes and Tropical Meteorology, 2006. [14] I. T. Jolliffe. Principal Component Analysis. Springer, 2nd edition, 2002. [15] B. Kulis, P. Jain, and K. Grauman. Fast similarity search for learned metrics. IEEE Trans. on Pattern Analysis and Machine Intelligence, 31(12):2143–2157, 2009. [16] Z. Lu, P. Jain, and I. S. Dhillon. Geometry-aware metric learning. Proc. 26th Inter. Conf. on Machine Learning, pages 673–680, 2009. [17] P.-F. Marteau. Time warp edit distance with stiffness adjustment for time series. IEEE Trans. on Pattern Analysis and Machine Intelligence, 31(2):306–318, 2009. [18] S. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, pages 2323–2326, 2000. [19] B. Schoelkopf, A. Smola, and K. R. Muller. Kernel principal component analysis. Advances in kernel methods: support vector learning, pages 327–352, 1999. [20] J. Tenenbaum, V. d. Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, 2000. [21] M. Vlachos, G. Kollios, and D. Gunopulos. Discovering similar multidimensional trajectories. Proc. 18th ICDE Conference, pages 673–684, 2002. [22] K. Wagstaff and C. Cardie. Clustering with Instance-level Constraints Proc. 17th Inter. Conf. on Machine Learning, pages 1103–1110, 2000. [23] S. Wang and R. Jin. An information geometry approach for distance metric learning. Proc. 12th AISTATS, pages 591–598, 2009. [24] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning with application to clustering with side-information. NIPS, pages 505–512, 2002. [25] W. Yu and M. Gertz. Constraint-based learning of distance functions for object trajectories. Proc. 21st Int. Conf. on Scientific and Statistical Database Management, pages 627–645, 2009.
We propose a framework to solve the similarity search problem given user-defined instance-level constraints for tropical cyclone events, represented by arbitrary length multidimensional spatiotemporal data sequences. We describe a novel Longest Common Subsequence (LCSS) parameter learning approach driven by nonlinear dimensionality reduction and distance metric learning. Similarity search is achieved through consensus among the (similar) instance-level constraints based on ranking orders computed using the learned LCSS-based similarity. Experimental results using a combination of synthetic and real tropical cyclone event data sequences are presented to demonstrate the feasibility of our parameter learning approach and its robustness to variability in the instance constraints. We also present a similarity query example using real tropical cyclone event data sequences from 2000 to 2008. The main contributions of this paper are the introduction of novel problems/directions in three different research disciplines, namely: (i) [database] a new query type conditioned by user-defined instance-level constraints (not only limited to data sequence instances, but also for any data structure) (ii) [machine learning] a new metric learning problem for arbitrary length multidimensional data sequences, and (iii) [climate science] supporting scientific research on weather events (e.g., tropical cyclones) via similarity search. Our future work include (i) overcoming the technical challenges in our proposed approach such as the need to select the dimensionality reduction algorithm and to select the parameter for the voting scheme, and (ii) theoretical study and generalization of our framework.
Acknowledgments This research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Adminstration.
9.
REFERENCES
[1] M. S. Baghshah and S. B. Shouaki. Semi-supervised metric learning using pairwise constraints. Proc. 21th IJCAI, pages 1217–1222, 2009. [2] J. Behnkre, T. H. Watts, B. Kobler, D. Lowe, S. Fox, and R. Meyer. Eosdis petabyte archives: tenth anniversary. Proc. 22nd IEEE/13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST’ 05), pages 81–93, 2005. [3] D. Berndt and J. Clifford. Using dynamic time warping to find patterns in time series. AAAI Workshop on Knowledge Discovery in Databases, pages 229–248, 1994. [4] M. Bernstein, V. d. Silva, J. Langford, and J. Tenenbaum. Graph approximations to geodesics on embedded manifolds. Preprint, 2000. [5] D. P. Bertsekas. Nonlinear Programming. Athena Scientific, 2nd edition, 1999. [6] L. Chen and R. Ng. On the marriage of lp -norms and edit distance. Proc. 30th VLDB Conference, pages 792–803, 2004. [7] L. Chen, M. T. Ozsu, and V. Oria. Robust and fast similarity search for moving object trajectories. Proc. 24th SIGMOD Conference, pages 491–502, 2005.
144