Translation and Rotation Invariant Mining of Frequent ... - CiteSeerX

Report 2 Downloads 31 Views
Translation and Rotation Invariant Mining of Frequent Trajectories: Application to Protein Unfolding Pathways Alexander Andreopoulos1 , Bill Andreopoulos1,2 , Aijun An1 , and Xiaogang Wang1 1

York University, Dept. of Computer Science, Toronto Ontario, M3J 1P3, Canada 2 Biotechnological Centre, TU Dresden, Germany {alekos,billa,aan}@cs.yorku.ca, [email protected]

Abstract. We present a framework for mining frequent trajectories, which are translated and/or rotated with respect to one another. We then discuss a multiresolution methodology, based on the wavelet transformation, for speeding up the discovery of frequent trajectories. We present experimental results using noisy protein unfolding trajectories and synthetic datasets. Our results demonstrate the effectiveness of the proposed approaches for finding frequent trajectories. A multiresolution mining strategy provides significant mining speed improvements.

1

Introduction

There exist many situations where we are confronted with trajectories describing the movement of various objects. We are often interested in mining the frequent trajectories that groups of such objects go through. Trajectory datasets arise in many real world situations, such as discovering biological patterns, mobility experiments, and surveillance [9,7,4]. Of special interest are trajectories representing protein unfolding pathways, which have been derived from high-throughput single molecule force spectroscopy experiments [8]. Such trajectories are represented on a two-dimensional f orce × distance grid. The y axis corresponds to the force (pN) involved in pulling the protein out of the cellular membrane via the tip of a mechanical cantilever; the x axis corresponds to the force-induced distance (nm) on the unfolding pathway of the protein. Such trajectories are often very noisy, which makes it difficult to distinguish the frequent subtrajectories from the deluge of irrelevant trajectories. Moreover, frequent subtrajectories may be translated or rotated with respect to one another. Our aim is to find such frequent subtrajectories in datasets resulting from high-throughput experiments. This is useful for identifying different protein unfolding pathways and, therefore, classifying proteins based on their structure. The contributions of this paper are as follows: (i) We present a framework for finding frequent trajectories whose sampling interval is small enough to estimate their first and second order derivatives. (ii) We propose a robust framework for mining frequent translated trajectories and frequent trajectories that are both rotated and translated with respect to each other. (iii) We apply our method T. Washio et al. (Eds.): PAKDD 2007, LNAI 4819, pp. 174–185, 2007. c Springer-Verlag Berlin Heidelberg 2007 

Translation and Rotation Invariant Mining of Frequent Trajectories

175

to find frequent trajectories in protein unfolding pathways. (iv) We present a multiresolution framework to speed up the mining process. This paper is organized as follows. Section 2 presents some related work. Section 3 introduces the general framework we use for mining trajectories. Section 4 describes a method for mining translated and rotated trajectories. Section 5 offers an approach for optimizing the mining speed of frequent trajectories and dealing with noisy trajectories. Section 6 presents experiments testing the proposed approaches. Section 7 concludes the paper.

2

Related Work

In sequential pattern mining we are typically given a database containing sequences of transactions and we are interested in extracting the frequent sequences, where a sequence is frequent if the number of times it occurs in the database satisfies a minimum support threshold. Popular methods for mining such datasets include the GSP algorithm [2] - which is an Apriori [1] based algorithm - and the PrefixSpan [10] algorithm. GSP can suffer from a high number of generated candidates and multiple database scans. Pattern growth methods such as PrefixSpan are more recent approaches for dealing with sequential pattern mining problems. They avoid the candidate generation step, and focus the search on a restricted portion of the initial database making them more efficient than GSP [2,10]. The problem that is most related to frequent trajectory mining is sometimes referred to as frequent spatio-temporal sequential pattern mining in the literature. The main difference between our work and previous work [9,7,4,5] is that our method assumes that we are dealing with densely sampled trajectories - trajectories whose sampling interval is small enough to allow us to extract from a trajectory its first and second derivative. This allows us to define a neighborhood relation between the cells making up our trajectories, allowing us to perform various optimizations. There has been a significant amount of research on defining similarity measures for detecting whether two trajectories are similar [13,3]. However, this research has not focused on mining frequent trajectories. The previous work closest to our approach is given in the innovative work described in [5] where the authors match two candidate subgraphs by comparing the set of angles made by the graph edges. This measure is similar to the curvature measure that we use later on to detect rotation and translation invariant trajectories. However, [5] is not suited for detecting trajectories that are translated but not rotated with respect to each other and does not address various robustness and speed improvements that are introduced in this paper.

3

Apriori Based Mining of Frequent Trajectories

We define a trajectory c as a continuous function c(s) = [x(s), y(s)] in the 2D case and as c(s) = [x(s), y(s), z(s)] in the 3D case. Similar extensions follow for higher dimensional trajectories. The function c(s) is an arc-length parameterization of a curve/trajectory. In other words, the parameter s denotes the length along

176

A. Andreopoulos et al.

the trajectory and c(s) denotes the position of the trajectory after traversing distance s. In other words our trajectories do not depend on time, or the speed with which the object/person traverses the trajectory. We assume independence from time and speed for mining the frequent trajectories and subtrajectories. A trajectory c is frequent, if the number of the trajectories {c1 , c2 , ..., co } that pass through the path described by c satisfy a minimum support count (minsup). This definition requires only that there exist minsup subtrajectories of all trajectories in {c1 , c2 , ..., co } that are identical to c; but it does not require that c = ci for a sufficient number of ci ’s. More formally, we say that trajectory c over interval [0, τ ] is frequent with respect to a dataset of trajectories if there exist a minsup number of compact intervals [α1 , α1 +τ ], · · · , [αminsup , αminsup +τ ] such that for all i ∈ {1, · · · , minsup} and for all 0 ≤ s ≤ τ we have c(s) = cπ(i) (αi + s) (where π is a permutation function of {1, · · · , o}). y10 y9 y8 y7 y6 y5 y4 y3 y2 y1 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

Fig. 1. A cell sequence representation of a dense trajectory. The cell sequence representation of the dense trajectory consists of the gray cells (in order) that are intersected by the dense trajectory.

The frequent trajectory mining problem for 2D trajectories can be formulated as a sequential pattern mining problem in the following way. The 3D case is similar to the 2D case. Assume that we are observing a square region of size N × N over which all the trajectories occur. By splitting the region into a grid of square cells, as shown in Figure 1, we denote by (xi , yj ) the cell located at the ith column and j th row. A potential way of discretizing a region into a grid is by uniformly sampling along the two dimensions. In this paper we create the grid by uniform sampling, even though square cells are not necessary for our approach to work. Then we define: (i) A trajectory c(s) is referred to as a dense trajectory if it is represented by a densely sampled set of points. The sampling interval depends on the problem at hand and should be small enough to obtain accurate first and second derivatives. (ii) A dense trajectory’s cell sequence refers to the sequence of cells ((xπx (1) , yπy (1) ), · · · , (xπx (n) , yπy (n) )) intersected by the dense trajectory (where π x is a permutation function). The following conditions must hold: a. π x (i) = π x (i + 1) or π y (i) = π y (i + 1), and b. |π x (i) − π x (i + 1)| ≤ 1 and |π y (i) − π y (i + 1)| ≤ 1. Thus, we encode the order in which the dense trajectory intersects the cells. As we discuss below, in some situations it is preferable to also associate with each cell (xi , yj ) from the sequence the arclength/distance over which the trajectory falls in this cell.

Translation and Rotation Invariant Mining of Frequent Trajectories

177

(iii) The number of cells in a trajectory’s cell sequence is its length. For example, the cell sequence ((x4 , y3 ), (x3 , y2 ), (x3 , y3 ), (x3 , y4 ), (x2 , y5 )) has length 5. (iv) A continuous subsequence ω of a trajectory c’s cell sequence ((xπx (1) , yπy (1) ), . . . , (xπx (n) , yπy (n) )) must satisfy ω = ((xπx (i) , yπy (i) ), (xπx (i+1) , yπy (i+1) ), . . ., (xπx (j) , yπy (j) )) where 1 ≤ i ≤ j ≤ n. Sometimes a trajectory c(s) might be represented by a small number of sample points. We can interpolate those points and subsample the interpolated trajectory, to obtain the dense representation of those trajectories. Using the cell representation method to represent trajectories, the problem of mining frequent trajectories is defined as finding all the contiguous subsequences of the cell sequences in a database that satisfy a support threshold. We first point out that frequent trajectories satisfy the Apriori property: Any continuous subsequence of a frequent trajectory’s cell sequence is frequent. We exploit this property to implement efficient algorithms for mining frequent cell sequences. If (xi , yj ) is our current cell position, the next allowable cell position (xk , yl ) must be one of its 8 neighboring cells, such that |i−k| ≤ 1 and |j −l| ≤ 1. We use this constraint to modify the GSP algorithm and generate a much lower number of candidates than the GSP algorithm would generate without this constraint. Figure 2 shows the pseudocode for the Apriori based mining of frequent trajectories where Lk is the set of frequent length-k cell sequences found in the grid and Ck is the set of candidate length-k cell sequences. The main difference between this algorithm and GSP lies in the trajectory() function for generating candidates of length k from frequent cell sequences of length k − 1. (Figure 2b). When finding the candidate length-2 cell sequences, it suffices to only join two

Fig. 2. (a) Apriori based mining of frequent trajectories. (b) The trajectory() function.

178

A. Andreopoulos et al.

length-1 cell sequences i.e., single cells, if the cells are neighboring/adjacent to each other, resulting in a much smaller number of candidates than if we had used GSP to accomplish this without using this neighborhood constraint. For k ≥ 2, when joining length-k cell sequences to find length-(k + 1) candidate cell sequences, it suffices to only join cell sequences a with b if the last k − 1 cells of a and first k − 1 cells of b are identical. We notice that by joining two continuous paths, the resulting path is also continuous. Also, notice that there is no pruning step in the candidate generation process of our algorithm. This is because pruning may cause the loss of good candidates since we are mining for frequent contiguous subsequences. This is another major difference between the stardard GSP algorithm and this algorithm. Assume there are r cells into which our N × N region has been split and there are b neighboring cells for each non-boundary cell. In our case b = 8, since each cell is surrounded by at most 8 other cells. Then, the upper bound on the number of length-(k + 1) candidate cell sequences generated is |Lk | × b since every sequence in Lk can only be extended by its b neighboring cells on one end of the sequence. This is lower than the upper bound of |Lk | × r that GSP might generate if we were dealing with a sequential pattern mining problem where we could not apply this neighborhood constraint, since typically b