Technical Report - Computer Science & Engineering

Comment

Report 3 Downloads 135 Views

Technical Report

Department of Computer Science and Engineering University of Minnesota 4-192 Keller Hall 200 Union Street SE Minneapolis, MN 55455-0159 USA

TR 15-020 Tripoles: A New Class of Climate Teleconnections Saurabh Agrawal, Gowtham Atluri, Stefan Liess, Snigdhansu Chatterjee, Vipin Kumar

December 11, 2015

Powered by TCPDF (www.tcpdf.org)

Technical Report Tripoles: A New Class of Climate Teleconnections Saurabh Agrawal, Gowtham Atluri, Vipin Kumar∗, Stefan Liess†, and Snigdhansu Chatterjee‡ November 2015

Abstract Teleconnections in climate represent a persistent and large-scale temporal connection in a given climate variable between two distant geographical regions. They are known to impact and explain the variability in climate of many regions across the globe and have been a subject of interest to climatologists. Traditionally, climate teleconnections have been studied as a persistent relationship between a pair of geographical regions (e.g. North Atlantic Oscillation (NAO), and El-Nino Southern Oscillation (ENSO)). In this report, we define a new class of climate teleconnections which we refer to as tripoles that capture climatic relationships between three regions, in contrast to teleconnections that are traditionally defined using only two regions. We further provide a categorization of tripoles based on pairwise relationships between the three participating regions and propose a shared nearest neighbor (SNN) graph-based approach to find tripoles in a given spatio-temporal dataset.

1

Introduction

Climate teleconnections represent persistent connections between the climate anomalies observed at regions that are located far from each other [9]. Dipoles are one of the most widely studied category of teleconnections in climate [4], that are identified as pairs of regions that have opposite climate anomalies observed at the same time. Examples of most prominent and well-known dipoles are North-Atlantic Oscillation (NAO) and Southern Oscillation (SO). Figures 1 and 2 show the time series observed at the two ends for NAO and SOI respectively. They have been shown to be important for explaining the variability in climate of many regions across the globe [6] ∗ Department

of Computer Science, University of Minnesota of Earth Sciences, University of Minnesota ‡ Department of Statistics, University of Minnesota † Department

1

Figure 1: North Atlantic Oscillation

Figure 2: Southern Oscillation

2

Climatic relationships can potentially exist among three regions and such relationships can’t be captured using the traditional definition of teleconnections. For illustration, consider the three regions R1 , R2 , and R3 shown in the Figure 3a where the area-averaged anomaly time series1 (T1 , T2 , and T3 respectively) of Sea Level Pressure (SLP) was observed over winter months (December, January, Februrary) during 1979-2011. Figure 3b shows the correlations between three pairs of time series: 1) anomaly time series of R3 and R1 , 2) anomaly time series of R3 and R2 , and 3) anomaly time series of R3 and the difference of anomaly time series of R1 and R2 (R1 -R2 ) in the lower, middle and upper panels respectively. R3 shows weak correlations with that of R1 (0.28) and R2 (-0.25). However, if the anomaly time series of R1 and R2 are subtracted, then the resultant time series shows a stronger correlation with that of R3 (0.65). This indicates that there is a strong connection between SLP anomalies observed at R3 and difference of SLP anomalies observed at regions R1 and R2 . However, this connection couldn’t be revealed if the pairwise relationships, (R3 , R1 ) and (R3 ,R2 ), alone are considered. Therefore, to capture such relationships, traditional definition of teleconnection patterns is not suitable. This motivated us to define a new teleconnection pattern called a tripole, which comprises of three regions (R1 , R2 , and R3 ) along with their corresponding time series T1 , T2 , and T3 , such that T3 is strongly correlated with T12 , the resultant combination of T1 and T2 . Combining the two time series T1 and T2 is justified only if the resultant time series T12 shows much stronger correlation with T3 as compared to the pairwise correlations between T1 and T3 , and, T2 and T3 . It should be noted that the concept of a tripole, by itself, is more general and not limited to spatio-temporal (or climate) domain. A combination of two time series may show strong correlations with a third time series even when they are weakly correlated with the third time series individually in any time series data such as stock market data and gene-expression data. Figure 4 demonstrates an example of tripole found across the normalized time series of three stocks: Lennar Corporation (LEN), DaVita HealthCare Partners Inc. (DVA) and FMC Technologies, Inc. (FTI), observed on a weekly scale during the period 2002-2006. As shown in the figure, the time series of stock LEN is weakly correlated with that of stocks DVA (0.27) and FTI (-0.33), which do not indicate any obvious relationship between LEN and any of the other two stocks. However, the correlation between time series of LEN and the difference time series of DVA and FTI is extremely strong (0.93), which is indicative of a strong relationship. The organization of the report is as follows: the related work on finding teleconnections in climate data is discussed in Section 2. Notations and definitions related to tripole are introduced in Section 3. Categorization of tripoles based on pairwise correlations between the three regions is proposed in Section 4. Section 5 discusses the challenges related to searching tripoles in spatio-temporal 1 Anomaly time series are computed as deviations from mean to remove annual seasonality. See [10] for further details. The time series for a region is computed by averaging the time series observed at all the locations within the region.

3

(a) An example of a tripole

(b) Correlations between anomaly time series

Figure 3: An example of a tripole in Sea Level Pressure (SLP) data data. In Section 6, we present a detailed description of our algorithm to search tripoles in the data.

4

Figure 4: An example of a tripole formed by time series of three stocks

2

Related Work

Most of the related work in the context of searching teleconnections has been focused on finding dipoles, that involve a high negative correlation between two distant geographical regions. Empirical Orthogonal Functions (EOF) (also known as Principal Component Analysis (PCA)) is one of the most commonly used techniques in which a covariance matrix of the space-time data observed over a selected region during a given period of time is decomposed into a set of mutually orthogonal eigenvectors and the two ends of dipoles are typically captured as dominant spatial patterns (with opposite signs) in the top k eigenvectors[8, 9]. However, the spatial patterns captured in eigenvectors are not guaranteed to capture dipoles or any other physically meaningful pattern in general. Further, different dipoles can get mixed across multiple EOFs [1]. Over the last decade, several graph-based approaches have been introduced for the automated discovery of teleconnections [7, 4, 3]. These approaches first construct a climate graph where each location is treated as a node and an edgeweight between a pair of nodes is computed as the correlation between anomaly time series of corresponding regions. The regions participating in the teleconnections are identified by applying different clustering techniques on the climate graph including Shared Nearest Neighbor (SNN) and Shared Reciprocal Nearest Neighbor (SRNN). However, none of these approaches guarantee to identify all the teleconnections. SNN approaches can be further extended to find tripoles. One of such extensions to find tripoles is presented in this report.

5

3

Definitions and Notations

In this section, we introduce the notation used in this paper and then define the notion of a tripole systematically.

3.1

Notations

Let a gridded spatio-temporal data D contains {l1 , l2 , ..., ln } a set of locations and the time series from these locations are T1l , T2l , ..., Tnl . We treat a set of contiguous locations {li1 , li2 , ..., lin } as a Region Ri of size n. The area-averaged time series from all locations in Ri , is indicated as Tir . Variance of a time series Tir is denoted as var(Tir ). We denote correlations between two time series Tir and Tjr of length t (Tir , Tjr ∈ Rt ) as corr(T1r , T2r ). A combining function f (Tir , Tjr ) r combines two time series Tir and Tjr of length t into a new time series Tcomb of the same length t.

3.2

Definitons

DEFINITION 1: Given three time series T1r , T2r , and T3r , and a combining function f (), T1r , T2r , and T3r form a tripole if |corr(f (T1r , T2r ), T3r )| > max(|corr(T1r , T3r )|, |corr(T1r , T3r )|)

(1)

The resultant tripole is denoted by (T3r : f (T1r , T2r )), where T3r is called the root of the tripole whereas T1r and T2r are called the leaves of the tripole. In this work, we limit our discussion to two simple combination functions - addition and subtraction. In the context of climate science, the three time series T1r , T2r , and T3r in a tripole represent the average time series of three regions, where a region is defined as a set of spatially contiguous locations that are strongly correlated with each other. Based on the choice of the function, we categorize tripoles into • Sumtripoles - All tripoles in which the combination function f (Tir , Tjr ) = Tir + Tjr . • Difftripoles - All tripoles in which the combination function f (Tir , Tjr ) = Tir − Tjr . To assess the interestingness of a tripole, we propose two measures: 1. Strength - The strength of a tripole (T3r : f (T1r , T2r )), denoted by Strength(T3r : f (T1r , T2r )) is computed as |corr(f (T1r , T2r ), T3r )|, i.e. the strength of the correlation between T3r and combination of T1r and T2r . An interesting tripole would be the one capturing the strong connection between the root (T1r ) and the combination of its leaves f (T2r , T3r ), hence its strength should be high. Given the pairwise correlations between T1r , T2r , and T3r ,

6

the strength of a sumtripole and a difftripole can be directly computed using the following relations corr(T1r + T2r , T3r ) =

corr(T1r , T3r ) + corr(T2r , T3r ) p 2(1 + corr(T1r , T2r ))

(2)

corr(T1r − T2r , T3r ) =

corr(T1r , T3r ) − corr(T2r , T3r ) p 2(1 − corr(T1r , T2r ))

(3)

The above relations are true only for the standardized time series, i.e. T1r , T2r , and T3r are standardized to have zero mean and unit variance. These relations (eq (2) and (3)) follow directly from the basic definition of covariance between two time series. A detailed proof is provided in the Appendix A. 2. Delta - The delta of a tripole (T3r : f (T1r , T2r )), denoted by Delta(T3r : f (T1r , T2r )), is computed as Strength(T3r : f (T1r , T2r )) − (max(|corr(T1r , T3r )|, |corr(T2r , T3r )|)) A high value of Strength(T3r : f (T1r , T2r )) indicates a strong connection between T3r and the combination of T1r and T2r which makes it an interesting tripole. Furthermore, a high value of delta indicates a drastic improvement in the connection between T3r and the combination f (T1r , T2r ) when compared to pairwise connections (T3r , T1r ) and (T3r , T1r ) which further characterizes the interestingness of the tripole. Therefore, a tripole is interesting if both strength and delta are high. DEFINITION 2: A pair of regions Ri and Rj form a dipole if there exists a significant negative correlation between their corresponding area-averaged anomaly time series Tir and Tjr .

4

Categories of Triplets of Time Series

Triplets of time series T1r , T2r , T3r can be categorized based on the nature of the pairwise correlations into following four categories 1. Positive Triplets- In a positive triplet T1r , T2r , T3r , all the pairwise correlations between the three time series T1r ,T2r , and T3r are positive. 2. Mixed OPTN (One Positive Two Negative) Triplets- In a mixed OPTN triplet T1r , T2r , T3r , exactly two of the pairwise correlations between the three time series T1r ,T2r , and T3r are negative. 3. Mixed ONTP (One Negative Two Positive) Triplets- In a mixed ONTP triplet T1r , T2r , T3r , exactly two of the pairwise correlations between the three time series T1r ,T2r , and T3r are positive. 7

(a) Positive Sumtripole

(b) Positive Difftripole

(c) Mixed OPTN Sumtripole

(d) Mixed OPTN Difftripole

(e) Mixed ONTP Sumtripole

(f) Mixed ONTP Difftripole

(g) Negative Sumtripole

Figure 5: Examples of different categories of tripoles that can be found in data. The strengths of the pairwise correlations for all the three pairs are shown along the corresponding edges. The final strength of the tripole is indicated in the bold at the center.

8

4. Negative Triplets- In a negative triplet T1r , T2r , T3r , all the pairwise correlations between the three time series T1r ,T2r , and T3r are negative. Fig 5 shows the examples of sumtripoles and difftripoles that can be formed in all four categories of triplets. Note that a difftripole can never occur in a negative triplet which follows directly from Eq 3. Further analysis of Eq 3 also indicates that high strength of a difftripole (T3r : f (T1r , T2r )) is obtained when 1) the leaves T1r and T2r are positively correlated, and 2) the leaves show correlations of opposite signs with the root T3r , i.e. one of the leaves show positive correlation with T3r whereas the other shows negative correlation with T3r . Among different categories of difftripoles, only Mixed ONTP difftripoles satisfy the two conditions and are more likely to show higher strength. Similarly, the analysis of Eq 2 indicates that high strength of a sumtripole (T3r : (T1r , T2r )) is obtained when 1) the leaves T1r and T2r are negatively correlated, and 2) the leaves show correlations of same signs with the root T3r , i.e. both the leaves show either positive or negative correlation with T3r . Only Negative sumtripoles and Mixed ONTP sumtripoles satisfy the two conditions and thus are more likely to show higher strength. In this study, we limit our focus only to Negative sumtripoles.

5

Challenges

Given a gridded climate dataset consisting of time series observed at N grid points, a most simplistic approach would be to search for the tripoles of grid points in a brute-force manner. However, it has several limitations: Firstly, searching all possible combinations of three grid points would be O(N 3 ) in time complexity, and hence computationally expensive. Second, due to high spatial autocorrelation, a large number of redundant tripoles will also be discovered across multiple combinations of the grid points from the three regions, which would potentially be representing the same climate phenomena. For illustration, if a tripole pattern exists across three regions, each of size K, then the number of tripoles formed would be O(K 3 ). Therefore, one would need to design a clustering mechanism and a suitable similarity measure to group the tripoles appropriately. Furthermore, since the climate phenomena spanning thousands of kilometres, tripole involving regions are desired as opposed to a tripole involving individual grid points. Therefore, in this work, we focus on searching tripoles involving three regions such that their area weighted anomaly time series form a tripole, where a region is defined as the set of spatially coherent grid points whose anomaly time series are strongly correlated with each other. One of the possible approaches could be to first cluster the data into spatially coherent clusters, and then searching all combinations of three regions. However, as the process of finding regions is completely independent to the objective of finding relationships with each other, the actual regions participating in a tripole can be missed.

9

6

Proposed Approach

On the account of issues discussed above, we propose a novel methodology based on Shared Nearest Neighbor (SNN) clustering approach to discover tripoles. SNN clustering was also used in [3] for discovering dipoles in climate data. The major advantage of this approach is that the relationships and the corresponding pair of regions are found simultaneously. Our proposed methodology to discover tripoles consists of two phases in which we first find all the pairs of strongly correlated regions (negatively correlated regions for negative sumtripoles) which are treated as potential pairs of leaves of a tripole, and then for each such pair of leaves, we search for the root region that could possibly form a tripole with it. We first briefly describe the background and the preliminary steps of data processing that are precursor to our proposed algorithm for searching tripoles. Following that, we present a formal description of our algorithm for searching negative sumtripoles. At the end, we discuss the choice of the threshold parameters used in the algorithm.

6.1 6.1.1

Background and Data Preliminaries Dataset

We used reanalysis datasets of Sea Level Pressure (SLP) data to find tripoles. Reanalysis projects provide gridded global datasets by assimilating remote and in-situ sensor measurements using a numerical climate model to achieve physical consistency and interpolation for global coverage. Reanalysis datasets are typically considered as the best available surrogate of the real observations. We used the datasets provided by NCEP/National Center for Atmospheric Research (NCAR) Reanalysis Project [5] which is provided at a monthly scale spanning from 1979-2014 at a spatial resolution of 2.5 × 2.5 degree (10512 grid points). We present our results only for the winter months - December, January, and February, thereby resulting in 36 × 3 = 108 observations for each grid point. 6.1.2

Detrending

The presence of long-term increasing or decreasing linear trends often lead to artificial increase in the strength of the correlations [2] and thus should be removed from the data. For each location li , we first compute the least-squares fit of a straight line to its corresponding time series Til and then subtract the resultant function from Til to remove any linear trend present in Til . 6.1.3

Seasonality removal and Standardization

The annual cycle of seasonality is one of the strongest signals that can be observed in the Earth science data. It is important to get rid of seasonality so that our discovered relationships are not an artifact of annual seasonality. The standard method to remove seasonality is to construct the anomaly time series from

10

the raw data by removing the monthly mean values of the data. Given monthly data xy (i) between the years startY ear and endY ear, where xy (i) represents the observation for ith month in y th year, for each ith month (i = 1, 2, ..., 12), the mean µi is calculated as µi =

1 endY ear − startY ear + 1

endY Xear

xy (i), ∀i ∈ {1, 2..., 12}

(4)

y=startY ear

The anomaly time series is then constructed by subtracting the monthly mean µi from all the observations for the ith month. xy (i) = xy (i) − µi , ∀y ∈ [start, end]

(5)

The observations in equatorial regions have much lower variance than what is observed at the polar regions. To ensure similar treatment to every location, the anomaly time series at all locations are further standardized to have zero mean and unit variance. We denote the set of standardized anomaly time series at all locations by Danom . 6.1.4

Network Construction

Once the seasonality and the linear trends are removed, a complete graph G is constructed from the data using the approach used earlier in [3] where the pairwise correlation between the anomaly time series of all pairs of locations are computed. The nodes in the graph represent locations on the Earth and the edge-weights represent the correlation between the anomaly time series of the two locations on the Earth.

6.2

Searching Negative Sumtripoles

Once the graph G is constructed, we apply our algorithm on G for searching negative sumtripoles. The proposed algorithm takes the graph G as an input and outputs a list of all such groups of three regions whose average time series form a negative sumtripole. Our approach is summarized in Algorithm 1. Apart from the complete graph G, the algorithm requires as input, the set of anomaly time series at all locations (denoted by Danom ), and six thresholding parameters used at different stages of the algorithm namely - N egDipT h (threshold applied on negative correlations while finding dipoles), N egRootDipT h (threshold on negative correlations between locations of root and dipole ends), N bT hresh (threshold on positive correlations between a given location and all the locations in its neighborhood), StrengthT h (lower bound on strngth of the tripole), DeltaT h (lower bound on the delta of the tripole), and M inSize (lower bound on size of all the regions in a tripole). Our approach shown in Algorithm 1 is a two phase approach. In first phase , we find all strong dipoles, i.e. pairs of negatively correlated regions using SNN approach similar to [3]. In the second phase, each pair of regions obtained in first phase is treated as a potential pair

11

of leaves of a sumtripole for which all the potential roots, i.e. regions that would form a sumtripole with the given pair are obtained. In each of the two phases, different procedures are used for different specific sub-tasks. We now provide the description of each procedure in detail. 6.2.1

Procedure GET NEIGHB

This procedure is used to construct a neighborhood N eighb around a given location. For a given graph G, location pt, and a threshold parameter N bT hresh, GET NEIGHB outputs a set N eighb that consists of all locations lj in G that show correlation ≥ N bT hresh with pt. 6.2.2

Procedure FIND DIP REGS

This procedure is used to find a pair of regions R1 and R2 around two given locations l1 and l2 respectively such that R1 and R2 form a dipole. Along with the two given locations, it takes additional inputs including the complete graph G, the set of all anomaly time series Danom , and the two parameters N bT hresh and N egDipT h and outputs a dipole that includes the two regions R1 and R2 , their average time series T1r and T2r respectively and the strength of the dipole which is equal to correlation between T1r and T2r . As described in Algorithm 3, first the neighborhoods P1 and P2 are obtained using GET NEIGHB procedure (line 2). Then the region R1 (R2 ) is extracted from P1 (P2 ) by discarding all the locations in P1 that do not show negative correlation (stronger than N egDipT h - the input parameter) with any of the locations in P2 (P1 ), as shown in lines 3-8. As a result, we get R1 (R2 ) as set of all the locations that show strong positive correlation (≥ N bT hresh) with l1 (l2 ) and strong negative correlation (stronger than N egDipT h) with at least one location in R2 (R1 ). 6.2.3

Procedure FIND DIPOLES

This procedure is used in first phase of the algorithm to find all dipoles in the graph G. It takes graph G, the set of all anomaly time series Danom , and the three parameters N egDipT h, N bT hresh, and M inSize as input and outputs a list of all pairs of negatively correlated regions, i.e dipoles found in G. In Algorithm 2, first all the negative edges in G are sorted in ascending order of their strengths and are marked UNSEEN (lines 3-5). A pair of locations l1 and l2 corresponding to strongest unseen negative edge is chosen (lines 8-13), for which a pair of negatively correlated regions R1 and R2 is obtained using procedure FIND DIP REGS (lines) (line 14). The pair (R1 , R2 ) is added to the list of discovered dipoles only if the negative correlation between the two regions is stronger than N egDipT h and both the regions are big enough in size (≥ M inSize) (lines 15-16). All the edges connecting locations in R1 with R2 are marked as SEEN so that they would not be reexamined in the search process for the remaining dipoles in the graph (line 17). The search process for finding remaining dipoles is continued by repeating lines (8-17) until all the negative edges are examined. 12

6.2.4

Procedure FIND ROOT

This procedure is used in the second phase of Algorithm2 to find all the regions that would form a negative sumtripole (with sufficient strength and delta) with a given dipole. Apart from a dipole, the procedure takes graph G, the set of all anomaly time series Danom , and five parameters namely N egRootDipT h, N bT hresh, StrengthT h, DeltaT h, and M inSize as input and outputs a list of the regions that would not only form a negative sumtripole with the dipole in which the two ends of the given dipole R1 and R2 form its leaves. As explained in Algorithm 4, first all the locations in G that show negative correlations (stronger than N egRootDipT h- the input parameter) with at least one location in both R1 and R2 are separated out as the set of potential locations for the third region (denoted by P otLocsR3 in lines 2-4). Next, all locations in P otLocsR3 are sorted in the ascending order of their sum of correlations with the average time series of each end of the given dipole and are marked UNSEEN (lines 5-7). An unseen location Loc with the most negative sum is chosen (lines 10-14), a neighborhood LocN eighb around Loc is obtained using procedure GET NEIGHB and region R3 is obtained as the intersection of LocN eighb and P otLocsR3 (lines 15-16). As a result, all the locations in R3 show negative correlation (stronger than N egRootDipT h) with at least one of the locations in each end of the given dipole and show strong positive correlation (stronger than N bT hresh) with the Loc, the centre of the region R3 . After obtaining R3 , strength and the delta of the resultant sumtripole (T3r : Dipole.T1r + Dipole.T2r ) formed between the average time series of three regions are calculated (lines 17-18). Only those sumtripoles whose strength and delta are higher than StrengthTh and DeltaTh respectively are considered to be interesting. The region R3 is added to the final list of output only if it is big enough in size (≥ M inSize) and the resultant sumtripole is interesting (lines 19-22). All the locations in R3 are marked seen and the entire process covered in the lines 5-22 is repeated until all the locations in P otLocsR3 are marked seen.

Acknowledgements This work was supported by NSF grants IIS-0905581 and IIS-1029771. Access to the computing facilities was provided by the University of Minnesota Supercomputing Institute.

References [1]

Dietmar Dommenget and Mojib Latif. “A cautionary note on the interpretation of EOFs”. In: Journal of Climate 15.2 (2002), pp. 216–225.

[2]

Robert F Engle and Clive WJ Granger. “Co-integration and error correction: representation, estimation, and testing”. In: Econometrica: journal of the Econometric Society (1987), pp. 251–276.

13

[3]

Jaya Kawale, Michael Steinbach, and Vipin Kumar. “Discovering Dynamic Dipoles in Climate Data.” In: SDM. SIAM. 2011, pp. 107–118.

[4]

Jaya Kawale et al. “A graph-based approach to find teleconnections in climate data”. In: Statistical Analysis and Data Mining: The ASA Data Science Journal 6.3 (2013), pp. 158–179.

[5]

Robert Kistler et al. “The NCEP-NCAR 50-year reanalysis: Monthly means CD-ROM and documentation”. In: Bulletin of the American Meteorological society 82.2 (2001), pp. 247–267.

[6]

S Power et al. “Inter-decadal modulation of the impact of ENSO on Australia”. In: Climate Dynamics 15.5 (1999), pp. 319–324.

[7]

Michael Steinbach et al. “Discovery of climate indices using clustering”. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM. 2003, pp. 446–455.

[8]

Hans Von Storch and Francis W Zwiers. Statistical analysis in climate research. Cambridge university press, 2001.

[9]

John M Wallace and David S Gutzler. “Teleconnections in the geopotential height field during the Northern Hemisphere winter”. In: Monthly Weather Review 109.4 (1981), pp. 784–812.

[10]

Daniel S Wilks. Statistical methods in the atmospheric sciences. Vol. 100. Academic press, 2011.

14

A

Derivation of Strength of Sumtripoles and Difftripoles

Consider three time series, T1 ,T2 , and T3 with zero mean and unit variance. Then, cov(T3 , T1 + T2 ) corr(T3 , T1 + T2 ) = p var(T3 ).var(T1 + T2 ) cov(T3 , T1 ) + cov(T3 , T2 ) =p var(T3 ).(var(T1 ) + var(T2 ) + 2cov(T1 , T2 ))

Since var(T1 ) = var(T2 ) = var(T3 ) = 1. Therefore, corr(T3 , T1 ) = cov(T3 , T1 ), corr(T3 , T2 ) = cov(T3 , T2 ) and corr(T1 , T2 ) = cov(T1 , T2 ). Therefore, we get corr(T3 , T1 + T2 ) =

corr(T3 , T1 ) + corr(T3 , T2 ) p 2(1 + corr(T1 , T2 ))

Similarly, one can write cov(T3 , T1 − T2 ) corr(T3 , T1 − T2 ) = p var(T3 ).var(T1 − T2 ) cov(T3 , T1 ) − cov(T3 , T2 ) =p var(T3 ).(var(T1 ) + var(T2 ) − 2cov(T1 , T2 )) corr(T3 , T1 ) − corr(T3 , T2 ) p = 2(1 − corr(T1 , T2 ))

15

Algorithm 1 Finding sumtripoles in negative triplets Input: G - A complete graph Danom - Set of anomaly time series at all locations N egDipT h - Threshold on negative correlations used in finding dipoles N egRootDipT h - Threshold on negative correlations between locations in dipole regions and root N bT hresh -Threshold on positive correlations used in neighborhood construction StrengthT h - Lower bound on strength of sumtripole DeltaT h - Lower bound on delta measure of sumtripole M inSize - Lower bound on size of all three regions in a tripole Output: 1: AllN egSumT p ← An empty list . Initialization

2:

FIRST PHASE: AllDipoles ← FIND DIPOLES(G, Danom , N egDipT h, N bT hresh, M inSize)

SECOND PHASE: for each Dipole in AllDipoles do Roots ← FIND ROOT(Dipole, Danom , N bT hresh, N egRootDipT h, StrengthT h, DeltaT h, M inSize) 5: T ripoles ← Tripoles formed by Roots with Dipole 6: AllN egSumT p ← APPEND(AllN egSumT p, T ripoles) 7: end for 3: 4:

16

Algorithm 2 Finding dipoles in the graph 1:

procedure Find Dipoles Input:G, Danom , Parameters-(N egDipT h, N bT hresh, M inSize) Output:AllDipoles - List of all dipoles found in G

8: 9: 10: 11: 12: 13:

Initialize an empty list AllDipoles N umEdges ← Number of negative edges in G SortEdges ← Sorted edges in G in ascending order of their strength Mark all edges as UNSEEN. EdgCounter ← 0 while EdgCounter ≤ N umEdges do . Loop over all the sorted edges EdgCounter ← EdgCounter + 1 Edge ← SortedEdges(EdgCounter) . Next strongest edge if Edge is marked SEEN then continue end if l1 , l2 ← End locations of Edge

14:

NewDipole ←FIND DIP REGS(l1 , l2 , Danom , N bT hresh, N egDipT h)

15:

if NewDipole.Strength ≤ −N egDipT h and

2: 3: 4: 5: 6: 7:

16:

SIZE(NewDipole.R1 ) ≥ M inSize

17:

and SIZE(N ewDipole.R2 ) ≥ M inSize then AllDipoles ← APPEND(AllDipoles, N ewDipole)

18:

end if

19:

Mark all edges between N ewDipole.R1 and N ewDipole.R2 as

20:

SEEN 21: 22: 23:

end while return AllDipoles end procedure

. The list of all dipoles found

17

Algorithm 3 Find pair of regions forming a dipole for a given pair of locations l1 and l2 1: procedure Find Dip Regs Input:(l1 ,l2 )- pair of locations, G, Danom , N egDipT h, N bT hresh Output:N ewDipole - Dipole found for given pair of locations l1 and l2 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:

P1 , P2 ← GET NEIGHB(G, l1 , N bT hresh), GET NEIGHB(G, l2 , N bT hresh); P1 vsP2 ← corr(P1 , P2 ) . All cross correlations StrongEdges ←Edges in P1 vsP2 with strength ≤ −N egDipT h AllEnd1 ← Ends of StrongEdges in P1 AllEnd2 ← Ends of StrongEdges in P2 N ewDipole.R1 ← P1 ∩ (AllEnd1 ) N ewDipole.R2 ← P2 ∩ (AllEnd2 ) N ewDipole.Strength ← corr(T1r , T2r ) return N ewDipole end procedure

18

Algorithm 4 Finding all regions that form a negative sumtripole with a given dipole 1:

procedure Find Root Input: Dipole-Given dipole Danom - Set of anomaly time series at all locations User-provided parameters - N bT hresh, N egRootDipT h, StrengthT h, DeltaT h, M inSize Output: AllT hirdP oles - List of all regions that would form a negative sumtripole with Dipole

2:

N egN bsR1 ← All locations that show correlation ≤ −N egRootDipT h with at least one location in Dipole.R1 N egN bsR2 ← All locations that show correlation ≤ −N egRootDipT h with at least one location in Dipole.R2 P otLocsR3 ← N egN bsR1 ∩ N egN bsR2 . Potential Locs in R3 SortedLocsR3 ← sorted P otLocsR3 in increasing order of sum of their correlations with Dipole.T1r and Dipole.T2r N umLocsR3 ← Number of locations in P otLocsR3 LocCounter ← 0 Mark all locs in SortedLocsR3 as UNSEEN while LocCounter ≤ N umEdges do . Loop over SortedLocsR3 LocCounter ← LocCounter + 1 Loc ← SortedLocsR3 (LocCounter) if Loc is marked SEEN then continue end if LocN eighb ← GET NEIGHB(G, Loc, N bT hresh) R3 ← LocN eighb ∩ SortedLocsR3 T rSt ← corr(T3r , Dipole.T1r + Dipole.T2r ) . Tripole Strength Delta ← |T rSt| − max(|corr(T3r , Dipole.T1r )|, |corr(T3r , Dipole.T2r )|) if |T rSt| ≥ StrengthT h and SIZE(NewDipole.R3 ) ≥ M inSize and Delta ≥ DeltaT h then AllThirdPoles ← APPEND(AllT hirdP oles, R3 ) end if

3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25:

end while return AllT hirdP oles end procedure

19

Algorithm 5 Constructing a neighborhood for a given location pt 1:

procedure Get Neighb Input:G, N bT hresh, l-given location Output:N eighb - A neighborhood constructed for l

2:

N eighb ← All vertices lj in G that show correlation ≥ N bT hresh with l return N eighb end procedure

3: 4:

20

Recommend Documents

Technical Report - Computer Science & Engineering