Research Track Poster
Building Connected Neighborhood Graphs for Isometric Data Embedding Li Yang Department of Computer Science, Western Michigan University Kalamazoo, Michigan 49008-5466
[email protected] ABSTRACT
such as clustering, classification, indexing, and searching. As a generic approach for dimensionality reduction, data embedding has applications in many areas where data are usually assumed as distributions in high dimensional space. These areas include data mining, pattern analysis, information retrieval, and multimedia data processing. A majority of data embedding methods are isometric. In other words, they take a metric space (inter-point distances) as input and try to embed the data to a low dimensional Euclidean space such that the inter-point distances are preserved as much as possible. Earlier isometric data embedding methods include classical multidimensional scaling [2], Kruskal’s metric multidimensional scaling (MDS) [11], Sammon’s nonlinear mapping (NLM) [9], and curvilinear component analysis (CCA) [3]. A recent advance in isometric data embedding is the use of geodesic distances [1, 14, 18, 24]. In implementation, geodesic distance between a pair of data points is usually estimated by length of the shortest path between the pair, computed by applying Dijkstra’s algorithm or Floyd’s algorithm, on a neighborhood graph. A neighborhood graph G = (V, E) is defined as a graph where the set V of vertices is the set of all data points and every edge in E connects a point to one of its neighbors. Once geodesic distances are estimated, an isometric data embedding method can be applied to the estimated geodesic distances to produce a final configuration of data. Example projection strategies include Isomap [18] which applies classical multidimensional scaling to the estimated geodesic distances, and curvilinear distance analysis (CDA) [14] which applies CCA to the estimated geodesic distances. An important issue that is common to all methods based on geodesic distances is how to construct a neighborhood graph. Existing methods use one of the following two approaches to define whether two points are neighbors: the first approach connects each point to its k nearest neighbors (the k-NN approach); the second approach connects each point to all the points within a pre-defined Euclidean distance ǫ (the ǫ-neighbor approach). The success of data embedding depends on how well the constructed neighborhood graph represents the underlying data manifold. One problem with both k-NN and the ǫneighbor approach is that they do not guarantee connectedness of the constructed neighborhood graphs. For example, both approaches perform well when data are uniformly distributed, and fail to project data when the data are under-sampled or spread among multiple clusters. Using a disconnected neighborhood graph, geodesic-distancebased methods fail to estimate geodesic distances between
Neighborhood graph construction is usually the first step in algorithms for isometric data embedding and manifold learning that cope with the problem of projecting high dimensional data to a low space. This paper begins by explaining the algorithmic fundamentals of techniques for isometric data embedding and derives a general classification of these techniques. We will see that the nearest neighbor approaches commonly used to construct neighborhood graphs do not guarantee connectedness of the constructed neighborhood graphs and, consequently, may cause an algorithm fail to project data to a single low dimensional coordinate system. In this paper, we review three existing methods to construct k-edge-connected neighborhood graphs and propose a new method to construct k-connected neighborhood graphs. These methods are applicable to a wide range of data including data distributed among clusters. Their features are discussed and compared through experiments. Categories and Subject Descriptors: H.2.8 [Database Management]: Database Applications — Data mining; I.5.1 [Pattern Recognition]: Models — Geometric, Statistical; G2.2 [Discrete Mathematics]: Graph Theory — Graph algorithms General Terms: Algorithms, Experimentation Keywords: Data embedding, dimensionality reduction, manifold learning, graph connectivity
1.
INTRODUCTION
Traditional methods for feature extraction and dimensionality reduction typically assume that the data reside on a hyperplane in high dimensional space. This assumption is usually too restrictive in many applications where data distribute on nonlinear manifolds in high dimensional space. The problem of data embedding is defined as follows: given a set of high dimensional data points, project them to a lower Euclidean space so that the resulting configuration performs better than the original data in further processing
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD’05, August 21–24, 2005, Chicago, Illinois, USA. Copyright 2005 ACM 1-59593-135-X/05/0008 ...$5.00.
722
Research Track Poster Algorithm
data points across the disconnected components. Consequently, the data can not be projected into a single lower dimensional coordinate system. Another closely related problem is how to choose a proper value of the parameter k or ǫ. If the parameter were chosen too small, the neighborhood graph would not be connected. If it were chosen too large, on the other hand, a so-called “short-circuit” problem [1] would occur where the constructed neighborhood graph may contain short-circuit edges that do not follow the manifold. The choice of a proper value of k or ǫ reflects a fundamental problem: what is global and what is local. For many applications, there may not exist such a value of k or ǫ to avoid both problems. In this paper, we present methods to construct connected neighborhood graphs. We will review three existing methods and give a new one. The three existing methods construct kedge-connected neighborhood graphs, the new method constructs k-connected neighborhood graphs. The first method, called k-MST, was reported in [20]. It works by repeatedly extracting minimum spanning trees (MSTs) from the complete Euclidean graph of all data points. The rest three approaches use greedy algorithms. Each of them starts by sorting all edges in non-decreasing order of edge length. The second method, referred as Min-k-ST [23], works by finding k edge-disjoint spanning trees the sum of whose total lengths is a minimum. The third method, denoted as k-EC [22], works by repeatedly adding each edge in non-decreasing order of edge length to a partially formed neighborhood graph if end vertices of the edge are not yet k-edge connected on the graph. The fourth method, denoted as k-VC, is a new approach proposed in this paper. It works in a similar way as k-EC. However, it adds an edge to a partially formed neighborhood graph if two end vertices of the edge are not yet k-connected on the graph. Therefore, the k-VC approach guarantees that the set of edges obtained is a set of shortest possible edges to keep the neighborhood graph k-connected. Many data embedding algorithms have been developed. However, these existing algorithms are more or less described separately in the literature. To have an overview of these algorithms, Section 2 identifies fundamental techniques for isometric data embedding. The resulting framework is used to systematize existing data embedding algorithms. In Section 3, we review the three existing methods to construct k-edge connected neighborhood graphs and present the new method. We discuss in Section 4 results of experiments and compare these methods for constructing neighborhood graphs. We conclude this paper with a summary of these methods and a short discussion of future work.
2.
P
Error Measure
MDS
EM DS =
NLM
ENLM =
CCA
ECCA =
P
∗ 2 i<j (dij −dij ) ∗2 i<j dij
P
P
1
d∗ ij ∗ (d ij i<j
P
2 (d∗ ij −dij ) d∗ ij dij )2 F (dij , λ)
i<j
i<j
−
Table 1: Error measures used
There are many nonlinear approaches, including the tetrahedral method [19], for isometric data embedding. Some of them use iterative optimization. These algorithms include Kruskal’s metric multidimensional scaling (MDS) [11], Sammon’s nonlinear mapping (NLM) [9], and curvilinear component analysis (CCA) [3]. These iterative algorithms work in similar ways: Each algorithm starts with a random configuration of points in the destination q-space. It uses an error measure that is defined as a function of differences between input distances and the corresponding distances in the q-space. The algorithm iteratively reconfigures the coordinates of points in the q-space (usually through calculating the steepest gradient descent of the error measure) to minimize the error measure. The algorithm usually stops when the error measure falls below a user-defined threshold, or the number of iterations exceeds a user-specified limit. These algorithms differ from each other by the error measures they use and by the ways they reconfigure points in each iteration. Let d∗ij represent the distance between a pair of points i and j in the original p-space, and dij represent the distance between the projections of i and j in the destination q-space. The error measures used by MDS, NLM, and CCA are summarized in Table 1. We can see from Table 1 that NLM follows closely MDS. In fact, Kruskal[12] has demonstrated how a configuration that is similar to NLM could be produced from MDS. The difference between EM DS and ENLM is that each term of ENLM is normalized by d∗ij . Therefore, NLM emphasizes the preservation of short distances. That is the reason why NLM has an effect of unfolding a data manifold. Niemann and Weiss [16] have ∗ 2 ∗p 1 further used ∗(p+2) i<j (dij −dij ) dij as a generic er-
P
P
i<j dij
ror measure and proposed a way to improve the convergence of NLM. The error measure becomes ENLM when p = −1 and becomes EM DS when p = 0. A value p < 0 prefers the preservation of short distances. A value p ≥ 0 prefers the preservation of long distances. Compared with EM DS and ENLM , ECCA uses a weight function F with a parameter λ. F has to be non-increasing to dij to make CCA favor the preservation of short distances. The function F used in [3] is defined as a binary function: F (dij , λ) = 1 if dij ≤ λ and F (dij , λ) = 0 if dij > λ. It makes CCA to completely ignore distances longer than λ. CCA has been reported successful to unfold highly twisted data manifolds. The success of using these algorithms for data embedding depends on the quality of input data. Being used alone, these algorithms usually take Euclidean distances as input. In the case that the data are distributed on a twisted manifold, however, Euclidean distance is not a good measure of dissimilarity between data points. A significant recent development in this area is the use of geodesic distances [1, 14, 18, 24]. Intuitively, geodesic distance between a pair of points on a manifold is the distance measured along the manifold. In implementation, geodesic distance between a pair of data
BASIC PRINCIPLE AND COMMON ALGORITHMS
Many data embedding methods are isometric. Among these methods, the simplest one is a linear method. Let X = [x1 , . . . , xn ] be a p × n matrix of coordinates of n points in p-space. Given distance between every pair of points, a covariance matrix XT · X can be obtained. A projection of the n points can be defined as Y = Λ1/2 · V, 1/2 1/2 where Λ1/2 = diag(λ1 , . . . , λq ) where λ1 , . . . , λq are the q T largest eigenvalues of X ·X and the rows of V are the eigenvectors of XT ·X corresponding to the eigenvalues λ1 , . . . , λq . This linear embedding method is commonly referred as classical multidimensional scaling (classical MDS) [2].
723
Research Track Poster Use Euclidean Distance Classical MDS NLM CCA
Use Geodesic Distance Isomap GeoNLM CDA
B
A
Table 2: Systematization of isometric data embedding algorithms
points can be estimated by graph distance, that is, the distance along the shortest path, between the pair on a neighborhood graph that connects every point to its neighbor points. As the number of data points increases, the graph distance is expected to converge to the true geodesic distance asymptotically. Once geodesic distances are estimated, an isometric data embedding algorithm can be applied to the estimated geodesic distances. Example algorithms include Isomap [18], CDA [14], and GeoNLM [21]. Isomap uses classical MDS, CDA uses CCA, and GeoNLM used NLM. These algorithms and their relationships are shown in Table 2. As long as good estimations of geodesic distances are obtained, all the above algorithms would perform well and there is no particular reason to choose one over the other. CDA and GeoNLM have ever been reported outperforming Isomap. However, the performance gains are simply because the underlying CCA and NLM did better than classical MDS to compensate badly estimated geodesic distances (especially the short ones) when the data manifold is highly twisted or folded. The construction of a quality neighborhood graph is the most important step in geodesic distance estimation and isometric data embedding.
3.
(a) A Swiss-roll data
(b) 5-NN neighborhood graph
Figure 1: A Swiss-roll data set
which is then wrapped into a Swiss roll. Figure 1(b) displays its 5-NN (k = 5) neighborhood graph superimposed on the data. Figure 1(b) also illustrates the shortest path between an example pair of data points, A and B. Geodesic distance between the points A and B is estimated as the length of the shortest path. The first algorithm, k-MST [20], constructs a k-edge connected neighborhood graph by repeatedly extracting minimum spanning trees (MSTs) from the Euclidean graph of all data points. Algorithms to extract MSTs have been well studied (see [8] for a survey). k-MST works in the following way: Let G = (V, E) denote the complete Euclidean graph of all data points where the weight of each edge is the Euclidean distance. Let M ST1 denote the set of edges of an MST of G, M ST1 = mst(V, E). Let M ST2 denote the set of edges of an MST of G with M ST1 removed, that is, M ST2 = mst(V, E − M ST1 ). Similarly, let M STi denote the set of edges of an MST of G with ∪i−1 j=1 M STj removed, M STi = mst(V, E − ∪i−1 M ST ). k-MST constructs the j j=1 neighborhood graph as (V, ∪ki=1 M STi ). Time complexity of Kruskal’s classical MST algorithm [10] is O(|E| log |E|) in the worst case and is often close to O(|E| + |V | log |E|) in the average case. For complete graphs where |E| = |V |(|V | − 1)/2, this average case time complexity can be simplified as O(|V |2 ). Therefore, k-MST has a time complexity of O(k|V |2 ) in the average case. This time complexity is the same as the time complexity of k-NN and the ǫ-neighbor approach. The rest three algorithms (Min-k-ST, k-EC, and k-VC) can be classified as greedy algorithms and work in similar ways. Let G = (V, E) denote the Euclidean graph of all data points where the weight of each edge is the Euclidean distance. A greedy algorithm starts by sorting all edge in non-decreasing order of edge length. It initializes an empty edge set F and repeatedly adds each edge e ∈ E, in nondecreasing order of edge length, to F if F does not satisfy the corresponding connectivity criterion. After all edges are processed, the resulting set F of edges would be a set of shortest edges (because edges are processed in a non-decreasing order of edge length) to satisfy the corresponding criterion of graph connectivity. Min-k-ST finds k spanning trees of minimum total length. Please note that the property of having k trees spanning a pair of vertices is an equivalence relation. k-EC builds a k-edge-connected neighborhood graph. The property of kedge connectivity between vertices in an undirected graph is also an equivalence relation. For Min-k-ST and k-EC, there-
BUILDING CONNECTED NEIGHBORHOOD GRAPHS
A neighborhood graph over a set of data points distributed on a manifold should be constructed in such a way that: (1) it contains only short edges so that it follows the manifold; (2) its connectivity is guaranteed so that the geodesic distance between every pair of data points can be estimated; (3) it has multiple edge connections between any partitions of the graph so that there are multiple choices of paths between a pair of points across the partitions. An obvious choice for such a graph is a k-connected or k-edge-connected minimum spanning subgraph of the Euclidean graph of all data points. When k = 1, the problem of finding such a subgraph becomes finding the minimum spanning tree, which can be trivially solved. When k ≥ 2, however, the problem is NP-hard [7, Problem GT31]. This is easy to understand from the fact that, when k = 2, the problem is reduced to the classical traveling salesman problem. Geodesic distance estimation does not require a neighborhood graph to be strictly minimal spanning. Therefore, we change our objective to find a neighborhood graph that is k-edge-connected or k-connected and that can be computed efficiently. In this section, we briefly review three existing methods and the corresponding algorithms (k-MST [20], Min-k-ST [23], and k-EC [22]) for constructing k-edgeconnected neighborhood graphs. We will then present a new algorithm k-VC which constructs k-connected neighborhood graphs. These algorithms are summarized in Table 3 For the illustration purpose, Figure 1(a) shows a synthetic data of 1,000 points randomly distributed on a 4×1 rectangle
724
Research Track Poster Algorithm k-MST Min-k-ST k-EC k-VC
Neighborhood Graph k-edge-connected k-edge-connected k-edge-connected k-connected
Time Complexity O(k|V |2 ) O((k|V |)2 ) O((k|V |)2 ) O(|V |3 + k3 |V |2 )
How It Works Repeatedly extract k MSTs. Extract k spanning trees with a minimum total length. Add edge (i, j) if i and j are not k-edge-connected. Add edge (i, j) if i and j are not k-connected.
Table 3: Algorithms for constructing connected neighborhood graphs The idea of updating the k forests to accommodate an edge e0 is to find an updating sequence he0 , e1 , . . . , en i such that e0 replaces the edge e1 in F1 , e1 replaces the edge e2 in F2 , and so on, until en can be inserted into a forest F(n mod k)+1 where the end vertices of en are in different trees. The updating sequence can be found by using a breadth-first search: Starting from e0 , we test whether e0 spans two trees in F1 . If not, we test whether any edge in the loop created by e0 in F1 ∪ {e0 } spans two trees in F2 . If not, for each e1 in the loop, we test whether any edge in a loop created by e1 in F2 ∪ {e1 } spans two trees in F3 , and so on. We continue this breadth-first search until we find an edge en that spans two trees in the next forest. Once an updating sequence is found, the k edge-disjoint forests are updated to accommodate e0 by following the updating sequence. Because each step in the update swaps in an edge to replace another edge in the loop created by the edge, it is guaranteed that each Fi remains as a forest after the update. Computational complexity of Min-k-ST follows that of the Kruskal’s algorithm. Building a heap of edges takes O(|E|) time, and each call of remove-min() takes O(log E) time. Updating the k forests to accommodate an edge has an average complexity of O(k|V |). Using a set union/find algorithm, the union/find of equivalence classes takes nearly O(1) time. Since there are O(k|V |) edges inserted into the resulting graph, the total time complexity of Min-k-ST is O(|E| + k|V | log |E| + k2 |V |2 ). For a Euclidean graph where |E| = |V |(|V | − 1)/2, the time complexity can be simplified as O(k2 |V |2 ). Neighborhood graphs constructed by using k-MST and Min-k-ST have the following properties: (1) Each neighborhood graph has exactly k(|V | − 1) edges; (2) It is minimally k-edge-connected; (3) If we cut it into two partitions, the cut edges include the k shortest edges among all edges between the partitions. In particular, the k nearest neighbors of a point are always connected to the point. Therefore, neighborhood graphs constructed by using k-MST and Min-k-ST are supergraphs of the corresponding neighborhood graphs constructed by using k-NN.
fore, performance can be improved by avoiding the test of the corresponding connectivity criterion for a coming edge whose end vertices are known within an equivalence class. If two end vertices of the coming edge belong to different classes and the edge cannot be added to the resulting neighborhood graph, the equivalence classes connected by the edge are merged. The algorithm that is used for Min-kST and k-EC is given in Algorithm 1. It uses a min-heap to avoid sorting all edges at the beginning. It continues until all vertices in the graph are in a single equivalence class. The algorithm is similar to Kruskal’s MST algorithm [10], which is not surprising because the latter is also a greedy algorithm taking advantage of the equivalence relation of edge connectivity. Algorithm 1 Greedy algorithm for Min-k-ST and k-EC Input: G(V, E), a complete Euclidean graph Output: G′ (V, F ), a k-edge connected spanning subgraph 1: F = ∅; 2: nc = |V |; 3: Assign each v ∈ V a unique equivalence class, class(v); 4: Build a min-heap H of edges in E using edge length as the key; 5: while nc > 1 do 6: (a, b) = remove-min(H); 7: if class(a) 6= class(b) then 8: if F and {(a, b)} meet the connectivity criterion then 9: F = F ∪ {(a, b)} 10: else 11: Merge class(a) and class(b); 12: nc = nc − 1; 13: end if 14: end if 15: end while
3.1 Min-k-ST Min-k-ST [23] finds a set of k edge-disjoint spanning trees the sum of whose total edge lengths is guaranteed to be a minimum. In Min-k-ST, the set E of edges of G form a matroid if we define a set F of edges to be independent if and only if F can be partitioned into k forests. Therefore, we can use the matroid greedy algorithm [13] to construct a neighborhood graph. After we run Algorithm 1, F is a basis (maximal independent set) of minimum total length. In other words, F is the union of k edge-disjoint spanning trees the sum of whose total lengths is a minimum. The hard part of the Min-k-ST matroid greedy algorithm is at line 8 in Algorithm 1, that is, to test the independence of F ∪ {(a, b)}. To do this we maintain a partition of F into k edge-disjoint forests F1 , . . . , Fk . When an edge e0 is added, the k forests are updated to accommodate e0 .
3.2 k-EC The same as Min-k-ST, k-EC uses Algorithm 1 to produce k-edge connected neighborhood graphs. The difference of kEC to Min-k-ST is at line 8 in Algorithm 1. k-EC defines the connectivity criterion as that the pair a, b of vertices are not yet k-edge-connected in F . According to Menger’s theorem, this connectivity test can be done by querying whether there are no more than k − 1 edge-disjoint paths between a and b in F . Such a test can be performed by using network flow techniques [4, 17, 5] by assuming every edge in F has a unit flow capacity. After we run Algorithm 1, therefore, there are at least k edge-disjoint paths between every pair of data points. k-EC guarantees that the resulting edges in F
725
Research Track Poster give a set of shortest edges to keep the neighborhood graph k-edge-connected. The classical algorithm to measure max-flow from a vertex to another vertex in a directed graph with flow capacity is the Ford-Fulkerson labeling algorithm [6]. Edmonds and Karp [4] modified the labeling algorithm so that each flow augmentation is taken along a path with the fewest number of edges from the source to the destination. There also exist more efficient network flow algorithms [5, 17] which find flow augmentation paths phase by phase. Because our objective is to test the existence of k paths rather than to find the actual maximum flow, the Ford-Fulkerson labeling algorithm [6] with the Edmonds-Karp algorithm [4] of finding augmenting paths would be a good choice for the k-connectivity test. Specifically, we construct a directed graph (V, F ′ ), where F ′ is a set of directed edges: for each e ∈ F , we have e′ and e′′ in F ′ where e′ and e′′ connect the two end vertices of e and are directed in opposite directions. All edges in F ′ have unit flow capacity. An edge is called useable in its direction if it has no flow. An edge is called usable in its opposite direction if it has a unit flow. The network flow algorithm uses breadth-first search on F ′ for the test. Specifically, a breadth-first search starts at a. The search goes along only usable edges until it reaches b. It is easy to see that this is the simple algorithm for finding the shortest path along useable edges when every edge is assumed having a unit length. A path between a and b is found by backtracking the breadth-first search. Once we found such a path, we augment every edge on the path with unit flow. Such an augmentation makes all the edges on the path not useable in the direction of the path in the next search. The breadth-first search is repeated k times, each one on the network with flows augmented by previous searches. If the k searches are successful, we know that a and b are k-connected in (V, F ). If any of the k searches fails to reach b, we know that a and b are not yet k-connected and the edge (a, b) has to be added to F . Computational complexity of k-EC follows that of Mink-ST. Each k-edge-connectivity test takes at most O(k|V |) steps because each breadth-first search in the network maxflow algorithm takes at most O(|V |) steps. Therefore, the time complexity of k-EC is the same as Min-k-ST. The total time complexity can be expressed as O(|E| + k|V | log |E| + k2 |V |2 ). For a Euclidean graph, the time complexity can be simplified as O(k2 |V |2 ).
connectivity criterion can be tested by querying whether there are no more than k − 1 vertex-disjoint paths between a and b in F . Such a test can be performed using network flow techniques by assuming every vertex in F has a unit flow capacity. This can be done in a similar way as in k-EC, but on a different graph. Given the undirected graph (V, F ), we construct a directed network flow graph (V ′ , F ′ ) as follows: for each e ∈ F , we have e′ and e′′ in F ′ where e′ and e′′ connect end vertices of e and are directed in opposite directions. Each vertex, other than a and b, has flow capacity of 1. This unit vertex capacity can be translated to edge capacities in (V ′ , F ′ ) as follows: each vertex v is split into two vertices, v ′ and v ′′ ; a new edge e, which has a flow capacity of 1, connects from v ′ to v ′′ ; all edges which formerly led to v now lead to v ′ , and all edges which came from v now come from v ′′ . Clearly, the new edge e and its unit capacity in (V ′ , F ′ ) specify the unit vertex flow capacity in (V, F ). The query of the existence of k vertex-disjoint paths in (V, F ) between a and b can now be translated to a query of the existence of k edge-disjoint paths from a′′ to b′ in (V ′ , F ′ ). Another difference of k-VC to k-EC is that k-connectivity is not an equivalence relation and, therefore, Algorithm 1 cannot be directly used in k-VC. However, k-connectivity has the property that any two k-connected blocks of vertices have no more than k − 1 vertices in common [15]. In particular, if both vertices a and b are k-connected to a set of more than k − 1 vertices, a and b must be k-connected. k-VC can use this property to avoid the k-connectivity test between vertices within a k-connected block. Time complexity of k-VC follows that of the k-EC. The time complexity for sorting the edges is O(|E| log |E|). The test of whether a and b are k-connected to a set of at least k vertices takes at most O(|V |) time. k-VC has O(k2 |V |) applications of the k-connectivity test. Therefore, the total time complexity of k-VC can be expressed as O(|E| log |E| + |E||V |+k3 |V |2 ). For a complete graph where |E| = |V |(|V |− 1)/2, the time complexity can be simplified as O(|V |3 + k3 |V |2 ).
4. EXPERIMENT AND DISCUSSION Figure 2 shows 2D projections of the synthetic Swiss-roll data in Figure 1 by applying classical multidimensional scaling to the estimated geodesic distances. The corresponding neighborhood graph and two example shortest paths (A to B and C to D) are superimposed in each projection to illustrate how geodesic distances are estimated and are kept in the projections of data points. Classical multidimensional scaling makes these projections in such ways that Euclidean distances between points reflect the estimated geodesic distances, in other words, makes the shortest paths as straight as possible. Because neighborhood graphs contain many small holes, long geodesic distances (between A and B, for example) are generally better estimated than short geodesic distances (between C and D, for example) on the neighborhood graphs. This explains why more advanced techniques (such as metric MDS, NLM, and CCA), which emphasize the preservation of short distances, may not outperform the simple classical multidimensional scaling. To give quantitative comparison, we use residual variance as the error measure. Let E be the matrix of real geodesic distances. Let G be the matrix of estimated geodesic distances. Let d be the matrix of Euclidean distances between
3.3 k-VC A k-edge-connected neighborhood graph may not perform well for geodesic distance estimation on clustered data, where a k-edge-connected graph may be only 1-connected. This section proposes a new approach, k-VC, to overcome this problem. k-VC constructs k-connected neighborhood graphs. It works in a similar way as k-EC. The major difference to k-EC is querying for k-vertex-connectivity versus querying for k-edge-connectivity. Another difference is that k-vertex-connectivity between vertices is not an equivalence relation. Special attention has to be paid to reduce the complexity of the algorithm. Similar to k-EC, the core part of the k-VC algorithm is the testing of the connectivity criterion. k-VC defines the connectivity criterion as the pair a, b of vertices are not kvertex connected in F . According to Menger’s theorem, this
726
Research Track Poster −1
0
10
10 5−NN 5−MST Min−5−ST 5−EC 5−VC
k−NN k−MST Min−k−ST k−EC k−VC
−2
2
2
1−R (G,d)
1−R (E,G)
10
−1
10
−3
10
−4
10
−2
1
2
3
4 5 6 7 Dimensionality of Projection
8
9
10
10
1
2
3
4
5
6
7
8
9
10
k
(a) 1 − R2 (G, d) vs. p
(b) 1 − R2 (E, G) vs. k
Figure 3: Statistics of data projections and neighborhood graphs projection and intrinsic dimensionality estimation. Furthermore, k-VC constructs k-connected neighborhood graphs instead of k-edge-connected neighborhood graphs. Therefore, k-VC easily outperforms other approaches when data are distributed among clusters. A downside of k-VC is that it has a larger time complexity than other approaches.
projected data points. The residual variance used in Isomap [18] is defined as 1 − R2 (G, d) where R represents correlation coefficient. Using the syntactic Swiss-roll data as the test data, Figure 3 gives basic statistics of data projections and neighborhood graphs constructed by using the five approaches. Figure 3(a) shows the variance 1 − R2 (G, d) of using the five approaches as functions of the projection dimensionality. These projections are produced using neighborhood graphs constructed when k = 5. For each approach, the residual variance decreases as the projection dimensionality increases. The intrinsic dimensionality (which is 2 for the synthetic Swiss-roll data) can be estimated by looking for the “elbow” effect at which the residual variance ceases to decrease significantly with added dimensions. Residual variances are shown in log scale in Figure 3(a), where we can see that k-MST, Min-k-ST, and k-VC produce smaller variances and stronger elbow effects than k-NN and k-EC. Figure 3(b) reports the residual variance 1 − R2 (E, G) of geodesic distance estimation by using the five approaches as the value of k increases. It does not give the variances of using k-NN when k < 4. This is because k-NN fails to build a connected neighborhood graph when k < 4. The problem of disconnected neighborhood graph would become more serious for the k-NN approach when data points were distributed among clusters. In contrast, all other approaches guarantee the connectivity of the constructed neighborhood graphs for any positive k value. Residual variances in Figure 3(b) are also shown in log scale. Among these approach, k-VC performs the best in estimating geodesic distances. As the value of k increases, the residual variances become stable and these approaches have similar performances. All the four approaches (k-MST, Min-k-ST, k-EC, and kVC) presented in this paper outperforms k-NN in the sense that they build connected neighborhood graphs when k is small or the data are distributed among clusters, in which cases k-NN would fail. Among these four approaches, we think k-VC performs the best. It outperforms other approaches for geodesic distance estimation (as shown in Figure 3(b)) and, consequently, produces good results for data
5. CONCLUSION We have explained the fundamentals of techniques for isometric data embedding and derived a general classification of algorithms for data embedding. We have identified neighborhood graph construction as the first and the most important step in techniques based on geodesic distances. We have reviewed three existing methods for constructing kedge-connected neighborhood graphs and proposed a new method for constructing k-connected neighborhood graphs. Because these methods guarantee the connectivity of the constructed neighborhood graphs, they are applicable to a wide range of data including data distributed among clusters. These methods are compared through experiments. Their features are discussed and summarized. Future work will be on further exploration and application of the k-VC approach and other methods for geodesic distance estimation. One interesting topic is how to characterize the dimensions of projected data. Another topic is how to simplify these methods and to make them work with large data sets. Computational complexity is a big challenge to bring these methods to practical application.
ACKNOWLEDGMENT This work was supported in part by the US National Science Foundation under grants IIS-0414857, EIA-0215356, and EIA-0130857.
REFERENCES [1] M. Balasubramanian, E. L. Schwartz, J. B. Tenenbaum, V. de Silva, and J. C. Langford. The Isomap algorithm and topological stability. Science, 295:7a, Jan. 2002.
727
Research Track Poster [8] R. L. Graham and P. Hell. On the history of the minimum spanning tree problem. Annals of the History of Computing, 7(1):43–57, Jan. 1985. [9] J. J. W. Sammon. A nonlinear mapping for data structure analysis. IEEE Trans. Computers, C-18(5):401–409, May 1969. [10] J. Kruskal. On the shortest spanning subtree of a graph and the travelling salesman problem. Proc. Amer. Math. Soc., 7(1):48–50, 1956. [11] J. Kruskal. Multidimensional scaling by optimizing goodness-of-fit to a nonmetric hypothesis. Psychometrika, 29:1–27, 1964. [12] J. Kruskal. Comments on a nonlinear mapping for data structure alanysis. IEEE Trans. Computers, C-20(12):1614, Dec. 1971. [13] E. L. Lawler. Combinatorial Optimization: Networks and Matroids. Holt, Rinehart and Winston, New York, 1976. [14] J. A. Lee, A. Lendasse, N. Donckers, and M. Verleysen. A robust nonlinear projection method. In Proc. 8th European Symp. Artificial Neural Networks (ESANN2000), Bruges, Belgium, Apr. 2000. [15] D. Matula. k-blocks and ultrablocks in graphs. Journal of Combinatorial Theory, Series B, 24(1):1–13, Feb. 1978. [16] H. Niemann and J. Weiss. A fast converging algorithm for nonlinear mapping of high-dimensional data onto a plane. IEEE Trans. Computers, C-28(2):142–147, Feb. 1979. [17] R. E. Tarjan. Testing graph connectivity. In Proc. 6th Annual ACM Symp. on Theory of Computing (STOC’74), pages 185–193, 1974. [18] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, Dec. 2000. [19] L. Yang. Distance-preserving projection of high dimensional data for nonlinear dimensionality reduction. IEEE Trans. Pattern Analysis and Machine Intelligence, 26(9):1243–1246, Sept. 2004. [20] L. Yang. k-edge connected neighborhood graph for geodesic distance estimation and nonlinear data projection. In Proc. 17th Inter. Conf. Pattern Recognition (ICPR’04), volume 1, pages 196–199, Cambridge, UK, Aug. 2004. [21] L. Yang. Sammon’s nonlinear mapping using geodesic distances. In Proc. 17th Inter. Conf. Pattern Recognition (ICPR’04), volume 2, pages 303–306, Cambridge, UK, Aug. 2004. [22] L. Yang. Building k-edge-connected neighborhood graphs for distance-based data projection. Pattern Recognition Letters, 26, 2005. To appear. [23] L. Yang. Building k edge-disjoint spanning trees of minimum total length for isometric data embedding. IEEE Trans. Pattern Analysis and Machine Intelligence, 27(10), Oct. 2005. To appear. [24] H. Zha and Z. Zhang. Isometric embedding and continuum ISOMAP. In Proc. 20th Inter. Conf. Machine Learning (ICML’03), pages 864–871, Washington, DC, Aug. 2003.
B C D A
(a) 5-NN
B C D A (b) 5-MST
B C D A (c) Min-5-ST
B C
D
A
(d) 5-EC
B C D A (e) 5-VC Figure 2: 2D projections of the Swiss-roll data using classical MDS of the estimated geodesic distances [2] T. F. Cox and M. A. A. Cox. Multidimensional Scaling, 2nd Edition. Chapman & Hall, 2001. [3] P. Demartines and J. Herault. Curvilinear component analysis: A self-organizing neural network for nonlinear mapping of data sets. IEEE Trans. Neural Networks, 8(1):148–154, Jan. 1997. [4] J. Edmonds and R. M. Karp. Theoretical improvements in algorithmic efficiency for network flow problems. J. ACM, 19(2):248–264, Apr. 1972. [5] S. Even and R. E. Tarjan. Network flow and testing graph connectivity. SIAM J. Computing, 4(4):507–518, Dec. 1975. [6] L. R. Ford, Jr. and D. R. Fulkerson. Flows in Networks. Princeton University Press, Princeton, NJ, 1962. [7] M. R. Garay and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, New York, 1979.
728