Incremental Construction of Neighborhood Graphs ... - Semantic Scholar

Report 3 Downloads 160 Views
Incremental Construction of Neighborhood Graphs for Nonlinear Dimensionality Reduction Dongfang Zhao Li Yang Department of Computer Science, Western Michigan University {dzhao, yang}@cs.wmich.edu Abstract Most nonlinear data embedding methods use bottom-up approaches for capturing underlying structures of data distributed as points on nonlinear manifolds in high dimensional spaces. These methods usually start by designating neighbor points to each point. Neighbor points have to be designated in such a way that the constructed neighborhood graph is connected so that the data can be projected to a single global coordinate system. In this paper, we present an incremental method for updating neighborhood graphs. The method guarantees k-edge-connectivity of the constructed neighborhood graph. Together with incremental approaches for geodesic distance estimation and multidimensional scaling, our method enables incremental embedding of high dimensional data streams. The method works even when the data are under-sampled or non-uniformly distributed. It has important applications in the processing of data streams and multimedia data.

1. Introduction Dimensionality reduction has applications in many areas such as pattern analysis and data mining. Two classical methods for linear dimensionality reduction are principal component analysis (PCA) and classical multidimensional scaling (MDS). They are in duality with each other when the distances used in MDS are Euclidian distances. Both assume that the data distribute on a hyperplane in high dimensional space. Neither of them can capture intrinsic structures of nonlinear manifolds. In recent years, methods [3, 4, 6] for nonlinear dimensionality reduction and manifold learning have been proposed. These methods usually work in two phases: the first phase collects local information and the second phase derives global coordinates from the arrangements of local information. They always start by assigning neighbor points to each point. For example, Isomap [6] uses two steps to estimate geodesic distances: the first step constructs a neigh-

borhood graph of all points and the second step computes the length of the shortest path between every pair of points. In the first step, Isomap uses k nearest neighbors of each point or a neighborhood of fixed radius to assign neighbors to the point. However, these approaches do not guarantee the connectivity of the constructed neighborhood graph. This could be a serious problem when the data are undersampled or distributed among clusters. It may eventually make the data embedding method fail. Data embedding methods need to work in an incremental way in order to process data streams. When a new data point comes in, using Isomap as an example, we should have mechanisms to update the neighborhood graph, to update the estimation of geodesic distances, and to adjusts data configurations in the low dimensional space. Recently, incremental Isomap [2] has been developed. Because it builds neighborhood graphs by using k nearest neighbors, however, it has no guarantee on the connectivity of the constructed neighborhood graph. In fact, the problem of disconnected neighborhood graphs is more serious in incremental data embedding: when new data points come in, a connected neighborhood graph may become disconnected which will fail the later steps for data embedding. This paper proposes an incremental method to guarantee the connectivity of the constructed neighborhood graphs. Our method is based on the k-MST algorithm [7] which builds a k-edge-connected neighborhood graph by keeping k minimal spanning trees (MSTs). The k MSTs are sequentially updated when a new point comes in. Because our method guarantees connectivity of the constructed neighborhood graphs, it outperforms the original Isomap using k nearest neighbors. The difference is significant when the data are under-sampled or non-uniformly distributed, which often happens in the application of data stream processing.

2. Incremental k-MST The k-MST algorithm [7] constructs a k-edge connected neighborhood graph by repeatedly extracting k MSTs from the Euclidean graph of all data points. Let G = (V, E)

0-7695-2521-0/06/$20.00 (c) 2006 IEEE

denote the complete Euclidean graph of all data points where the weight of each edge is the Euclidean distance. Let MST 1 denote the set of edges of an MST of G, MST 1 = mst(V, E). Let MST 2 denote the set of edges of an MST of G with MST 1 removed, that is, MST 2 = mst(V, E − MST 1 ). Similarly, let MST i denote the set of edges of an MST of G with ∪i−1 j=1 MST j removed, MST i = mst(V, E − ∪i−1 MST ). k-MST constructs the k-edgej j=1 connected neighborhood graph as (V, ∪ki=1 MST i ). There exist efficient algorithms [1, 5] for updating a single MST. These algorithms have a time complexity of O(n) for vertex insertion. However, these algorithms require that all edges to be inserted incident to the new vertex. This makes them not applicable to the problem of updating an MST with an arbitrary set of edges. In this paper, we have k MSTs to be updated. Edges replaced in the i-th MST have to be considered in updating the (i + 1)-th MST. The basic idea of our algorithm is simple: suppose there are already n vertices, a new vertex brings in n new candidate edges; when a candidate edge is inserted to a MST, a loop may be created and the MST is then updated by dropping the longest edge in the loop. Finding the loop can be easily done in a rooted tree where each node except the root has one parent node. The n candidate edges are first tried to be inserted to MST 1 . The uninserted edges and the replaced edges are then tried to be inserted to MST 2 , and so on. The process stops after MST k is updated. There is no need for another round of updating because the edges left by MST k have no chances to be inserted in MST 1 . The incremental k-MST algorithm is given in Algorithm 1. Each MST has a root. A list elist is used to keep the edges to be inserted into each MST. It’s initialized as the set of all edges incident to the new vertex v. When an edge (l, r) is inserted into an MST, the algorithm first tries to find the closest common ancestor of l and r. The only case that this fails is when v is first added to the MST. Otherwise, the algorithm finds the longest edge on the path from l to r. If the found edge is longer than (l, r), it is replaced by (l, r) in the MST and it replaces (l, r) in elist. Finally, the parent relationships of vertices on the path affected by this change are reversed. An MST is updated once all edges in elist are scanned. The process stops after all MSTs are updated. The algorithm works because each MST is a rooted tree. In implementation, each vertex keeps a point to its parent and length of the edge to its parent. We have arbitrarily chosen the very first vertex as the root of k MSTs. The algorithm starts once the first point comes in. Finding a path between two vertices in a rooted tree takes O(log n) steps on average. The same complexity applies in finding the longest edge and in reversing the parent relationship. In summary, Line 5 to Line 16 in Algorithm 1 has an average time complexity of O(log n). Since there are about n edges to be inserted in each MST and there are k

Algorithm 1 Incremetal k-MST Input: k rooted MSTs {T1 (V, E1 ), . . . , Tk (V, Ek )} where V = {vi }, i = 1, . . . , n; a new vertex v Output: the k MSTs updated 1: V ← V ∪ {v}; 2: build elist = {(v1 , v), . . . , (vn , v)}, a list of edges; 3: for each MST Ti (V, Ei ) do 4: for each edge (l, r) ∈ elist do 5: find a, the closest common ancestor of l and r; 6: if fail {only happens when l or r is v} then 7: move (l, r) from elist to Ei ; assign r parent of v if l = v or l parent of v if r = v; 8: else 9: find (u1 , parent(u1 )), the longest edge in the path from l to a; 10: find (u2 , parent(u2 )), the longest edge in the path from r to a; 11: if (u1 , parent(u1 )) is longer than both (u2 , parent(u2 )) and (l, r) then 12: swap (u1 , parent(u1 )) and (l, r) between Ei and elist; reverse parent relationships of vertices on the path from l to u1 ; assign r the parent of l; 13: else if (u2 , parent(u2 )) is longer than both (u1 , parent(u1 )) and (l, r) then 14: swap (u2 , parent(u2 )) and (l, r) between Ei and elist; reverse parent relationships of vertices on the path from r to u2 ; assign l the parent of r; 15: end if 16: end if 17: end for 18: end for

MSTs to be updated, the time complexity of Algorithm 1 to accommodate a new vertex is O(kn log n).

3. Experiment We implemented the incremental k-MST algorithm and did experiments using the constructed neighborhood graphs for geodesic-distance based data embedding. We follow the same steps in incremental Isomap [2] for updating the shortest paths and updating low dimensional coordinates. This section presents results of experiments on one synthetic data set, the Swiss-Roll data set, and two real world data sets, the Iris and Olivetti faces data sets. Figure 1 shows 2D embedding results of the Swiss Roll data by Isomap using neighborhood graphs constructed by incremental k-MST when k = 4. The figure shows how the incremental method captures the underlying structure of data when the number of data points increases from 20 to

0-7695-2521-0/06/$20.00 (c) 2006 IEEE

50 0 15

15 10

50 0

50 0

50 0

15

15 15

15 15

10

10 5 0

15

15

15

n= 50

15

15

30

30

20

20

20

20

10

10

10

10

0

0

0

0

10

10

10

10

20

20

20

20

0

30 50 50

0

0

50

30 50

50 0

50 0

50 0

50 0

15

15 15

15 15

15 15

10

10

10 5 0

15

15

15

20

20

10

0

15

15

0 5

5 10

10

n= 500 30

5

5 10

10

n= 300 30

15 10

0 5

5 10

10

50

5

5 0

0 5

5 10

0

10

10 5

5 0

0 5

10

10 5

5

15

n= 200

30

30 50 50

10

n= 100

30

30 50

5 10

10 15

15

n= 20

0 5

5 10

10

5 0

0 5

5 10

10

10 5

5 0

0 5

5 10

15 10

10 5

5 0

0 5

10

10 5

5

10 15

15

n= 700

15

n= 1000

30

30

20

20

10

10

10

0

0

10

10

20

20

30 50

0

50

0

0

10

10

20

30 50

0

50

20

30

50

0

50

30

50

0

50

Figure 1. Incremental k-MST of the Swiss Roll data set when k = 4. n = 25

n = 45

n = 65

3

3

2

2

1

1

0

0

3

2

1

0

1

1

1

2

2

2

3

3

3 4

3

2

1

0

1

2

3

4

4

3

2

1

0

1

2

3

4

3

2

1

0

1

2

3

4

Figure 2. Incremental k-MST of the Iris data set when k = 4. 1,000. For each number of points, both the original 3D data and the corresponding 2D projection are drawn. Neighborhood graphs are superimposed on the 3D data. The figure clearly shows that short-circuit edges disappear as the number of data points increases. When n = 700, in particular, there are 2 short-circuit edges that make two corners of the manifold tied in the 2D projection. The problem disappears when n = 1, 000.

Our next experiment is on the classical Iris data. The data set consists of 150 points in three classes. The k-nearestneighbor approach fails to build a connected neighborhood graph. Figure 2 shows 2D embedding results by Isomap using the incremental k-MST algorithm when k = 4. Each point in the 2D projections is marked with its class label. The last experiment is on 50 face images, shown in Figure 3, of five people from the Olivetti image data set. Each

0-7695-2521-0/06/$20.00 (c) 2006 IEEE

n= 15

n= 20

n= 25

3000 3000 2000 2000 2000 1000 1000 1000 0 0

0

1000 1000

1000

2000 2000

2000 3000 3000

2000

1000

0

1000

2000

3000

3000

2000

1000

0

1000

2000

3000

3000

2000

1000

0

n= 40

n= 30

1000

2000

3000

4000

n= 50

4000 3000

3000

3000

2000

2000

2000

1000

1000 0

0 1000 1000

1000

2000

0 2000

3000 1000

3000 4000 4000

2000

5000

5000 4000

3000

2000

1000

0

1000

2000

3000

6000 6000

4000

4000

2000

0

2000

4000

6000

4000

2000

0

2000

4000

Figure 4. Incremental k-MST of the Olivetti image data when k = 3. The method has applications in multimedia stream processing. Future work will be on incremental methods to build more rigid neighborhood graphs and on the development of software systems for dimensionality reduction and characterization of multimedia data streams.

+ o

∗ +

References

Figure 3. Olivetti image data. image has 64 × 64 pixels. Figure 4 gives six 2D projections. Markers are used in Figures 3 and 4 to denote different people. We can see that points of the same person move together and create clusters as the value of n increases.

4. Conclusion and Future Work This paper has proposed an incremental algorithm for neighborhood graph construction which guarantees the kedge-connectivity of a neighborhood graph while the neighborhood graph is being expanded with new data points. The method works when the data are under-sampled or distributed among clusters, in which cases the conventional approaches of using k nearest neighbors or fixed-radius neighborhoods would fail.

[1] F. Chin and D. Houck. Algorithms for updating minimal spanning trees. J. Computer and System Sciences, 16(3):333–344, 1978. [2] M. H. C. Law and A. K. Jain. Incremental nonlinear dimensionality reduction by manifold learning. IEEE Trans. Pattern Analysis and Machine Intelligence, 28(3):377-391, March 2006. [3] S. T. Roweis and L. K. Saul. Nonlineaar dimensionality reduction by locally linear embedding. Science, 290:2323– 2326, Dec. 2000. [4] L. K. Saul and S. T. Roweis. Think globally, fit locally: Unsupervised learning of low dimensional manifolds. J. Machine Learning Research, 4:119–155, June 2003. [5] P. M. Spira and A. Pan. On finding and updating spanning trees and shortest paths. SIAM J. Computing, 4(3):2015– 2021, Sept. 1975. [6] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, Dec. 2000. [7] L. Yang. Building connected neighborhood graphs for isometric data embedding. In Proc. ACM SIGKDD Inter. Conf. Knowledge Discovery and Data Mining, Chicago, IL, Aug. 2005, pp.722–728.

0-7695-2521-0/06/$20.00 (c) 2006 IEEE