MULTIPLE PENALIZED PRINCIPAL CURVES: ANALYSIS AND COMPUTATION
arXiv:1512.05010v1 [math.AP] 15 Dec 2015
SLAV KIROV AND DEJAN SLEPČEV
Abstract. We study the problem of finding the one-dimensional structure in a given data set. In other words we consider ways to approximate a given measure (data) by curves. We consider an objective functional whose minimizers are a regularization of principal curves and introduce a new functional which allows for multiple curves. We prove the existence of minimizers and establish their basic properties. We develop an efficient algorithm for obtaining (near) minimizers of the functional. While both of the functionals used are nonconvex, we argue that enlarging the configuration space to allow for multiple curves leads to a simpler energy landscape with fewer undesirable (high-energy) local minima. Furthermore we note that the approach proposed is able to find the one-dimensional structure even for data with considerable amount of noise.
Keywords. principal curves, geometry of data, curve fitting Classification. 49M25, 65D10, 62G99, 65D18, 65K10, 49Q20 1. Introduction We consider the problem of finding one-dimensional structures in data given as point clouds. This is a classical problem. It has been studied in 80’s by Hastie and Stuetzle [17] who introduced principal curves as the curves going through the "middle" of the data. A number of modifications of principal curves, which make them more stable and easier to compute, followed [8, 10, 15, 18, 25]. However there are very few precise mathematical results on the relation between the properties of principal curves, and their variants, and the geometry of the data. On the other hand rigorous mathematical setup has been developed for related problems studied in the context of optimal (transportation) network design [5, 6]. In particular the objective functional studied in network design, the average-distance functional, is closely related to a regularization of principal curves. Our first aim is to carefully investigate a desirable variant of principal curves which is closely related to the average-distance problem. We call the minimizers penalized principal curves. We establish their basic properties (existence and basic regularity of minimizers) and investigate the sense in which they approximate the data and in which they represent the one-dimensional structure. One of the shortcomings of principal curves is that they tend to overfit noisy data. Adding a regularization term to objective functional minimized by principal curves is a popular way to address overfitting. The tradeoff is that doing so introduces bias: when data lie on a smooth curve the minimizer is only going to approximate them. We investigate the relationship between the data and the minimizers and establish how the length scales present in the data and the parameters of the functional dictate the length scales seen in the minimizers. In particular we establish what is the critical length scale below which variations in the input data are treated as noise and establish what is the typical error (bias) when the input curve is smooth. Our second aim is to introduce a functional which allows the data to be approximated by more than one curve (multiple penalized principal curves). The motivation is twofold. This allows one to appropriately represent and approximate data whose one-dimensional structure consists of more than one component. Another reason is to design a good scheme to compute penalized principal curves. Namely for many data the penalized principal curves functional has a complicated energy Date: December 17, 2015. 1
2
SLAV KIROV AND DEJAN SLEPČEV
landscape with many local minima. This is a typical situation and an issue for virtually all present approaches to nonlinear principal components. As we explain below, enlarging the set over which the functional is considered (from a single curve to multiple curves) and appropriately penalizing the number of components leads to better behavior of energy-descent based approaches (they more often converge to low-energy local minima). We apply modern optimization algorithms based on ADMM [3] and closely related Bregman iterations [16] to compute approximate minimizers. We describe the algorithm in detail, discuss its complexity and present computational examples that both illustrate the theoretical findings and support the viability of the approach for finding complex one-dimensional structures in point clouds with tens of thousands of points in high dimensions.
Figure 1. Example of a point cloud generated by noisy samples of two curves (not shown): a section of a circle and a curved helix wrapping around it. The green curves shown represent the one dimensional approximation of the data cloud obtained by minimizing the proposed functional (MPPC) using the algorithm of section 4.
1.1. Outline. In section 2 we introduce the objective functionals (both for single and multiple curve approximation), and recall some of the related approaches. We establish the basic properties of functionals. In particular the existence of minimizers and their regularity. Under assumption of smoothness we derive the Euler-Lagrange equation for critical points of the functional. We furthermore consider the linear stability of straight-line critical points. In section 3 we investigate the relation between the data and the minimizers of the functionals. In particular we establish the relation between the length-scales present in the data, the parameters of the functional and the length scales present in the minimizers. We then discuss the parameter selection for the functional. In section 4 we discuss the algorithm for computing the minimizers. In section 4.6 we discuss some further numerical examples which illustrate the applicability of the functionals. Section 5 contains the conclusion and a brief discussion of and comparison with other approaches for one-dimensional data cloud approximation. Appendix A contains some technical details of an analysis of a minimizer considered in section 3.
MULTIPLE PENALIZED PRINCIPAL CURVES: ANALYSIS AND COMPUTATION
3
2. The functionals and basic properties In this section we introduce the penalized principal curves functional and the multiple penalized principal curves functional. We recall and prove some of their basic properties. 2.1. Penalized principal curves. Given a measure (distribution of data), µ, with compact support in Rd , λ > 0, and p ≥ 1, the penalized principal curves are minimizers of (PPC)
Eµλ (γ) ∶= ∫
Rd
d(x, Γ)p dµ(x) + λ L(γ)
over γ ∈ C ∶= {γ ∶ [0, a] → Rd ∶ a ≥ 0, γ is Lipschitz with ∣γ ′ ∣ ≤ 1, L1 − a.e.}, and where Γ ∶= γ([0, a]), d(x, Γ) is the distance from x to set Γ and L(γ) is the length of γ: n
L(γ) ∶= ∣∣γ∣∣T V ∶= sup {∑ ∣γ(xi ) − γ(xi−1 )∣ ∶ 0 ≤ x1 < x2 < ... < xn ≤ a, n ∈ N} . i=2
The functional is closely related to the average-distance problem introduced by Buttazzo, Oudet, and Stepanov [5] having in mind applications to optimal transportation networks [6]. In this context, the first term can be viewed as the cost of a population to reach the network, and the second a cost of the network itself. The authors considered general connected one-dimensional sets and instead of length penalty considered a length constraint. The functional restricted to curves, that we consider, has been considered by Lu and one of the authors [20]. Similar functionals have been considered in statistics and machine learning literature as regularizations of the principal curves problem by Tibshirani [26] (introduces curvature penalization) Kegl, Krzyzak, Linder, and Zeger [18] (length constraint), Biau and Fischer [2] (length constraint) Smola, Mika, Schölkopf, and Williamson [25] (a variety of penalizations including penalizing length as in (PPC)) and others. The first term in (PPC) measures the approximation error, while the second one penalizes the complexity of the approximation. If µ has smooth density, or is concentrated on a smooth curve, the minimizer γ is typically be supported on smooth curves. However this is not universally true. Namely is was shown in [24] that minimizers of average-distance problems can have corners, even if µ has smooth density. An analogous argument applies to (PPC) (and (MPPC)). This raises important modeling questions regarding what is the best functional and if further regularization is needed. We do not focus on these questions in this paper, but note that the information on lengthscales in the problem that we obtain can provide guidance regarding how much regularization is needed. Existence of minimizers of (PPC) in C was shown in [20]. There it was also shown that any minimizer γmin has the following total curvature bound p ′ ∣∣γmin ∣∣T V ≤ diam(supp(µ))p−1 µ(Rd ). λ The total variation (TV) above allows to treat the curvature as a measure, with delta masses at locations of corners, which is necessary in the light of possible lack of regularity. In [20], it was also shown that minimizing curves are injective (i.e. do not self-intersect) in dimension d = 2 if p ≥ 2. 2.2. Multiple penalized principal curves. Here we consider a functional related to (PPC) which allows for configurations which have more than one component. Since (PPC) can be made arbitrarily small by considering γ with many components, a penalty on the number of components is needed. Thus we propose the following functional for multiple curves (MPPC)
Eµλ1 ,λ2 (γ) ∶= ∫
Rd
d(x, Γ)p dµ(x) + λ1 (L(γ) + λ2 (k(γ) − 1))
where, we relax γ to now be piecewise Lipschitz and k(γ) is the number of curves used to parametrize γ. More precisely, we aim to minimize (MPPC) over the admissible set k(γ)
A ∶= {γ = {γ i }i=1 ∶ k(γ) ∈ N, γ i ∈ C, i = 1, ..., k(γ)} .
4
SLAV KIROV AND DEJAN SLEPČEV
We call elements of A, multiple curves. One may think of this functional as penalizing both zeroand one-dimensional complexities of approximations to µ. In particular we can recover the (PPC) functional by taking λ2 large enough. On the other hand, taking λ1 large enough leads to a kmeans clustering problem which penalizes the number clusters. This special case of (MPPC) has been encountered in [4, 19]. The main motivation for considering (MPPC), even if we are only searching for one curve, has to do with the non-convexity of the (PPC). We will see that numerically minimizing (MPPC) often helps evade undesirable (i.e. high-energy) local minima of the (PPC) functional. In particular (MPPC) can be seen as a relaxation of (PPC) to a larger configuration space. The energy descent for (MPPC) allows for curve splitting and reconnecting which is the mechanism that enables one to evade undesirable local minima of (PPC). 2.3. Existence of minimizers of (MPPC). We show that minimizers of (MPPC) exist in A. We follow the approach of [20], where existence of minimizers was shown for (PPC). We first cover some preliminaries, including defining the distance between curves. If γ1 , γ2 ∈ C with respective domains [0, a1 ], [0, a2 ], where a1 ≤ a2 , we define the extension of γ1 to [0, a2 ] as ⎧ ⎪ if t ∈ [0, a1 ] ⎪γ1 (t) γ˜1 (t) = ⎨ ⎪ γ (a ) if t ∈ (a1 , a2 ]. ⎪ ⎩ 1 1 We let dC (γ1 , γ2 ) = max ∣˜ γ1 (t) − γ2 (t)∣. t∈[0,a2 ]
We have the following lemma, and the subsequent existence of minimizers. Lemma 2.1. Consider a measure µ ∈ M and λ1 , λ2 > 0, p ≥ 1. (i) For any minimizing sequence {γn } of (MPPC) (a) lim supn→∞ k(γn ) ≤ λ11λ2 (diam supp(µ))p , and (b) lim supn→∞ L(γn ) ≤ λ11 (diam supp(µ))p (ii) There exists a minimizing sequence {γn } of (MPPC) such that ∀n, Γn is contained in Conv(µ), the convex hull of the support of µ. Proof. The first property follows by taking a singleton as a competitor. The second follows from projecting any minimizing sequence onto Conv(µ). Doing so can only decrease the energy, as shown in [6, 20]. The argument relies on the fact that projection onto a convex set decrease length. Lemma 2.2. Given a positive measure µ ∈ M and λ1 , λ2 > 0, p ≥ 1, the functional (MPPC) has a minimizer in A. Moreover, the image of any minimizer is contained in the convex hull of the support of µ. Proof. The proof is an extension of the one found in [20] for (PPC). Let {γn }n∈N be a minimizing sequence in A. Since the number of curves k(γn ) is bounded, we can find a subsequence (which we take to be the whole sequence) with each member having the same number of curves k. We enumerate the curves in each member of the sequence as γn = {γni }ki=1 . We assume that each curve γni is arc-length parametrized for all n ∈ N, i ≤ k. Since the lengths of the curves are uniformly bounded, let L = supn,i L(γni ), and extend the parametrization for each curve in the way defined above. Then for each i ≤ k, the curves {γni }n∈N satisfy the hypotheses of the Arzela-Ascoli Theorem. Hence for each i ≤ k, up to a subsequence γni converge uniformly to a curve γ i ∶ [0, L] → Rd . Diagonalizing, we find a subsequence (which we take to be the whole sequence) for which the aforementioned convergence holds for all i ≤ k. Moreover, the limiting object is a collection of curves which are 1-Lipschitz since all of the curves in the sequence are. Thus γ ∶= {γ i }ki=1 ∈ A.
MULTIPLE PENALIZED PRINCIPAL CURVES: ANALYSIS AND COMPUTATION
5
In addition the mapping Γ ↦ ∫Rd d(x, Γ)p dµ(x) is continuous and Γ ↦ L(Γ) is lower-semicontinuous with respect to convergence in C. Thus lim inf n→∞ Eµλ1 ,λ2 ,p (γn ) ≥ Eµλ1 ,λ2 ,p (γ), and so γ is a minimizer. 2.4. First variation. In this section, we compute the first interior variation of the functional and state under what conditions a smooth multiple curve is a critical (i.e. stationary) point. Let µ be a compactly supported measure. Assume γ is C 2 multiple curve. In the first variation we investigate we only perturb the interiors of the curves, and not the endpoints (or their number). For that reason we can focus on a single component and in fact assume that the multiple curve has only one component, that is that γ ∈ C 2 ([a, b], Rd ). Without loss of generality we assume that ∣γs ∣ = 1. We consider variations of γ of the form γ(s, t) = γ(s) + tv(s), where v ∈ C 2 ([a, b], Rd ), v(a) = v(b) = 0. We note that one could allow for v(a) and v(b) to be nonzero (as has been considered in for example [24]), but that we do not need that for our analysis. One can furthermore assume with loss of generality (by reparameterizing the curves if necessary) that v(s) ⋅ γs (s) = 0 ∀s ∈ [a, b], that is that v is orthogonal to the curve. Let Γt be the image of γ( ⋅ , t), that is let Γt = γ( ⋅ , t)([a, b]). To compute how the energy (PPC) is changing when γ is perturbed we need to describe how the distance of points to Γt is changing with t. For any x ∈ supp µ let Πt (x) be a point on Γt which is closest to x, that is let Πt (x) be a minimizer of ∣x − y∣ over y ∈ Γt . We assume that Πt (x) is unique for all x ∈ supp µ. The general case that the closest point is non-unique can also be considered [20, 24], but we omit it here for simplicity. We call Πt the projection of data onto Γt . For x ∈ supp µ let g(x, t) ∶= d(x, Γt )2 = ∣x − Πt (x)∣2 . Then
(2.1)
∂g = −2(x − Πt (x)) ⋅ γt ∂t (γt ⋅ γs − (x − Πt (x)) ⋅ γst )2 ∂2g 2 = 2 (∣γ ∣ − (x − Π (x)) ⋅ γ − ) t t tt ∂t2 ∣γs ∣2 − (x − Πt (x)) ⋅ γss
where γ and its derivatives are evaluated at (s(t), t), where s = γ( ⋅ , t)−1 (Πt (x)), that is s(t) = arg minr∈[a,b] d(x, γ(t, r)). Note that for any s ∈ (a, b) the set of points projecting onto γ(s, t) ⊥ satisfies Π−1 t (γ(s, t)) ⊂ γs (s, t) . Taking the derivative in t, and changing coordinates so that the approximation-error term is written as double integral, we obtain b dE γs ⃗ = ∫ (λ1 ⋅ γst − 2 α ˜ (s) ∫ −1 (x − Πt (x)) ⋅ γt ∣1 − K(s) ⋅ (x − Πt (x))∣ dµs (x)) ds dt ∣γs ∣ a Πt (γ)
⃗ where we have suppressed notation for dependence on t (and in some places s), ∣1 + K(s) ⋅ (γ(s) − x)∣ ⃗ is the Jacobian for change of coordinates, and K is the curvature vector of γ. Here we have used the disintegration theorem (see for example pages 78-80 of [9]) to rewrite an integral for µ over Rn as an iterated integral along slices orthogonal to the curve (more precisely over the set of points that project to a given point on the curve). Measure µs is a probability measure supported in Π−1 t (γ(t, s)) and α ˜ ∶= d(γ −1 (t, ⋅ ) ○ ΠΓ # µ)/dL1 is the projected density of µ pulled back to the parameterization, and µs is a probability measure corresponding to the orthogonal set Π−1 t (γ). Integrating by parts we obtain b dE ⃗ ⋅ γt − 2 α ⃗ ⋅ (x − Πt (x))∣ dµs (x)) ds. ∣ = ∫ (−λ1 K ˜ (s) ∫ −1 (x − Πt (x)) ⋅ γt ∣1 − K dt t=0 a Πt (γ)
We conclude γ is a stationary configuration if and only if (2.2)
⃗ λ1 K(s) = −2 α ˜ (s) ∫
for L1 a.e. s ∈ (a, b).
Π−1 t (γ)
⃗ ⋅ (x − Πt (x))∣ dµs (x) (x − Πt (x)) ∣1 − K
6
SLAV KIROV AND DEJAN SLEPČEV
2.5. Linear stability. In this section we compute the second variation to obtain conditions for linear stability. For simplicity we consider the situation that a straight line segment is (a part of a) steady state and obtain conditions for its linear stability. This has important implications to determining when the minimizers of the chosen functional start to overfit the data. If γ is a straight line segment, K = 0, and (2.2) simplifies to γ(s) = x ¯(s) ∶= ∫
Π−1 t (γ(s,t))
x dµs (x)
for L1 a.e. s ∈ (a, b) such that α ˜ (s) ≠ 0. This simply states that a straight line is a critical point of the functional if and only if almost every point on the line it is the mean of points projecting there. That is if it is a principal curve for the data set. The second variation of the length term is
(2.3)
b γ γs γs γs d2 st − ⋅ γst )) ⋅ γst + ⋅ γstt ds L(γ) = ∫ ( ( 2 2 dt ∣γs ∣ ∣γs ∣ ∣γs ∣ ∣γs ∣ 1 b
=∫
a
2
⎞ γs γs 1 ⎛ ∣γst ∣2 − ( ⋅ γst ) + ⋅ γstt ds ∣γs ∣ ⎝ ∣γs ∣ ⎠ ∣γs ∣
We note that 0 = (γs ⋅ γt )s = γss ⋅ γt + γs ⋅ γst , and therefore γs ⋅ γst = 0, so that the second variation of the length term becomes just ∣γst ∣2 . Thus using (2.1) we get (2.4)
b d2 E ∣ = (λ1 ∣γst ∣2 + 2 α ˜ (s) ∫ −1 (∣γt ∣2 − ((x − Πt (x)) ⋅ γst )2 +) dµs (x)) ds ∫ 2 dt t=0 a Πt (γ)
3. Relation between the minimizers and the data Our goal is identify the length scales that relate the parameters of the functional, the length-scales present in the data and the length scales seen in the minimizer. To do so we consider examples of data and corresponding minimizers, use the characterization of critical points of (MPPC) and perform linear stability analysis. 3.1. Examples and properties of minimizers. In this section we provide some insight as to how minimizers of (MPPC) behave. We first find minimizers of the (MPPC) in some simple yet instructive cases. 3.1.1. Uniform density on line. We now consider the measure µ to have uniform density α on the line segment [0, L] ⊂ R. This relegate the technical details of the analysis to Appendix A; here we report the main conclusions. By (A.6) there is a critical density 4 2 λ1 α∗ = ( ) 2 3 λ2 such that if α > α∗ then the minimizing γ has one component √ and is itself a line segment contained in [0, L]. The minimizer γ is shorter than L by a length of λ1 /α on each side. If α < α∗ and L is long enough then the minimizer consists of regularly spaced points on [0, L] with space between them approximately (because of finite size effects) 1
(3.1)
3λ1 λ2 3 gap ≈ 2 ( ) . 4α
An example of this scenario is provided in Figure 2.
MULTIPLE PENALIZED PRINCIPAL CURVES: ANALYSIS AND COMPUTATION
0
0.8
2.4
4.0
5.6
7.2
8.8
10.4
12.0
13.6
7
15.2 16
Figure 2. Numerical results show a minimizer for n = 1000 uniformly spaced points on a line segment, with total mass 1. Here λ1 = 1/16, λ2 = .6 and the critical value for connectedness is λ∗2 = 4/3. The optimal gap between the points is 1.6, compared to the approximation of ≈ 1.53 given by (3.1). The discrepancy is due to the finite length of the line segment considered in the example.
3.1.2. Two parallel lines. Consider data measure µ distributed on two parallel lines with length L, and distance h apart in R2 . Assume that the data density is uniform and equals α per unit length. Let γ be the arc-length parametrized straight line segment of length L, parallel to the data curves and located half-way between them. By (2.1) note that the first variation of γ is zero, since every point is the mean of the points that project there. We now evaluate the second variation, with γt (s) = v(s) = (v1 (s), v2 (s)), γst = v ′ . Since γ is horizontal, and we assume γs ⋅ γt = 0, we have v1 (s) = 0. Thus, L h ′ 2 d2 E 2 ′ 2 − ( ∣ = ) + 4α (v v ) ) ds λ (v 1 2 ∫ 2 dt2 t=0 2 2 0 and we have linear stability if and only if L
(3.2)
0 < ∫
0
h 2 (λ1 − 4α ( ) ) (v2′ )2 + 4αv22 ds 2
Clearly linear stability holds if λ1 ≥ αh2 . This condition is sharp in the sense that if λ1 < αh2 and L is long enough then the line segment γ is linearly unstable. To see this let v2 (s) = sin(as). Then the RHS of (3.2) becomes (λ1 − αh2 ) a2 ∫
L 0
cos2 (as)ds + 4α ∫
L
sin2 (as)ds
0
The first term dominates the second for a sufficiently large. 3.1.3. Data on a curve. In the special case that µ lies on a curve, and that a local minimizer γ of (PPC) is "close" to µ, one can obtain an exact expression for the projection distance. More precisely suppose that for all s˜ near some s ∈ (a, b) the set of x ˜ ∈ supp µ such that s˜ minimizes ∣˜ x − γ(ˆ s)∣ over sˆ ∈ [a, b] is a singleton. Then (2.2) simplifies to λ1 K(s) = 2˜ α(s)l(1 + K(s)l) where we have let l ∶= ∣γ − x∣, and consequently √ √ √ ⎧ λ1 λ1 1 ⎪ 2 if ≪ ⎪ ⎛ ⎞ 1 λ1 K ⎪ 2α˜ K α ˜ 1+2 −1 ≈⎨ (3.3) l= √ ⎪ 2K ⎝ α ˜ ⎠ ⎪ λ1 K ⎪ if K1 ≫ λα˜1 . ⎩ 2α˜ √ Note that always l ≤ 2λα˜1 . The transition of the projection distance l indicated in (3.3) can be obsreved in the example on Figure 5.
8
SLAV KIROV AND DEJAN SLEPČEV
3.1.4. Uniform density in rectangle. Now suppose that µ has uniform density α/h inside [0, L]×[0, h] with L ≫ 1. Linear instability here would indicate when a minimizer starts to overfit the data. We again consider γ to be the arc length parametrized straight line in the middle with length L. The second variation (2.4) then becomes L d2 E α ′ 2 h/2 2 ′ 2 2 ∣ = (v ) ∫ λ (v ) + 2 (αv − 2 x dx) ds 1 ∫ 2 2 dt2 t=0 h 2 0 0 L 1 = ∫ (λ1 − αh2 ) (v2′ )2 + 2αv22 ds 6 0 and using the same argument as previously we get linear instability if and only if λ1 < 16 αh2 .
1
0 0
1
2
3
4
Figure 3. Numerical results showing linear instability for n = 361 × 81 uniformly spaced points with total mass = 1. Final curves with decreasing amplitude are shown for λ1 = 1/1000, 1/150, 1/50, 1/27, 1/23. For all solution curves, λ2 = 1, to ensure connectedness, and by results above the critical value for linear stability is λ∗1 = 1/24. The initial curve used for all results was a randomly perturbed straight line y0 with max y0 −min y0 ≈ 0.005, and the endpoints were kept fixed at (0, 0.5), (4, 0.5) to avoid boundary effects. To illustrate how close the data√are to the minimizing curve we plot the typical distance from data to the curve as a function of λ1 in Figure 4. 3.1.5. Vertical Gaussian noise. In this case we consider µ to have Gaussian noise with variance σ 2 orthogonal to a straight line of length L. We denote the uniform projected density on the line by α. Computations analogous to the previous examples show that there is linear stability if and only if λ1 < 2ασ 2 . 3.2. Summary. Here we summarize how length scales present in the minimizers are affected by the parameters λ1 and λ2 and the properties of data. The following are the important length scales seen: λ2 — connectivity threshold. It sets the minimum distance for components of the solution. Gaps of size λ2 or less in the data, are not "seen" by the minimizer. The length scales below explain the relation between one-dimensional data and the minimizers. √ λ1 α — smoothing length scale (discussed in Examples 3.1.3, 3.1.2 and 3.1.4). Consider data generated by smooth curve with data density per length α with added noise (high-frequency oscillations, uniform distribution in a neighborhood of the curve, etc.). Then the “noise" is ignored by the minimizer, as long as its amplitude (distance in space from the generating
MULTIPLE PENALIZED PRINCIPAL CURVES: ANALYSIS AND COMPUTATION
9
Mean projection distance
0.25
0.2
0.15
0.1
0.05
0.05
0.1
0.15
0.2
λ1/2 1
√ Figure 4. A plot of the mean projection distance versus λ1 for the solution curves shown √ in Figure 3. We expect that the mean projection distance should be roughly ≈ λα1 , where α is the projected density onto the solution curve. Note that α increases as λ1 increases, since the length of γ decreases. This explains why the slope of the plotted curve decreases as higher values of λ1 are reached.
√ curve) is less than (a constant multiple which depends on the noise model) times λα1 . In √ λ1 other words α is the length scale over which the noise is averaged out. Below it the minimizer neglects it, while above it interprets it as signal√that it needs to approximate. In
λ1 K α
other words if we think of data as drawn by a pen then 2 λα1 is the widest the pen tip can be, for the line to be considered a line. — bias or approximation-error length scale (discussed in Example 3.1.3). Consider again data generated by smooth curve with data density per length α and curvature K. If the √ curvature (and reach) of the curve are small (compared to
λ1 α
), then the distance from the
λ1 K α .
curve to the minimizer is going to scale like That is the typical error in reconstruction of a smooth curve that a minimizer makes (due to the presence of the length penalty term) scales like λ1αK . We further illustrate the length scales an example below. Example 3.1. Curve with decaying oscillations. Consider data uniformly spaced on the image of the function x5 sin(−4π log(x)), which ensures that the amplitude and period are decreasing with the same rate, as x → 0+ . In Figure 5, the linear density of the data is constant with total mass 1, and solution curves are shown for two different values of λ1 . For x small enough √ the minimizing curve is flat, as it is not influenced by oscillations whose amplitude is less than λα1 . As the amplitude of oscillations grows beyond the smoothing length scale the minimizing curve starts to follow them. As x gets larger and K becomes smaller, how well the minimizing curves approximate the data scales linearly with λ1 , and is governed by the approximation-error length scale. In the case that the data do not lie exactly on a curve, and instead are noisy and supported in a neighborhood around a curve, we know from the linear stability analysis when minimizers will start to overfit. This occurs when λ1 < c˜ αh2 , where c depends on the distribution, and h is the maximum projection distance. Any such curve will be locally linearly stable on a small enough scale (otherwise it would not be a local minimizer), and thus the mean projection distance in this case is
10
SLAV KIROV AND DEJAN SLEPČEV
2
1
0 λ1 = .04
−1
λ1 = .01 2
4
6
8
10
12 λ1 = .04
3
λ1 = .01 2
1
0
−1
−2
−3
−4 14
16
18
20
22
24
Figure 5. Numerical results shown for n = 3000 uniformly spaced data points (in gray) on the image of x5 sin(−4π log(x)) for x ∈ [.001, e3.25 ], and two different values of λ1 .
√ also on the order of λ1 /˜ α. In Figure 4 we show a plot of the mean projection distance versus λ1 for the linear stability example shown in Figure 3. In addition to the above length scales, there is a density threshold describing how dense the data need to be along a curve for the functional to interpret them as a continuum. λ1 λ22
— density threshold (discussed in Example 3.1.1 and Appendix A.6). Consider again data generated by a smooth curve with data density per length α. If α is smaller than α∗ = 2 ( 34 ) λλ21 , then it is cheeper for the data to be approximated by a series of points than by 2 a continuous curve. That is if there is too few data points the functional no longer sees them as a continuous curve. Basically if α > α∗ then the minimizers of (PPC) and (MPPC) are expected to coincide, while if α < α∗ then the minimizer of (MPPC) will consist of 1
points spaced at distance about ( λ1αλ2 ) 3 . Note that the condition α > α∗ can be written √ as λα1 > 34 λ2 . So basically we expect that the minimizer will be a continuous curve if the connectivity threshold is less than the smoothing length scale. λ /a,λ
λ1 ,λ2 We also note the following scaling properties of the functionals. Note that Eaµ = aEµ 1 2 for any a > 0. Thus, when "mass" of data points is changed λ1 should scale like ∣µ∣ to preserve
MULTIPLE PENALIZED PRINCIPAL CURVES: ANALYSIS AND COMPUTATION
11
d minimizers. Alternatively, if µL (A) ∶= µ( A L ) for every A ⊆ R and some L > 0, one easily obtains λ /L,λ2 /L
that EµλL1 ,λ2 (Lγ) = L2 Eµ 1
(γ).
3.3. Parameter selection. Understanding the length scales above allows one to choose proper parameters for a variety of tasks. Here we provide an example. Consider the problem of finding one-dimensional structures which have linear density at least α ¯ at spatial resolution set by length L. For this to be possible without overfitting it is needed that L is greater than a constant multiple of the "vertical" noise, with the constant depending on the noise model (recall linear instability examples 3.1.4, 3.1.5). Proper √ values of parameters λ1 and λ2 can be set as follows. Pick λ1 so that λ1 α ¯
is equal to L. That is λ1 = α ¯ L2 . The value of λ2 is then set so the √ 2 density threshold α∗ = ( 43 ) λλ21 = α ¯ , so λ2 = 34 λα¯1 = 34 L. Then structures with linear density greater 2 than α ¯ will be represented by continuous lines, while those with lower density will be represented by isolated points.
the smoothing length scale,
Example 3.2. Two arcs. Figure 6 illustrates a scenario in which two semi-circles have different densities. The upper semicircle has half the density per area, but twice the density per unit length, of the other. Since the λ2 component of the energy only feels the projected density, we show a parameter setting in which only the upper semicircle is recovered by a continuous curve. We also show settings in which both are recovered, and in which only the lower semicircle is recovered. 4. Numerical algorithm for computing multiple penalized principal curves For this section we assume the data measure µ is discrete, with points x1 , x2 , ..., xn ∈ Rd and corresponding weights w1 , w2 , ..., wn ≥ 0. The weights are uniform (1/n) for most applications, but we make note of our flexibility in this regard for cases when it is convenient to have otherwise. For a piecewise linear curve y = (y1 , ..., ym ), we will consider projections of data to yi ’s only. Hence, we approximate d(xi , y) ≈ min{∣xi − yj ∣ ∶ j = 1, ..., m}. (Notation: when a is a vector, as are xi , yj in the previous line, ∣a∣ denotes the Euclidean norm). Before addressing minimization of (MPPC), we first consider (PPC). The discrete form is m
(4.1)
m−1
2 ∑ ∑ wi ∣xi − yj ∣ + λ1 ∑ ∣yj+1 − yj ∣ j=1 i∈Ij
j=1
′
where Ij ∶= {i ∶ ∣xi −yj ∣ ≤ ∣xi −yj ′ ∣, j = 1, ..., m} (in case of ties the assignment is unique and arbitrary so that I1 , ..., Im partition {1, ..., n}). 4.1. Basic approach for PPC. Here we restrict our attention to performing energy decreasing steps for the (PPC) functional. We emphasize again that this minimization problem is non-convex. The projection assignments I1 , ..., Im depend on y itself. However, if the projection assignments are fixed, then the resulting minimization problem is convex. This suggests the following EM-type algorithm. Algorithm 1 Input: data x1 , ...xn , weights w1 , ..., wn , initial curve y1 , ..., ym , λ > 0 repeat 1. compute I1 , ..., Im 2. minimize (4.1) for I1 , ..., Im fixed until convergence
12
SLAV KIROV AND DEJAN SLEPČEV
1
1
1
0
0
0
−1
−1
−1
−2
−2 0
−2
−1
0
−1
0
−1
Figure 6. Two semicircles with uniform noise shown for three parameter regimes. The lower semicircle has half as many data points and 1/4 (by area) of the support of the upper semicircle, so that the density per unit area is twice that of the upper semicircle, but the projected linear density is half as much. On the left the param2 eters λ1 and λ2 are such that the density threshold α∗ = ( 43 ) λλ12 is in between the 2
∗ linear densities of the two moons. In the middle, it is below both. On √ the right, α is the same, but λ1 is lowered so that the smoothing length scale λα1 is less that the half-width of the top semicircle. This causes the overfitting of the top semicircle, to the point the linear density on the minimizer drops below α∗ .
Note that if the minimization of (4.1) is solved exactly, then Algorithm 1 converges to a local minimum in finitely many steps (since there are finitely many projection states, which cannot be visited more than once). 4.1.1. Minimize functional with projections fixed. We now discuss how to perform step 2 of Algorithm 1. We first address the minimization of (4.1) without the k(y) term. One may observe that with the projections fixed this minimization problem resembles that of a regression, and in particular the fused lasso [27]. To perform the minimization we apply the alternating direction method of multipliers (ADMM) [3], which is equivalent to split Bregman iterations [16] when the constraints are linear (our case) [11]. We rewrite the total variation term as ∣∣Dy∣∣1,2 ∶= ∑m−1 i=1 ∣(Dy)i ∣, where D is the difference operator, (Dy)i = yi+1 − yi and ∣ ⋅ ∣ again denotes the Euclidean norm. An equivalent constrained minimization problem is then m
min
y,z ∶ z=Dy
2 ∑ ∑ wi ∣xi − yj ∣ + λ∣∣z∣∣1,2 j=1 i∈Ij
Expanding the quadratic term and neglecting the constant, we obtain (4.2)
min
y,z ∶ z=Dy
∣∣y∣∣2w¯ − 2(y, x ¯)w¯ + λ∣∣z∣∣1,2
MULTIPLE PENALIZED PRINCIPAL CURVES: ANALYSIS AND COMPUTATION
13
where notation was introduced for total mass projecting to yj by w ¯j = ∑i∈Ij wi , center of mass m 1 ¯j (yj , x ¯j ). One iteration of the x ¯j = w¯j ∑i∈Ij wi xi , and weighted inner product (y, x ¯)w¯ = ∑j=1 w ADMM algorithm then consists of the following updates: (1) y k+1 = argminy ∣∣y∣∣2w¯ − 2(y, x ¯)w¯ + ρ2 ∣∣Dy − z k + bk ∣∣2 (2) z k+1 = argminz λ∣∣z∣∣1,2 + ρ2 ∣∣Dy k+1 − z + bk ∣∣2 (3) bk+1 = bk + Dy k+1 − z k+1 where ρ > 0 is a parameter that can be interpreted as penalizing violations of the constraint. As such, lower values of ρ tend to make the algorithm more adventurous, though the algorithm is known to converge for any fixed value of ρ > 0. The minimization in the first step is convex, and the first order conditions yield a tridiagonal system for y. The tridiagonal matrix to be inverted is the same for all subsequent iterations, so only one inversion is necessary, which can be done in O(md) time. In the second step, z decouples, and the resulting solution is given by block soft thresholding ⎧ vik ⎪ k if ∣∣vik ∣∣ > λ/ρ k+1 ⎪vi − (λ/ρ) ∣∣v k ∣∣ i zi = ⎨ ⎪ ⎪ else ⎩0 where we have let vik = (Dy k+1 )i + bki . We therefore see that ADMM applied to (4.2) is very fast. Note that one only needs for the energy to decrease in this step for Algorithm 1 to converge to a local minimum. This is typically achieved after one iteration, so we suggest at most a few iterations for minimization of (4.1), especially as finer precision gets lost as soon as projections are updated. On the other hand, the projection step is more expensive, requiring O(nmd) operations to compute exactly. It may be worthwhile to investigate how to optimize alternating these steps, as well as more efficient methods for updating projections especially when changes in y are small. For our purposes though, we apply the above iterations until the energy decreases, and then exactly recompute all projections. 4.2. Accounting for λ2 in MPPC. We now discuss how we perform steps that decrease the energy of the modified ADP functional (MPPC). Since λ2 controls the level of connectedness of minimizers, we introduce routines that disconnect and connect curves based on the resulting change in energy. 4.2.1. Disconnecting and connecting curves. We first examine the energy contribution of each edge {i, i′ } ∶= [yi , yi′ ]. To do so we compare the energy of the present configuration to one where the particular edge is removed. It is straightforward to check that the energy contribution of the edge {i, i′ } with respect to the continuum functional (MPPC) is ∆Ei,i′ ∶= λ1 ∣yi′ − yi ∣ − λ1 λ2 − ∑ wj min (∣yi − Πi,i′ (xj )∣, ∣yi′ − Πi,i′ (xj )∣)2 j∈Ii,i′
where Ii,i′ is the set of data points projecting to the edge {i, i′ }, and Πi,i′ is the orthogonal projection onto edge {i, i′ }. Our connecting and disconnecting routines will be based on the sign of ∆Ei,i′ . We note that above criterion is based on the variation of the continuum functional rather than its discretization (4.1), which Algorithm 1 is based on. Doing so is motivated by virtue of providing a stable criterion that would not change if we had a finer discretization of the line segment [yi , yi′ ]. While we use the discrete functional to simplify computations in approximating the optimal fitting of curves, we perform topological changes based on the continuum energy (MPPC). We first discuss disconnecting. We compute the energy contribution for each edge and if ∆Ei,i′ < 0, then we remove edge {i, i′ }. Note this condition can only be true if the length of the edge is at least λ2 . It may happen that all edge lengths are less than λ2 , but that the energy may be decreased by removing a sequence of edges, whose total length is greater than λ2 . Thus, in addition to checking single edges, we implement an analogous check for sequences of edges. The energy contribution of
14
SLAV KIROV AND DEJAN SLEPČEV
a sequence of k edges {i, i + 1}, {i + 1, i + 2}, ..., {i + k − 1, i + k} (including the corresponding interior vertices yi+1 , ..., yi+k−1 ) is given by k−1
∆Ei∶i+k ∶= λ1 ( ∑ ∣yi+l+1 − yi+l ∣ − λ2 ) l=0 k−1
+∑
2
∑
wj ((xj − Πi+l,i+l+1 (xj )) − (min{∣xj − yi ∣, ∣xj − yi+k ∣})2 ) .
l=0 j∈Ii+l,i+l+1
The routine for checking such edge sequences is outlined in Algorithm 2. Algorithm 2 Check edge sequences for removal Input: data x1 , ...xn , weights w1 , ..., wn , connected curve y1 , ..., ym , projections I, λ1 , λ2 > 0 set i = 1 repeat set k = 1, len = ∣yi+1 − yi ∣ repeat increment k = k + 1, len = len + ∣yi+k − yi+k−1 ∣ until len > λ2 (or i + k = m) compute ∆Ei∶i+k if ∆Ei∶i+k < 0 then remove edge sequence {i, i + 1}, {i + 1, i + 2}, ..., {i + k − 1, i + k} advance i = i + k − 1 end if increment i = i + 1 until i > m − 1 Connecting is again based on the energy contribution of potential new edges. We use a greedy approach to adding the edges. That is, we compute ∆Ei,i′ for each potential edge {i, i′ }, and add them in ascending order, connecting curves until no admissible energy-decreasing edges exist. We note that finding the globally optimal connections is essentially a traveling salesman problem, which is NP-hard. More sophisticated algorithms could be used here, but the greedy search is simple and has satisfactory performance for our purposes. 4.2.2. Management of singletons. Another aspect of the numerics of the modified functional regards singletons (curves whose range is just a single point in Rd ). Even if one is only interested in recovering one-dimensional structures, singletons may play a vital role. Singletons can represent low-density regions of data, and divert the attraction of curves away from outliers and background clutter. This is especially important when low-density regions exist far from the one-dimensional structures. Below we provide effective routines for energy-decreasing transitions between configurations involving singletons. For checking whether (and where) singletons should be added, we examine each point yi individually. If yi is itself not a singleton, we compute the expected change in energy resulting from disconnecting yi from its curve, placing it at the mean x ¯i of the data that project to it, and reconnecting the neighbors of yi , so the number of components only increases by one. The change in the fidelity term will be exactly −w ¯i (¯ xi − yi )2 , where w ¯i = ∑j∈Ij wj is the total mass projecting to yi . Thus we add a singleton when λ1 λ2 < w ¯i (¯ xi − yi )2 + λ1 (∣yi − ymax(1,i−1) ∣ + ∣yi − ymin(m,i+1) ∣ − ∣ymax(1,i−1) − ymin(m,i+1) ∣) . If yi is itself a singleton, then one cannot exactly compute the change in energy due to adding another singleton in its neighborhood without knowing the optimal positions of both singletons. We
MULTIPLE PENALIZED PRINCIPAL CURVES: ANALYSIS AND COMPUTATION
15
restrict our attention to the data which project onto yi , and note that if those points are the only ones that project to the new singleton, then adding the singleton may be advantageous only if the fidelity term associated to yi is greater than λ1 λ2 . If that holds, we perturb yi in the direction of one of its data points, place a new singleton opposite to yi with respect to its original position, and apply a few iterations of Lloyd’s k-means (with k = 2) algorithm applied to the data points that projected on yi . We keep the two new points if and only if the energy decreases below that of the configuration with only yi . A singleton yi gets removed if doing so decreases the energy. That is if λ1 λ2 > ∑ wj (∣xj − yi ∣2 − d(xj , y−i )2 ) j∈Ii
where d(xj , y−i ) ∶= min{∣xj − yi′ ∣ ∶ i′ ∈ [m], i′ ≠ i}. Since singletons are represented by just a single point and cannot grow by themselves, we also check transitioning from singleton to short curve is advantageous. To do so we enforce that the √ ˜ average projection distance di to a singleton yi is less than λ1 /˜ α, which represents the expected spatial resolution, where α ˜=w ¯i /(4d˜i ) is an approximation to the potential linear density. Thus we add a neighboring point to yi if d˜i = ∑ wj ∣xj − yi ∣ > λ1 /w ¯i . j∈Ii
Since this is based on an approximation, we also explicitly compute the posterior energy, to make sure that it has indeed decreased. In summary, we have fast and simple ways to perform energy decreasing steps of the k(y) term of the functional. Even when minimizers are expected to be connected, performing these steps changes the topological structure of the curve, keeping it in higher density regions of the data, and consequently evading several potential local minima of the original functional (PPC). 4.3. Re-parametrization of y. In applying the algorithm described thus far, it may, and often does, occur that some regions of y are represented with fewer points yi than others, even if an equal amount of data are projected to those regions. That is, there is nothing that forces the nodes yi to be well spaced along the discretized curve. To address this, we introduce criteria that li w¯i be roughly constant for i = 1, ..., m, where li = 21 ∑{∣yi − yj ∣ ∶ j ∈ {i − 1, i + 1} ∩ [1, m]} and w¯i is the total weight of points projecting to yi . This condition is motivated by finding for fixed m the optimal spacing of yi ’s that minimizes the fidelity term, under the assumption that the data are distributed with slowly changing density in a rectangular tube around straight line y. 4.4. Criteria for well-resolved curves. Here we discuss criteria for when a curve can be considered well-resolved with regard to the number of points m used to represent it. One would like to have an idea of what conditions would give an acceptable degree of resolution, without requiring m too large and significantly increasing computational time. We provide two such conditions. One is related to the objective of obtaining an accurate topological representation of the minimizer, specifically the number of components. In order to have confidence in recovering components at a scale λ2 , the spacing between consecutive points on a discretized curve should be of the same scale. Thus we require that the average of the edge lengths is at most λ22 . Another approach for determining the degree of resolution of a curve is to consider its curvature. π One may calculate the average turning angle and desire that it be less than some value (e.g. 10 ). If λ2 is not small enough, the first condition will not guarantee small turning angles, and so we include this criterion as optional. We note that in light of the possible lack of regularity of minimizers [24] it would not be reasonable to limit the maximal possible turning angle.
16
SLAV KIROV AND DEJAN SLEPČEV
4.5. Initialization. Finally, we discuss initialization. While the procedures described above enable the algorithm to evade undesirable local minima, they do not guarantee convergence to a global minimizer. For this reason initialization remains a relevant task. A natural idea that has been found to work well in practice is to initialize using disconnected segments or singletons randomly distributed in the data. The initial segments are then expected to grow and eventually connect along the data (under appropriate choices for λ1 , λ2 ). 4.6. Further numerical examples. We present a couple of further computational examples which illustrate the behavior of the functionals and the algorithm. Example 4.1. Noisy spiral. The data are generated as noisy samples of the spiral t ↦ (t cos(t), t sin(t)), t ∈ [3, 14], shown as a dashed line on Figure 7(a). More precisely, 2000 points are drawn uniformly along the spiral. To each of the points, noise drawn, independently, from normal distribution 1.5N2 (0, 1) is added. Figure 7(a) shown the computed minimizer of the (MPPC) functional. Figure 7(b) shows the result of the algorithm if splitting the curve were not allowed. In other words it shows the result of the energy descent algorithms for the (PPC) functional. The initial data (which are the same for both experiments) are the diagonal line, which is the first principal component of the data. The descent for (PPC) is attracted by a local minimum of (PPC). An advantage of (MPPC) is that this configuration is not a local minimum for (MPPC), since cuts are allowed. As seen on Figure 7(a) the descent for (MPPC) is able via cutting and reconnecting to converge to a curve which recovers the geometry of the dataset.
1
1
0
0
−1
−1
−1
0
(a) Local minimizer of (MPPC)
1
−1
0
1
(b) Local minimizer of (PPC)
Figure 7. Numerical results on data generated by spiral (in purple) plus Gaussian noise. On the left, (PPC) is used to find the local minimizer given by the green curve, using the first principal component (blue) as initialization. On the right, (MPPC) is used to find the green curve, with critical linear density α⋆ = .09. In both cases λ1 = 0.01. Example 4.2. Noisy grid. Consider data with an underlying grid-like structure, shown in Figure 8. The data consist of 2400 points generated by four intersecting lines with Gaussian noise and 2400 points uniformly sampled from the background [0, 3] × [0, 3] × [−.75, .75]. The computed minimizer
MULTIPLE PENALIZED PRINCIPAL CURVES: ANALYSIS AND COMPUTATION
17
consists of eight disjoint curves and a number of isolated points (small green dots). Our approach is not able to capture the intersections due to structure of configurations we consider. Nevertheless, the algorithm succeeds in locating the one-dimensional structures. In essence the algorithm is able to identify a one-dimensional approximation of the data, albeit one where the global topology is not identified. It may be possible to use the approximation obtained as the input for approaches to recovering the correct topology of the data [23] . We observe that since the density of data in the background noise is relatively low, the computed minimizer approximates the data in the background by isolated points. For the parameters we used, this is predicted by the discussion of the density threshold in section 3. In other words by 2 choosing parameters λ1 and λ2 so that the critical density threshold α∗ = ( 34 ) λλ12 is between the 2 linear density of the background noise and the linear density of the data the background noise will be represented by isolated points. That is the proposed approach has some robustness to noise.
0.5
0
−0.5
3 3 2 2 1 1
Figure 8. Computed minimizer on data generated from four intersecting lines forming a grid with Gaussian noise and background clutter in R3 , with λ1 = 7 × 10−4 , λ2 = 0.2. The minimizer consists of curves and isolated points (green). The larger blue dots represent the discretization (points yi ) of curves which are a part of the minimizer.
5. Discussion and Conclusions In this paper, we proposed a new objective functional for finding one-dimensional structures in data which allows for representation made of several components. The functional introduced is based on the average-distance functional and can be seen as a regularization of principal curves. It penalizes the approximation error, total length of the curves used and the number of components.
18
SLAV KIROV AND DEJAN SLEPČEV
We have investigated the relationship between the data generated by one dimensional signal with noise, the parameters of the functional and the minimizer. This provides guidance for the choice of parameters, and can further be used for multiscale representation of the data. We have furthermore demonstrated that the zeroth-order term helps the energy-descent-based algorithms converge to desirable configurations. In particular, energy-descent based approaches for (PPC) very often end up in undesirable local minima. The major reason for this in of topological nature, namely points on the approximate local minimizer "visit" the data points in an order which may be very different from the ordering of data points on the (unknown) generating curve. The added flexibility of being able to split and reconnect the curves provides a way for topological obstacles to be resolved. Finally, we have developed a fast numerical algorithm for estimating minimizers of (MPPC). It has computational complexity O(mnd), where n is the number of data points in Rd , and m is the number of points along the approximating curve.
5.1. Relation to other approaches. We now briefly compare the proposed approach to other available approaches to finding one-dimensional structure in data. The original principal curves are prone to overfitting, as carefully explained in [15] and are difficult to compute numerically. The approach of Gerber and Whitaker [15] offers a more stable notion and effective numerical implementation (in low dimensions). However the functional would still often overfit noisy data and there are many curves which minimize the functional (but do not represent the data well). We should mention that the experiments by the authors indicate that the algorithm often selects desirable minimizers. We also note that since the functional does not measure the approximation error, its low values are not a measure of how well the data are approximated. On the other hand if the data are sampled from a smooth curve the minimizers can be expected to recover the curve exactly. A number of works inspired by principal curves treats the problem by considering functionals which regularize the principal curves problem. Among them are works of Tibshirani [26] (introduces square curvature penalization) Kegl, Krzyzak, Linder, and Zeger [18] (length constraint), Biau and Fischer [2] (length constraint) Smola, Mika, Schölkopf, and Williamson [25] (a variety of penalizations including penalizing length as in (PPC)). These works are in spirit similar to our approach. The work of Biau and Fischer [2] also discusses model-selection based automated ways to choose parameters of the functional used for the specific data set. Wang and Lee [29] also use model selection to select parameters, but use a different way to ensure the regularity of the minimizer. Namely they model the points along the curve as an autoregressive series. Regarding numerical approaches, early on Kegl, Krzyzak, Linder, and Zeger [18] proposed a polygonal-line algorithm that penalizes sharp angles. Feuersänger and Griebel employ sparse grids to minimize a functional similar to (PPC), with length squared regularization [12] as in [25], for manifolds up to dimension three. While taking measures against overfitting data, these and other similar methods do not address the non-convexity of the associated minimization problem, resulting in performance that is very sensitive to the initialization of the algorithms. Verbeek, Vlassis and Kröse [28] approach this issue by iteratively inserting, fitting, and connecting line segments in the data. This approach is effective in some situations where others yield poor performance (e.g. spiral in 2-d, some self-intersecting curves, and curvy data with little noise). However, in cases of higher noise the algorithm overfits if the number of segments is not significantly limited. A more precise understanding of the impact of the number of segments on the final configuration is still needed, despite some efforts to automatize selection of this parameter [29]. A different class of approaches to finding one dimensional structures is based on estimating the density function from the point cloud and finding the density ridges. Works of Ozertem and Erdogmus [21], Genovese, Perone-Pacifico, Verdinelli, and Wasserman [14], and Pulkkinen [22] take this approach. These approaches can often find the desired one-dimensional structure. Their limitation
MULTIPLE PENALIZED PRINCIPAL CURVES: ANALYSIS AND COMPUTATION
19
is that typically many sample points are required and that estimating the underlying density is a computationally expensive task which limits the approach to low dimensions. Recently there has been a significant effort to recover one-dimensional structures which are branching and intersecting and in particular the connectivity network of the data set. The ability to recover graph structures and the topology of the data is very valuable, and facilitates a number of data analysis tasks. Several notable works are based on Reeb graphs and related structures [7, 13, 23]. We note that these approaches are sensitive to noise and furthermore the presence of noise significantly slows down the algorithms. We believe our approach and algorithm could be valuable as a pre-processing step for simplifying the data prior to applying graph-based approaches to find the connectivity network of the data set. For example we illustrate on Figure 4.6 how our approach correctly identifies and simplifies the one dimensional structure present in the data (but does not identify the topological structure). The approaches mentioned should work much better on the simplified (green) data set than the original point cloud. Finally we mention the work of Arias-Castro, Donoho, and Huo [1] who studied optimal conditions and algorithms for detecting (sufficiently smooth) one-dimensional structures with uniform noise in the background. Acknowledgements We are grateful to Xin Yang Lu, Ryan Tibshirani, and Larry Wasserman for valuable discussions. The research for this work has been supported by the National Science Foundation under grants CIF 1421502 and DMS 1516677. We are furthermore thankful to Center for Nonlinear Analysis (CNA) for its support. Appendix A. Analysis of the uniformly distributed data on a line segment Consider data uniformly distributed with density α on a line segment [0, L]. The functional (MPPC) takes the form (A.1)
Eµλ1 ,λ2 (γ) ∶= ∫
L 0
d(x, γ)p αdx + λ1 (L(γ) + λ2 k(γ))
where k(γ) is the number of components of γ minus 1. We restrict ourselves to γ such that +1 {0, L} ⊂ range(γ), so that γ takes the form γ = ⋃ki=1 [ai , bi ], where a1 = 0, and bk +1 = L. Define k +1 k τ ∶= ∑i=1 τi , g ∶= ∑i=1 gi , where τi ∶= bi − ai and gi ∶= ai+1 − bi . We make the following observations: Lemma A.1. The energy Eµλ1 ,λ2 is invariant under redistribution of total length of γ, assuming that the number of components is k +1, and that the gap sizes remain constant. More precisely, if +1 +1 γ¯ = ⋃ki=1 [a¯i , b¯i ], γ˜ = ⋃ki=1 [a˜i , b˜i ] and there exists a permutation σ of {1, . . . , k} such that g¯i = g˜σ(i) λ1 ,λ2 for i = 1, ..., k, then Eµ (¯ γ ) = Eµλ1 ,λ2 (˜ γ ). Lemma A.2. For k > 0 fixed, the energy Eµλ1 ,λ2 is minimized when the length of the gaps between +1 components are uniform. More precisely, consider an arbitrary γ = ⋃ki=1 [ai , bi ], with total gap g defined as above. Let γ˜ have k +1 components such that g˜i = g/ k, implying that g˜ = g. Then γ ) ≤ Eµλ1 ,λ2 (γ), with equality only if gi = g˜i . Eµλ1 ,λ2 (˜ Proof. The result is trivial for k = 0. We prove the result for k = 1. Consider γ with g = g1 + g2 . The fidelity part of the energy Eµλ1 ,λ2 , as a function of g1 is g1 /2
F (g1 ) = 2 ∫
0
Thus
xp αdx + 2 ∫
(g−g1 )/2 0
1 dF = p−1 α (g1p − (g − g1 )p ) dg1 2
xp αdx.
20
SLAV KIROV AND DEJAN SLEPČEV
and d2 F p = p−1 α (g1p−1 + (g − g1 )p−1 ) ≥ 0. 2 dg1 2 By these we see that g1 minimizes the energy if and only if g1 = g/2 = g2 . The result for k > 2 follows since one can consider the above situation by looking at the gaps formed by three consecutive components. Using Lemma A.1 we may assume that each component not containing the endpoints 0 or L has the same length l, and that the two components containing the endpoints are of length l/2. By l Lemma A.2, the gaps between the components are L−k k . We first consider k > 0 fixed, and minimize L the energy w.r.t. l in the range l ∈ [0, k ). The energy E = λ1 k l + 2 k ∫
0
(A.2) = λ1 k l +
L−k l 2k
xp αdx + λ1 λ2 k
L − k l p+1 2 k( ) α + λ1 λ2 k p+1 2k
is convex on [0, Lk ). Taking a derivative in l we obtain dE L − kl p = λ1 k − k ( ) α. dl 2k Setting the derivative to zero and solving for l, and by noting that if there is no solution on [0, Lk ) then E is a nondecreasing function of l, we get that the energy is minimized at lk∗
(A.3)
⎧ λ1 1/p L ⎪ ⎪ ⎪k − 2( α ) =⎨ ⎪ ⎪ ⎪ ⎩0
¯ = 1+ as we indicate above let k
L 2(
λ1 1/p ) α
k≤
if
L 1/p λ 2( α1 )
¯ =∶ k
else.
¯ plugging back into (A.2) we get . For k between 1 and k,
that the minimal energy is Emin (k) = λ1 L + λ1 λ2 k −
(A.4)
2p λ1 1/p ( ) k p+1 α
By direct inspection we verify that (A.4) is the (minimal) energy in the case that there is only one component (no breaks in the line). We note that (A.4) is linear in k, and hence for k between 0 and ¯ the minimizing value is at a boundary: k, ⎧ 2p λ1 1/p ⎪ (α) if λ2 ≥ p+1 ∗ ⎪0 k =⎨ ¯ ⎪ ⎪ ⎩⌊k⌋ otherwise ¯ when all components have length zero (l∗ = 0). The energy in this case is We now consider k > k k
El=0 (k) ∶=
2 L p+1 1 ( ) α + λ1 λ2 k p+1 2 kp
Considering k as a real variable we note that El=0 (k) is a convex function. Taking a derivative in k gives dEl=0 −2p L p+1 = ( ) α + λ1 λ2 . dk p + 1 2k If λ2 ≥
2p p+1
1/p
( λα1 )
then
dEl=0 dk
¯ ≥ 0 for k > k.
MULTIPLE PENALIZED PRINCIPAL CURVES: ANALYSIS AND COMPUTATION
If λ2
k ¯ and thus belongs to the range considered. If k ¯∗ is an integer then it is the satisfies k l=0 l=0 ¯∗ ⌋ + 1}. In all cases let us ¯∗ ⌋, ⌊k minimizer of the energy, otherwise the minimizer is in the set {⌊k l=0 l=0 ∗ ∗ denote by kl=0 the minimizer of the energy: kl=0 = arg mink=⌊k¯∗ ⌋,⌈k¯∗ ⌉ El=0 (k). l=0 l=0 ¯ In that case the minimizer of the energy with We note that there is a special case that k∗l=0 < k. ¯ case, and thus exactly k∗l=0 + 1 components will be the one considered in the analysis of the 1 ≤ k ≤ k ∗ will have segments of positive length l given by formula (A.3). To summarize, the optimal number of components will be ⎧ 2p λ1 1/p ⎪ (α) ⎪ 1 if λ ≥ 2 ⎪ p+1 ⎪ ⎪ ⎪ ¯ 2p λ1 1/p ¯ ⎨⌊k⌋ + 1 if λ2 < p+1 (A.5) ( α ) , and k∗l=0 < k ⎪ ⎪ ⎪ 2p λ1 1/p ⎪ ∗ ∗ ¯ ⎪ ⎪ ⎩kl=0 +1 if λ2 < p+1 ( α ) , and kl=0 ≥ k. ¯ +1 In the first case, there is just one single connected component. In the second case there are ⌊k⌋ components, each with equal positive length. We note that by Lemma A.1 there exists a configuration with the same energy where one of these components has positive length, while the rest have ¯ zero length. The third case is that each of the components has length zero. We point out that if k 2p λ1 1/p ∗ is integer-valued and λ2 < p+1 ( α ) , then the minimizer will have kl=0 +1 components. We can now derive conclusions to the structure of minimizers if L ≫ 1. From above we conclude 2p λ1 1/p ( α ) , and that the minimizer will have one component (and be a continuous line) if λ2 ≥ p+1 ∗ ¯ break up into at least ⌊kl=0 ⌋ + 1 components otherwise. Rearranging, the condition also provides the critical density at which topological changes (gaps) in minimizers occur: 2p p λ1 ) (A.6) α∗ = ( . p + 1 λp2 ¯ ) that is Finally we note that the typical gap length is L/(k l=0 ∗
1
(A.7)
(p + 1)λ1 λ2 p+1 ) . L = 2( 2pα ∗
References [1] E. Arias-Castro, D. L. Donoho, and X. Huo, Adaptive multiscale detection of filamentary structures in a background of uniform random points, Ann. Statist., 34 (2006), pp. 326–349. [2] G. Biau and A. Fischer, Parameter selection for principal curves, Information Theory, IEEE Transactions on, 58 (2012), pp. 1924–1939. [3] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Found. Trends Mach. Learn., 3 (2011), pp. 1–122. [4] T. Broderick, B. Kulis, and M. Jordan, Mad-bayes: Map-based asymptotic derivations from bayes, in Proceedings of The 30th International Conference on Machine Learning, 2013, pp. 226–234. [5] G. Buttazzo, E. Oudet, and E. Stepanov, Optimal transportation problems with free dirichlet regions, in Variational Methods for Discontinuous Structures, G. dal Maso and F. Tomarelli, eds., vol. 51 of Progress in Nonlinear Differential Equations and Their Applications, BirkhÃďuser Basel, 2002, pp. 41–65. [6] G. Buttazzo and E. Stepanov, Optimal transportation networks as free dirichlet regions for the mongekantorovich problem, Annali della Scuola Normale Superiore di Pisa - Classe di Scienze, 2 (2003), pp. 631–678. [7] F. Chazal and J. Sun, Gromov-hausdorff approximation of filament structure using reeb-type graph, in Proceedings of the Thirtieth Annual Symposium on Computational Geometry, SOCG’14, New York, NY, USA, 2014, ACM, pp. 491:491–491:500. [8] P. Delicado, Another look at principal curves and surfaces, J. Multivariate Anal., 77 (2001), pp. 84–116.
22
SLAV KIROV AND DEJAN SLEPČEV
[9] C. Dellacherie and P. Meyer, A Probabilities and Potential, North-Holland Mathematics Studies, Elsevier Science, 1979. [10] T. Duchamp and W. Stuetzle, Geometric properties of principal curves in the plane, in Robust statistics, data analysis, and computer intensive methods (Schloss Thurnau, 1994), vol. 109 of Lecture Notes in Statist., Springer, New York, 1996, pp. 135–152. [11] E. Esser, Applications of lagrangian-based alternating direction methods and connections to split bregman, CAM Report, 31 (2009). [12] C. FeuersÃďnger and M. Griebel, Principal manifold learning by sparse grids, Computing, 85 (2009), pp. 267–299. [13] X. Ge, I. I. Safa, M. Belkin, and Y. Wang, Data skeletonization via reeb graphs, in Advances in Neural Information Processing Systems 24, Curran Associates, Inc., 2011, pp. 837–845. [14] C. R. Genovese, M. Perone-Pacifico, I. Verdinelli, and L. Wasserman, Nonparametric ridge estimation, Ann. Statist., 42 (2014), pp. 1511–1545. [15] S. Gerber and R. Whitaker, Regularization-free principal curve estimation, The Journal of Machine Learning Research, 14 (2013), pp. 1285–1302. [16] T. Goldstein and S. Osher, The split Bregman method for L1-regularized problems, SIAM J. Imaging Sci., 2 (2009), pp. 323–343. [17] T. Hastie and W. Stuetzle, Principal curves, J. Amer. Statist. Assoc., 84 (1989), pp. 502–516. [18] B. Kegl, A. Krzyzak, T. Linder, and K. Zeger, Learning and design of principal curves, Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22 (2000), pp. 281–297. [19] B. Kulis and M. Jordan, Revisiting k-means: New algorithms via bayesian nonparametrics, in Proceedings of the 29th International Conference on Machine Learning (ICML-12), J. Langford and J. Pineau, eds., ICML ’12, New York, NY, USA, July 2012, Omnipress, pp. 513–520. [20] X. Y. Lu and D. Slepčev, Average-distance problem for parameterized curves, ESAIM: COCV, (2015). [21] U. Ozertem and D. Erdogmus, Locally defined principal curves and surfaces, The Journal of Machine Learning Research, 12 (2011), pp. 1249–1286. [22] S. Pulkkinen, Ridge-based method for finding curvilinear structures from noisy data, Computational Statistics & Data Analysis, 82 (2015), pp. 89 – 109. [23] G. Singh, F. Memoli, and G. Carlsson, Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition, in Eurographics Symposium on Point-Based Graphics, 2007, pp. 91–100. [24] D. Slepčev, Counterexample to regularity in average-distance problem, Annales de l’Institut Henri Poincare (C) Non Linear Analysis, 31 (2014), pp. 169 – 184. [25] A. J. Smola, S. Mika, B. Schölkopf, and R. C. Williamson, Regularized principal manifolds, J. Mach. Learn. Res., 1 (2001), pp. 179–209. [26] R. Tibshirani, Principal curves revisited, Stat. Comput., 2 (1992), pp. 182–190. [27] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight, Sparsity and smoothness via the fused lasso, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67 (2005), pp. 91–108. [28] J. Verbeek, N. Vlassis, and B. Krose, A k-segments algorithm for finding principal curves, Pattern Recognition Letters, 23 (2002), pp. 1009 – 1017. [29] H. Wang and T. C. Lee, Automatic parameter selection for a k-segments algorithm for computing principal curves, Pattern recognition letters, 27 (2006), pp. 1142–1150. Department of Mathematical Sciences, Carnegie Mellon University, Pittsburgh, PA, 15213, USA. E-mail address:
[email protected],
[email protected]