Convergence Properties of Normalized Random Incremental Gradient ...

Comment

Report 1 Downloads 117 Views

Convergence Properties of Normalized Random Incremental Gradient Algorithms for Least-Squares Source Localization Michael Rabbat

Angelia Nedi´c

Electrical and Computer Engineering McGill University Montr´eal, Canada Email: [email protected]

Industrial and Enterprise Systems Engineering University of Illinois at Urbana-Champaign Urbana, Illinois, USA Email: [email protected]

Abstract—We consider the problem of localizing a single source using received signal strength measurements gathered at a number of sensors. We assume that the measurements follow the standard path loss model and are corrupted by additive white Gaussian noise. Under this model, the maximum likelihood solution to the source localization problem involves solving a non-linear least squares optimization problem. We study the convergence property of a normalized incremental gradient method for solving this problem. Remarkably, despite the fact that the problem is non-convex, the normalized incremental gradient method generates a sequence of iterates which are attracted to the global optimum under some mild conditions.

I. P ROBLEM S TATEMENT We consider the problem of localizing a source using received signal strength measurements gathered from a number of sensors; see [6] for a survey. A single source located at coordinates θ∗ ∈ R2 emits a signal, and we would like to estimate the source position given received signal strength measurements obtained by sensors scattered around the source. The received signal strength follows the standard path loss model with amplitude α > 0 and path loss coefficient β > 0. The measurement yj taken by a sensor at position xj ∈ R2 is Gaussian with mean α µj = (1) d0 + kθ∗ − xj kβ2 2

and variance σ , where d0 is a reference parameter (so that the mean remains finite if xj coincides with θ∗ for some j. We assume that there are n sensors at known locations x1 , . . . , xn , and each takes a measurement yj according to the model described above. The minimum mean-squared error estimator is found by solving the least-squares optimization problem 2 n X α minimize f (θ) = − y (2) j d0 + kθ − xj kβ j=1 over

θ ∈ R2 .

Consider a term in the objective function corresponding to

a single sensor, fj (θ) =

α − yj d0 + kθ − xj kβ

2 .

(3)

Observe that fj (θ) is constant over the sets of the form {θ | kθ − xj k = c} for some c > 0. This property of the functions fj is a key to our development in this paper. Furthermore, as the property is valid for any dimension, we will consider a more general setting in Section III, where we give some further precise characterization of the functions fj (θ) for θ ∈ Rm with an arbitrary integer m ≥ 1. In general, the objective f (θ) is not convex, so problem (2) is not a convex optimization problem. In [7], simulations of several gradient descent methods were performed on a specific example of problem (2), one of which was the normalized random incremental gradient method; for a survey on incremental gradient methods see [1]. Remarkably, among these methods, the normalized random incremental gradient method has always converged to the global minimum of problem (2). The goal of this paper is to establish the theoretical convergence properties of the normalized random incremental gradient descent method for this family of problems. II. S IMULATIONS WITH O PTIMIZATION M ETHODS We report some simulations with several gradient-type methods for solving problem (2). We consider an example with n = 25 sensors, and with model parameters α = 100, β = 2, and d0 = 1. The true location of the signal source is θ∗ = [75, 25]. Figure 1(a) shows log f (θ), the logarithm of the objective f (θ), as a function of θ. We show log f (θ) instead of f (θ) to emphasize the location of the optimal solution. The objective function f (θ) has a sharp drop near the minimum and sharp peaks at the location xj of each sensor. Far away from these points, the function f (θ) is very flat. The ten X’s in Figure 1(a) indicate the values of θ0 that are used to initialize the methods which are discussed next. To motivate this study, we try solving the optimization problem in (2) using four different methods:

l ogf ( θ ) 100 8

Nor maliz e d Inc r e me nt al Gr adie nt D e s c e nt

90

5

10 6

160

4

140

2

120

80

70

60

80

f ( θk)

θ2

0 50

0

10

100

−2

60

40 −4 30

40

−6

20

−8

0

20

−5

10

−10

10

10 −10 0

0

10

20

30

40

50 θ1

60

70

80

90

100

−20

−15

10

−40

(a) log f (θ)

−50

0

50

100

0

0.5

1

1.5 2 it e r at ion k

150

(b) Normalized Incremental Gradient Descent

2.5

3

3.5 4

x 10

(c) Normalized Incremental Gradient Descent Cost

Fig. 1. (a) The log cost function log f (θ) as a function of θ for an example problem; the color corresponds to the value of log f (θ). The source is located at θ∗ = [75, 25], and measurements are taken from n = 25 sensors at random locations. The ten X’s indicate initial conditions θ0 used by different optimization algorithms. (b) No matter where the incremental gradient descent method is initialized, the algorithm (with constant step size) converges to (a neighborhood of) the globally optimal solution. (c) Cost as a function of iterations for the different initializations of the normalized incremental gradient descent method.

(1) Standard gradient descent on the objective function f (θ), θk = θk−1 − γk ∇f (θk−1 ), with the stepsize γk obtained by a backtracking line search [5]. (2) The Levenberg-Marquardt trust region method, a standard approach to solving nonlinear least squares problems which is suited for ill-conditioned nonlinear-least squares problems [5]. (3) Randomized incremental gradient method [3], [4], θk = θk−1 − γk ∇fjk (θk−1 ), with a sensor jk chosen uniformly i.i.d. at each iteration. (4) Normalized randomized incremental gradient descent; similar to the previous approach but with the gradient direction normalized provided that the gradient is non-zero; i.e., ( ∇f (θ ) if ∇fjk (θk−1 ) 6= 0, θk−1 − γk k∇fjjk (θk−1 k−1 )k k θk = θk−1 otherwise. Figures 1 and 2 show the results of running these four algorithms on the problem realization depicted in Fig. 1(a). In this problem there are n = 25 nodes and the target is located at θ∗ = [75, 25]. Each algorithm is executed ten times, with the different initializations indicated by the X’s in Fig. 1(a). As seen in Fig. 2, the first three methods (gradient descent, Levenberg-Marquardt, and incremental gradient descent) in general converge to locally optimal solutions. This is not surprising since the cost function is not convex. Surprisingly, the normalized incremental gradient descent method converges to the true target position regardless of where it is initialized. This behavior was first observed in [7]. In order to understand the behavior, we embark on a theoretical analysis of the normalized random incremental gradient method for a more general form of the problem. III. C LOSER L OOK AT THE C OST F UNCTIONS We begin by studying basic characteristics of the objective function. We consider here a general case where θ ∈ Rm for

some integer m ≥ 1. The cost function in (3) can be rescaled so as to have d0 = 1 (we can do this by rescaling the variables xj and θ with d0 assuming d0 > 0). We assume that this is done, so we are dealing with the function of the form 2 α − y fj (θ) = , j 1 + kθ − xj kβ where α > 0 and β > 0. We can now pull α as a common term for the expressions under the square, and thus write 2 1 yj fj (θ) = α2 − . (4) 1 + kθ − xj kβ α The gradient of fj is given by ∇fj (θ) =−

2α2 β(θ − xj )kθ − xj kβ−2 2

(1 + kθ − xj kβ )

yj 1 − 1 + kθ − xj kβ α

.

To find a set of local minima of this cost, we set the gradient of fj (θ) to zero and obtain that ∇fj (θ) = 0 if and only if (θ − xj )kθ − xj kβ−2 1 yj − = 0. (5) 2 1 + kθ − xj kβ α (1 + kθ − xj kβ ) Thus, it follows that k∇fj (θ)k = 0 if and only if kθ − xj kβ−1 1 yj − = 0. 2 α (1 + kθ − xj kβ ) 1 + kθ − xj kβ

(6)

For β ≥ 2, the gradient vanishes at points θ = xj , and it also vanishes at points θ satisfying kθ − xj kβ = yαj − 1 provided that yαj − 1 ≥ 0. For the remainder of this paper let us focus on the particular case β = 2; most of the results below are easily generalized to the setting where β > 1. We have yj 4(θ − xj ) 1 2 ∇fj (θ) = −α − . 2 1 + kθ − xj k2 α (1 + kθ − xj k2 ) (7)

Gr adie nt de s c e nt on f ( θ ) =

n

i= 1 f i ( θ )

Le ve nbe r g-Mar quar dt

Inc r e me nt al Gr adie nt D e s c e nt

160

160

160

140

140

140

120

120

120

100

100

100

80

80

80

60

60

60

40

40

40

20

20

20

0

0

0

−20

−20

−20

−40

−40 −50

0

50

100

−40

150

−50

(a) Gradient Descent

0

50

100

150

−50

(b) Levenberg-Marquardt

4

10

2

50

100

150

4

5

10

0

10

−5

10

−10

10

10

2

10

10

0

(c) Incremental Gradient Descent

0

10 0

−2

10

−2

10

f ( θk)

f ( θk)

f ( θk)

10

10

−20

10

−25

10

−6

10

−4

−4

−15

10

10

−8

10 −6

10

0

1

2

3 it e r at ion k

(d) Gradient Descent

4

5 4

−10

10

−30

10

0

50

100

150

it e r at ion k

x 10

(e) Levenberg-Marquardt

0

0.5

1

1.5 it e r at ion k

2

2.5 5

x 10

(f) Incremental Gradient Descent

Fig. 2. Trajectories and costs obtained by the gradient descent, Levenberg-Marquardt, and incremental gradient methods for the example problem shown in Figure 1(a). All three of these methods get trapped at local minima unless the algorithm is initialized within a basin of attraction near the global solution.

In fact, by examining the gradient ∇fj (θ) and the Hessian ∇2 fj (θ) more carefully, we obtain the following result. Lemma 1: Under the assumption that β = 2, the following following statements hold. (a) All stationary points of the function fj are global minima or global maxima. (b) If yαj ≤ 1 then θ = xj is a unique maximum for fj . (c) If yαj > 1 then θ = xj is a unique maximum for fj , and the surface Cj given by α Cj = θ | kθ − xj k2 = −1 (8) yj is the set of minima for fj . The proof of Lemma 1 is omitted due to space constraints; the results follow by direct analysis of the gradient and Hessian characteristics, and a proof will be included in an extended version of this work. IV. G ENERAL P ROBLEM AND E XISTENCE OF S OLUTIONS Based on the discussion of the preceding section, we will now assume that d0 = 1, and by rescaling the functions with 1 α2 , we consider the problem 2 n 1X 1 yj minimize f (θ) = − n j=1 1 + kθ − xj k2 α m over θ∈R . (9)

We work assuming that the following condition holds. Assumption 2: We have yαj > 1 for all j = 1, . . . , n. This assumption holds when the RSS measurement at every node is below the transmitted amplitude α. According to Lemma 1, under this assumption the component fj (θ) achieves its global maximum value at θ = xj , and it achieves its global minimum value at all points on the surface Cj . Let us use f ∗ and Θ∗ to denote the optimal value and the solution set for problem (9), f ∗ = infm f (θ), θ∈R

Θ∗ = {θ∗ | f (θ∗ ) = minm f (θ)}. θ∈R

We now show that a global minimum of problem (9) exists. In the result, we use the notion of the convex hull of a collection of sets, defined as follows. For a collection of sets Xj , j = 1, . . . , n, the convex hull X of the collection is given by   n n X  X X= λj xj xj ∈ Xj , λj ≥ 0, λj = 1 .   j=1

j=1

We have the following result for the solutions of problem (9). Theorem 3: Let Assumption 2 hold. Then, the set Θ∗ of global minima for problem (9) is not empty. Furthermore, Θ∗ is contained in the convex hull of the collection {Cj | j = 1, . . . , n} of the surfaces (of the unit balls), as given in (8).

Sketch of proof: Let C be the convex hull of the collection {Cj | j = 1, . . . , n}. Since each Cj is compact and we have a finite number of these sets, the convex hull C is a compact convex set (Proposition 1.3.2, pg. 39 of [2]). Since f is continuous, it attains a minimum over C at a point in C. We next show that the points θ outside the set C are worse for the function value f , i.e., function values are larger than ˆ ∈ C be the within the set C. Let θˆ 6∈ C, and let ΠC [θ] ˆ projection of θ on the set C, i.e., ˆ = kΠC [θ] ˆ − θk. ˆ min kθ − θk θ∈C

ˆ exists and it is unique as C is closed The projection ΠC [θ] and convex. Using the projection property, the fact that xj ∈ C for all j, and the fact that the circle Cj has radius yαj − 1, we find ˆ > fj (ΠC [θ]) ˆ for all j. By summing these relations that fj (θ) over j = 1, . . . , n, we find that ˆ > f (ΠC [θ]). ˆ f (θ) Thus, the function f always has smaller values within the set C than outside the set C, i.e., f ∗ = inf f (θ) = min f (θ), θ∈R

θ∈C

where the infimum over C is attained, as C is compact and f is continuous. This also shows that Θ∗ ⊆ C. Lemma 4: Let Assumption 2 hold. Suppose that the surfaces Cj , j = 1, . . . , n, as defined in (8) have a common point. Then, the set of global minima Θ∗ for problem (9) is given by Θ∗ = ∩nj=1 Cj . Proof: The result follows immediately by Lemma 1. V. R ANDOMIZED I NCREMENTAL G RADIENT M ETHOD WITH N ORMALIZED G RADIENTS The randomized incremental subgradient method with normalized gradients seeks a solution to the problem (9) as follows. Let θ0 ∈ Rm denote the initial point, and let {γk } denote a sequence of non-negative scalar stepsize parameters. The method generates updates for k ≥ 1 using the recursion: ( ∇f (θ ) θk−1 − γk k∇fjjk (θk−1 if ∇fjk (θk−1 ) 6= 0, k−1 )k k θk = (10) θk−1 otherwise. The random indices {jk }∞ k=1 are drawn independently and uniformly from the index set {1, . . . , n}. Lemma 5: Let Assumption 2 hold. Then, for the iterates of the method (10) we have 2

2

kθk − θk = kθk−1 − θk + φjk ,k (θk−1 , θ) +

γk2

for all θ ∈ Rm and all k ≥ 1, where the functions φj,k (ξ, θ) are given by: for all j and all ξ, θ ∈ Rm ,   −2γk cj (ξ, θ) if ξ 6∈ Bj , 2γk cj (ξ, θ) if ξ ∈ Bj \ ({xj } ∪ Cj ), φj,k (ξ, θ) =  0 if ξ ∈ Cj ∪ {xj }, with cj (ξ, θ) = kξ − θk cos ∠(ξ − xj , ξ − θ).

Proof: By the definition of the algorithm, when ∇fjk (θk−1 ) 6= 0 we have for any θ ∈ Rm : kθk − θk2 =kθk−1 − θk2 ∇fjk (θk−1 )0 (θk−1 − θ) + γk2 . (11) − 2γk k∇fjk (θk−1 )k 2 yj 1 we have For the function fj (θ) = 1+kθ−x 2 − α jk 4(θ − xj ) 1 yj ∇fj (θ) = − − . 2 1 + kθ − xj k2 α (1 + kθ − xj k2 ) (12) Note that the gradient ∇fj (θ) is zero when θ = xj or θ lies on the surface Cj . To rewrite the iterate process more compactly, q let Bj be the ball centered at xj and with a radius rj = yαj − 1, i.e., Bj = {θ ∈ Rm | kθ − xj k2 ≤ rj2 }. The surface Cj is the boundary for the ball Bj , and the gradient of fj vanishes when θ is at the center xj of the ball Bj and when θ is on the boundary of the ball Bj . For θ 6= xj and θ 6∈ Cj from (12) we have (θ − xj ) 1 yj ∇fj (θ) =− sgn − k∇fj (θ)k kθ − xj k 1 + kθ − xj k2 α (θ − xj ) yj 1 = sgn − . kθ − xj k α 1 + kθ − xj k2 Therefore ∇fj (θ) = k∇fj (θ)k

(

θ−xj kθ−xj k θ−x − kθ−xjj k

when θ 6∈ Bj , when θ ∈ Bj \ ({xj } ∪ Cj ). (13)

Then, the iterate relations in (10) and (11) can be represented compactly as: kθk − θk2 = kθk−1 − θk2 + φjk ,k (θk−1 , θ) where φj,k (ξ, θ) is defined for all j and all ξ, θ ∈ R φj,k (ξ, θ) =  (ξ−xj )0 (ξ−θ)  −2γk kξ−x  jk     (ξ−xj )0 (ξ−θ) 2γk kξ−x  jk      0

(14) m

by:

if ξ 6∈ Bj , if ξ ∈ Bj \ ({xj } ∪ Cj ), if ξ ∈ Cj ∪ {xj }.

Noting that (ξ − xj )0 (ξ − θ) = kξ − θk cos ∠(ξ − xj , ξ − θ), kξ − xj k we obtain the expression for φj,k (ξ, θ), as stated in the lemma. We next explore a property of the functions φj,k (ξ, θ) which will help us show the convergence of the method. Lemma 6: For the functions φj,k as given in Lemma 5,

ξ

Cj

πj

Cj

πj

ξ xj

xj

Fig. 3. Allignment of the center xj of the ball Bj (with the surface Cj ) with a point ξ and its projection πj on Cj : to the left is the case when ξ is outside of the ball, and to the right is the case when ξ is inside the ball.

when ξ ∈ / Cj ∪ {xj }, we have φj,k (ξ, ΠCj (ξ)) = −2γk kξ − ΠCj (ξ)k. Proof: Consider the case when ξ 6∈ Bj . Then, the points ξ, ΠCj (ξ) and xj are colinear with ΠCj (ξ) lying in the linesegment [xj , ξ] that connects xj and ξ. Thus, the vectors ξ−xj and ξ − ΠCj (ξ) are colinear and form 0◦ -angle (see Figure 3), implying that cos ∠(ξ − xj , ξ − ΠCj (ξ)) = 1. Consider the case when ξ ∈ Bj but ξ 6∈ Cj ∪ {xj }. Then, again the points ξ, ΠCj (ξ) and xj are colinear, but in this case the point ξ lies in the line-segment [xj , ΠCj (ξ)]. Therefore, ξ − xj and ξ − ΠCj (ξ) are colinear and form a 180◦ -angle, yielding cos ∠(ξ − xj , ξ − ΠCj (ξ)) = −1. At this point we define the σ-field associated with the past history of the method: Fk = {θ0 , j1 , . . . , jk }

for k ≥ 1,

and F0 = {θ0 }. We can combine Lemma 5 and Lemma 6 to obtain the following result. Lemma 7: Let Assumption 2 hold, and assume that the points xj , j = 1, . . . , n, are distinct. Then, given the past Fk−1 we have with probability 1:   n X dist2 (θk , Cj ) Fk−1  E j=1

≤

n X

dist2 (θk−1 , Cj ) − 2γk

j=1

n X

dist(θk−1 , Cj ) + γk2 ,

j=1

when θk−1 6∈ ∩nj=1 Cj , and θk = θk−1 if θk−1 ∈ ∩nj=1 Cj . Using Lemma 7, we obtain the following convergence guarantees for the randomized incremental gradient method with normalized gradients. Theorem 8: Let Assumption 2 hold, assume that the points xj , j = 1, . . . , n, are distinct, and assume that ∩nj=1 Cj is nonempty. Assume further that the stepsize sequence {γk } is such that ∞ ∞ X X γk = ∞, γk2 < ∞. k=1

k=1

Then the sequence {θk } generated by method (10) converges to an optimal solution with probability 1.

The proof of Theorem 8 leverages Lemma 7 and the Supermartingale Convergence Theorem; see, e.g., [4] for a similar proof of convergence proof for the (non-normalized) randomized incremental subgradient method. Similarly, if a constant stepsize is used, then we get convergence to within a region of the optimum solution. Theorem 9: Let Assumption 2 hold, assume that the points xj , j = 1, . . . , n, are distinct, and assume that ∩nj=1 Cj is nonempty. Then, for the sequence {θk } generated by method (10) with the stepsizes γk fixed to some positive constant γ, with probability 1 we have γn . inf f (θk ) ≤ f ∗ + k≥0 2 The proof of Theorem 9 again uses Lemma 7, along with a stopped super-martingale argument similar to that used in [4]. These arguments can be extended to upper-bound the number of iterations required to converge to the γn 2 -neighborhood of the optimal value f ∗ . VI. C ONCLUSION We have proved that the randomized incremental gradient algorithm with normalized gradients converges to the global optimal solution of the source localization problem (9) when ∩nj=1 Cj is non-empty. The condition ∩nj=1 Cj 6= ∅ holds when the measurement noise variance σ 2 = 0; i.e., when there is no measurement noise. More generally, one of two cases will be true: the intersection of the balls ∩nj=1 Bj is nonempty, or the ∩nj=1 Bj is empty. Our future work aims to characterize the solution set and algorithm behavior in these two cases in order to obtain a complete characterization of the normalized incremental subgradient algorithm for solving source localization problems. ACKNOWLEDGEMENT A. Nedi´c gratefully acknowledges that the funding for this work provided by the National Science Foundation grants CMMI 07-42538 and CCF 11-11342. M. Rabbat gratefully acknowledges support from the Qu´ebec Ministry of Finance and Economy (MDEIE) and from the Natural Sciences and Engineering Research Council of Canada. R EFERENCES [1] D.P. Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. In S. Sra, S. Nowozin, and S. J. Wright, editors, Optimization for Machine Learning, pages 85–119. MIT Press, Cambridge, MA, 2012. [2] D.P. Bertsekas, A. Nedi´c, and A.E. Ozdaglar. Convex Analysis and Optimization. Athena Scientific, Cambridge, Massachusetts, 2003. [3] Z.Q. Luo. On the convergence of the LMS algorithm with adaptive learning rate for linear feedforward networks. Neural Computation, 3(2):226–245, 1991. [4] A. Nedi´c and D.P. Bertsekas. Incremental subgradient method for nondifferentiable optimization. SIAM Journal of Optimization, 12:109– 138, 2001. [5] J. Nocedal and S.J. Wright. Numerical Optimization. Springer, 2nd ed. edition, 2006. [6] N. Patwari, J.N. Ash, S. Kyperountas, A.O. Hero III, R.L. Moses, and N.S. Correal. Locating the nodes: Cooperative localization in wireless sensor networks. IEEE Signal Processing Magazine, 22(4):54–69, 2005. [7] M. Rabbat and R. Nowak. Decentralized source localization and tracking. In Proc. ICASSP, Montreal, Canada, May 2004.

Recommend Documents

Generalized conditional gradient: analysis of convergence and ...

Convergence of Proximal-Gradient Stochastic Variational Inference ...

Convergence of Online Gradient Methods for ... - ScienceDirect

The Fast Convergence of Incremental PCA

Convergence of Laplacian spectra from random samples

LOCAL CONVERGENCE OF RANDOM GRAPH COLORINGS