A Linearly Convergent First-order Algorithm for Total Variation Minimization in Image Processing Cong D. Dang
∗
Kaiyu Dai
†
Guanghui Lan
‡
October 29, 2012
Abstract We introduce a new formulation for total variation minimization in image denoising. We also present a linearly convergent first-order method for solving this reformulated problem and show that it possesses a nearly dimension-independent iteration complexity bound. Keywords: Image denoising, Total variation, First-order methods, Complexity, Linear rate of convergent
1
Introduction
The restoration of images contaminated by noise is a fundamental problem in biomedical image processing and plays an important role in certain diagnosis techniques such as Magnetic Resonance Image (MRI) and functional Magnetic Resonance Image (fMRI). In 1992, Rudin, Osher and Fatemi (ROF) [17] proposed an influential optimization approach for image denoising by minimizing the total variation (TV). It turns out that the ROF model can preserve edges and important features in the original image. In this paper we propose an alternative formulation (or relaxation) for minimizing the total variation, which leads to comparable denoising quality to the classical ROF model. Moreover, we show that the relaxed model can be solved very efficiently. In particular, we present a linearly convergent first-order algorithm for solving this new model, and demonstrate that it possesses an O (ln(1/)) iteration complexity for achieving the target accuracy . Since the aforementioned iteration complexity bound is almost dimension-independent and the iteration cost only linearly depends on the dimension, the total arithmetic complexity of our algorithm is O N 2 ln(1/) for processing an N ×N image. Hence, our approach is scalable to very large-scale image denoising problems. By contrast, most existing approaches for solving the original ROF model are based on an equivalent dual or primal-dual formulation (see, e.g., Chan et al. [5], Chambolle [3], Beck and Teboulle [1, 2], ∗ Department of Industrial and Systems Engineering, University of Florida, Gainesville, FL, 32611. (email:
[email protected]). This author was partially supported by a doctoral fellowship from the Vietnam International Education Development and NSF GRANT CMMI-1000347. † Software School, Fudan University, Shanghai, China, 201203. (email:
[email protected]). This author was partially supported by a visiting scholar fellowship from China Scholarship Council. ‡ Department of Industrial and Systems Engineering, University of Florida, Gainesville, FL, 32611. (email:
[email protected]). This author was partially supported by NSF GRANT CMMI-1000347.
1
Chambolle and Pock [4]). All these algorithms converge sublinearly for solving the ROF model and √ the best iteration complexity bound is given by O(1/ ) [1, 4]. Moreover, these complexity results heavily depend on the dimension of the problem, the selection of the initial point and certain regularization parameters. We also refer to other algorithms recently developed for solving the ROF model (e.g., [6, 8, 7, 9, 12, 13, 19]) and references therein. This paper is organized as follows. We review the classical ROF model and present a new strongly convex composite reformulation (or relaxation) for the ROF model in Section 2. An efficient algorithm for solving the reformulation is presented and analyzed in Section 3. We then report some promising numerical results and biomedical applications in Section 4. All the technical proofs are given in the Appendix.
2
A strongly convex composite reformulation for total variation minimization
In this section, we review the classical ROF model for image denoising and present a novel reformulation for it. We also show how these two TV-based models for image denoising are related. For the sake of simplicity, let us assume that the images are of 2-dimension with N × N pixels. For any image u ∈ RN ×N , the discretized gradient operator ∇u is defined as (∇u)i,j := ((∇u)1i,j , (∇u)2i,j ), ∀ i, j = 1, . . . , N, where (∇u)1i,j
:=
ui+1,j − ui,j , i < N 0, i=N
and
(∇u)2i,j
:=
(2.1)
ui,j+1 − ui,j , j < N 0, j = N.
Then, the classical total variation minimization problem is given by λ 2 u ¯ = arg min φ(u) := T (u) + ku − f k , 2
(2.2)
where λ > 0 is a user-defined parameter, f is the observed noisy image and
X (∇u)1
i,j
. T (u) =
(∇u)2
i,j i,j
Observe that the norm k · k in the definition of T (·) (and hereafter) can be either the l2 or l1 norm. If k · k = k · k2 , then problem (2.2) is exactly the original ROF model. It can be easily seen that the objective function φ(u) in (2.2) is a nonsmooth strongly convex function. It is known that oracle-based convex optimization techniques would require O(1/) iterations to find an -solution of (2.2), i.e., a point u ˆ such that φ(ˆ u) − φ∗ ≤ (see [10]). It has recently √ been shown that the above iteration complexity can be significantly improved to O(1/ ) by using a dual or saddle point reformulation of (2.2) (e.g., [2, 4, 15]). Note, however, that all these algorithms converge sublinearly, and that their performance also heavily depends on the dimension N 2 and the selection of starting points. In order to address these issues, we consider an alternative formulation of problem (2.2). The basic idea is to introduce an extra variable d ∈ R2N (N −1) , which corresponds to the nonzero components 2
of the gradient operator ∇u, and then to impose the following set of constraints: d1i,j
d2i,j
= ui+1,j − ui,j , ∀ i = 1, . . . , N − 1; j = 1, . . . , N,
= ui,j+1 − ui,j , ∀ i = 1, . . . , N ; j = 1, . . . , N − 1.
Observe that the above constraints can be written in the matrix form as Eu + d = 0, where E T is a network flow matrix with N 2 nodes and 2N (N most degree 4, i.e., 1 −1 0 0 ··· ··· 0 0 1 −1 0 ··· ··· 0 .. . . .. .. .. .. .. .. E= . . . . . −1 0 0 ··· 1 ··· 0 0 0 0 · · · 1 · · · −1
(2.3) − 1) arcs, with each node having at 0 ··· 0 ··· .. .. . . 0 0 0 0
0 0 .. .
. 0 0
Now, denoting
T˜(d) :=
X d1
i,j
,
d2
i,j i,j
we consider the following optimization problem of i λh 2 2 ∗ ∗ ˜ ˜ (u , d ) = arg min φ(u, d) := T (d) + ku − f k + q kEu + dk 2
(2.4)
˜ d) is also a for some parameters λ, q > 0. Similar to φ(u) in (2.2), the new objective function φ(u, nonsmooth strongly convex function. While the non-separable and nonsmooth term T (·) makes problem (2.2) difficult to solve, the nonsmooth term T˜(·) in (2.4) is separable with respect to (d1i,j , d2i,j ). This fact will enable us to design a very efficient algorithm for solving problem (2.4) (see Section 3.2). We would also like to provide some intuitive explanations about the reformulation given in (2.4). Observe that both terms, i.e., T˜(d) and kEu + dk2 , can be viewed as certain regularization terms. While the first term T˜(d) enforces the sparsity of the vector d, i.e., the estimated gradient vector, and thus help to smooth the recovered image, the latter term kEu + dk2 essentially takes into account that the computation of d is not exact because of the stochastic noise. Introducing this extra regularization term into the optimization problem would protect the image from being oversmoothed, as what might happen for the original formulation in (2.2) (see Section 4). It is interesting to observe some relations between problem (2.2) and (2.4). Proposition 1 Let φ∗ and φ˜∗ be the optimal values of (2.2) and (2.4), respectively. We have N2 φ˜∗ ≤ φ∗ ≤ φ˜∗ + . 2λq
(2.5)
If follows from Proposition 1 that, for a given λ and N , the parameter q in (2.4) should be big enough in order to approximately solve the original problem (2.2). Observe, however, that our goal is not to solve problem (2.2), but to recover the contaminated image. Due to the aforementioned role that the extra regularization term kEu + dk plays in (2.4), we argue that it is not necessary to choose a very large q. Indeed, we observe from our computational experiments that q can be set to 2 or 4 for most cases, and that selecting a much larger value of q seems to be actually harmful to image denoising. 3
3
A linearly convergent algorithm for TV denoising
In the previous section, we reformulated the TVM problem as to minimize the summation of a relatively simple nonsmooth convex function and a smooth strongly convex function. Our goal in this section is to show that such a reformulation can be efficiently solved. More specifically, we first present an accelerated gradient descent (AC-GD) method based on Nesterov’s smooth optimal method [14, 16] for solving a general class of strongly convex composite optimization problems. Then, √ we show that, by using this algorithm, one can solve problem (2.4) in O q ln(1/) iterations. Our algorithm can be viewed as a variant of the well-known FISTA algorithm by Beck and Teboulle [1, 2]. However, since FISTA does not take the advantage of the strong convexity of the problem, it possesses a much worse performance guarantee than the one mentioned above.
3.1
The accelerated gradient descent (AC-GD) algorithm
Consider the following general composite problem of Ψ∗ := min {Ψ(x) := ψ(x) + X (x)} , x∈X
(3.6)
where X ⊆ Rn is a closed convex set, X : X → R is a simple convex function, and ψ : X → R is smooth and strongly convex with Lipschitz continuous gradients, i.e., ∃ L ≥ 0, µ ≥ 0, such that µ L ky − xk2 ≤ ψ(y) − ψ(x) − hψ 0 (x) , y − xi ≤ ky − xk2 , ∀x, y ∈ X. 2 2
(3.7)
The following AC-GD algorithm for solving (3.6) maintains the updating of three intertwined semd quences, namely, {xt }, {xag t } and {xt } at each iteration t. All these types of multi-step gradient algorithms originate from Nesterov’s seminal work in [14] (see Tseng [18] for a summary). However, very few of these algorithms can make use of the special strongly convex composite structure in (3.6) except those in [10, 11]. The AC-GD method for strongly convex composite optimization. Input: x0 ∈ X, step-size parameters {αt }t≥1 and {γt }t≥1 s.t. α1 = 1, αt ∈ (0, 1) for any t ≥ 2, and γt ≥ 0 for any t ≥ 1. 0) Set the initial point xag 0 = x0 and t = 1; 1) Set (1 − αt )(µ + γt ) ag αt [µ(1 − αt ) + γt ] xmd = xt−1 + xt−1 ; (3.8) t 2 γt + µ(1 − αt ) γt + µ(1 − αt2 ) 2) Set
αt µ md (1 − αt )µ + γt x + xt−1 , µ + γt t µ + γt
2 (µ + γt ) + 0 md
= arg min αt hψ (xt ), xi + αt X (x) + xt−1 − x , x∈X 2 = αt xt + (1 − αt )xag t−1 ;
x+ t−1 = xt xag t
(3.9) (3.10) (3.11)
3) Set t ← t + 1 and go to step 1. The AC-GD algorithm differs from a related accelerated stochastic approximation (AC-SA) algorithm for solving strongly convex composite optimization problems [10, 11] in the following aspects. 4
Firstly, the above algorithm is deterministic, while the one in [10] is stochastic. Secondly, the subproblem used to define xt in (3.10) is much simpler than the corresponding one in [10]. Finally, we show that the above simple AC-GD algorithm can achieve the optimal rate of convergence for solving strongly convex composite problems possessed by a more involved multi-stage algorithm in [11]. Theorem 2 below describes the main convergence properties of the above AC-GD algorithm. Theorem 2 Assume that {αt }t≥1 and {γt }t≥1 in the AC-GD algorithm are chosen such that
where
µ + γt ≥ Lαt2 ,
(3.12)
γ1 /Γ1 = γ2 /Γ2 = ...,
(3.13)
Γt :=
Then, we have for any t ≥ 1,
1, (1 − αt )Γt−1 ,
∗ Ψ(xag t )−Ψ ≤
t = 1, t ≥ 2.
Γt γ1 kx0 − x∗ k2 , 2
(3.14)
(3.15)
where x∗ is an optimal solution of (3.6). By properly choosing the stepsize parameters αt and γt , we show that the above AC-GD algorithm can achieve the optimal rate of convergence for solving problem (3.6). Corollary 3 Let {xag t }t≥1 be computed by the AC-GD algorithm with r µ 2 αt = max , and γt = 2LΓt , L t+1 where Γt is defined in (3.14). Then we have ( ) r t−1 2 µ ag ∗ , L kx0 − x∗ k2 , ∀t ≥ 1. Ψ(xt ) − Ψ ≤ min 1− L t(t + 1)
(3.16)
(3.17)
Proof. The result follows by plugging the values of αt and γt into (3.15) and noting that ( ( ) r t−1 t r t−1 ) Y 2 µ µ 2 Γt ≤ min = min 1− , 1− 1− , . L τ +1 L t(t + 1) τ =2
3.2
The AC-GD algorithm for total variation minimization
In this subsection, we discuss how to apply the above AC-GD algorithm to solve the reformulated TV minimization problem in (2.4). ˜ in (2.4) can be written in the composite form, i.e., First, observe that the objective function φ(·) ˜ ˜ ˜ ˜ φ(u, d) = T (d) + ψ(u, d), where ψ(u, d) is given by
2
λ u f I 0 u ˜
, A := √ √ ψ(u, d) := A − . (3.18) d 0 qE qId 2 2
2
Here Iu ∈ RN ×N and Id ∈ R2N (N −1)×2N (N −1) are the identity matrices. Proposition 4 below ˜ summarizes some properties of ψ(·). 5
˜ in (3.18) is strongly convex with modulus Proposition 4 The function ψ(·) µψ˜ ≥
λ . 10 + 1q
(3.19)
Moreover, its gradient is Lipschitz continuous with constant Lψ˜ ≤ λ(1 + 12q).
(3.20)
˜ and Proposition 4, we can apply the AC-GD algorithm In view of the composition structure of φ(·) ˜ for solving problem (2.4). Moreover, since A is very sparse, the computation of the gradient of ψ(·) 2 only takes O(N ) arithmetic operations. Second, it is worth noting that the subproblem (3.10) arising from the AC-GD method applied to problem (2.4) is easy to solve. Indeed, the subproblem (3.10) is given in the form of n
2
o p p
+ h˜
u − u+ 2 , (u, d)t = argminu,d hc, di + T˜(d) + d − d+ c , ui + (3.21) t−1 2 t−1 2 2 2 for some p > 0, c ∈ R2N (N −1) and c˜ ∈ RN ×N . Suppose that the norm k · k is given by an l2 norm in the definition of T˜(·). By examining the optimality condition of problem (3.21), we have the following explicit formula (see Section A.4 for more details): ut = and dt,ij where
1 pu+ ˜ , t−1 − c p
0 + = −cij k−1 + kpdt−1,ij . pd − c ij + t−1,ij pkpdt−1,ij −cij k
dt,ij =
(dt )1i,j (dt )2i,j
! ,
d+ t−1,ij
d+ t−1
if if
+
pd − c
t−1,ij ij ≤ 1,
+
pdt−1,ij − cij > 1;
1 !
i,j 2 d+ t−1 i,j
=
(3.22)
and cij =
c1i,j c2i,j
(3.23)
.
Also note that one can write explicit solutions of (3.21) if k · k = k · k1 . For both cases, solving the subproblem (3.10) requires only O(N 2 ) arithmetic operations. We are now ready to state our main results. Theorem 5 Let (u0 , d0 ) be an initial point of the AC-GD algorithm applied to problem (2.4), and D0 := k(u0 , d0 ) − (u∗ , d∗ )k22 . Also assume that the parameter q ≥ 1 and that the stepsize parameters (αt , γt ), t ≥ 1, are set to (3.16). Then, the iteration-complexity of the AC-GD algorithm for finding an -solution of (2.4) can be bounded by )! ( r λqD0 λqD0 √ O min q ln , . (3.24) ε ε Moreover, its arithmetic complexity can be bounded by ( )! r λqD0 λqD0 √ 2 O N min q ln , . ε ε 6
(3.25)
Proof. The bound (3.24) follows immediately from Corollary 3, Proposition 4 and the observation that Lψ˜ /µψ˜ = O(q) when q ≥ 1. The bound (3.25) follows from (3.24) and the fact that the number 2 of arithmetic operations in each iteration of the algorithm is bounded by p O(N ). Observe that the complexity of FISTA applied to problem (2.4) is O( λqD0 /ε) (see [1, 2]), which is strictly worse than the bound in (3.24). In particular, suppose that q is a given constant, in view of Theorem 5, the complexity of the AC-GD algorithm only weakly depends on the accuracy , the parameter λ, as well as the distance D0 (and thus the dimension of the problem). Moreover, its total arithmetic complexity is polynomial with a mild linear dependence on the problem dimension N 2 .
4
Numerical Results and Biomedical Application
In this section, we report our preliminary computational results where we compare our reformulation (RTVM) in (2.4) with the original TVM (OTVM) model in (2.2) for image denoising. We also compare the performance of two first-order algorithms for composite optimization, i.e., FISTA and AC-GD, applied to our reformulation. Furthermore, we discuss the application of the developed techniques for solving certain biomedical image denoising problems.
4.1
Numerical Study on General Image Denosing Problems
In this subsection, we conduct numerical experiments on a few classical image denosing problems. In our first experiment, we show that the reformulated TVM model is comparable to the original model in term of the quality of the denoised images. Two image instance sets were used in this experiment. In the first instance set, we take the 256 × 256 Lena test image whose pixels were scaled between 0 and 1. The noisy image is obtained by adding a white Gaussian noise with zero mean and various standard deviations (σ). In the second one, we use the Peppers test images with different sizes and the noise is added similarly to the first one with σ = 0.1. The original and noisy images for both instance sets are given in Figure 1. We set the parameter λ to 16 for both formulations in (2.2) and (2.4). We solve the original TVM model by using an efficient primal-dual algorithm [4] and also apply the AC-GD algorithm for the reformulated TVM model with different values of q. We then report the best (largest) value of the Peak Signal-to-Noise Ratio (PSNR) obtained after 200 iterations for both approaches. The results of these two instance sets are reported in Table 1 and Table 2, respectively. Moreover, Figure 2 and Figure 3, respectively, represent the denoised Lena and Peppers images obtained from solving the original TVM model, and the reformulated TVM model with q = 2 and 4. Table 1: PSNR of denoised Lena images by OTVM and RTVM formulations Std. Denoised Lena images dev. Noisy OTVM RTVM RTVM RTVM RTVM RTVM RTVM σ image q = 0.5 q=1 q=2 q=4 q=8 q = 16 0.05 28.7841 30.5001 31.4201 31.5248 31.5166 31.4811 31.4538 31.4301 0.08 24.5702 28.5705 28.5887 28.9212 29.0479 29.0817 29.0749 29.0493 0.10 22.6215 27.7510 27.4318 27.9536 28.1674 28.2284 28.2340 28.1978 0.15 19.2589 26.0906 24.6397 25.4139 25.7594 25.8660 25.9108 25.9327
7
Original 256 × 256 Lena image
Noisy 256 × 256 Lena image, σ = 0.1
Original 256 × 256 Peppers image
Noisy 256 × 256 Peppers image, σ = 0.1
Figure 1: Original and noised images
OTVM denoised image with λ = 16
RTVM denoised image with λ = 16, q = 2
RTVM denoised image with λ = 16, q = 4
Figure 2: Denoised Lena images obtained by the original and reformulated TVM models
8
Table 2: PSNR of denoised Peppers images by OTVM and RTVM formulations Images Denoised Peppers images size Noisy OTVM RTVM RTVM RTVM RTVM RTVM RTVM N image q = 0.5 q=1 q=2 q=4 q=8 q = 16 128 22.6652 25.8811 25.9709 26.1794 26.2507 26.2507 26.2232 26.1884 256 22.8032 28.0891 27.4795 28.1125 28.3800 28.4470 28.4465 28.4195 384 22.8388 29.5727 28.2988 29.2267 29.6905 29.8482 29.8730 29.8386 512 22.8575 30.3808 28.6557 29.7541 30.3290 30.5400 30.5831 30.5541
OTVM denoised image with λ = 16
RTVM denoised image with λ = 16, q = 2
RTVM denoised image with λ = 16, q = 4
Figure 3: Denoised Peppers images obtained by the original and reformulated TVM models
It can be seen from Table 1 and 2 that the values of PSNR computed from the reformulated TVM model are not too sensitive to the choice of parameter q under different selection of noise level σ and image size N . We can set q = 2 or q = 4 in practice to achieve reasonably good solution quality. We also observe that the quality of denoised images obtained by the reformulated TVM model is comparable to that obtained by the original model. In fact, at the first glance, the denoised Lena image using the original TVM model seems to be cleaner than those obtained by using the reformulated model. However, a closer examination reveals that some undesirable oversmoothing effects, e.g., the disappeared texture at the hat and a few extra lines at the nose of the Lena image in Figure 2, were introduced by the original TVM model. On the other hand, these oversmoothing effects were not appearant in the denoised images using the reformulated model. Moreover, no significant differences could be observed for the denoised Peppers images obtained by using the original and reformulated TVM models. In our second experiment, we demonstrate that AC-GD is faster than FISTA for solving the composite minimization problem in (2.4). From our discussion in Section 3.2, the convergence rate of AC-GD always dominates that of FISTA for solving strongly convex composite optimization problems. Our goal here is to verify this claim from the numerical experiments. Figure 4 shows the convergence behavior of AC-GD and FISTA applied to the Lena image. More specifically, we report ˜ k , dk ) − φ˜∗ for both algorithms, where the optimal value φ˜∗ was estimated by the optimality gap φ(u running FISTA for 10, 000 iterations. As shown in Figure 4, after only 250 iterations, the AC-GD method can reach 10−11 accuracy. It can also be easily seen from Figure 4 that AC-GD converges linearly while FISTA converges sublinearly. This indeed reflects the difference on the theoretical
9
4
10
FISTA algorithm AC−GD algorithm 2
10
0
10
−2
˜ k , dk ) − φ˜∗ φ(u
10
−4
10
−6
10
−8
10
−10
10
−12
10
0
50
100
150
200
250
k
Figure 4: AC-GD vs. FISTA applied to denoise the Lena image
convergence rates for both algorithms applied to problem (2.4).
4.2
Applications in Biomedical Image Denosing
In this subsection, we apply the developed reformulations for TVM in Magnetic resonance image (MRI), which provides detailed information about internal structures of the body. In comparison with other medical imaging techniques, MRI is most useful for brain and muscle imaging. Two image instance sets were used in this experiment. In the first instance set, we use the 256 × 256 Brain MRI test image and noisy images are obtained by adding a white Gaussian noise with zero mean and various standard deviations (σ). In the second one, we use the Knee MRI test images with different sizes and noisy image with σ = 0.1. We applied AC-GD algorithm with same settings (λ = 16 and q = 0.5, 1, 2, 4, 8, 16 ) as previous experiment to solve the RTVM model for these two instance sets. The As shown in Figure 5, Figure 6 and Table 3, Table 4, these results obtained for MRI images are consistent with those in Section 4.1. Table 3: PSNR of denoised Brain MRI by OTVM and RTVM formulations Std. Denoised Brain MRIs dev. Noisy OTVM RTVM RTVM RTVM RTVM RTVM RTVM σ image q = 0.5 q=1 q=2 q=4 q=8 q = 16 0.05 29.0079 29.4946 30.1097 30.0759 30.0148 29.9581 29.9246 29.9102 0.08 25.2942 26.8318 27.0898 27.0372 26.9734 26.9341 26.9113 26.8969 0.10 23.3351 25.4786 25.4708 25.4540 25.4179 25.3926 25.3732 25.3591 0.15 19.9300 22.9899 22.8896 23.0580 23.1191 23.1391 23.1469 23.1383
10
256 × 256 Original Knee MRI
256 × 256 Noisy Knee MRI, σ = 0.1
Denoised Knee MRI, λ = 16, q = 2
Denoised Knee MRI, λ = 16, q = 4
Figure 5: Original, noised and denoised Knee MRI
256 × 256 Original Brain MRI
256 × 256 Noisy Brain MRI, σ = 0.1
Denoised Brain MRI, λ = 16, q = 2
Denoised Brain MRI, λ = 16, q = 4
Figure 6: Original, noised and denoised Brain MRI
11
Table 4: PSNR of denoised Knee Images size Noisy OTVM RTVM N image q = 0.5 128 23.0303 25.0887 25.2891 256 23.1935 26.6968 26.4656 384 23.2989 28.1547 27.6806 512 23.3239 29.2226 28.3805
5
MRIs by OTVM and RTVM Denoised Knee MRIs RTVM RTVM RTVM q=1 q=2 q=4 25.3021 25.2828 25.2617 26.5326 26.5115 26.4684 27.9761 27.9956 27.9441 28.9444 29.0726 29.0379
formulations RTVM q=8 25.2387 26.4279 27.8966 28.9787
RTVM q = 16 25.2206 26.3931 27.8574 28.9145
Conclusions
In this paper, we introduced a strongly convex composite reformulation for the ROF model, a wellknown approach in biomedical image processing. We show that this reformulation is comparable with the original ROF model in terms of the quality of denoised images. We presented a first-order algorithm which possesses a linear rate of convergence for solving the reformulated problem and can be scalable to large-scale imaging denosing problems. We demonstrate from our numerical experiments that the developed algorithm, when applied to reformulated model, compares favorably with existing first-order algorithms applied to original model. In the future, we would like to generalize these reformulations to others image processing problems such as image deconvolution and image zooming.
12
References [1] A. Beck and M. Teboulle. Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. IEEE Trans. Image Proc., 18:2419–2434, 2009. [2] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences, 2:183–202, 2009. [3] A. Chambolle. An algorithm for total variation minimization and applications. J. Math. Imaging Vision, 20:89–97, 2004. [4] A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vision, 40:120–145, 2011. [5] T.F. Chan, G.H. Golub, and P. Mulet. A nonlinear primal-dual method for total variation-based image restoration. SIAM J. Sci. Comput., 20(6):19641977, 1999. [6] P.L. Combettes and J. Luo. An adaptive level set method for nondifferentiable constrained image recovery. IEEE Trans. Image Process., 11. [7] J. Dahl, P.C. Hansen, S.H. Jensen, and T.L. Jensen. Algorithms and software for total variation image reconstruction via first-order methods. Numerical Algorithms, pages 67–92, 2009. [8] F. Dibos and G. Koepfler. Global total variation minimization. SIAM J. Numer. Anal., page 646664, 2000. [9] H. Y. Fu, M. K. Ng, M. Nikolova, and J. L. Barlow. Efficient minimization methods of mixed l2 −l1 and l1 −l1 norms for image restoration. SIAM Journal on Scientific Computing, 27(6):1881– 1902, 2006. [10] S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, I: a generic algorithmic framework. Manuscript, Department of Industrial and Systems Engineering, University of Florida, Gainesville, FL 32611, USA, 2010. Submitted to SIAM Journal on Optimization. [11] S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, II: shrinking procedures and optimal algorithms. Manuscript, Department of Industrial and Systems Engineering, University of Florida, Gainesville, FL 32611, USA, 2010. Submitted to SIAM Journal on Optimization. [12] D. Goldfarb and W. Yin. Second-order cone programming methods for total variation-based image restoration. SIAM Journal on Scientific Computing, 27:622–645, 2005. [13] T. Goldstein and S. Osher. The split bregman algorithm for l1 regularized problems. Manuscript, UCLA CAM Report(08-29), April 2008. [14] Y. E. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence O(1/k 2 ). Doklady AN SSSR, 269:543–547, 1983. [15] Y. E. Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Optimization, 16:235–249, 2005. 13
[16] Y. E. Nesterov. Smooth minimization of nonsmooth functions. Mathematical Programming, 103:127–152, 2005. [17] L. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D, 60:259268, 1992. [18] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. Manuscript, University of Washington, Seattle, May 2008. [19] Y. Wang, J. Yang, W. Yin, and Y. Zhang. A new alternating minimization algorithm for total variation image reconstruction. SIAM Journal on Imaging Sciences, 1(3):248–272, 2008.
14
Appendix We provide the proofs for Proposition 1, Theorem 2 and Proposition 4 in Sections A.1, A.2 and A.3, respectively. We also discuss how to solve the subproblems arising from the AC-GD algorithm applied to problem (2.4) in Section A.4.
A1. Relation between two formulations (Proposition 1) In this subsection, we provide the proof of Proposition 1, which shows how the optimal values of problem (2.2) and (2.4) are related. Proof of Proposition 1 First note that by the definitions of T (·) and T˜(·), we have T˜(d) = T (u), if d = −Eu.
(5.26)
Let u ¯ be an optimal solution of (2.2) and d¯ = −E u ¯. Using (2.2), (2.4) and (5.26), we have i λh λ ˜ u, d) ¯ = T˜(−E u ¯) + φ˜∗ ≤ φ(¯ k¯ u − f k2 + q kE u ¯ − Eu ¯k = T (¯ u) + k¯ u − f k2 = φ∗ , 2 2 We now show the second relation in (2.5). Let (u∗ , d∗ ) be an optimal solution to (2.4), then φ∗ ≤ φ(u∗ ) = T (u∗ ) +
λ ∗ ku − f k2 . 2
Observe that by the definition of T˜(·), we have, X 1 X 1 1 X
d1ij + δij
dij
δij ˜ ˜
T (d + δ) =
d2 + δ 2 ≤
d2 +
δ 2 ≤ T (d) + N kδk , ij ij ij ij i,j
i,j
i,j
where the last inequality follows from the Cauchy-Swartz inequality. It then follows from (5.26) and the above conclusion that T (u∗ ) = T˜(−Eu∗ ) = T˜(d∗ − (d∗ + Eu∗ )) ≤ T (d∗ ) + N kd∗ + Eu∗ k . Therefore, λ φ∗ ≤ T (d∗ ) + N kd∗ + Eu∗ k + ku∗ − f k2 2 i λh ∗ λq ∗ 2 ∗ = T (d ) + ku − f k + q kd∗ + Eu∗ k2 + N kd∗ + Eu∗ k − kd + Eu∗ k2 2 2 2 ˜ ∗ , d∗ ) + N kd∗ + Eu∗ k − λq kd∗ + Eu∗ k2 ≤ φ˜∗ + N , = φ(u 2 2λq where the last relation follows from Young’s inequality.
A2. Concergence analysis for the AC-GD algorithm (Theorem 2) Our main goal in this subsection is to prove the convergence results of the AC-GD algorithm described in Theorem 2. We first establish two technical results. Lemma 6 states some properties of the subproblem (3.10) and Lemma 7 establishes an important recursion of the AC-GD algorithm. Theorem 2 then follows directly from Lemma 7. The first technical result below characterizes the solution of the projection step (3.10). 15
Lemma 6 Let X ˆ is an optimal n be a convex set and p : Xo→ R be a convex function. Assume that u 2 µ solution of min p(u) + 2 k˜ x − uk : u ∈ X , where x ˜, y˜ ∈ X and µ > 0 are given. Then, for any u ∈ X, µ µ µ p(ˆ u) + k˜ x−u ˆk2 + kˆ u − uk2 ≤ p(u) + k˜ x − uk2 . (5.27) 2 2 2 x − uk2 . The result immediately follows from the strong convexity Proof. Denote q(u) = p(u)+ µ2 k˜ of q(u) and the optimality condition that hq 0 (ˆ u), u − u ˆi ≥ 0 for any u ∈ X. The following lemma establishes an important recursion for the AC-GD algorithm. ag md Lemma 7 Let (xt−1 , xag t−1 ) ∈ X × X be given. Also let (xt , xt , xt ) ∈ X × X × X be computed according to (3.8),(3.10), (3.11) and suppose that (3.12) holds for given γt and αt . Then, for any x ∈ X, we have h i µ µ ag 2 2 kx − xk ≤ (1 − α ) Ψ(x ) + kx − xk + αt Ψ(x) Ψ(xag ) + t t t−1 t t−1 2 2 i h γt + kxt−1 − xk2 − kxt − xk2 . (5.28) 2 + md Proof. We first establish some basic relations among the search points xag t , xt , xt and xt−1 . ag md Denote dt := xt − xt . It follows from (3.8), (3.11) and (3.9) that md dt = αt xt + (1 − αt ) xag t−1 − xt (1 − αt2 )µ + γt αt [(1 − αt )µ + γt ] = αt xt + − 1 xmd xt−1 t − µ + γt µ + γt αt µ md (1 − αt )µ + γt x − xt−1 = αt (xt − x+ = αt xt − t−1 ). µ + γt t µ + γt
(5.29)
Using the above result and the convexity of ψ, we have ag md 0 md md 0 md ψ(xmd t ) + hψ (xt ), dt i = ψ(xt ) + hψ (xt ), αt xt + (1 − αt )xt−1 − xt i i h ag md 0 md = (1 − αt ) ψ(xmd t ) + hψ (xt ), xt−1 − xt i h i 0 md md + αt ψ(xmd t ) + hψ (xt ), xt − xt i h i md 0 md md ≤ (1 − αt )ψ(xag ) + α ψ(x ) + hψ (x ), x − x i . t t t t t t−1
(5.30)
It then follows from the previous two observations, (3.7),(3.11) and the convexity of X (x) that Ψ(xag t )
= ≤ ≤ ≤ =
ag ψ(xag t ) + X (xt )
L 0 md ψ(xmd kdt k2 + (1 − αt )X (xag t ) + hψ (xt ), dt i + t−1 ) + αt X (xt ) 2 h i L md 0 md md (1 − αt )ψ(xag ) + α ψ(x ) + hψ (x ), x − x i + kdt k2 t t t t t t−1 2 +(1 − αt )X (xag ) + α X (x ) t t t−1 h i L ag 0 md md (1 − αt )Ψ(xt−1 ) + αt ψ(xmd kdt k2 t ) + hψ (xt ), xt − xt i + X (xt ) + 2 h i md 0 md md (1 − αt )Ψ(xag t−1 ) + αt ψ(xt ) + hψ (xt ), xt − xt i + X (xt ) +
µ + γt µ + γt − Lαt2 kdt k2 − kdt k2 . 2 2αt 2αt2 16
(5.31)
Now let us apply the result regarding the projection step in (3.10). Specifically, by using Lemma 6 with p(·) = αt [hψ 0 (xmd ˆ = xt , and x ˜ = x+ t ), xi + X (·)], u t−1 , we have, for any x ∈ X, h i µ+γ
2 µ + γt t 0 md md
kxt − xk2 + αt ψ(xmd ) + hψ (x ), x − x i + X (x ) + xt − x+ t t t t t t−1 2 2
µ + γt 0 md md
x − x+ 2 ≤ αt [ψ(xmd t ) + hψ (xt ), x − xt i + X (x)] + t−1 2
2 (1 − α )µ + γ D E αt µ
t t 0 md md kx − xt−1 k2 ≤ αt [ψ(xmd + X (x)] +
x − xmd t + t ) + ψ (xt ), x − xt 2 2 (1 − αt )µ + γt ≤ αt Ψ(x) + kx − xt−1 k2 , (5.32) 2 where the second inequality follows from (3.9) and the convexity of k·k2 , and the last inequality follows from the strong convexity of Ψ(·). Combining (5.31) and (5.32), we obtain i h i µ γt h ag 2 2 2 Ψ(xag ) ≤ (1 − α ) Ψ(x ) + kx − x k + α Ψ(x) + kx − x k − kx − x k t t−1 t t−1 t t t−1 2 2 2 µ + γ − Lα µ t t kdt k2 , − kx − xt k2 − 2 2αt2 which clearly implies (5.28) due to the assumption in (3.12). We are now ready to prove Theorem 2. Proof of Theorem 2: Dividing both sides of (5.28) by Γt , and using (3.14) and the fact that α1 = 1, we have i i α 1 h 1 h µ µ t 2 kx − xt k2 ≤ kx − x k + Ψ(x) Ψ(xag Ψ(xag ) + t−1 t )+ t−1 Γt 2 Γt−1 2 Γt i γt h 2 2 kx − xt−1 k − kx − xt k , ∀t ≥ 2, + 2Γt and
i α i 1 h µ γ1 h 1 2 2 2 Ψ(xag ) + kx − x k ≤ kx − x k − kx − x k . Ψ(x) + 1 0 1 1 Γ1 2 Γ1 2Γ1 Summing up above inequalities, we obtain t t i i X X 1 h µ ατ γτ h 2 2 2 Ψ(xag ) + kx − x k ≤ Ψ(x) + kx − x k − kx − x k . t τ −1 τ t Γt 2 Γτ 2Γτ τ =1
τ =1
Note that by (3.14) and the fact α1 = 1, we have t t t X X α1 X 1 Γτ 1 1 1 1 ατ = + 1− = + − = . Γτ Γ1 Γτ Γτ −1 Γ1 Γτ Γτ −1 Γt τ =1
(5.33)
τ =2
τ =2
Using the above two relations, condition (3.13) and the fact that Γ1 = 1, we have i 1 h µ 1 γ1 2 Ψ(xag kx − x k ≤ Ψ(x) + [kx − x0 k2 − kx − xt k2 ] ) + t t Γt 2 Γt 2Γ1 1 γ1 ≤ Ψ(x) + kx − x0 k2 , Γt 2 which clearly implies (3.15). 17
(5.34)
A3. Properties of the composite function (Proposition 4) In this subsection, we provide the proof of Proposition 4, which provides certain estimates on the ˜ in the composite function φ(·). ˜ two crucial parameters µψ˜ and Lψ˜ for the smooth component ψ(·) Proof of Proposition 4: Denote the maximum eigenvalue and minimum eigenvalue of M := AT A by λmax and λmin respectively. Then, it suffices to show that λmax ≤ 1 + 12q,
(5.35)
1 . 10 + 1q
(5.36)
and λmin ≥
We bound the eigenvalues of M by using Gershgorin’s Theorem. Observe that the network flow matrix E in (2.3) can be written explicitly as E :=
e1 e1 0 0 .. .
K 0 P1 P2 .. .
0 K 0 0 .. .
0 0 K 0 .. .
0 0 0 K .. .
··· ··· ··· ··· .. .
0 0 0 0 .. .
0 0 0 .. .
PN −1 0 0 .. .
0 P1 P2 .. .
0 L1,1 L2,1 .. .
0 L1,2 L2,2 .. .
0 ··· ··· .. .
K
0
0
PN −1 LN −11,1 LN −1,2 · · ·
L1,N −1 L2,N −1 .. .
,
LN −1,N −1
where e1 ∈ RN −1 is the unit vector (1, 0, . . . , 0)T , K ∈ R(N −1)×(N −1) denotes the two-diagonals lower triangular matrix with the main diagonal entries equal to 1 and the sub-diagonal entries equal to −1, Pi , 1 ≤ i ≤ N − 1, denotes the matrix having the ith column equal to e1 and others entries equal to 0 and Li,j ∈ R(N −1)×(N −1) , 1 ≤ i, j ≤ N − 1, denotes the matrix with the ith column equal to jth column of K and others entries equal to 0. First, we will find the upper bound for the maximum eigenvalue of M . It is easy to see that the 2N th row of M having the largest value of the sum of the absolute values of all entries. We have N 2 +2N (N −1)
X i=1
|M2N,i | = 1 + 12q,
which, in view of the Gershgorin’s Theorem, clearly implies (5.35). Second, we find the lower bound for the minimum eigenvalue of M . Since if λ is an eigenvalue of M then 1/λ is an eigenvalue of M −1 , we will find the upper bound for the maximum eigenvalue of M −1 instead of the minimum eigenvalue of M . Note that M −1 = A−1 (A−1 )T and that by applying
18
Gauss-Jordan’s elimination, we easily obtain the formula of A−1 as follows
| | | | | | | |
Iu −− −− −− −− −− −− −− −e −K 0 0 0 ··· 0 1 −e 0 −K 0 0 · · · 0 1 0 −P1 0 −K 0 ··· 0 −P2 0 0 −K ··· 0 A−1 0 . . . . . . .. .. .. .. .. .. 0 0 −PN −1 0 0 0 0 −K 0 0 −P −L −L · · · −L 1 1,1 1,2 1,N −1 0 0 −P −L −L · · · −L 2 2,1 2,2 2,N −1 .. .. .. .. .. .. .. . . . . . . . 0 0 −PN −1 −LN −1,1 −LN −1,2 · · · −LN −1,N −1
|
| | | | |
0
−− 1 √ Id q
It is easy to see that the (N 2 + 2N )th row of M −1 having the largest value for the sum of the absolute values of all the entries. In particular, we have N 2 +2N (N −1)
X
1 = 10 + , +2N,i q
−1 M 2 N
i=1
which, in view of the Gershgorin’s Theorem, clearly implies that 1 λmax (M −1 ) ≤ 10 + . q By using the above observation and the fact that λmin (M ) = 1/λmax (M −1 ), we obtain (5.36).
A4. Solving the subproblem (3.21) in AC-GD algorithm In this subsection, we derive the explicit solution to the subproblem (3.21) in Step 2 of the AC-GD algorithm. It is easy to see that (3.22) is true. By definition of T˜(d), dt,ij is the solution of
p
min hcij , dij i + kdij k2 + dij − d+ t−1,ij , dij 2 2 where dij =
d1i,j d2i,j
, d+ t−1,ij
d+ t−1
1 !
i,j 2 d+ t−1 i,j
=
, cij =
c1i,j c2i,j
.
For notational convenience, let us consider the following problem: min hr, yi + kyk2 + y
19
p ky − qk2 , 2
(5.37)
where y, q, r ∈ R2 , p ∈ R and p > 0. We consider two cases: Case 1: kpq − rk2 ≤ 1. We have p ky − qk2 2 p p kyk2 + kyk22 + kqk22 − p hq, yi + hr, yi 2 2 p p 2 kyk2 + kyk2 + kqk22 − hpq − r, yi 2 2 p p 2 kyk2 + kyk2 + kqk22 − kpq − rk2 kyk2 2 2 p p 2 2 kyk2 + kqk2 , 2 2
hr, yi + kyk2 + = = ≥ ≥
which implies that y = 0 is the solution of (5.37) in case kpq − rk2 ≤ 1. Case 2: kpq − rk2 > 1. By the optimality condition of (5.37), we have y + py − pq + r = 0, kyk2 which means
(
y1 kyk2 y2 kyk2
+ py1 − pq1 + r1 = 0, + py2 − pq2 + r2 = 0.
Denoting t = kyk2 , then we have 1 t 1 t
y1 y2
(
y1 = y2 =
+ p = pq1 − r1 , + p = pq2 − r2 ,
or equivalently, t 1+tp t 1+tp
(pq1 − r1 ) , (pq2 − r2 ) .
Combining the above relation with t = kyk2 , we have t kpq − rk2 = t, 1 + tp or equivalently, t=
1 (kpq − rk2 − 1) , p
which immediately implies that the optimal solution of (5.37) is given by y=
kpq − rk2 − 1 (pq − r) . p kpq − rk2
Replacing y, q and r by dij , d+ t−1,ij and cij respectively, we obtain (3.23). It is worth noting that this formula still holds in case y, q, r ∈ R. 20