Generalized Thresholding Sparsity-Aware Online ... - Semantic Scholar

Report 1 Downloads 94 Views
1

Generalized Thresholding Sparsity-Aware Online Learning in a Union of Subspaces Yannis Kopsinis, Member, IEEE, Konstantinos Slavakis, Member, IEEE,

arXiv:1112.0665v2 [cs.IT] 11 Dec 2011

Sergios Theodoridis, Fellow, IEEE, Stephen McLaughlin, Fellow, IEEE.

Abstract This paper studies a non-convexly constrained, sparse inverse problem in time-varying environments from a set theoretic estimation perspective. A new theory is developed that allows for the incorporation, in a unifying way, of different thresholding rules to promote sparsity, that may be even related to non-convex penalty functions. The resulted generalized thresholding operator is embodied in an efficient online, sparsity-aware learning scheme. The algorithm is of low computational complexity exhibiting a linear dependence on the number of free parameters. A convergence analysis of the proposed algorithm is conducted, and extensive experiments are also exhibited in order to validate the novel methodology. Index Terms Inverse problems, sparsity, non-convex, partially quasi-nonexpansive mappings, online signal recovery.

I. I NTRODUCTION Sparsity-aware learning has been a topic at the forefront of research over the last ten years or so [1], [2]. Considerable research effort has been invested in developing efficient schemes for the recovery of sparse signals/parameter vectors. However, most of these efforts have been and continue to be focussed on batch processing where all measurements (training data) are available prior to the estimation task. It is only very recently that online, time-adaptive algorithms have been developed [3]–[9]. The major evolution of the latter activity is along the lines of greedy-like algorithms and around the ℓ1 -norm regularization of the Least-Squares (LS) (LASSO) cost. Both lead to hard and soft thresholding operators in order to achieve sparsification. The thresholding operation is a critical step for many sparsity-aware algorithms and a number of related operators have been studied, both theoretically and experimentally, mainly within the statistics community and in the context of the nonlinear wavelet regression task. Iterative hard thresholding methods for sparse recovery, and in the compressed sensing framework, have been attracting recently a lot of attention [10]–[13]. It is by now well established that hard thresholding (a discontinuous) operator has the tendency for larger variance of the estimates. On the other hand, soft-thresholding (a continuous) operator tends to introduce bias in the estimates. Moreover, due to its discontinuity, hard thresholding can be unstable, in the sense of being sensitive to small changes in the data [14]. To this end, the alternative thresholding rules have been proposed in an effort to bypass these drawbacks. Generally, this involves modifying the term that regularizes the least-squares cost by using different penalty functions that may even be non-convex. Y. Kopsinis is with The University of Edinburgh, Institute for Digital Communications, King’s Buildings, EH9 3JL, Edinburgh, UK. Email: [email protected]. K. Slavakis is with the University of Peloponnese, Dept. of Telecommunications Science and Technology, Karaiskaki St., Tripolis 22100, Greece. Email: [email protected]. Tel: +30.2710.37.2204, Fax: +30.2710.37.2160. S. Theodoridis is with the University of Athens, Dept. of Informatics and Telecommunications, Ilissia, Athens 15784, Greece. Email: [email protected]. Tel: +30.210.727.5328, Fax: +30.210.727.5337. S. McLaughlin is with the Heriot Watt University, School of Engineering and Physical Sciences, EH14 4AS, Edinburgh. Email: [email protected].

2

The aim of this paper is to develop an online scheme that builds upon such generalized thresholding operators, which may, also, be associated with non-convex costs. The scheme is generic and exploits a number o known thresholding rules. Furthermore, it provides the means to include new ones. As an example, a novel piecewise linear thresholding operator is developed, which leads to an efficient, sparsity-aware time-adaptive algorithm of reduced complexity. The framework in which the new scheme has been developed is that of the set theoretic estimation [15] and in particular its online version first proposed in [16]–[18]. However, the treatment of the more general cases of thresholding rules, which is the focus of this work, dictates for the generalization of the existing theory [18] which, so far, has been developed around convex sets and constraints and the associated projection operators [9]. In this paper, non-convex sets have to be considered and the involved projection operator has to be replaced by more general mapping operators, in order to meet the needs of the problem. To this end, this paper has a value beyond sparsity-aware learning. It presents an online algorithmic framework that can deal with non-convex constraints, provided that these fall within the rationale of our newly proposed mapping operators. To be more specific, in this paper we constrain our solution to lie in a union of subspaces, which is known to be a non-convex region. The remainder of the paper is organized as follows: In section II, the inverse problem under consideration is described and in section III the principles of the set theoretic estimation approach to online learning is presented. In section V, novel sparsity-aware mappings are introduced, which allow the convex and nonconvex thresholding operators solving the penalized least-squares problem (discussed in section IV), to be incorporated in the set theoretic estimation framework. The resulted algorithm is presented in section VI and it is evaluated with respect to both performance and computational complexity in section VII. A number of appendixes support theoretically the developments exposed through out the paper. More specifically, in App. A some required elements of convex analysis are quoted, in App. B and App. C the newly introduced partially quasinonexpansive mapping and the generalized thresholding operator are defined respectively. Finally, in App D, the convergence analysis of the proposed algorithm is conducted. II. S YSTEM M ODEL AND P ROBLEM S TATEMENT We will denote the set of all non-negative integers, positive integers, and real numbers by N, N∗ , and R, respectively. Given j1 , j2 ∈ N, such that j1 ≤ j2 , let j1 , j2 := {j1 , j1 + 1, . . . , j2 }. The stage of discussion will be the Euclidean space RL , where L ∈ N∗ . Given any pair of vectors a1 , a2 ∈ RL , the inner product in RL is defined as the classical vector-dot product ha1 , a1 i := at1 a2 , where the superscript t stands for vector/matrix transposition. The induced norm will be denoted by k·k. Our task is to estimate the signal h∗ ∈ RL , based on measurements that are sequentially generated by the linear regression model: yn = xtn h∗ + vn ,

∀n ∈ N,

(1)

where the model outputs (observations) (yn )n∈N ⊂ R and the model input vectors (xn )n∈N ⊂ RL comprise the measurement pairs (xn , yn )n∈N , and (vn )n∈N is the noise process. The unknown signal parameter vector h∗ is “sensed” by a sequence of inner products, vector-dot products in a Euclidean space, with appropriately selected “sensing” vectors xn , which are chosen to be instances of independent and identically distributed (i.i.d.) random variables. In this study, the signal h∗ is assumed to be sparsely represented by a basis Ψ ∈ RL×L : h∗ = Ψa∗ ,

(2)

where a∗ is sparse, i.e., most of its components are zero, and Ψ := [ψ1 , ψ2 , . . . , ψL ] is a (usually) orthogonal matrix, with {ψi }L i=1 being the basis vectors. If we define ka∗ k0 to stand for the number of non-zero components of a∗ , then the assumption

3

that a∗ is sparse can be equivalently given by K := ka∗ k0 ≪ L. Hereafter, such signals will be referred to as K-sparse signals in the transform domain. The regression model (1), by the incorporation of (2), can be equivalently expressed as yn = utn a∗ + vn ,

∀n ∈ N,

(3)

where (un := Ψt xn )n∈N stand for the sensing vectors in the transform domain, defined by Ψ. According to this formulation, the unknown parameter becomes the sparse vector a∗ ∈ RL . III. T HE S ET T HEORETIC E STIMATION A PPROACH

TO

O NLINE L EARNING

The mainstream of sparsity-aware online methods exploits the training data (un , yn )n∈N in the context of classical adaptive filtering [19]; a quadratic objective function is used to quantify the designer’s perception of loss. In the sequel, this convex differentiable function is regularized by a sparsity promoting term, usually one that involves the ℓ1 norm penalty function, and a minimizer of the resulting optimization task is sought either in the RLS or the LMS rationale, e.g., [3]–[7]. Very recently, a novel online method for the recovery of sparse signals, referred to as the Adaptive Projection-based Algorithm using Weighted ℓ1 -balls (APWL1), was developed in [9], based on set theoretic estimation arguments [15]. The set theoretic estimation philosophy departs from the more standard approach of constructing a loss function, and searches for a set of solutions, which are in agreement with the available measurements as well as the available a-priori knowledge, which is given in the form of a set of constraints. More specifically, at each time instance, n ∈ N, the training data pair (un , yn ) is used to define a closed convex subset of RL , which is considered to be the region where the unknown a∗ lies with high probability. Different alternatives exist on how to “construct” such convex regions. A popular choice takes the form of a hyperslab around (un , yn ), which is defined as:  Sn [ǫ] := a ∈ RL : |utn a − yn | ≤ ǫ ,

∀n ∈ N,

(4)

for some user-defined tolerance ǫ ≥ 0, and for un 6= 0. The parameter ǫ determines, essentially, the width of the hyperslabs, and it implicitly models the effects of the noise, as well as various other uncertainties, like measurement inaccuracies, calibration errors, etc. Any point that lies in the hyperslab is considered to be in agreement with the corresponding measurement pair, at the specific time instance. For example, if the noise were bounded, then an appropriate choice of ǫ would guarantee that the unknown solution lies within Sn [ǫ], in (4). Following the set theoretic estimation terminology [15], Sn [ǫ] is also called a property set. ˆ ∗ ∈ RL that lies in the intersection Given the sequence of training pairs (un , yn )n∈N , our ultimate goal is to search for a point a of the hyperslabs (Sn [ǫ])n∈N , which are defined by the training points. An algorithmic solution to the task of finding a point in the intersection of a finite number of closed convex sets is given by the celebrated method of Projections Onto Convex Sets (POCS) [15], [20]–[23]. In contrast to POCS, where only a finite number of sets is involved, our methodology, based on [16]–[18], deals with an infinite number of sets, which allows operation in online settings; in the latter cases, at each time instance, a new convex set, e.g., hyperslab, is formed as time evolves. In the online sparsity-promoting context of [9], the set theoretic estimation rationale is realized by the following recursion; for an arbitrary starting point a0 ∈ RL , define ∀n ∈ N, 



an+1 := PBℓ1 [wn ,δ] an + µn 

n X

i=n−q+1

(n) ωi PSi [ǫ] (an )



− a n  .

(5)

In order not to disturb the flow of the discussion, we do not delve into the details of (5) here. Loosely speaking, (5) comprises of a weighted average of q projections (see App. A) onto the hyperslabs {PSi [ǫ] }ni=n−q+1 , which are formed by the q most

4

recently received samples and an extra projection onto the weighted ℓ1 -ball, Bℓ1 [wn , δ], which is the means to impose our a-priori knowledge w.r. to the sparse nature of the unknown a∗ . It can be verified that the projection PBℓ1 [wn ,δ] is equivalent to a soft thresholding operation [9], [24]. In this paper, we go one step further in the way the a-priori knowledge about sparsity is exploited. We will explore various possibilities of more general thresholding techniques, to replace the “soft” one. In such a context, simple projection operators onto convex sets, like the PBℓ1 [wn ,δ] , are inappropriate to fully describe the theoretical ingredients to serve the needs of such perplexed thresholding operations. A new family of mappings has to be invented in order to unify existing thresholding techniques, and give new directions towards a more efficient usage of the available a-priori knowledge, in online set theoretic estimation problems. It will be seen in Sections V and VI that the present study stands also as a first attempt towards the generalization of the results of [16]–[18] to non-convexly constrained online learning tasks. IV. T HE P ENALIZED L EAST-S QUARES T HRESHOLDING O PERATORS (PLSTO) Going back to (3), choose N ∈ N∗ , and define Un := [un , un−1 , . . . , un−N +1 ] ∈ RL×N , as well as yn := [yn , yn−1 , . . . , yn−N +1 ]t ∈ RN , and vn := [vn , vn−1 , . . . , vn−N +1 ]t ∈ RN . Then, it can be easily verified that (3) takes the form of yn = Unt a∗ + vn , ∀n ∈ N. The mainstream of sparsity-aware algorithms belongs to batch techniques; freeze n, and utilize all the gathered N training data to solve, in most cases iteratively, the following penalized least-squares minimization task, min

a∈RL

L X

1

yn − U t a 2 + λ p(|ai |), n 2 i=1

(6)

where p : R → [0, ∞) stands for a sparsity-promoting penalty function, λ ∈ (0, ∞) is the regularization parameter, and ai stands for the i-th coordinate of the vector a. Choices for p are numerous; if, for example, p(|a|) := χR\{0} (|a|), ∀a ∈ R, where χA stands for the characteristic function PL with respect to A ⊂ R (χA (α) := 1, if α ∈ A , and χA (α) := 0, if α ∈ / A ), then the regularization term i=1 p(|ai |)

becomes the ℓ0 -norm of a. In the case where p(|a|) := |a|, ∀a ∈ R, then the regularization term is nothing but the ℓ1 -norm PL kak1 := i=1 |ai |, and the task (6) becomes the celebrated LASSO [25], [26]. It has been observed that if some of the LASSO’s regularity conditions are violated, then the LASSO is sub-optimal for model selection [27]–[31]. Such a behavior has motivated

the search for non-convex penalty functions p, which bridge the gap between the ℓ0 - and ℓ1 -norm; for example, the ℓγ penalty, for γ ∈ [0, 1], [29], [32], the log- [29], [33], the SCAD [33], [34], the MC+ [30], [31], and the transformed ℓ1 [33], [35] penalties. Recently, sparsity-aware coordinate-wise optimization techniques for solving the task (6) have been attracting a lot of interest [4], [31], [36]. Indeed, there are special cases [14], [33], where a coordinate-wise treatment of the general problem (6) is firmly ˜ n := Un yn , (6) can be justified; assuming, for example, N = L, and that the matrix Un is orthogonal, and by defining a equivalently viewed as the following separable optimization task [14], [33],  L  X 1 2 min (˜ ai − ai ) + p(|ai |) . 2λ a∈RL i=1

(7)

Even if the original task (6) is not separable, a coordinate-wise treatment, as in (7), albeit in a sub-optimal context, is gaining a lot of attention, due to algorithmic implementations that exhibit low computational load and increased scalability to the dimension of the problem at hand [4], [31], [36]. ˜ , we will refer to the operator, which maps a ˜ to a minimizer of (7), as the Penalized Least-Squares Thresholding Given a  2  Operator (PLSTO). For example, if p(|a|) := λ − (|a| − λ)2 χ[0,λ) (|a|) /λ, ∀a ∈ R, then the resulting thresholding operator

5

is the celebrated Hard Thresholding (HT) one [33], which is depicted in Fig. 1d. In the case of the penalty p, that leads to the LASSO task, (7) generates the well-known Soft Thresholding (ST) operator (Fig. 1a). Due to the separability of the minimization task of (7) in coordinates, a PLSTO can be viewed, also, as a function defined on the real axis. PLSTO has played a key role in numerous applications such as wavelet shrinkage and denoising [37], [38], regression [34], variable selection [33]. Moreover, both ST and HT operators have been effectively employed in iterative thresholding schemes for fast sparse signal recovery under the compressed sensing framework [10]–[13], [24].

Fig. 1.

Plots of penalized Least-Squares Thresholding Operators (PLSTO) for various choices of the penalty function p in (7).

Fig. 1 shows some of the most commonly employed PLSTO. Fig. 1b and 1c depicts the thresholding rules corresponding to the MC+ penalty [30], [31] and the Smoothly Clipped Absolute Deviation Penalty (SCAD) [34] respectively. The latter resembles the firm shrinkage operator [39]. Both SCAD and MC+ leave large components unchanged, such like HT, avoiding to be discontinuous and at the same time allowing a linear/gradual transition between the “kill” and the “keep” regimes of HT. HT is far from being the only discontinuous thresholding operator. Two examples are shown in Fig. 1e and Fig. 1f. The first

6

one is the widely known Bridge threshold [32], which is related to the ℓγ penalty, γ < 1 [40]. The second is stemmed from the log-penalty [31]. Note that these two thresholding rules comprise nonlinear segments. Continuous thresholding functions, that contain nonlinear parts, are shown in Fig. 1(g-i). More specifically, the nonnegative garrote [41] and representatives of the n-degree garrote threshold are shown in Fig. 1g, the hyperbolic shrinkage rule [42] is shown in Fig. 1h and a representative of thresholding operators stemming from the nonlinear diffusive filtering approach, namely the Weickert threshold [14], is shown in Fig. 1i V. A N OVEL O PERATOR T HEORETICAL F RAMEWORK

WHICH

P ROMOTES S PARSITY

The key concept of this section is the notion of an operator or mapping, defined on the Euclidean space RL , i.e., T : RL → RL . Another concept of fundamental importance, associated to every mapping T , is its fixed point set Fix(T ) := {a ∈ RL : T (a) = a}. In other words, Fix(T ) reveals the hidden modes of T , by putting together all those points unaffected by T . To leave no place for ambiguity, every Fix(T ) that appears in the sequel is assumed nonempty. In the theory of convex analysis [43], [44], a number of mappings have been defined as tools to facilitate a number of theoretical arguments, including convergence results of iterative schemes. Such mappings range from more classical and more widely known to less familiar ones. For the sake of completeness and in order to make them available to the reader in a common frame, we provide the main definitions. Definition 1. 1) Nonexpansive mappings. A mapping T : RL → RL is called nonexpansive if kT (x) − T (y)k ≤ kx − yk, ∀x, y ∈ RL [44], [45]. A celebrated example of this class of mappings, with a principle role in the theory and applications of convex analysis, is the (metric) projection mapping PC onto a closed convex set C [43], [44] (see also App. A). The fixed point set Fix(T ) is well-known to be closed and convex [44], [45]. For example, Fix(PC ) = C. 2) Strongly or η-attracting nonexpansive mappings. If there exists an η > 0, such that ∀x, y ∈ RL , 2

2

2

η k(IL − T )(x) − (IL − T )(y)k ≤ kx − yk − kT (x) − T (y)k , where IL stands for the identity matrix of dimension L, then T is called strongly or η-attracting nonexpansive. It is easy to verify that this class of operators is a subset of the (λ)

class of nonexpansive mappings. The relaxed projection mapping TC , λ ∈ (0, 2), onto a closed convex set (see App. A) (λ)

is such an example; indeed, it is well-known [44] that TC

is ((2 − λ)/λ)-attracting nonexpansive. This class of operators

includes compositions and convex combinations of (relaxed) projection mappings [44]. Such a composition of projection mappings is the engine behind the success of the celebrated Projections Onto Convex Sets (POCS) method [15], [20]–[23]. 3) Quasi-nonexpansive mappings. The mapping T is called quasi-nonexpansive, if ∀x ∈ RL , ∀y ∈ Fix(T ), kT (x) − yk ≤ kx − yk. It can be verified that the fixed point set of any quasi-nonexpansive operator is convex [44]. It can be readily verified that this class of operators contains the class of nonexpansive mappings. An example, which is known to be quasi-nonexpansive, but not nonexpansive, is the subgradient projection operator [44] (see (19) in App. D); that is the driving force behind the majority of the algorithms, that employ the subgradients of a convex loss function, in order to solve non-smooth minimization tasks. 4) Partially quasi-nonexpansive mappings. To the best of our knowledge, this class of mappings appears for the first time in the literature. A mapping T is called partially quasi-nonexpansive, if ∀x ∈ RL , ∃Yx ⊂ Fix(T ) : ∀y ∈ Yx , kT (x) − yk ≤ kx − yk . It can be seen that this class of operators contains, as special cases, all of the previously presented classes of mappings. To see this, notice that if we insert Yx := Fix(T ), ∀x ∈ RL , then T becomes quasi-nonexpansive. The notable characteristic

7

of a partially quasi-nonexpansive mapping is that its fixed point set is no longer necessary to be convex. An example (K)

of such a mapping is the Generalized Thresholding (GT) operator TGT , which is defined in the following Section V-A. (K)

There, we will verify that Fix(TGT ) is a union of subspaces, which is indeed a non-convex set. Recall that at the heart of any sparsity-aware learning is to search for a solution that lies on a union of subspaces [46]. The reason for defining this new class of operators is twofold: (i) many members of this family promote sparsity, and (ii) this new class of operators introduces tools which help us to generalize the very recent results, obtained for the Adaptive Projected Subgradient Method (APSM) [18], to non-convexly constrained online learning tasks (see App. D). To the best of our knowledge, this is also the first time where convergence results are presented for an online, time-adaptive [19] algorithm (see Section VI) that deals with non-convex a-priori knowledge, and in a computational complexity that scales linearly to the number of unknowns. Indicative sparsity-promoting members of the aforementioned class are the following. •

The projection operator onto weighted ℓ1 -balls adopted in [9]: Given a time instant n, a vector of weights wn ∈ RL , where wn,i > 0, ∀i ∈ 1, L, and a radius δ > 0, the weighted ℓ1 -ball is defined as ) ( L X L wn,i |ai | ≤ δ . Bℓ1 [wn , δ] := a ∈ R :

(8)

i=1

The (metric) projection operator PBℓ1 [wn ,δ] onto Bℓ1 [wn , δ], which is intimately related to the weighted ℓ1 -norm [27], [47], was shown to promote sparsity in [9]. Since Bℓ1 [wn , δ] is a closed convex set, the relaxed projection mapping (λ)

TBℓ

1

[wn ,δ] ,

and thus PBℓ1 [wn ,δ] of [9], belongs to the class of partially quasi-nonexpansive mappings, according to

Def. 1.2. •

(K)

The newly defined GT operator, TGT , to be introduced in Section V-A, is indeed a member of the class of partially quasi-nonexpansive mappings, as it is verified in App. C.



The univariate penalized least-squares problem of (7) can be alternatively viewed through the framework of Moreau envelopes and proximal mappings [43], [48]. A subclass of proximal mappings, in the case where p is lowersemicontinuous [44] and convex, has been studied in [44], [49], [50] and, in the context of time-adaptive learning, in [8]. It can be verified [44], [49], [50] that such proximal mappings are 1-attracting nonexpansive ones, and thus belong to the proposed class of partially quasi-nonexpansive operators, according to Def. 1.2.

A mapping T will be called partially strongly or η-attracting quasi-nonexpansive, if there exists an η > 0 such that 2

2

2

∀x ∈ RL , ∃∅ 6= Yx ⊂ Fix(T ) : ∀y ∈ Yx , η kx − T (x)k ≤ kx − yk − kT (x) − yk . In the case where Yx = Fix(T ), ∀x ∈ RL , then T is called strongly or η-attracting quasi-nonexpansive. (K)

Next, the novel generalized thresholding mapping, TGT , which can incorporate any PLSTO as a shrinkage function, is presented. (K)

A. The Generalized Thresholding (GT) operator TGT . (K)

Given a K ∈ 1, L and any a ∈ RL , the proposed GT operator output z := TGT (a) is obtained coordinate-wise as follows:   (K)  al , l ∈ Ja , ∀l ∈ 1, L, zl := (9)  (K) shr(al ), l ∈ / Ja , (K)

where Ja

contains all those positions (indices), which correspond to the K largest, in absolute value, components of the vector

a, and shr denotes a user-defined shrinkage function. In simple words, GT acts as follows: given the input vector a ∈ RL , identify, first, its K largest, in magnitude, components, while apply to the rest of them the shrinkage function shr (see Fig. 2

8

x3 a2 x2

a1 (K)

TGT (a1) M(1,2) M(2,3) (K)

THT (a1) x1 Fig. 2.

(K)

An illustration of TGT , for the case of a 3-dimensional space, i.e., L := 3, and K := 2. Take for example the point a1 . The K = 2 largest, in (K)

magnitude, coordinates of a1 are the two first ones, i.e., Ja1 function shr. If this third coordinate is set to 0, then

(K) TGT

(K)

= (1, 2). These coordinates stay unaffected by TGT , while the third one is shrinked by the

acts as a hard-thresholding mapping (K)

(K)

M(2,3) , i.e., its first coordinate is 0. Hence, a2 is a fixed point of TGT , and applying TGT

(K) THT .

On the other hand, the point a2 is already located in

to it will have no effect.

for an illustration, and Def. 5 in App. C for a more mathematically rigorous definition). The requirements for shr are mild, so (K)

that any PLSTO can be used in the place of shr, as Remark 1 of App. C suggests. It can be verified in App. C that TGT is partially strongly quasi-nonexpansive, and not quasi-nonexpansive, and that its fixed point set is a union of subspaces, which is a non-convex set. Bridge

0.5

(a) Fig. 3.

(b)

Plots of thresholding rules belonging to GT family; (a) Based on known PLSTO and (b) constructed arbitrarily.

Two examples of thresholding rules based on the GT are shown in Fig. 3a-b. In Fig. 3a, the shrinkage function shr has been chosen to be either the bridge ℓ0.5 or the 2-degree garrote thresholding functions, drawn with dashed and solid lines respectively. (K)

The number ξa

indicates the value of the smaller among the K larger components of the vector a. In Fig. 3b, a GT example,

where the shrinkage function has been selected arbitrarily without being related to any known penalty function, is shown. Note that the proposed GT is allowed to be highly discontinuous. VI. T HE P ROPOSED A LGORITHM Algorithm 1 (The Adaptive Projection-based Generalized Thresholding (APGT) algorithm). Given the desired sparsity level K ∈ 1, L, the parameter ǫ > 0 of the hyperslabs, the number q ∈ N∗ of the hyperslabs to be processed concurrently at every time instant, the function shr for the generalized thresholding operation, and an arbitrary initial point, a0 ∈ RL , execute the following, for every n ∈ N.

9

1) Define the sliding window Jn := max{0, n − q + 1}, n on the time axis, of size at most q. The set Jn defines all the indices corresponding to the hyperslabs, which are to be processed at the time instant n. 2) Having at our disposal the current estimate an , identify the indices In := {i ∈ Jn : PSi [ǫ] (an ) 6= an }, which correspond to the active hyperslabs. In other words, In identifies all those hyperslabs of Jn which can provide with “new” information, as opposed to the rest, which leave the current estimate an unaffected, i.e., ∀i ∈ Jn \ In , PSi [ǫ] (an ) = an ⇔ an ∈ Si [ǫ]. In other words, active hyperslabs are those which do not contain an . P (n) (n) (n) (n) 3) Choose {ωi }i∈In such that ωi ∈ (0, 1] and i∈In ωi = 1. That is, for every i ∈ In , the weight ωi grades the (n)

importance of the information carried by Si [ǫ]. For example, a fair approach is to let ωi

:= 1/|In |, ∀i ∈ In , where |In |

denotes the cardinality of In . 4) ∀i ∈ In , compute the projection PSi [ǫ] (an ) (15). 5) Choose an extrapolation parameter µn from the range (0, 2Mn ), where P 2 (n) P kPSi [ǫ] (an )−an k (n) i∈In ωi  

P 2

, if i∈In ωi PSi [ǫ] (an ) 6= an , (n)

i∈In ωi PSi [ǫ] (an )−an Mn :=  1, otherwise.

(10)

2

Notice that due to the convexity of the function k·k , we always have Mn ≥ 1. As such, our parameter µn can take

values larger or equal to 2. 6) Compute the next estimate by an+1 :=

(K) TGT

a n + µn

X

(n) ωi PSi [ǫ] (an )

i∈In

− an

!!

.

(11)

Under some assumptions, which can be found in App. D, a convergence analysis for APGT can be established, which states, among others, that the sequence (an )n∈N is (Fej´er) monotonous, the set of all cluster points of (an )n∈N is nonempty, all of these cluster points are sparse vectors, of sparsity level at most K, and the sequence of estimates approaches arbitrarily (K)

an intersection of hyperslabs Sn [ǫ]. Since the APGT is based on the mapping TGT , whose fixed point set is non-convex, the convergence analysis of App. D can also be seen as the first attempt to generalize the results of [16]–[18] to non-convexly constrained online learning tasks. VII. P ERFORMANCE S TUDY The sparsity-aware adaptive learning techniques can be split in two categories depending on their computational complexity requirements. On the one hand, the low complexity algorithms, e.g. [3], [7], need a number of mathematical operations which are comparable with that of the Least Mean Squares (LMS) algorithm. As expected, the reduced complexity operation is accompanied by performance sacrifices. On the other hand, the high computational complexity algorithms, e.g., [4], need O(L2 ) operations per iteration, resembling the recursive least-squares (RLS) algorithm. The proposed projection-based adaptive techniques have the merit to be capable of operating with tunable computational requirements via adjusting the parameter q, i.e., the number of hyperslabs, which are considered in each iteration. When q takes small values, say q ≤ 5, the method operates in the low complexity regime with complexity similar to LMS. Of course, adopting the low complexity operation the algorithm does not exhibit its full performance potential. When q is increased, convergence is accelerated up to a high enough q value, denoted by q∗ , where any further increase does not lead to noticeable improvements. The q∗ is hereafter characterized as best and specifies the high complexity regime operational mode of the algorithm. A method for estimating the best q as a function of L and K will be presented elsewhere. It has to be noted that for moderately sparse

10

signals, q∗ is much lower compared to the signal dimension L. Moreover, it has to be stressed out that the algorithm exhibits low sensitivity with respect to the best q value. Experimentally, deviations of even up to 10% from q∗ are hardly noticeable in the performance. Furthermore, noticeable performance improvements w.r.t. LMS type algorithms can be obtained for values of q, as low as 10, for typical application scenarios. A. Performance Study of PLSTO-based GT The performance of the proposed algorithm when the shrinkage function, shr, in (9) assumes some well known PLSTOs is ˆ (K)

studied next in both high and low computational complexity regimes of operation. The resulted sparsity inducing operator TGT ˆ components unaltered, where K ˆ is an estimate of the actual K := ka∗ k . The signal under consideration has leaves the K 0

L = 1024 and K = 100, and the noise variance equals to σ 2 = 0.1. Moreover, the extrapolation parameter µn , is set equal to (n)

1.8 × Mn , ωj

are all getting the same value and the hyperslabs parameter ǫ was set equal to σ.

12 12

12

Fig. 4. Performance results of the proposed method employing PLSTO-based generalized thresholds. (a-b) correspond to q∗ , i.e. operation in high computational complexity mode, whereas (c) corresponds to q = 2, i.e. operation in low computational complexity mode.

Fig. 4a corresponds to q = 390, which turned out to be the best q for the specific L, K combination. In other words, for this value the algorithms achieve their best convergence speed. Each Mean Square Error (MSE) performance curve, is constructed based on 50, ensemble averaged, independent realizations, i.e. MSE(n) = (r)

(r)

50

2 1 X

(r)

a∗ − a(r) n , 50L r=1

(12)

where an denotes the estimate, of a∗ , obtained based on the first n samples of the r-th realization. Curves marked with asterisks, crosses and x-crosses correspond to the proposed adaptive projection (AP) method based on the SCAD (APSCAD) thresholding rule, which operates entry-wise according to the following; given the input vector a ∈ RL , the i-th coordinate, i ∈ 1, L, of the output vector z := TSCAD (a) are given by the next rule:    0, if |ai | ≤ λ,      sgn(ai )(|ai | − λ), if λ < |ai | ≤ 2λ, zi =  (α−1)ai −αλ sgn(ai )  , if 2λ < |ai | ≤ αλ,  α−2     a , if |ai | > αλ, i (K)

where α and λ are user defined parameters. It is obvious that when ξa

(13)

> αλ, K ≥ 1, then (13) follows the proposed GT

model (9). In this case, all the components of x, which are greater than αλ, remain unaltered. In the penalized LS based variable

11

selection problem, α, often takes the value 3.7 or slightly larger (up to 5) depending on whether L is smaller or larger than 100 [34]. The curves with crosses and x-crosses correspond to fixed λ values and particularly, to the values 0.04 and 0.01 respectively. The α was set equal to 12 since it led to somewhat better results compared to the aforementioned α values. In any case, it is apparent that the obtained performance is significantly inferior compared to the APWL1 method, which is indicated with squares. Recall that APWL1 [9] uses projections onto weighted ℓ1 -balls instead of the GT. In the previous experiment, the shrinkage function is fixed throughout the learning task. However, the developed GT operator can adaptively change from iteration to iteration, a property, which is supported by the theoretical analysis already discussed. This can be exploited in order to embody a priori known information about the sparsity level of the signal under consideration ˆ ˆ of the true sparsity level K. Indeed, for α fixed, if λ is recursively set equal to (ξa(K) − δ)α, where based on the estimate K n ˆ larger δ an arbitrarily small value, then in each iteration the thresholding operator is adapted in order to shrink all but the K ˆ = K, is referred to as APSCAD adapt and it achieves performance very components. In Fig. 4a, the specific method, setting K close to APWL1. The sensitivity of the proposed methods in inaccurate estimates of K is investigated in section VII-C. The curve exhibiting the best convergence speed, which is indicated with triangles, uses as shr function the one corresponding (K)

to the ℓ0.5 penalty. In this particular case, K in (9) was set equal to 1 in order for TGT to resemble as much as possible the original bridge threshold as it is shown in Fig. 1e. The regularization parameter, λ, which is used in the corresponding univariate (K)

penalized least-squares problem, is, once again, sparsity-level dependent. In particular λ = bξan . The curve given in the figure is optimized, via simulations, with respect to b (particularly b = 0.06). The less complex thesholding rule is inevitably the hard thresholding (HT) one. In the general context of our GT formulation, ˆ of the actual sparsity level K, HT sets all but the largest (in magnitude) K ˆ the HT rule becomes: Having an estimate K components of a to zero. In the cases where the K larger components might not be uniquely defined, then the smallest possible indices are chosen. The corresponding performance curve is marked with circles. Based on Fig. 4a, we can conclude that in the high computational mode regime the performance obtained by all the PLSTObased thresholding operators studied here, do not exhibit large differences, provided that the thresholding operator is not fixed ˆ A we will see next, the performance achieved is similar to that obtained solving the LASSO but adaptively obtained using K. problem. In Fig. 4b, the proposed methods operating in full complexity mode, i.e. with q = 390, are compared against the state-of-the-art sparsity-aware online algorithms, such as the Online Cyclic Coordinate Descent - Time Weighted Lasso (OCCD-TWL) and Time and Norm Weighted LASSO (OCCD-TNWL) [4] indicated with triangles and asterisks respectively. These online algorithms are of great interest since they succeed in attaining performance similar to LASSO using closed form adaptation equations. Their regularization parameters are optimized such as in [9]. As a reference for comparison, the APWL1 and APHT performance curves have been transferred from Fig. 4a to Fig. 4b. OCCD-TWL exhibits similar convergence speed to that of the proposed methods, attaining, however, lower MSE values after about 1200 samples. OCCD-TWL performance in the specific example settings is virtually coincident to the solution of the batch LASSO problem. The OCCD-TNWL incorporates in OCCD-TWL a SCAD-based reweighting scheme along the principles of the local linear approximation method [28]. Interestingly, it can be seen that OCCD-TNWL succeeds in outperforming the rest of the algorithms only after about L iterations. This is due to the fact that only after L iterations the method can have reliable information about the weights, since they are obtained based on tentative estimates, which are provided by a Recursive Least-Squares (RLS) algorithm, which runs in parallel. Indeed, reliable unconstraint Least Squares estimates can only be achieved based on at least L observations. It should be stressed out that the proposed methods are considerably less complex than both OCCD algorithms, since, as it will be discussed in VII-E, the latter

12

schemes are of O(L2 ) complexity. In the same figure, the curve exhibiting the worst performance and marked with crosses, corresponds to the low-complexity SpAdOMP [7] algorithm. Apparently, it cannot compete with the proposed methods, when they function in the high-complexity regime. The performance of the proposed methods when they are operating in reduced computational complexity mode, by setting q = 2, is studied in Fig. 4c. We can observe that the situation has significantly changed compared to the q = q∗ case since, at this low complexity operation regime, the performance can drastically change using different thresholding operators. More specifically, both APHT as well as APWL1 converge much slower than the rest of the methods, such as those using SCAD (curve with asterisks) and bridge based on ℓ0.05 penalty1 (curve marked with triangles). Similarly to the case of Fig. 4a, the best results of APSCAD and AP-ℓ0.05 were achieved with sparsity-level dependent parametrization. The latter algorithm achieved its best performance with b = 0.001. A thorough study of the forms that GT can take, their implications in practice as well as techniques for efficient parametrization of the methods is beyond the scope of this paper. However, in order to exhibit the flexibility and potential of the proposed GT operator in combination with the set-theoretic projection based framework, we next focus on sparsity-inducing thresholding operators, which lead to reduced computational complexity implementations, compared to APWL1. B. The Piecewise linear thresholding operator Many of the proposed thresholding rules, as it is shown in Fig. 1 and Fig. 3, are discontinuous, something that, in many cases, is considered a disadvantage. For example, it has been reported that such thresholding rules are not suitable in coordinate descent algorithms [31] and might result in instabilities in model prediction [27]. However, in our case, we never noticed any problematic behavior using discontinuous thresholding functions in the proposed set theoretic framework. On the contrary, as it can be deduced from the previous examples, they often lead to the best performances. Motivated by this and the advantages regarding complexity requirements of the HT, we next propose and study a novel piece-wise linear threshold in an attempt to achieve enhanced performance under reduced computational complexity demands. The sparsity promotion via projections onto weighted ℓ1 balls [9] has the following drawbacks with respect to computational complexity requirements: a) it does not lead to estimates with a fixed and predefined sparsity level at each iteration, b) it requires O(L) multiplications and divisions, and c) it requires a full sorting of the unknown vector values, which takes O(L log2 L) sorting operators. On the other hand, HT is a multiplications and divisions-free operator. It only requires the detection of the ˆ K-th order statistic of x, which can be effectively performed in linear complexity, O(L), using, for example, the median-ofmedians selection algorithm [51]. Moreover, HT leads to estimates per iteration which have fixed sparsity level and equal to the ˆ predetermined value K. ˆ smaller components. However, such a “strict” strategy A major drawback of HT is that it sets to exactly zero all the L − K may push to zero coefficients which, actually, belong to the support, especially at the early iterations of the algorithm, prior to convergence, where the obtained estimates may not be good. A more “gentle” treatment, would be, to shrink their values ˆ values. Several such strategies have been proposed in the literature, with the to some degree, instead of zeroing these L − K SCAD penalty being among the most favorite ones. Here, we propose and study a thresholding rule, which is simpler than the SCAD, referred to as Piecewise Linear Thresholing (PLT) rule, which operates as follows; given a ∈ RL , the i-th coordinate, 1 In

this low complexity configuration, bridge ℓ0.5 exhibited degraded performance.

13

(K,b)

i ∈ 1, L, of the output z := TPLT (a) is defined by the next relations:    0, if |ai | ≤ P2 ,    zi = sgn(ai )(|ai | − bP2 ), if P2 < |ai | ≤ P1 , ,     x , if |ai | > P 1, i (2K)

(K)

where P1 = ξa , P2 = ξa

(14)

, and b is a free parameter taking values in [0, 1].

0

Fig. 5.

Plot of the piecewise linear thresholding operator.

Schematically, the PLT operator is shown in Fig. 5. It can be observed that it shares common attributes with both hard and soft thresholding. The major advantage of PLT, is that it can be nearly considered as a multiplication free operator (it needs 1 ˆ nonzero multiplication only). Moreover, similarly to HT, PLT leads to estimates of fixed sparsity level, equal to, at most, 2K components. Next we study the behavior of PLT operator, for different b values when applied to the q = 2 and the q = 390 cases depicted in Fig. 6(a,b) and Fig. 6(c,d) respectively. More specifically, b is allowed to take distinct values in the interval [0, 1] with a step of 0.05. The study concerns both the steady state performance and the convergence speed. Error floor (q=2)

Error floor (Best q)

Convergence speed (q=2)

b

b

Convergence speed (Best q)

b

b

(d) Fig. 6.

Study for the optimization of the b parameter used in PLT.

Let us first focus in Fig. 6a, where the behavior of APPLT with respect to the error floor for different values of b is considered. The curves denoted with circles, asterisks and squares correspond to simulation examples of sparsity ratio, i.e. ρ = K/L ∗ 100% equal to 1%, 10% and 30% respectively. In all the simulations shown in Fig. 6, the (L, K) combination pairs used were the

14

(200, 2), (200, 20), (200, 60) leading to the aforementioned ρ values. Different combinations of the same sparsity ratio produced similar results. Fig. 6a, shows the percentage deviation between the performance achieved using any b and the best one of them, say ¯b. In other words, for any b value, the corresponding curve takes the value (eb − e¯b )/e¯b ∗ 100, where eb = MSEb (N ), MSEb is the MSE curve obtained with the specific b value. N takes a value large enough in order to guarantee that the steady state regime has been reached. The following conclusions can be drawn: a) error floor performance close to the best one is achieved for b close around to 0.2, b) the larger the sparsity ratio, the less sensitive the method is to the choice of parameter b. Indeed, when ρ = 10%, any b value leads to error floors which are less that 10% worse compared to the one achieved with the best b value. Fig. 6b is obtained in a similar way, with the difference that instead of eb , the iteration number, say cb ∈ [1, N ], at which the algorithm is entering the converging phase, is used. Here, we assume that this phase starts when the algorithm reaches the 60% of its overall MSE performance2 expressed in logarithmic scale. Precisely, we use as cb the iteration number at which log10 (MSEb (cb )) takes a value lower than log10 (MSEb (1)) + (log10 (MSEb (N )) − log10 (MSEb (1)))60/100 for the first time. Indicatively, the APWL1 of Fig. 4a is assumed to enter the converging stage when log10 (MSE) < −3. Getting back to Fig. 6b, it is observed that smaller values of b are favored over the larger ones. We conclude that in the low complexity regime, i.e. with q as small as 2, a low b value leads to enhanced performance in terms of both error floor and convergence speed. Fig. 6(c,d) shows the corresponding curves when the method operates in the high computational regime using q = q∗ which, for the specific (L, K) settings, amount to q equal to 27, 93 and 167 for the 1%, 10% and 30% sparsity ratios respectively. We observe, that in contrast to the q = 2 case, b parameter need to acquire different values depending on whether a low error floor or a fast convergence speed is of preference. For example, the best convergence speed is achieved with b = 0.8, whereas the lower error floors, with the exemption of the extensively sparse case of ρ = 1%, are achieved with b values lower than 0.5. Note, however, that the convergence speed is more sensitive to deviations from the best b values, since a low b value can lead to up to 50% decrease of the convergence speed. As a conclusion, when the method operates in the high computational complexity regime, a large b value (around 0.8) is preferred. The performance attained by the proposed PLT-based AP method (APPLT), compared to the methods discussed before, is shown with dashed curves indicated with diamonds in Fig. 4b and Fig. 4c, for the high and the low computational regimes respectively. It has been observed that APPLT exhibits a better performance when a somewhat larger value than the true sparsity ˆ (K,b) ˆ := K + 3 is adopted in the aforementioned simulation examples. We , where K level K was used. As a result, the PLT T PLT

observe that in the high computational complexity mode, APPLT achieves performance similar to APWL1, whereas in the low one (with q = 2), APPLT exhibits the best performance. In any case, APPLT performs better than APHT, without requiring significant computational burden increases. A short discussion on computational complexities is held in section VII-E. The SpAdOMP method [7], represents the state of the art of low complexity sparsity-aware algorithms, therefore a detailed comparison of APPLT with it is of high interest. In Fig. 4c, APPLT with q = 2 appeared to outperform SpAdOMP. Note that in the specific example, ρ ≈ 10% since L = 1024 and K = 100. As it will be seen next, the relative performance between the two algorithms highly depends on the sparsity ratio ρ. Fig. 7a depicts a performance comparison of the two methods when ρ = 10%. For this specific experiment, the values of L = 600 and K = 60 where used. however, the conclusions are similar for other L,K combinations, as long as the sparsity ratio remains of the same order. The curves indicated with triangles, circles and squares correspond to the proposed APPLT using q = 1, 2 and 5 respectively. The gradual convergence speed enhancement, that ˆ ˆ in T (K,b) was set equal to K + 3. The curves indicated with asterisks is achieved as q increases, is readily seen. As before, K PLT

2 The

60% value is merely arbitrarily chosen. In fact anything from 30% to 80% lead to similar conclusions.

15

(a) Fig. 7.

(b)

Performance comparison between APPLT and SpAdOMP for the cases of (a) sparsity ratio equal to 0.1 and (b) sparsity ratio equal to 0.03.

and x-crosses correspond to the SpAdOMP method for two different values of the step size α. Similarly to LMS algorithm, larger step values lead to faster convergence but higher error floors and vice versa. We can observe, that when α = 0.2 is chosen so as to achieve the same error floor with APPLT then SpAdOMP converges with speed similar to the APPLT with q = 1. It should be stressed out that when q = 1, APPLT has slightly lower computational complexity than SpAdOMP. When the step size of SpAdOMP is increased to α = 0.3 in order to achieve convergence speed closer to the q = 5 setting, then the error floor is significantly increased. When the signal under consideration is very sparse, then SpAdOMP is favored over APPLT. This is shown in Fig. 7b, where K decreases to 18, leading to a sparsity ratio equal to 3%. We observe, that in this very sparse signal case, q = 5 or even larger is needed in order for APPLT to perform similarly to SpAdOMP. However, it might be questioned whether such low sparsity ratios are realistic in practice or not. With respect to the user-defined parameters, that one has to tune in each method, SpAdOMP needs the tuning of the step size and a forgetting factor, so the difficulty is similar to that of the LMS and the RLS algorithms. On the other hand, APPLT needs the tuning of parameters µ, ǫ, q and b. In a first glance, it might seem that APPLT is more demanding regarding fine tuning. However, this is not true. Parameter q is an integer, which is set according to the available computational power and in general, the larger the better. Parameter µ, as long as it takes a value close to 2, has little influence on the performance; so the value 1.8 was chosen here and it was kept the same for all experiments in this paper. The method is not particularly sensitive to parameter ǫ either. A rough estimate of the noise standard deviation is enough. Accordingly, parameter b, is the only parameter ˆ is needed in both which needs a somewhat careful setting according to the study in Fig. 6. An estimate of the sparsity level K, methods. C. Robustness against inaccurate sparsity level estimates ˆ was set equal to the true sparsity level K. However in practice, the exact sparsity In the examples reported before, K level might not be known in advance, so the sensitivity in such inaccuracies is studied next. Our attention is turned to the computationally efficient APPLT algorithm. However, for the shake of comparison with PLSTO-based GT, results of the AP-ℓγ are also shown. Fig. 8 corresponds to L = 1024, K = 100 and shows the performance degradation resulting from underestimated and overestimated K values for the APPLT and AP-ℓγ algorithms. The achieved performance when the algorithms adopt the true sparsity level is shown with the curves indicated with x-crosses and triangles for the APPLT and AP-ℓγ . In the sub-figures (a) and (b), the q = 390 and q = 2 cases are depicted, respectively and the γ and b parameters, when applicable, were taken

16

(a)

Fig. 8.

(b)

Study of the sensitivity that the proposed methods exhibit to inaccurate estimates of the true sparsity level.

to be the same as the values in sections VII-A and VII-B, respectively. Between the two computational complexity modes, a difference in the APPLT behavior is noticeable. In the best q case on the left figure, an overestimation of K (curve indicated with diamonds) affects the convergence speed of the algorithm. On the contrary, when q = 2 the error floor is affected. Moreover, in this latter case, APPLT appears to be less sensitive to K overestimation. Indeed even a 100% overestimation - curve marked with circles - results in limited elevation of the error floor. On the other hand, AP-ℓγ exhibits somewhat more robust behavior on the overestimation of K. Moreover, in general, both methods are more sensitive on the underestimation of K. Such a behavior is not surprising since it has also been observed for APWL1 as well [9]. D. Time varying example

Fig. 9.

Performance evaluation in time-varying conditions.

Fig. 9 shows the ability of the tested algorithms to track an abrupt change, which is realized after 1500 observations. This is a typical setting used in adaptive filtering community to study the tracking agility of an algorithm. In the first half, the signal under consideration has the characteristics of the one discussed in Fig. 4a. However, at the mid-time point of 1500, ten randomly selected components, change their values from 0 to a randomly selected nonzero one. Thus, the signal after 1500 has a sparsity level of 110. As a result, in the second half, the AP methods operate with an under-estimate of the true sparsity level. It can be concluded that AP-ℓ0.5 and APHT appear to be more sensitive compared to the rest of the proposed methods. Moreover, the APPLT, with b = 0.1, reaches lower error floors, albeit at a slower convergence speed. OCCD-TWT performance deteriorates compared to Fig. 4b, due to the fact that a forgetting factor lower than 1 is adopted, in order to succeed in re-estimating the

17

unknown signal after the abrupt change. In all the cases before, the proposed algorithm operated in the high complexity regime using q = 390. In fact, this abrupt change case can not be handled effectively by the low complexity algorithms. As an example, the performance of SpAdOMP is displayed with the curve indicated with crosses. E. Computational complexity issues (n)

Under, the commonly adopted set up in practice, that all ωi

are given the same value, and kuk = 1, the basic recursive

scheme of (11), including (10), at each iteration, requires (q + 2)Kn + qL + 2L multiplications, when q > 1, or 2L, for q = 1. We denote by Kn the sparsity level of the estimate in the current iteration n. For the APWL1 case, where the sparsity level is not fixed in each iteration, then the worse case scenario, i.e. Kn = L, should be considered. On the contrary, for thresholding ˆ and 2K, ˆ respectively. rules such as HT and PLT, Kn equals to K The sparsity promoting approach necessitates some extra computational burden for all the methods discussed here. In particular, the projection onto the weighted ℓ1 -ball needs 3L multiplications and 2L divisions on top of O(L log L) sorting operations. The hard-thresholding operation needs just O(L) comparisons, whereas the PLT needs a single multiplication plus O(L) comparisons ˆ and 2K ˆ order statistics. SCAD and bridge ℓγ are computationally more demanding. for the detection of the K OCCD-TWL needs 2L2 +2L+(L+1)Kn multiplications and Kn+1 divisions, and OCCD-TNWL requires 4L2 +6L+(L+1)Kn multiplications and L+Kn+1 divisions. It should be stressed out that in the case considered in this paper, i.e., signal reconstruction of signals that are sparse in a transform domain, translation time-invariance cannot be exploited in order to reduced the complexity of the OCCD-based methods. Finally, the low complexity SpAdOMP method, in the case of nearly orthonormal sampling vectors and normalized LMS ˆ multiplications, one division, and O(L) comparisons for the retrieving of the 2K ˆ order statistic. updates, needs 2L + 7K VIII. C ONCLUSIONS A novel online sparsity-aware learning scheme has been presented, that can deal in a unifying way with different thresholding rules, including nonconvex ones. Several variants of the algorithm were studied and compared with state-of-the-art online methods of both low and high computational complexity. The proposed scheme exhibited enhanced complexity-performance trade-off in both time-constant and time-variant environments requiring a limited computational complexity burden. A PPENDIX A E LEMENTS OF C ONVEX A NALYSIS A subset C of RL will be called convex if every line segment {λa + (1 − λ)a′ : λ ∈ [0, 1]}, with endpoints any a, a′ ∈ C, lies in C. Given any set C ⊂ RL , define the (metric) distance function d(·, C) : RL → R to C as follows: ∀a ∈ RL , d(a, C) := inf{ka − vk : v ∈ C}. If we further assume now that C is closed and convex, then the (metric) projection onto C is defined as the mapping PC : RL → C, which maps to an a ∈ RL the unique PC (a) ∈ C, such that ka − PC (a)k = d(a, C). For example, the metric projection mapping PSn [ǫ] onto the hyperslab Sn [ǫ] (4) is given by the following simple analytic formula:  yn −ǫ−utn a   un , if yn − ǫ > utn a, 2    kun k PSn [ǫ] (a) = a + 0, (15) if |utn a − yn | ≤ ǫ,      yn +ǫ−u2tn a u , if y + ǫ < ut a. n n n ku k n

Given a projection mapping, PC , onto the closed convex set C, and a λ ∈ (0, 2), the relaxed projection mapping is defined (λ)

as the mapping TC (λ) Fix(TC )

:= IL + λ(PC − IL ), where IL stands for the identity matrix of dimension L. It is well-known that

= Fix(PC ) = C [44].

18

A PPENDIX B C OMPOSITION

OF

PARTIALLY Q UASI -N ONEXPANSIVE M APPINGS

Next is a lemma which will be useful to the analysis of App. D. Lemma 1 (Composition of partially quasi-nonexpansive mappings). Given the mappings T1 , T2 : RL → RL , with Fix(T1 ) ∩ Fix(T2 ) 6= ∅, assume that there exist η1 , η2 > 0 such that ∀x ∈ RL , ∃Yx ⊂ Fix(T1 ) ∩ Fix(T2 ) : ∀y ∈ Yx ,

  η1 kx − T1 (x)k2 ≤ kx − yk2 − kT1 (x) − yk2 ,  η kx − T (x)k2 ≤ kx − yk2 − kT (x) − yk2 . 2 2 2

Notice that the previous condition implies that both T1 and T2 are partially strongly attracting quasi-nonexpansive. Then, the composition T2 T1 is partially ((η1 η2 )/(η1 + η2 ))-attracting quasi-nonexpansive, i.e., ∀x ∈ RL , ∃Yx ⊂ Fix(T2 T1 ) such that ∀y ∈ Yx , η1 η2 2 2 2 kx − T2 T1 (x)k ≤ kx − yk − kT2 T1 (x) − yk . η1 + η2 Proof: The condition in the statement of the theorem and the steps, which can be found in [52, Prop. 1.d-i], guarantee that Fix(T2 T1 ) = Fix(T1 ) ∩ Fix(T2 ). Let Yx take the place of Fix(T1 ) ∩ Fix(T2 ) and follow exactly the same steps as in [52, Prop. 1.d-iii] to establish the claim of Lemma 1. A PPENDIX C T HE G ENERALIZED T HRESHOLDING M APPING

AND I TS

P ROPERTIES

Definition 2 (The ordered tuple notation). Given L ∈ N∗ , the notation 1, L will stand for the set {1, 2, . . . , L}. Given K ∈ 1, L, define the set of all ascending tuples of length K, out of the alphabet 1, L, as T (K, L) := {(l(1) , l(2) , . . . , l(K) ) : l(1) , l(2) , . . . , l(K) ∈ 1, L, 1 ≤ l(1) < l(2) < . . . < l(K) ≤ L}. The parentheses in the subscripts of the indices, taken from the order statistics’ literature, are used in order to emphasize the fact that those indices have been sorted in ascending order. It is  L easy to verify that the cardinality of T (K, L) is K .

To save space in the sequel, we abuse the previous notation, and allow ourselves to treat tuples also as sets. For example,

whenever we write l ∈ J, for some J ∈ T (K, L), we mean that l is one of the K components of J. Whenever l ∈ / J appears, we mean that l ∈ 1, L \ J. In the case where J1 ⊂ J2 , for some J1 ∈ T (K1 , L) and J2 ∈ T (K2 , L), with K1 ≤ K2 , we mean that each component of J1 is also a component of J2 . In this sense, and given a vector x ∈ RL , the support of x is defined as the ordered tuple supp(x) := (l ∈ 1, L : xl 6= 0). If | supp(x)| stands for the cardinality of supp(x), then clearly supp(x) ∈ T (| supp(x)|, L).  Definition 3 (Subspace associated to a tuple). Given J ∈ T (K, L), let MJ := a ∈ RL : al = 0, ∀l ∈ / J . Clearly, MJ is a linear subspace of RL . Moreover, notice that if J1 ⊂ J2 , then MJ1 ⊂ MJ2 . In particular, if supp(x∗ ) ⊂ J, then x∗ ∈ MJ . An illustration of MJ can be found in Fig. 2. Definition 4 (Finding the largest, in absolute value, components of a vector). Fix a K ∈ 1, L. Given a vector x ∈ RL , define inductively the following indices: ∀k ∈ 1, K, lk := min {arg max {|xl | : l ∈ / {l1 , . . . , lk−1 }}}, where, in order to avoid (K)

confusion, we set {l1 , . . . , l0 } := ∅, for the case of k = 1. Then, define Jx

:= (l(1) , l(2) , . . . , l(K) ), where the parentheses, (K)

appearing in the subscripts, mean that the indices {l1 , l2 , . . . , lK } have been sorted in ascending order. In other words, Jx

contains all those positions, which correspond to the K largest, in absolute value, components of the vector x. The presence of the min operator is to avoid non-uniqueness problems; we choose always the smallest index when an absolute value appears simultaneously in multiple positions.

19

(K)

(K)

Definition 5 (Generalized Thresholding (GT) operator TGT ). Given a K ∈ 1, L, we define TGT : RL → RL as follows. For (K)

any x ∈ RL , the output z := TGT (x) is obtained according to the following steps: (K)

1) Compute, first, the tuple Jx = (l(1) , l(2) , . . . , l(K) ) ∈ T (K, L) of Def. 4. n o (K) (K) (K) 2) Define ξx := min |xl | : l ∈ Jx . That is, ξx stands for one of the K-th largest components, in magnitude, of the (K)

(K)

vector x. Clearly, ∀l ∈ / Jx , |xl | ≤ ξx .

3) Compute the components of z as: ∀l ∈ 1, L,

zl :=

  xl ,

(K)

l ∈ Jx ,

 (K) shr(xl ), l ∈ / Jx ,

where the function shr : R → R satisfies the following properties: 4) τ shr(τ ) ≥ 0, ∀τ ∈ R.

5) | shr(τ )| ≤ |τ |, ∀τ ∈ R. In other words, the function shr shrinks all but the K largest, in magnitude, elements of the vector x. (K)

6) Given ǫ > 0, there exists a δ > 0 such that ∀τ , which satisfy ǫ ≤ |τ | ≤ ξx , we have | shr(τ )| ≤ |τ | − δ. That is, shr acts as a strict shrinkage operator over all intervals which do not include 0. Notice here that we do not impose any regularity conditions, like continuity, on shr. Notice, also, that most of the known PLSTO [29]–[32], [34], [35] are continuous, as it can be readily verified by Fig. 1. (K)

Remark 1 (Any PLSTO can be used as the shrinkage function shr in TGT ). A characterization of the PLSTO can be found in [33, Thm. 1]. According to [33, Thm. 1], the basic requirements on the penalty function of (7) are that p should be nonnegative, nondecreasing, and differentiable on (0, ∞). Then, surely, the minimization task of (7) possesses a unique solution [33]. In a coordinate-wise sense, the operator which maps a given a ˜i to the unique minimizer of (7), i.e., the PLSTO, becomes a function, ˆ It was proved in [33, Thm. 1], among other properties, that θˆ is antisymmetric, and that it satisfies 5.4, 5.5. denoted here by θ. ˆ )| = 0, i.e., θˆ acts as Moreover, for an appropriately defined real value p0 , it was observed in [33, Thm. 1] that ∀|τ | ≤ p0 , |θ(τ a thresholder. These properties can be also verified by Fig. 1, for all the known PLSTO [29]–[32], [34], [35]. Notice that since p is assumed to be nondecreasing, then p′ (|τ |) ≥ 0, ∀τ ∈ R. It was also shown in [33, Thm. 1] that if p′ is ˆ )| ≤ |τ | − λp′ (|τ |). This means that for all those τ such that |τ | > p0 , and p′ (|τ |) > 0, the nonincreasing, then ∀|τ | > p0 , |θ(τ PLSTO θˆ strictly shrinks the magnitude of τ . For |τ | ≤ p0 , the strict shrinkage property of θˆ is obvious, due to its thresholding properties. This reminds us of Def. 5.6. Indeed, if p is chosen such that p′ is continuous on (0, ∞), and p′ (|τ |) > 0, ∀τ ∈ R, as for most of the known penalties in PLSTO, then Def. 5.6 holds true if we define δ := min|τ |∈hǫ,ξ(K) i p′ (|τ |) > 0. For penalties x

that do not satisfy such conditions, such as those which produce the HT and the SCAD, then it can be verified by Fig. 1 that h i (K) we can always define their parameters in a way such that they achieve strict shrinkage over the interval ǫ, ξx in Def. 5.6.

Summarizing, any known PLSTO [29]–[32], [34], [35] can serve as a candidate for the function shr. (K)

Theorem 1 (Properties of the generalized thresholding operator TGT ). (K)

(K)

(K)

1) ∀x ∈ RL , if z := TGT (x), then Jz = Jx . S (K) (K) 2) Fix(TGT ) = J∈T (K,L) MJ . Notice, here, that as a union of subspaces, Fix(TGT ) is non-convex.

3) Assuming that K < L, a sequence (xn )n∈N ⊂ RL and an x∗ ∈ RL ,    lim xn = x∗ , (K) n→∞ if then x∗ ∈ Fix(TGT ). (K)   lim (I − TGT )(xn ) = 0, n→∞

20

(K)

This property can be rephrased as I − TGT being demiclosed at 0 [45].

2

2



(K) (K) 2 (K) 4) TGT is partially 1-attracting quasi-nonexpansive, i.e., ∀x ∈ RL , ∀y ∈ MJ (K) , x − TGT (x) ≤ kx − yk − TGT (x) − y . x

(K)

The previous property obviously does not hold true if we put Fix(TGT ) in the place of MJ (K) . Therefore, the existence x

(K)

of TGT suggests that the class of partially quasi-nonexpansive mappings strictly contains all the quasi-nonexpansive ones.

Proof: (K)

1) By Def. 4, the claim of Thm. 1.1 is false if and only if one of the following two cases holds true: either (i) ∃l0′ ∈ / Jx (K)

(K)

(K)

/ Jx such that |zl′0 | > |zl |, ∀l ∈ Jx , or (ii) ∃l0 ∈ Jx , ∃l0′ ∈ (K)

such that |zl0 | = |zl′0 | and l0 > l0′ .

(K)

For the first case, since l0′ ∈ / Jx , then ∀l ∈ Jx , |xl′0 | ≥ | shr(xl′0 )| = |zl′0 | > |zl | = |xl | ≥ |xl′0 |, which is absurd. (K)

For the second case, since l0 ∈ Jx

(K)

and l0′ ∈ / Jx , we have |xl′0 | ≥ | shr(xl′0 )| = |zl′0 | = |zl0 | = |xl0 | ≥ |xl′0 |, which

implies |xl′0 | = |xl0 |, and, by Def. 4, l0 < l0′ . This contradicts our assumption that l0 > l0′ . The previous two contradictions establish the claim of Thm. 1.1. S (K) (K) 2) Pick any x ∈ J∈T (K,L) MJ . It is easy to verify by Def. 5 that TGT (x) = x, i.e., x ∈ Fix(TGT ). To prove the opposite (K)

(K)

(K)

(K)

inclusion, assume any x ∈ Fix(TGT ), i.e., TGT (x) = x. Since ∀l ∈ Jx , the relation TGT (x) = x leads to the trivial (K)

result xl = xl , we deal here only with the more interesting case of l ∈ / Jx . For such an l, according to Def. 5, we must have shr(xl ) = xl , which implies that | shr(xl )| = |xl |. However, by the properties of shr, given in Defs. 5.5 and (K)

5.6, we necessarily obtain that xl = 0. Since this holds ∀l ∈ / Jx , Def. 3 suggests that x ∈ MJ (K) . Now, recall that x S (K) Jx ∈ T (K, L) to establish the inclusion x ∈ J∈T (K,L) MJ .

3) The proof will be developed along a sequence of steps.

(K)

a) Since K < L, it becomes clear that ∀n we can always find an l ∈ / Jx n . (K)

b) Assume, for a contradiction, that there exists an ǫ > 0, a subsequence (nk )k∈N , and an lnk ∈ / Jxnk , ∀k ∈ N, such that xnk ,lnk ≥ ǫ, ∀k ∈ N. By Def. 5.6, ∃δ > 0 such that shr(xnk ,lnk ) ≤ xnk ,lnk − δ, ∀k. Then, it is easy to verify that ∀k, xnk ,lnk − shr(xnk ,lnk ) ≥ xnk ,lnk − shr(xnk ,lnk ) ≥ xnk ,lnk − xnk ,lnk + δ = δ. This in turn implies that

∀k,

X

(xnk ,l − shr(xnk ,l ))2 ≥ (xnk ,lnk − shr(xnk ,lnk ))2 ≥ δ 2 .

(16)

(K) l∈J / xn k

2 P

(K) (K) 2 Notice that (I − TGT )(xn ) = l∈J (K) (xn,l −shr(xn,l )) . Hence, the assumption that limn→∞ (I −T GT )(xn ) = / xn

0 implies that for the δ of (16), ∃n0 ∈ N such that ∀n ≥ n0 , X (xn,l − shr(xn,l ))2 < δ 2 . (K)

l∈J / xn

This contradicts (16). In other words, out initial claim is wrong, and the contrary proposition becomes: ∀ǫ > 0, there (K)

exists an n0 ∈ N such that ∀n ≥ n0 , and ∀l ∈ / Jxn , |xn,l | < ǫ. This can be equivalently written in a more compact form as follows: n o lim max |xn,l | : l ∈ / Jx(K) = 0. n

(17)

n→∞ (K)

(K)

c) Pick arbitrarily any l ∈ Jx∗ . There are two possible cases regarding l and the sequence of tuples (Jxn )n∈N . (K)

(K)

i) Assume that ∃n0 ∈ N such that ∀n ≥ n0 , l ∈ Jxn . This implies that ∀n ≥ n0 , Jx∗ (K)

(K)

(K)

(K)

⊂ Jxn . However, (K)

since the cardinality of both Jx∗ and Jxn is K, we actually have that ∀n ≥ n0 , Jx∗ = Jxn . Choose, now,

21

(K)

(K)

/ Jxn , ∀n ≥ n0 , and (17) to any l′ ∈ / Jx∗ . Recall our initial assumption that limn→∞ xn = x∗ , the fact l′ ∈ (K)

deduce that x∗,l′ = 0. Since l′ was chosen arbitrarily, from the complement of Jx∗ , we necessarily have that S x∗ ∈ J∈T (K,L) MJ . (K)

ii) Assume, now, that ∄n0 ∈ N such that ∀n ≥ n0 , l ∈ Jxn . Equivalently, this can be written as: ∃(nk )k∈N such (K)

that ∀k ∈ N, l ∈ / Jxnk . However, our initial assumption that limn→∞ xn = x∗ and (17) imply that x∗,l = 0. (K)

Since l ∈ Jx∗ , i.e., x∗,l is one of the K largest, in absolute value, components of x∗ , it is easy to verify S (K) that ∀l′ ∈ / Jx∗ , we must have x∗,l′ = 0. In other words, x∗ ∈ J∈T (K,L) MJ . This establishes the claim of Thm. 1.3.

(K)

(K)

4) Define RK := 2TGT − I. Given any x ∈ RL , let z := TGT (x), as in Def. 5. Then, verify that ∀y ∈ MJ (K) , x X X kRK (x) − yk2 = (2zl − xl − yl )2 + (2zl − xl − yl )2 (K)

(K)

l∈Jx

=

X

l∈J / x

(xl − yl )2 +

(K)



(2 shr(xl ) − xl )2

X

(xl − 0)2 = kx − yk .

(K)

l∈Jx

X

X

l∈J / x

(xl − yl )2 +

(K)

2

(K)

l∈Jx

l∈J / x

The previous inequality is obtained from the observation that the properties of shr in Def. 5 suggest shr2 (xl ) ≤ xl shr(xl ), and from the following elementary calculations: (2 shr(xl ) − xl )2 = 4 shr2 (xl ) + x2l − 4xl shr(xl ) ≤ 4 shr2 (xl ) + x2l − 4 shr2 (xl ). Hence, ∀x ∈ RL , ∀y ∈ MJ (K) , x

kRK (x) − yk2 ≤ kx − yk2

2

(K)

⇔ 2TGT (x) − x − y ≤ kx − yk2

2

(K) 2 ⇔ 2(TGT (x) − y) − (x − y) ≤ kx − yk

2

2



(K)

(K) 2 ⇔ x − TGT (x) ≤ kx − yk − TGT (x) − y ,

where in order to obtain the last equivalence we used some elementary algebra, and the fact

2

2 E D

(K)

(K) (K) 2 2 x − y, TGT (x) − y = kx − yk + TGT (x) − y − x − TGT (x) . This establishes the claim of Thm. 1.4. A PPENDIX D C ONVERGENCE A NALYSIS

OF THE

P ROPOSED A LGORITHM

The convergence analysis of the APGT, i.e., the following Thm. 2, is based on the next set of assumptions. Assumption 1. 1) Assume that ∃n0 ∈ N such that ∀n ≥ n0 , Ωn := MJ (K) ∩ an

T

i∈In

 Si [ǫ] = 6 ∅. Recall, here, that {Si [ǫ]}i∈In is the set of

all active hyperslabs (Alg. 1), at the time instant n. In other words, after some time n0 , the linear subspace MJ (K) , defined an

by the K largest components of an , owns a non-empty intersection with the active hyperslabs {Si [ǫ]}i∈In . Provided that the designer chooses an appropriate ǫ for the hyperslabs, this is a natural assumption to make due to the following reason. For such an ǫ, the hyperslabs contain, with high probability, the desired a∗ . As time goes by, and due to a long sequence of projections, (an )n∈N is attracted closer and closer to some intersection of hyperslabs; thus, also to a∗ . Assuming that

22

K ≥ | supp(a∗ )|, and that ǫ is properly defined, with MJ (K) . an

2) Assume that ∃n′0 ∈ N such that J∞ :=

T∞

n=n′0

T

i∈In

Si [ǫ] has a non-empty intersection with MJ (K) , and hopefully

(K)

Jan 6= ∅ and MJ∞ ∩

a∗

T ∞

n=n′0

 Sn [ǫ] = 6 ∅. The existence of J∞ means that

for every time instant, after n′0 , the K largest, in magnitude, components of an share a common pattern. The motivation T of this assumption follows a similar rationale to that of Assump. 1.1. Moreover, the non-empty intersection ∞ n=n′ Sn [ǫ] 0

guarantees that all of the active hyperslabs {Si [ǫ]}i∈In own a common intersection, ∀n ≥ n′0 + q − 1.

3) Assume an ǫ′ ∈ (0, 1] such that ∀n ∈ N, µn /Mn ∈ [ǫ′ , 2 − ǫ′ ]. We stress here such an ǫ′ always exists, since (µn )n∈N are user-defined quantities. o n (n) (n) ˇ always exists, since (ωi )i∈In ,n∈N are 4) Assume that ω ˇ := inf ωi : i ∈ In , n ∈ N > 0. Notice, here, that such an ω user-defined parameters.

Theorem 2 (Properties of the algorithm). 1) Let Assumption 1.1 hold true. Then, ∀n ≥ n0 , d(an+1 , Ωn ) ≤ d(an , Ωn ). T 2) Let Assumption 1.2 hold true. Define n0 := n′0 + q − 1, and let Ω := n≥n0 Ωn . Recall that q is the maximum size of the

sliding window Jn := max{0, n − q + 1}, n, and that In ⊂ Jn , in order to verify that the reason behind the definition T of n0 is to guarantee that i∈In Si [ǫ] 6= ∅, ∀n ≥ n′0 . a) Then, ∀n ≥ n0 , d(an+1 , Ω) ≤ d(an , Ω). In other words, the sequence (an )n∈N is Fej´er monotonous [44] with respect to Ω. This means that every step of the APGT prevents the (an )n∈N to move away from Ω. b) ∀y ∈ Ω, the sequence (kan − yk)n∈N converges. c) The set of all cluster points C((an )n∈N ) of the sequence (an )n∈N is nonempty.

3) Let Assumptions 1.2 and 1.3 hold true. Then, S (K) a) C((an )n∈N ) ⊂ Fix(TGT ) = J∈T (K,L) MJ . In other words, the APGT generates a sequence of estimates (an )n∈N , whose cluster points are sparse vectors, of sparsity level no larger than K.

ˆ ∗ ∈ C((an )n∈N ) such that | supp(ˆ b) If there exists an a a∗ )| = K, then J∞ ⊂ supp(ˆ a∗ ). T∞ (K) c) If C((an )n∈N ) = {ˆ a∗ }, then there exists an n0 ∈ N such that supp(ˆ a∗ ) ⊂ n=n0 Jan .

4) Let Assumptions 1.2, 1.3, and 1.4 hold true. Then,

a) limn→∞ d(an , Sn [ǫ]) = 0. In other words, as the number of the APGT’s recursions tends to infinity, the sequence (an )n∈N moves closer and closer to the hyperslabs (Sn [ǫ])n∈N . b) C((an )n∈N ) ⊂ lim supn→∞ Sn [ǫ]. If, in particular, C((an )n∈N ) = {ˆ a∗ }, then the previous result can be strengthened ˆ ∗ ∈ lim inf n→∞ Sn [ǫ], where the inner limit lim inf n→∞ Sn [ǫ] and the outer limit lim supn→∞ Sn [ǫ], of the as a sequence (Sn [ǫ])n∈N , are defined as [43] lim inf Sn [ǫ] = n→∞

∞ \ ∞ \ [

ǫ>0 n=1 k=n ∞ [ ∞ \ \

(Sk [ǫ] + B[0, ǫ]) ,

lim sup Sn [ǫ] := n→∞

(Sk [ǫ] + B[0, ǫ]) ,

ǫ>0 n=1 k=n

and where Sk [ǫ] + B[0, ǫ] := {s + b : s ∈ Sk [ǫ], b ∈ B[0, ǫ]}.

Proof: The proof is based on the very recent extensions of the Adaptive Subgradient Projected Method (APSM) [18]. Thus, to avoid a repetition of lengthy arguments, the proof of Thm. 2 will not be given here in full detail, since most of the results can be straighforwardly obtained by following exactly the same steps as in [18]. However, to make the proof of Thm. 2 as clear

23

as possible, we will underline here the main similarities and differences between the present study and [18]. It can be shown after some algebra (see, for example, [18, Section 4.2], [53, App. E]), that the main recursion (11) of the proposed algorithm is equivalent to the following form:     T (K) an − λn Θ′n (an ) 2 Θ′ (an ) , if Θ′ (an ) 6= 0, n n GT kΘn (an )k an+1 :=  T (K) (an ), if Θ′n (an ) = 0, GT

(18)

where λn := µn /Mn ∈ (0, 2), Θ′n (an ) stands for the subgradient [43], [44] of the convex function Θn at an , which is defined inductively as follows; given an , let ∀a ∈ RL ,  P   i∈I n Θn (a) :=  0,

(n)

ωi P

d(an ,Si ) (n)

j∈In

ωj

d(an ,Sj )

d(a, Si ),

if In 6= ∅, if In = ∅.

(K)

(λ )

(λ )

Under Assumptions 1, (18) can be alternatively viewed as an+1 = TGT TΘnn (an ), ∀n ≥ 0, where TΘnn is a principle notion in the minimization tasks of non-smooth objective functions in convex analysis; assuming that the set {a ∈ RL : Θn (an ) ≤ 0} is nonempty, then the relaxed subgradient projection mapping [44] is defined as follows; ∀a ∈ RL , ∀λn ∈ (0, 2),   a − λn Θ′n (a) 2 Θ′ (a), if Θn (a) > 0, n (λ ) kΘn (a)k TΘnn (a) :=  a, if Θn (a) ≤ 0.

(19)

(λ )

Remarkably, TΘnn is also a member of the toolbox of operators given in Def. 1; it is in fact a quasi-nonexpansive mapping [44]. Although (18) and the main algorithmic object of [18] look very similar, there is a large underlying difference. The current (K)

state of the APSM [18] uses strongly attracting quasi-nonexpansive mappings in the place of TGT , in (18). Notice, however, (K)

that the presently used TGT belongs to the class of partially quasi-nonexpansive mappings, and, thus, (18) can be also seen as the first step to generalize [16]–[18] to online learning tasks under non-convex constraints. Although the gap between convex (K)

(K)

constraints [18] and non-convex ones, i.e., Fix(TGT ), might seem impassable in general, the special case of Fix(TGT ), which is nothing but a union of linear subspaces (closed convex sets), helps us to apply arguments of [18] to the present case. The trick (K)

behind the present proof lies on the observation that if we isolate one of the linear subspaces which constitute Fix(TGT ), i.e., (K)

if we view Fix(TGT ) partially, and if a non-empty intersection of this specific subspace (closed convex set) with a countable number of hyperslabs is assumed, then the whole discussion dismantles into convex analysis arguments. This is where [18] comes into play. 1) This result follows from Thm. 18.1 and Lem. 28.2 of [18]. Let Ωn take the place of Yx in Lemma 1, and follow exactly the same steps, as in the proof of [18, Thm. 18.1], to establish the claim of Thm. 2.1. 2) a) The claim follows from [18, Thm. 18.2]. b) This is an immediate consequence of [18, Thm. 18.3]. c) This is a result of [18, Thm. 18.3] and the fact that the strong and weak topologies coincide in the Euclidean RL . 3)

(K)

a) The inclusion C((an )n∈N ) ⊂ Fix(TGT ) is a result of [18, Thm. 18.10]. In order to mobilize [18, Thm. 18.10], [18, Thm. 18.8] is necessary. However, there is a delicate point in the proof of [18, Thm. 18.8] that needs to be verified in the present context. Let us make the connections here between the present study and the proof of [18, Thm. 18.8]. The sequence (an )n∈N (K)

takes the place of (un )n∈N , v stands for y, and Tn is substituted by TGT . Recall, also, that the proposed algorithm (K)

(λ )

(λ )

can be alternatively viewed as an+1 = TGT TΘnn (an ), where TΘnn is introduced in (19).

24

The delicate point here is whether the inequalities, which appear in the proof of [18, Thm. 18.8], still hold in the present context. To be more precise, under the Assumptions 1, can we guarantee that there exists a y such that those inequalities hold true? Such a question would have a positive answer if we could find an n′′0 ∈ N such that !! ∞ \ \ ∩ MJ (K) Si [ǫ] 6= ∅. T

n=n′′ 0

(λn ) (an ) Θn

i∈In

Any point y, then, from such an intersection could replace v, in the proof of [18, Thm. 18.8], and render those inequalities feasible. (K)

(λ )

The key observation to answer the previous question is Thm. 1.1. Since an+1 = TGT TΘnn (an ), then Thm. 1.1 (K)

(K)

suggests that Jan+1 = J (λn ) . Thus, the previous question can be recast as follows; is there an n′′0 ∈ N such TΘn (an )    T T∞ 6= ∅? The answer is positive if one recalls Assumption 1.2 and considers S [ǫ] that n=n′′ MJ (K) ∩ i∈In i 0

an+1

any n′′0 ≥ n0 , where n0 is defined in the statement of Thm. 2.2.

(K)

ˆ ∗ . Recall here by Thm. 2.3a that ∀l′ ∈ ˆ∗,l′ = 0. Define b) By definition, ∃(nk )k∈N such that limk→∞ ank = a / Jaˆ ∗ , a (K)

ξaˆ

(K)

(K)

(K)

a∗,l′ | < ξaˆ / Jaˆ ∗ , 0 = |ˆ := min{|ˆ al | : l ∈ supp(ˆ a)}. Clearly, ∀l ∈ Jaˆ ∗ , ∀l′ ∈

Hence, there exists k0 ∈ N such that ∀k ≥ k0 , ∀l ∈

(K) Jaˆ ∗ ,

∀l′ ∈ /

(K) Jaˆ ∗ ,

≤ |ˆ a∗,l |. (K)

|ank ,l′ | < ξaˆ /2 < |ank ,l |. Since, by

(K)

assumption, the cardinality of Jaˆ ∗ is K, the previous result suggests that ∀k ≥ k0 ,

(K)

. Jaˆ ∗ = Ja(K) n

(20)

k

(K)

It becomes clear by Assumption 1.2 that if we take k0 sufficiently large, we obtain that ∀k ≥ k0 , J∞ ⊂ Jank . By (K)

(20), J∞ ⊂ Jaˆ ∗ , which establishes Thm. 2.3b. (K)

c) Define, also here, ξaˆ ∗

ˆ ∗ , there exists n0 ∈ N such that := min{|ˆ a∗,l | : l ∈ supp(ˆ a∗ )}. Since limn→∞ an = a (K)

a∗ )| ≤ K , then ∀n ≥ n0 , ∀n ≥ n0 , ∀l ∈ supp(ˆ a∗ ), ∀l′ ∈ / supp(ˆ a∗ ), |an,l′ | < ξaˆ ∗ /2 < |an,l |. Now, since | supp(ˆ (K)

supp(ˆ a∗ ) ⊂ Jan , which establishes Thm. 2.3c. 4) a) The claim is a direct consequence of Thm. 31.2 and Def. 27 of [18]. b) The result follows from [18, Thm. 31.3]. This completes the proof.

R EFERENCES [1] E. Cand`es, J. Romberg, and T. Tao, “Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information,” IEEE Trans. Inform. Theory, vol. 52, no. 2, pp. 489–509, Feb. 2006. [2] D. L. Donoho, “Compressed sensing,” IEEE Trans. Inform. Theory, vol. 52, pp. 1289–1306, 2006. [3] Y. Chen, Y. Gu, and A. O. Hero, “Sparse LMS for system identification,” in Proceedings of the IEEE ICASSP, 2009, pp. 3125–3128. [4] D. Angelosante, J. A. Bazerque, and G. B. Giannakis, “Online adaptive estimation of sparse signals: Where RLS meets the ℓ1 -norm,” IEEE Trans. Signal Proc., vol. 58, no. 7, pp. 3436–3447, July 2010. [5] B. Babadi, N. Kalouptsidis, and V. Tarokh, “SPARLS: The sparse RLS algorithm,” IEEE Trans. Signal Proc., vol. 58, no. 8, pp. 4013–4025, Aug. 2010. [6] ——, “Asympotic achievability of the Cramer-Rao bound for noisy compressive sampling,” IEEE Trans. Signal Processing, vol. 57, no. 3, pp. 1233–1236, 2009. [7] G. Mileounis, B. Babadi, N. Kalouptsidis, and V. Tarokh, “An adaptive greedy algorithm with application to nonlinear communications,” IEEE Trans. Signal Proc., vol. 58, no. 6, pp. 2998–3007, June 2010. [8] Y. Murakami, M. Yamagishi, M. Yukawa, and I. Yamada, “A sparse adaptive filtering using time-varying soft-thresholding techniques,” in Proceedings of the IEEE ICASSP, Dallas: USA, Mar. 2010, pp. 3734–3737. [9] Y. Kopsinis, K. Slavakis, and S. Theodoridis, “Online sparse system identification and signal reconstruction using projections onto weighted ℓ1 balls,” IEEE Trans. Signal Proc., vol. 59, no. 3, pp. 905–930, Mar. 2011.

25

[10] T. Blumensath and M. E. Davies, “Normalized iterative hard thresholding: Guaranteed stability and performance,” IEEE Journal of Selected Topics in Signal Processing, vol. 4, no. 2, pp. 298–309, Apr. 2010. [11] T. Blumensath, “Accelerated iterative hard threshoding,” 2011, preprint. [12] S. Foucart, “Hard thresholding pursuit: an algorithm for compressive sensing,” SIAM J. Numerical Analysis, 2011, to appear. [13] V. Cevher, “On accelerated hard thresholding methods for sparse approximation,” EPFL, Tech. Rep., 2011. [14] A. Antoniadis, “Wavelet methods in statistics: some recent developments and their applications,” Statistics Surveys, vol. 1, pp. 16–55, 2007. [15] P. L. Combettes, “The foundations of set theoretic estimation,” Proc. IEEE, vol. 81, no. 2, pp. 182–208, 1993. [16] I. Yamada and N. Ogura, “Adaptive projected subgradient method for asymptotic minimization of sequence of nonnegative convex functions,” Numerical Functional Analysis and Optimization, vol. 25, no. 7&8, pp. 593–617, 2004. [17] K. Slavakis, I. Yamada, and N. Ogura, “The adaptive projected subgradient method over the fixed point set of strongly attracting nonexpansive mappings,” Numerical Functional Analysis and Optimization, vol. 27, no. 7&8, pp. 905–930, 2006. [18] K. Slavakis and I. Yamada, “The adaptive projected subgradient method constrained by families of quasi-nonexpansive mappings and its application to online learning,” 2011, conditionally accepted for publication in the SIAM J. Optimization. [Online]. Available: http://arxiv.org/abs/1008.5231 [19] A. H. Sayed, Fundamentals of Adaptive Filtering.

New Jersey: John Wiley & Sons, 2003.

[20] L. M. Bregman, “The method of successive projections for finding a common point of convex sets,” Soviet Math. Dokl., vol. 6, pp. 688–692, 1965. [21] L. G. Gubin, B. T. Polyak, and E. V. Raik, “The method of projections for finding the common point of convex sets,” USSR Comput. Math. Phys., vol. 7, pp. 1–24, 1967. [22] D. C. Youla and H. Webb, “Image restoration by the method of convex projections: part 1—theory,” IEEE Trans. Med. Imaging, vol. MI-1, pp. 81–94, Oct. 1982. [23] H. H. Bauschke and J. M. Borwein, “On projection algorithms for solving convex feasibility problems,” SIAM Review, vol. 38, no. 3, pp. 367–426, Sept. 1996. [24] I. Daubechies, M. Fornasier, and I. Loris, “Accelerated projected gradient method for linear inverse problems with sparsity constraints,” Journal of Fourier Analysis and Applications, vol. 14, no. 5-6, pp. 764–792, Dec. 2008. [25] R. Tibshirani, “Regression shrinkage and selection via the LASSO,” J. Royal. Statist. Soc. B., vol. 58, no. 1, pp. 267–288, 1996. [26] E. v. Berg and M. P. Friedlander, “SPGL1: A solver for large-scale sparse reconstruction,” June 2007. [Online]. Available: http://www.cs.ubc.ca/labs/scl/spgl1 [27] H. Zou, “The adaptive LASSO and its oracle properties,” Journal of the American Statistical Association, vol. 101, pp. 1418–1429, Dec. 2006. [28] H. Zou and R. Li, “One-step sparse estimates in nonconcave penalized likelihood models,” The Annals of Statistics, vol. 36, no. 4, 2008. [29] J. H. Friedman, “Fast sparse regression and classification,” Dept. of Statistics, Stanford University, Tech. Rep., 2008. [30] C.-H. Zhang, “Nearly unbiased variable selection under minimax concave penalty,” Annals of Statistics, vol. 38, no. 6, pp. 894–942, 2010. [31] R. Mazumder, J. H. Friedman, and T. Hastie, “SPARSENET: Coordinate descent with nonconvex penalties,” J. American Statistical Association, 2011, to appear. [32] I. E. Frank and J. H. Friedman, “A statistical view of some chemometrics regression tools,” Technometrics, vol. 35, no. 2, pp. 109–135, 1993. [33] A. Antoniadis and J. Fan, “Regularization of wavelet approximations,” J. American Statistical Association, vol. 96, pp. 939–967, 2001. [34] J. Fan and R. Li, “Variable selection via nonconcave penalized likelihood and its oracle properties,” Journal of The American Statistical Association, vol. 96, pp. 1348–1360, 2001. [35] M. Nikolova, “Local strong homogeneity of a regularized estimator,” SIAM Journal of Applied Mathematics, vol. 61, pp. 633–658, 2000. [36] N. Simon, J. H. Friedman, T. Hastie, and R. Tibshirani, “Regularization paths for Cox’s proportional hazards model via coordinate descent,” Journal of Statistical Software, vol. 39, no. 5, pp. 1–15, 2011. [37] D. L. Donoho and J. M. Johnstone, “Ideal spatial adaptation by wavelet shrinkage,” Biometrika, vol. 81, no. 3, pp. 425–455, 1994. [38] S. Mallat, A Wavelet Tour of Signal Processing, 2nd ed. San Diego: Academic Press, 1999. [39] H.-Y. Gao and A. G. Bruce, “Waveshrink with firm shrinkage,” Statistica Sinica, vol. 7, no. 4, pp. 875–892, 1997. [40] D. A. Lorenz, “Convergence rates and source conditions for Tikhonov regularization with sparsity constraints,” Journal of Inverse and Ill-posed Problems, vol. 16, no. 5, pp. 463–478, 2008. [41] H.-Y. Gao, “Wavelet shrinkage denoising using the non-negative garrote,” Journal of Computational and Graphical Statistics, vol. 7, no. 4, pp. pp. 469–488, Dec. 1998. [42] T. Tao and B. Vidakovic, “Almost everywhere behavior of general wavelet shrinkage operators,” Applied and Computational Harmonic Analysis, vol. 9, no. 1, pp. 72–82, 2000. [43] R. T. Rockafellar and R. J.-B. Wets, Variational Analysis.

Berlin: Springer, 2004.

[44] H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces. [45] K. Goebel and W. A. Kirk, Topics in Metric Fixed Point Theory.

Springer, 2011.

Cambridge University Press, 1990.

[46] Y. Lu and M. Do, “A theory for sampling signals from a union of subspaces,” Signal Processing, IEEE Transactions on, vol. 56, no. 6, pp. 2334 –2345, june 2008.

26

[47] E. Cand`es, M. B. Wakin, and S. P. Boyd, “Enhancing sparsity by reweighted ℓ1 minimization,” J. Fourier Anal. Appl., vol. 14, pp. 877–905, 2008. [48] J.-J. Moreau, “Fonctions convexes duales et points proximaux dans un espace Hilbertien,” Acad. Sci. Paris S´er. A Math., vol. 255, pp. 2897–2899, 1962. [49] P. L. Combettes and V. R. Wajs, “Signal recovery by proximal forward-backward splitting,” Multiscale Model. Simul., vol. 4, pp. 1168–1200, 2005. [50] P. L. Combettes and J.-C. Pesquet, “Proximal splitting methods in signal processing,” in Fixed-Point Algorithms for Inverse Problems in Science and Engineering, H. H. Bauschke, R. Burachik, P. L. Combettes, V. Elser, D. R. Luke, and H. Wolkowicz, Eds. Springer-Verlag, 2011. [51] M. Blum, R. W. Floyd, V. Pratt, R. L. Rivest, and R. E. Tarjan, “Time bounds for selection,” Journal of Computer and System Sciences, vol. 7, no. 4, pp. 448–461, 1973. [52] I. Yamada and N. Ogura, “Hybrid steepest descent method for variational inequality problem over the fixed point set of certain quasi-nonexpansive mappings,” Numerical Functional Analysis and Optimization, vol. 25, no. 7&8, pp. 619–655, 2004. [53] K. Slavakis, S. Theodoridis, and I. Yamada, “Adaptive constrained learning in reproducing kernel Hilbert spaces: the robust beamforming case,” IEEE Trans. Signal Processing, vol. 57, no. 12, pp. 4744–4764, Dec. 2009.