Parametric Dictionary Learning for Graph Signals Dorina Thanou, David I Shuman, and Pascal Frossard Signal Processing Laboratory (LTS4) Ecole Polytechnique F´ed´erale de Lausanne (EPFL) CH-1015 Lausanne, Switzerland Email:{dorina.thanou, david.shuman, pascal.frossard}@epfl.ch
Abstract—We propose a parametric dictionary learning algorithm to design structured dictionaries that sparsely represent graph signals. We incorporate the graph structure by forcing the learned dictionaries to be concatenations of subdictionaries that are polynomials of the graph Laplacian matrix. The resulting atoms capture the main spatial and spectral components of the graph signals of interest, leading to adaptive representations with effective implementations. Experimental results demonstrate the efficiency of the proposed algorithm for the sparse approximation of graph signals.
I. I NTRODUCTION Graphs are flexible data representation tools, suitable for modeling the geometric structure of signals that live on topologically complicated domains. Examples of such signals can be found in social, transportation, energy, and sensor networks [1]. In these applications, the vertices of the graph represent the discrete data domain, and the edge weights capture the pairwise relationships between the vertices. A graph signal is then defined as a function that assigns a real value to each vertex. A simple example of a graph signal is the current temperature at each location in a sensor network. We are interested in finding meaningful graph signal representations that i) capture the most important characteristics of the graph signals, and ii) are sparse. That is, given a weighted graph and a class of signals on that graph, we want to construct an overcomplete dictionary of atoms that can sparsely represent graph signals from the given class as linear combinations of only a few atoms in the dictionary. An additional challenge when designing dictionaries for graph signals is that in order to identify and exploit structure in the data, we need to account for the intrinsic geometric structure of the underlying weighted graph. This is because signal characteristics such as smoothness depend on the topology of the graph on which the signal resides (see, e.g., [1, Example 1]). For signals on Euclidean domains as well as signals on irregular data domains such as graphs, the choice of the dictionary often involves a tradeoff between two desirable properties – the ability to adapt to specific signal data and a fast implementation of the dictionary [2]. In the dictionary learning or dictionary training approach to dictionary design, many numerical algorithms such as KSVD and the Method of Optimal Directions (MOD) (see [2, Section IV] and references therein) have been constructed to learn a dictionary from a set of realizations of the data (training signals). The learned dictionaries are highly adapted to the given class of signals and therefore usually exhibit good representation performance. However, the learned dictionaries are highly non-structured, and therefore costly to apply in various signal processing tasks. On the other hand, analytic dictionaries based on signal transforms such as the Fourier, Gabor, wavelet, and curvelet transforms are based on mathematical models of signal classes. These structured dictionaries often feature fast implementations, but they are not adapted to specific realizations of the data. Therefore, their ability to efficiently represent the data depends on the accuracy of the mathematical model of the data.
Returning specifically to the graph setting, numerical approaches such as K-SVD and MOD can certainly be applied to graph signals, which can be viewed as vectors in RN . However, the learned dictionaries will neither feature a fast implementation, nor explicitly incorporate the underlying graph structure. Meanwhile, a number of transform-based dictionaries for graph signals have recently been proposed. For example, the graph Fourier transform has been shown to sparsely represent smooth graph signals, wavelet transforms such as diffusion wavelets and spectral graph wavelets target piecewisesmooth graph signals, and the windowed graph Fourier transform can be used to analyze signal content at specific vertex and frequency locations (see [1] for an overview and complete list of references). These dictionaries feature pre-defined structures derived from the graph and some of them can be efficiently implemented; however, they generally are not adapted to the signals at hand (one exception are the diffusion wavelet packets of [3], which feature extra adaptivity). The recent work in [4] tries to bridge the gap between the graphbased transform methods and the purely numerical dictionary learning algorithms. The learned dictionary in [4] has a structure derived from the graph topology, while its parameters are learned from the data. However, it does not necessarily lead to efficient implementations. In this work, we capitalize on the benefits of both numerical and analytical approaches by learning a dictionary that incorporates the graph structure and can be implemented efficiently. We assume that the graph signals consist of local patterns, describing localized events on the graph. We incorporate the underlying graph structure into the dictionary through the graph Laplacian operator, which encodes the connectivity. In order to ensure the atoms are localized in the graph vertex domain, we impose the constraint that our dictionary is a concatenation of subdictionaries that are polynomials of the graph Laplacian [5]. We then learn the coefficients of the polynomial kernels via numerical optimization. As such, our approach falls into the category of parametric dictionary learning [2, Section IV.E]. II. BASIC D EFINITIONS ON G RAPHS We consider a weighted and undirected graph G = (V, E, W ) where V, E represent the vertex and edge set of the graph and W the weights corresponding to each edge. We assume that the graph is connected. The graph Laplacian operator is defined as L = D − W , where D is the diagonal degree matrix. Its normalized 1 1 form is defined as L = D− 2 LD− 2 . Both operators are real symmetric matrices and they have a complete set of orthonormal eigenvectors with corresponding non-negative eigenvalues. We denote the eigenvectors of the normalized Laplacian by χ = [χ1 , χ2 , ..., χN ] and its eigenvalues by 0 = λ0 < λ1 ≤ λ2 ≤ ... ≤ λN −1 = 2. A graph signal y in the vertex domain is a real-valued function defined on the vertices of the graph G, such that y(v) is the value of the function at vertex v ∈ V. In the graph setting, as is often the case in classical settings, the spectral domain representation can provide significant information about the characteristics of signals. It
has been shown that the eigenvectors of the Laplacian operators can be used to perform harmonic analysis of signals that live on the graph, and the corresponding eigenvalues carry a notion of frequency [1]. Throughout the paper, we use the normalized Laplacian eigenvectors as the Fourier basis in order to avoid some computational issues that arise when taking large powers of the unnormalized graph Laplacian. In particular, for any function y defined on the vertices of the graph, the graph Fourier transform yˆ at frequency λ` is defined as yˆ (λ` ) = hy, χ` i =
N X
y(n)χ∗` (n).
n=1
Besides its use in harmonic analysis, the graph Fourier transform is also useful in defining the translation of a signal on the graph. The generalized translation operator can be defined as a generalized convolution with a delta function centered at vertex n [1]: √ Tn g =
N (g ∗ δn ) =
√ N
N −1 X
(c − )I gˆ(λ` )χ∗` (n)χ` .
The right-hand side of (1) allows us to interpret the generalized translation as an operator acting on the kernel gˆ(·), which is defined directly in the graph spectral domain, and the localization of Tn g around the center vertex n is controlled by the smoothness of the kernel gˆ(·). One can thus design atoms Tn g that are localized around n in the vertex domain by taking the kernel gˆ(·) in (1) to be a smooth polynomial function of degree K: gˆ(λ` ) =
αk λk` ,
` = 0, ..., N − 1.
(2)
k=0
In particular, the kth power of the Laplacian L is exactly khop localized on the graph topology [5]. Combining (1) and (2), translating a polynomial kernel to each vertex in the graph yields a set of N localized atoms, which are the columns of √
√ Tg =
N gˆ(L) =
N χˆ g (Λ)χT =
√ N
K X
αk Lk ,
(3)
k=0
S X
Ds (c + )I,
(6)
s=1
(1)
`=0
K X
where I is the N × N identity matrix; i.e., each subdictionary Ds is a positive semi-definite matrix whose maximum eigenvalue is upper bounded by c. Second, since the training signals usually contain frequency components that are spread across the entire spectrum, the learned kernels {gbs (·)}s=1,2,...,S should also cover the spectrum. One way to ensure P 2 this would be to impose a constraint that S s=1 |gbs (λ)| is constant for all λ ∈ [0, λmax ], which, following a slight generalization of [5, Theorem 5.6], also guarantees that the resulting dictionary D is a tight frame. However, such a constraint would lead to a significantly more difficult optimization problem for the learning phase discussed in the next P subsection. Therefore, we instead impose the constraint c−≤ S s=1 gbs (λ) ≤ c + , for all λ ∈ [0, λmax ], or equivalently
where is a small positive constant. To summarize, the dictionary D is a parametric dictionary that depends on the parameters {αsk }s=1,2,...,S; k=1,2,...,K , and the constraints (5) and (6) can be viewed as constraints on these parameters.
B. Dictionary Learning Algorithm Given a set of training signals Y = [y1 , y2 , ..., yM ] ∈ RN ×M , all living on the graph G, we aim at learning a graph dictionary D ∈ RN ×N S with the structure described in Section III-A that can efficiently represent all of the signals in Y as linear combinations of only a few of its atoms. Since D has the form (4), this is equivalent to learning the parameters {αsk }s=1,2,...,S; k=1,2,...,K that characterize the set of generating kernels, {gbs (·)}s=1,2,...,S . We denote these parameters in vector form as α = [α1 ; ...; αS ], where αs is a column vector with (K + 1) entries. To learn the dictionary parameters, we solve the following optimization problem:
where Λ is the diagonal matrix of the eigenvalues. α∈R(K+1)S , X∈RSN ×M
Given a set of training signals on a weighted graph, our objective is to learn a structured dictionary that sparsely represents similar graph signals. In this section, we describe the dictionary structure and learning algorithm, as well as computational issues related to the use of the learned dictionary. A. Dictionary Structure We learn a structured graph dictionary D = [D1 , D2 , ..., DS ] that is a concatenation of a set of S subdictionaries of the form Ds = gbs (L) = χ
K X k=0
! αsk Λ
k
χT =
K X
αsk Lk .
(4)
k=0
Note that the atom given by column n of subdictionary Ds is equal to √1N Tn gs ; i.e., the order K polynomial gbs (·) translated to vertex n. The polynomial structure of the kernel gbs (·) ensures that the resulting atom given by column n of subdictionary Ds has its support contained in a K-hop neighborhood of vertex n. We also impose two constraints on the kernels {gbs (·)}s=1,2,...,S . First, we require that the kernels are non-negative and uniformly bounded by a given constant c; i.e., 0 ≤ gbs (λ) ≤ c for all λ ∈ [0, λmax ], or, equivalently, 0 Ds cI, ∀s ∈ {1, 2, ..., S},
argmin
III. PARAMETRIC D ICTIONARY L EARNING ON G RAPHS
(5)
subject to
||Y − DX||2F + µkαk22
kxm k0 ≤ T0 ,
∀m ∈ {1, ..., M },
0 Ds c, ∀s ∈ {1, 2, ..., S} (c − )I
S X
(7)
Ds (c + )I,
s=1
where xm corresponds to column m of the coefficient matrix X, and T0 is the sparsity level of the signal. Note that in the objective of (7), we penalize the norm of the polynomial coefficients α in order to i) promote smoothness in the learned polynomial kernels, and ii) improve the numerical stability of the learning algorithm. The optimization problem (7) is not convex, but it can be approximately solved in a computationally efficient manner by alternating between the sparse coding and dictionary update steps. In the first step, we fix the dictionary D and solve min||Y − DX||2F X
subject to kxm k0 ≤ T0
∀m ∈ {1, ..., M },
using orthogonal matching pursuit (OMP). Note that other methods for solving the sparse coding step such as matching pursuit (MP) or iterative soft thresholding could be used as well. In the second step, we fix the coefficients X and update the dictionary D by finding the
vector of parameters, α, that solves
1 gb1( λ )
argmin α∈R(K+1)S
||Y −
DX||2F
+
µkαk22
0.9
gb2( λ ) gb3( λ )
0.8
subject to 0 Ds cI, (c − )I
S X
∀s ∈ {1, 2, ..., S}
gb4( λ )
(8)
0.7
0.6
Ds (c + )I.
0.5
s=1
0.4
Note that α determines Ds and D according to (4). Problem (8) is convex and can be written as a quadratic program.
0.3
0.2
0.1
C. Computational Efficiency of the Learned Dictionary
IV. E XPERIMENTAL RESULTS In the following experiments, we quantify the performance of the proposed dictionary on both synthetic and real data. A. Synthetic Examples We generate a graph by randomly placing N = 80 vertices in the unit square. We design the edge weights based on the thresholded −
[dist(i,j)]2
2θ 2 Gaussian kernel function in such a way that W (i, j) = e if the physical distance between vertices i and j is less than or equal to κ, and zero otherwise. We fix θ = 0.9 and κ = 0.5. To generate training and testing signals, we divide the spectrum of the graph into four bands: [λ0 : λ19 ], [(λ20 : λ29 ) ∪ (λ70 : λ79 )], [λ30 : λ49 ], and [λ50 : λ69 ]. We then generate a synthetic dictionary of J = 320 atoms, each with a spectral representation that is concentrated exclusively in one of the four bands. In particular, each atom j is of the form
dj = hbj (L)δn = χhbj (Λ)χT δn ,
(9)
and is independently generated in the following way. We randomly pick one of the four bands and we generate 20 uniformly random coefficients that cover the diagonal entries of hbj (Λ) corresponding to the indices of the chosen band. The rest of the entries of hbj (Λ) are set to zero. The center vertex n in (9) is then chosen randomly as well, so that the synthetic atoms {dj }j=1,2,...,J are centered at random areas in the graph. Note that the obtained atoms are not guaranteed to be well localized in the graph since the discrete values of hbj (Λ) are chosen randomly and are not necessarily derived from a smooth kernel. We generate a set of 1000 training signals by linearly
0 −0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
λ
0.35
Structured Graph Dictionary [4] Polynomial approximation of [4], K=20 KSVD [6] SGWT [5] Polynomial Dictionary K=5 Polynomial Dictionary K=10 Polynomial Dictionary K=20
0.3
Avarage approximation error
The structural properties of the proposed class of dictionaries lead to compact representations and computationally efficient implementations, which we elaborate on briefly in this section. First, the number of free parameters depends on the number S of subdictionaries and the degree K of the polynomials. The total number of parameters is (K + 1)S, and since K and S are small in practice, the dictionary is compact and easy to store. Second, contrary to the non-structured dictionaries, the dictionary forward and adjoint operators can be efficiently computed when the graph is sparse, is usually the case P asP K k in practice. Recall from (4) that DT y = S s=1 k=0 αsk L . The computational cost of the iterative sparse matrix-vector multiplication required to compute {Lk y}k=1,2,...,K is O(K|E|). Therefore, the total computational cost to compute DT y is O(K|E| + N SK). We further note that, by following a procedure similar to the one in [5, Section 6.1], the term DDT y can also in a fast way Pbe computed by exploiting the fact that DDT y = S gbs 2 (L)y. This leads to s=1 0 a polynomial of degree K = 2K that can be efficiently computed. T Both operators D y and DDT y are important components of most sparse coding techniques. Therefore, these efficient implementations are very useful in numerous signal processing tasks, and comprise one of the main advantages of learning structured parametric dictionaries.
0.25
0.2
0.15
0.1
0.05
0
0
2
4
6
8
10
Sparsity T0
12
14
16
18
20
Fig. 1. (a) Learned kernels {gbs (·)}s=1,2,3,4 with the polynomial dictionary algorithm. (b) Comparison of the polynomial dictionary with SGWT [5], KSVD [6], the structured graph dictionary [4] and its polynomial approximation in terms of approximation performance in synthetic data.
combining (with random coefficients) T0 = 4 random atoms from the dictionary. Next, we use these training signals to learn a dictionary of S = 4 subdictionaries using our graph polynomial dictionary learning algorithm, with a polynomial degree of K = 20 and a sparsity level of T0 = 4.1 The kernels bounds in (7) are chosen as c = 1 and = 0.01. The obtained kernels {gbs (·)}s=1,2,3,4 are shown in Fig. 1(a). We observe that each learned kernel captures one of the four bands defined in the synthetic dictionary, following a filter-like behavior. Our algorithm leads to kernels that can be selective in the spectral domain since they combine different parts of the spectrum (see e.g., gb2 (λ)). In Fig. 2, we illustrate the atoms obtained in each of the four subdictionaries and positioned at the same location. We can see that the support of the atoms adapts to the graph topology. The atoms can either be smooth around a particular vertex, as for example in Fig. 2(a), or highly localized, as in Fig. 2(d). We test the approximation performance of the learned dictionary on a set of 2000 testing signals, generated in the same way as the training signals. We compare the approximation performance of our algorithm for a polynomial degree of K = {5, 10, 20} to that obtained with (i) graph-based transform methods such as the spectral graph wavelet transform (SGWT) [5], (ii) purely numerical dictionary learning methods such as K-SVD [6], and (iii) the graph-based dictionary learning algorithm presented in [4]. The sparse coding step in the testing phase is performed using orthogonal matching pursuit (OMP) in all cases. The average approximation error is set to kYtest − DXk2F /|Ytest |, where |Ytest | is the size of the testing set and is measured for a fixed sparsity level. Fig. 1(b) shows that the approximation performance obtained with our algorithm is 1 To solve (8), we use the sdpt3 solver in the yalmip optimization toolbox, which is publicly available at http://users.isy.liu.se/johanl/yalmip/
0.12
18
0.14 0.1
Polynomial Dictionary Structured Graph Dictionary [4] KSVD [6] SGWT [5]
0.12 0.08
16
0.1 0.06 0.08
14
0.04
0.02
0.04
0.02
0
0
−0.02
−0.02 −0.04
(a) gb1 (L)δ21
(b) gb2 (L)δ21
Avarage approximation error
0.06
12
10
8
6
4 0.4 0.25 0.35
2
0.3 0.2
0 0.25 0.15 0.2
0
2
4
6
8
Sparsity T0
10
12
14
16
0.15 0.1 0.1 0.05 0.05
0 0 −0.05 −0.05
−0.1
(c) gb3 (L)δ21
(d) gb4 (L)δ21
Fig. 2. Learned atoms centered on vertex n = 21, from each of the subdictionaries.
consistently better than the one of SGWT, which demonstrates the benefits of the learning process. The improvement is attributed to the fact that the spectral graph wavelet kernels are designed a priori to cover the whole spectrum, while with our dictionary learning algorithm, we learn the shape of the kernels from the data. As expected, the gap in the gain increases as we increase the polynomial degree. On the other hand, we see that K-SVD, which is blind to the graph structure, outperforms our algorithm in terms of approximation performance. This is not surprising as the K-SVD algorithm has more flexibility to adapt the learned dictionary to the set of signals at hand, especially in the case when the number of the training signals is relatively high. However, the K-SVD dictionary is highly nonstructured and quite complex to apply. These results illustrate the tradeoff between the adaptability and the complexity of a dictionary. Finally, the algorithm proposed in [4] is an intermediate step between K-SVD and our algorithm. It learns a dictionary that consists of subdictionaries of the form χgbs (Λ)χT , where gbs does not follow any particular function model, but rather consists of some discrete values. Thus, the obtained dictionary is better in terms of approximation, but it does not benefit from the polynomial structure described in Section III-C. In order to obtained an efficiently implementable dictionary, we fit a polynomial function to the discrete values gbs learned with [4]. We observe that our polynomial dictionary outperforms the polynomial approximation of [4] in terms of approximation performance. B. Real World Examples We now provide experiments with signals from the real world that are well localized in the graph. We consider the daily number of distinct Flickr users that have taken photos at different geographical locations around Trafalgar Square in London, between January 2010 and June 2012. Each vertex of the graph represents a geographical area of 10 × 10 meters. The graph is computed by assigning an edge between two locations when the distance between them is shorter than 30 meters, and the edge weight is set to be inversely proportional to the distance. We threshold the weights in order to obtain a sparse graph. The number of vertices of the graph is N = 245. We set S = 2 and K = 10. We have a total of 913 signals, and we use 700 of them for training and the rest for testing. Fig. 3 compares the approximation performance of the polynomial dictionary with the dictionaries of SGWT, K-SVD, and the structured graph dictionary. We notice that the polynomial dictionary again
Fig. 3. Comparison of the polynomial dictionary with the SGWT, K-SVD and the graph structured dictionary [4] in terms of approximation performance in the Flickr dataset.
outperforms the SGWT. It can even achieve better performance than K-SVD when sparsity increases. In particular, we observe that K-SVD outperforms both graph structured algorithms for a small sparsity level as it learns atoms of global support that can smoothly approximate the whole signal. On the other hand, the polynomial dictionary consists of localized atoms with a support concentrated on the close neighborhood of a vertex, which implies that only a few atoms may not be able to cover the whole graph. However, as the sparsity level increases, the localization property clearly becomes beneficial. Thus, in datasets that are characterized by spatially localized events, the polynomial dictionary can capture important points of interests and landmarks. V. C ONCLUSIONS We have proposed a structured dictionary model and an algorithm for learning the dictionary parameters for the approximation of signals on graphs. We incorporate the graph structure into the dictionary by considering polynomials of the Laplacian operator. The learned dictionary consists of localized atoms that capture the important spectral and spatial components of the graph signals. Finally, its polynomial structure leads to dictionaries that are efficient in terms of storage and computational complexity, while at the same time are well-adapted to the signals at hand. A CKNOWLEDGEMENT The authors would like to thank Xiaowen Dong for the fruitful discussions and for providing the Flickr data. R EFERENCES [1] D. I Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, “The emerging field of signal processing on graphs: Extending highdimensional data analysis to networks and other irregular domains,” IEEE Signal Process. Mag., vol. 30, no. 3, pp. 83–98, May 2013. [2] R. Rubinstein, A. M. Bruckstein, and M. Elad, “Dictionaries for sparse representation modeling,” Proc. of the IEEE, vol. 98, no. 6, pp. 1045 –1057, Apr. 2010. [3] J. C. Bremer, R. R. Coifman, M. Maggioni, and A. D. Szlam, “Diffusion wavelet packets,” Appl. Comput. Harmon. Anal., vol. 21, no. 1, pp. 95– 112, 2006. [4] X. Zhang, X. Dong, and P. Frossard, “Learning of structured graph dictionaries,” in Proc. IEEE Int. Conf. Acoust., Speech and Signal Processing, Kyoto, Japan, Mar. 2012, pp. 3373 – 3376. [5] D. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets on graphs via spectral graph theory,” Applied and Computational Harmonic Analysis, vol. 30, no. 2, pp. 129–150, March 2010. [6] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4311–4322, 2006.