Over-complete Dictionary Learning via Neural Network Models Tae Hyung Kim Electrical Engineering and Computer Science University of Michigan
[email protected] Abstract—Over-complete dictionary model is a useful framework for handling under-determined or ill-conditioned problem. As it represents signals as sparse linear combinations of dictionary atoms, this approach is well suited in image restoration, compression and sparse coding. The performance of this model depends on how dictionary atoms are designed. In general, dictionaries are obtained from mathematical basis or learned from existing data. Common choices of dictionaries are discrete cosine transform (DCT) and K-SVD. In this project, I propose the use of the restricted Boltzmann machine (RBM) and the sparse autoencoder for dictionary learning. The RBM and the sparse autoencoder are unsupervised neural network models that learn representations from training data. This is appropriate for dictionary learning since the RBM and the sparse autoencoder are powerful in representing data. I apply the RBM and the sparse autoencoder dictionaries to image denoising, inpainting and compression problem, and I compare the performance with K-SVD and DCT dictionaries.
II.
The performance of over-complete dictionary model depends on how a dictionary is designed[1][2]. Therefore, deliberate choice of dictionaries may improve the performance of image reconstruction, restoration, and compression. In computer vision field, researchers have conducted much research on image classification based on the dictionary; this is called bag of words[3] technique. In recent image classification research, the RBM dictionary outperforms the K-means clustered dictionary[8], since the RBM dictionary is more effective in representing data distinctly. I expected the similar principle applies to the dictionary-based image processing. This motivated me to investigate the use of neural network models such as RBM and for over-complete dictionary learning. III.
I.
I NTRODUCTION
P RELIMINARIES : N EURAL N ETWORK M ODELS
A. General Description
Often, image processing problems are under-determined or ill-conditioned, and synthesis-based approach is a good method to handle such problems. The synthesis model represents or approximates a signal as a linear combination of basis vectors. x = Bz or x ≈ Bz
M OTIVATION
The neural network model[7] is a graphical model mimicking the signal propagation of neurons. In the graph, each vertex has a value, and the edges represents the relation between vertices. Figure 1 shows an example neural network. Let our
(1)
Over-complete dictionary model is one kind of the systhesis model where the number of dictionary atoms K are much larger than the number of pixels np . (K >> np ) x ≈ Dz , ||z||0 < r
(2)
This over-complete dictionary-based approach is particulary good at representing an image sparsely. This model is used in many applications such as image restoration and compression. Fig. 1.
There are many choices of dictionaries. It can be either mathematical such as discrete cosine transform (DCT), or learned from training data such as K-SVD. In previous research, Aharon et al.[1][2] demonstrated the K-SVD dictionary is very powerful in denoising, inpainting, and compression problems compared to DCT and Haar dictionary. In this project, I investigate the use of neural network models for training dictionaries. To be more specific, I employ the restricted Boltzmann machine (RBM) and the sparse autoencoder for dictionary learning method.
An example of neural network
input date be X. The number of layers n and the number of verticies in each layer depends on the purpose. In the graph, the first layer activation a1 becomes the same as the input X. When the signal a1 propagates to the right, it is multiplied by the weight matrix W1 specified in edges. The second layer activation becomes a fuction of propagated signal, so a2 = f (W1 · X). The next layer activation is calculated in the same manner. Therefore k th layer activation becomes ak = f (Wk−1 · ak−1 )
(3)
The function f is called activation function. The objective of this model is to minimize the distance between the desired output Y and the last layer’s activation an by adjusting weight matrices W ’s . min
W1 ,W2 ,...Wn−1
||Y − an ||2 + R(W )
The goal of training RBM is to find maximum likelyhood of this probability distribution with respect to W , b, and c. This can be done by sampling-based approximated gradient descent technique such as contrastive divergence[10]. The update rules[11] are
(4)
∇Wij = (
where R(W ) is the regularization term for W1 , ..., Wn−1 This can be optimized by back propagation technique[7], which is derived from gradient descent method.
1 1 )data − ( 2 )recon σi2 vi hj σi vi hj
∇bi = (
B. Sparse Autoencoder
1 1 vi )data − ( 2 vi )recon σi2 σi
∇cj = (hj )data − (hj )recon
The sparse autoencoder[5] has the same structure of figure 1, but it set the desired output the same as the input, Y = X. In this manner, the training becomes unsupervised and it learns a common representation of training images. I use the sigmoid function as the activation function. C. Restricted Boltzmann machine The restricted Boltzmann machine[4][9] is a bipartite and undirected graphical model. This consists of visible layer where the input data is assigned and hidden layer which is estimated by monte-carlo method.
(9)
(10)
(11)
Where (·)data is the calcualted from data, and (·)recon is computed by estimated sample values from conditional probabilities in equation (7), (8). By training the restricted Boltzmann machine and the sparse autoencoders, I obtained the optimal weight matricies W s, which contains common features of input images. I use those W s as the dictionary. IV.
E XPERIMENT
A. Experiment Setting In experiments, I employed Sheffield face dataset[12] which consists of 564 face images of 20 individuals. (112x92 pixels)
Fig. 2.
Fig. 3.
The Restricted Boltzmann Machine
The objective fuction of this model is the Boltzmann distriubtion with respect to v, h, and W . e−E(v,h) (5) Z Visible and hidden units are either binary or real-valued. The definition of E(v, h) depends on the type of visible and hidden units. If visible units are real-valued and hidden units are binary, it is called Gaussian RBM and the energy function is defined as following equation. P (v, h) =
E(v, h) =
nv X (vi − bi )2 i=1
2σi2
nh nh nv X X X vi − Wij hj − cj hj (6) σ i=1 j=1 i j=1
From above objective function, it is feasible to derive conditional probabilities for estimating hidden layer activation and visible layer values[11]. p(hj = 1|v) = sigmoid(cj +
X i
p(vi = v|h) = N (v|bi +
X j
Wij
vi ) σi2
hj Wij σi2 )
(7) (8)
Some face images in the Sheffield face dataset
I sample 100,000 random 8 x 8 patches for training RBM, Autoencoder and K-SVD. 256 dictionaries (4x8x8) are trained as following figure 4. I carry out several experiments with those dictionaries. I apply those dictionaries to denoising, inpainting, and compression problems. I choose the objective fuction for these problems z = argmin ||z||0 sub.to ||x − Dz|| ≤
(12)
z∈RK
and optimize it using the orthogonal matching pursuit (OMP)[6] with a fixed error . B. Denoising In denoising problem, I follow the procedure specified in [2]. From a face image, a gaussian noise with σ = 10 is added to generate a noisy image. I divide the image into 8x8 overlapping blocks, then apply equation (12) with error = 9 to restore each block. Once a reconstructed blocks are obtained, I restore the image by averaging the overlapping pixels between blocks. The result in figure 5 shows RBM and K-SVD have superier performance. DCT and autoencoder follows.
Fig. 6.
Inpainting results
D. Compression Fig. 4. RBM, sparse autoencoder, K-SVD, and DCT dictionaries from Sheffield face dataset
The procedure follows [1]. The image is divided into 154 non-overlapping 8x8 blocks, then each block is encoded by coefficients of dictionary atoms by applying equation (12). The quality of the compression is measured by PNSR rate, 255 ) RM SE with respect to bit-per-pixel values. (bpp) P N SR = 20 log10 (
bpp =
Fig. 5. Denoising results; RBM and K-SVD shows simliar performance, then DCT and autoencoder follows
C. Inpainting: filling in missing pixels The procedure follows [1]. A fraction α of the original image is removed. This image is divided into 154 nonoverlapping 8x8 blocks. For each block, I apply the equation (12) to find the optimal coefficients. The missing pixels are ignored while optimizing the equation. Each block is restored from coefficients. I set the error level = 1.15. I test with a low missing rate α = 0.3, and a high missing rate α = 0.7. The figure 6 illustrates that K-SVD shows the highest quality in the low missing rate, followed by DCT, RBM and autoencoder. In the high missing rate, RBM performs the best, then K-SVD, autoencoder, and DCT follows.
a · #Blocks + #coef s · Q #pixels
(13)
(14)
•
a is the required number of bits to code the number of coefficients for each block
•
#Blocks = 154 non-overlapping blocks
•
#coefs is total number of coefficients for representing the entire image
•
Q is the number of bits per coefficient. I assume 64 bits double precision for computational convenience.
•
#pixels = 112x88 = 9856 pixels
The bpp values vary by a, #coefs. In each test, I set those values by choosing the number of activation of z in the equation (12). (choose l = ||z||0 ) Figure 7 illustrates the result. RBM, autoencoder, DCT, and K-SVD are marked as red, blue, green, and black, respectively. In the low bpp range between 0 to 0.1, K-SVD dictionary shows the best performance. RBM and DCT follows, while autoencoder is the worst. In higher bpp levels, DCT is the best but RBM proves similar performance. K-SVD is the worst in this bpp level. V.
C ONCLUSION
In summary, experiment results indicates the RBM and the sparse autoencoder are promising dictionary learning method in image denoising, inpainting and compression tasks. In particular, RBM dictionary is outstanding and comparable to the performance of K-SVD and DCT dictionaries. Results
[6]
[7] [8]
[9] [10] [11]
[12] Fig. 7.
Compression results
from the sparse autoencoder dictionary are somewhat inferier. I believe this is because the sigmoid function is used for activation function. Better results are expected if the identity fuction f (W · X) = W · X is used, so that the objective function becomes X argmin ||Xi − W2 W1 Xi ||22 + R(W ) (15) W1 ,W2
[13]
[14]
[15]
i
where R(W) is a regularization term added for sparsity. W1 can be used as the sparse autoencoder dictionary.
[16]
Disadvantages of neral network models are 1. they often take longer training time compared to K-SVD and 2. they are hard to train if the input data is correlated each other, or the number of input data is insufficient. For example, it is hard to extract over-complete dictionary from one image unlike KSVD.
[17]
VI.
F UTURE WORK
In thie project, I applied existing neural network models to denoising, inpainting, and compression problems. However, I believe it is possible to create a novel neural network model for a specific problems. This can be an interesting research topic. R EFERENCES [1]
[2]
[3]
[4]
[5]
Michal Aharon, Michael Elad, and Alfred Bruckstein, K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation, In IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 54, NO. 11, Nov 2006 Michael Elad and Michal Aharon, James Bowley and Laura RebolloNeira. ”Image Denoising Via Sparse and Redundant Representations Over Learned Dictionaries”, In IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 15, NO. 12, Dec 2006 Li Fei-Fei and Pietro Perona. ”A Bayesian Hierarchical Model for Learning Natural Scene Categories”, In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005 P. Smolensky. ”Information processing in dynamical systems: Foundations of harmony theory,” in Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1, pp. 194281, 1986. Pascal Vincent, Hugo Larochelle, Yoshua Bengio and Pierre-Antoine Manzagol. ”Extracting and Composing Robust Features with Denoising Autoencoders.” in ICML 2008
[18] [19]
[20]
[21] [22]
[23]
Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition, in Conf. Rec. 27th Asilomar Conf. Signals, Syst. Comput., 1993, vol. 1. Andrew Ng. ”Unsupervised Feature Learning and Deep Learning Tutorial”, http://ufldl.stanford.edu/wiki/index.php/ Kihyuk Sohn, Dae Yon Jung, Honglak Lee, and Alfred Hero III.”Efficient Learning of Sparse, Distributed, Convolutional Feature Representations for Object Recognition”, in Proceedings of 13th International Conference on Computer Vision (ICCV), 2011. BENGIO, Yoshua. Learning deep architectures for AI. Foundations and trends in Machine Learning, 2009, 2.1: 1-127. G. E. Hinton, ”Training products of experts by minimizing contrastive divergence,” in Neural Computation, vol. 14, no. 8, pp. 17711800, 2002. Cho, KyungHyun, Alexander Ilin, and Tapani Raiko. ”Improved learning of Gaussian-Bernoulli restricted Boltzmann machines.” Artificial Neural Networks and Machine LearningICANN 2011. Springer Berlin Heidelberg, 2011. 10-17. Daniel B Graham and Nigel M Allinson.”Characterizing Virtual Eigensignatures for General Purpose Face Recognition”, in Face Recognition: From Theory to Applications ,NATO ASI Series F, Computer and Systems Sciences, Vol. 163. H. Wechsler, P. J. Phillips, V. Bruce, F. Fogelman-Soulie and T. S. Huang (eds), pp 446-456, 1998. Tan, C. C., and C. Eswaran. ”Reconstruction of handwritten digit images using autoencoder neural networks.” Electrical and Computer Engineering, 2008. CCECE 2008. Canadian Conference on. IEEE, 2008. Peleg, Tomer, Yonina C. Eldar, and Michael Elad. ”Exploiting statistical dependencies in sparse representations for signal recovery.” Signal Processing, IEEE Transactions on 60.5 (2012): 2286-2303. Smith, Leslie N., and Michael Elad. ”Improving Dictionary Learning: Multiple Dictionary Updates and Coefficient Reuse.” Signal Processing Letters, IEEE 20.1 (2013): 79-82. Kim, Yelin, Honglak Lee, and Emily Mower Provost. ”DEEP LEARNING FOR ROBUST FEATURE GENERATION IN AUDIOVISUAL EMOTION RECOGNITION.” Lee, Honglak, Chaitanya Ekanadham, and Andrew Ng. ”Sparse deep belief net model for visual area V2.” Advances in neural information processing systems. 2007. Hinton, Geoffrey. ”A practical guide to training restricted Boltzmann machines.” Momentum 9.1 (2010). Agostinelli, Forest, Michael R. Anderson, and Honglak Lee. ”Robust Image Denoising with Multi-Column Deep Neural Networks.” Advances in Neural Information Processing Systems. 2013. Olshausen, Bruno A. ”Emergence of simple-cell receptive field properties by learning a sparse code for natural images.” Nature 381.6583 (1996): 607-609. Lee, Honglak, et al. ”Efficient sparse coding algorithms.” Advances in neural information processing systems. 2006. Lee, Honglak, et al. ”Unsupervised learning of hierarchical representations with convolutional deep belief networks.” Communications of the ACM 54.10 (2011): 95-103. Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. ”Learning representations by back-propagating errors.” Cognitive modeling 1 (2002): 213.