LEARNING SPARSE IMAGE REPRESENTATION WITH SUPPORT ...

Report 2 Downloads 145 Views
Proceedings of 2010 IEEE 17th International Conference on Image Processing

September 26-29, 2010, Hong Kong

LEARNING SPARSE IMAGE REPRESENTATION WITH SUPPORT VECTOR REGRESSION FOR SINGLE-IMAGE SUPER-RESOLUTION Ming-Chun Yang1,2 , Chao-Tsung Chu2 , and Yu-Chiang Frank Wang2,3 1

Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan 2 Research Center for Information Technology Innovation and 3 Institute of Information Science, Academia Sinica, Taipei, Taiwan [email protected],{alrtui, ycwang}@citi.sinica.edu.tw ABSTRACT

Learning-based approaches for super-resolution (SR) have been studied in the past few years. In this paper, a novel single-image SR framework based on the learning of sparse image representation with support vector regression (SVR) is presented. SVR is known to offer excellent generalization ability in predicting output class labels for input data. Given a low resolution image, we approach the SR problem as the estimation of pixel labels in its high resolution version. The feature considered in this work is the sparse representation of different types of image patches. Prior studies have shown that this feature is robust to noise and occlusions present in image data. Experimental results show that our method is quantitatively more effective than prior work using bicubic interpolation or SVR methods, and our computation time is significantly less than that of existing SVR-based methods due to the use of sparse image representations. Index Terms— Super-Resolution, Sparse Representation, Support Vector Regression (SVR) 1. INTRODUCTION Super-resolution (SR) is a process of generating a highresolution image from one or several low-resolution versions, and it has been an active research topic in image processing and computer vision. Conventional SR methods are based on registration and alignment of multiple low-resolution images of the same scene in sub-pixel accuracy. These methods can be regarded as an inverse problem, which recovers the highresolution image as a linear operation of its low-resolution versions. However, this type of approach suffers from illconditioned registration and inappropriate blurring operator assumptions due to an insufficient number of low-resolution images [1]. Several regularized methods [2, 3, 4] are proposed to address this concern. Nonetheless, their results will be degraded if only a limited number of low-resolution images are available, or if a large image magnification factor is needed. In practice, the magnification factor of reconstruction-based approaches is limited to be less than 2 [1, 5].

978-1-4244-7994-8/10/$26.00 ©2010 IEEE

1973

In recent years, much attention has been drawn to singleimage SR methods. Protter et al. [6] proposed a non-local means approach which exploits self-similarities of patches in natural images for SR. Its performance varies with the similarity between different image patches, and thus is subject to the image of interest. Learning-based (or example-based) SR methods [7, 8] also address single-image SR problems. With the aid of training data, which consists of low and high resolution image pairs, the relationship between the images with different resolutions can be determined. Ni et al. [9] recently proposed a SR method using support vector regression (SVR) to fit low-resolution image patches in spatial or DCT domains. Due to excellent generalization of supportvector based methods, no assumption on the data such as the distribution of different image patch categories is needed. Although competitive results were reported in [9], the computation complexity for non-linear SVR prohibits SR problems with large-scale images. Originally applied to signal recovery, compressed sensing [10] is adopted to SR problems by Yang et al. [11], who assumed that the patches from high resolution images can be a sparse representation with respect to an over-complete dictionary composed of signal-atoms. They have shown that, under mild conditions, the sparse representation of highresolution images can be recovered from the low-resolution image patches. However, they used a set of selected image patches for training, and implied that their method only applies to images with similar statistical nature. Considering image regions with both high and low spatial frequencies, we do not limit to the case of natural images. Our approach suggests the learning of sparse image representations for image patches from low resolution images, and we use SVR to learn/predict the associated pixel labels. Details of our proposed SR framework will be discussed in Sect. 2. 2. SINGLE-IMAGE SR FROM SPARSITY Fig. 1 shows the flow chart of our method. It consists of three steps: image patch categorization, image sparse representation, and SVR for SR output. We now detail these steps.

ICIP 2010

Fig. 1. The flow chart of our single-image SR approach. 2.1. Image Patch Categorization In the training stage of our method, we first collect high and low-resolution image pairs. To achieve better SR performance, we do not train a single SVR on all image patch pairs as [12] did. We propose to learn separate SVR models for high and low spatial frequency patches, respectively. For a low-resolution image, we first use bicubic interpolation to synthesize its high-resolution version. We extract all 5 x 5 patches from this synthesized image. To determine whether each patch corresponds to regions with high or low spatial frequencies, we perform over-segmentation on the original low-resolution image to locate pixels on the boundaries or corners. The associated pixels in the synthesized image are those with more texture details (and with missing information), and thus correspond to regions with high spatial frequencies. Next, if the center of an extracted patch from this synthesized image is a part of image edges or corners, that image patch will be categorized as the set of regions with high spatial frequencies; otherwise, it belongs to the set of those with low spatial frequencies. Our experimental results will show that, compared to the method using only one SVR model, this step improves the SR results. In our work, we exploit Mean Shift algorithm [13] to oversegment the image for the determination of the above image patch sets (see Fig. 2 for example). We note that we do not limit our method to the use of any specific segmentation algorithm, and one could perform other types of edge detection or Fourier transform methods for this process. 2.2. Sparse Representation of Image Patches Instead of working directly with the image patches sampled from low resolution images, we learn compact representations Dh and Dl for patches with high and low spatial frequencies, respectively. This is motivated by recent progresses in compressed sensing, and the field of image processing has been one of the main beneficiaries from this theory. In the training stage, we apply the sparse coding tool developed by [14] to learn the dictionaries Dh and Dl . We determine the associated sparse coefficient vectors αh and αl , which minimize the reconstruction error with a small number of non-zero coefficients. Since the optimization of sparse coding is beyond the scope of this paper, we only briefly discuss this step. Considering high spatial-frequency patches for example, the sparse coding problem can be formulated as

1974

Fig. 2. Image patch categorization. Left: the original image. Middle: The over-segmented result. Right: High-resolution image by bicubic interpolation. The red (blue) rectangle denotes an example high (low) spatial frequency patch.

min αh 1

s.t.

Dh αh − yh 22 ≤ ,

(1)

where yh is the training image patch, Dh is an over-complete dictionary to be determined, and αh is the sparse coefficient vector. A small and positive  takes into account the possibility of noise present in image data. Equivalently, we solve the optimization problem below 1 min Dh αh − yh 22 + λαh 1 , 2

(2)

where the Lagrange multiplier λ balances the sparsity of αh and the l2 -norm reconstruction error. Similar remarks apply to the learning of Dl for low spatial-frequency patches. Once the above process is complete, we use the sparse coefficients αh and αl as the features for our SVR models, which learns the mapping functions between these input features and the associated pixel labels in high-resolution images. In testing, we first calculate the αh and αl of the test image, and use the predicted outputs to refine its high-resolution version. Details of the SVR learning and prediction processes are discussed in the next subsection. 2.3. Support Vector Regression 2.3.1. SVR Learning Support vector regression (SVR) [15] is the extension of support vector machine. Using kernel tricks, the task of SVR is to use nonlinear functions to linearly estimate the output function in high-dimensional feature space. Similar to SVMs, the

generalization ability makes the SVR very powerful in predicting unknown outputs. In training, our SVR solves the following problem n  1 (ξi + ξi∗ ) (3) min ∗ wT w + C w,b,ξ,ξ 2 i=1 s.t.

yi − (wT φ(αi ) + b) ≤  + ξi ,

(wT φ(αi ) + b) − yi ≤  + ξi∗ , ξi , ξi∗ ≤ 0, i = 1, ..., n. We note that y is the associated pixel label (at the same location as the center of the patch considered) in the highresolution image, n is the number of training instances, φ(αi ) is the sparse image patch representation in the transformed space, and w represents the nonlinear mapping function to be learned. C is the tradeoff between the generalization and the upper and lower training errors ξi and ξi∗ , subject to a threshold . We note that Gaussian kernels are used in all our SVRs, and their parameters are selected via cross validation. It is worth mentioning that, in our implementation, we subtract the mean value of each patch from its pixel values before calculating the sparse coefficient α; this mean value is also subtracted from the corresponding pixel label y in the high resolution image. This is because our proposed method suggests the learning of local pixel value variations, not the absolute pixel value output. In testing, the mean value of each patch will be added to the predicted output pixel value y. 2.3.2. SVR Prediction After the SVR models for high and low spatial frequency patches are learned, we use them to predict the high resolution image of a given low-resolution test input. As the progress shown in Fig. 1, we first synthesize the high resolution version of the test input using bicubic interpolation, and categorize all image patches accordingly (as discussed in Sect. 2.1). Based on the categorization results, we use the training dictionary Dh or Dl to calculate the corresponding the sparse coefficient vector α for each image patch. Finally, we update the pixel values in the synthesized image using the previously learned SVRs in sparse representation domain and obtain the final SR image. 3. EXPERIMENTAL RESULTS Images from the USC-SIPI database are used in our experiments (http://sipi.usc.edu/database). We start with the original images as high-resolution versions, and degrade them in a manner that is similar to the degradation we plan to undo in the images to be super-resolved. It is worth repeating that we do not limit our approach to any specific category of images; we only use the high and low-resolution image pair of lena for training, and the learned SVR models are used to super-resolve all test images. The SVR model is trained by LIBSVM [16], and dictionaries for image sparse representation are learned by the algorithm developed by [14]. We use

1975

Table 1. PSNR values of test images using different methods. Training time TT r for each SR method is listed (in min.). Bicubic SVR [12] SC+SVR SC [11] Our method

boat 20.09 20.76 21.13 20.77 21.20

bridge 18.40 18.73 18.96 19.32 18.97

person 19.05 19.70 20.16 20.74 20.19

cars 21.94 22.62 22.91 21.23 22.92

skyView 17.24 17.67 18.19 18.18 18.11

TT r N/A ∼ 40 ∼ 10 ∼ 660 ∼ 3.5

the Matlab function imresize to synthesize high-resolution images with bicubic interpolation. The PSNR values of five different test images are reported in Table 1 (the magnification factor is 2). To compare our approach with baseline and existing learning-based SR methods, we consider bicubic interpolation, SVR in pixel domain [12], the proposed SVR with image sparse representation without image patch categorization, and sparse-coding based SR method of Yang et al. [11] (denoted as bicubic, SVR, SC+SVR, and SC in Table 1, respectively). It is worth noting that the number of training images in the SR package developed in [11] is nearly one hundred. However, our approach, while using only one training image of lena, still gave better or comparable PSNR results. Moreover, comparing with bicubic interpolation, we achieved an improvement of 3.1% to 6% in PSNR, while only about 1.2% improvement was reported in [9] (which applied SVR for SR in DCT domain). Fig. 3 shows example high-resolution images of the ground truth and those synthesized by different methods. We now discuss the training time of the above learning based SR methods. All learning-based SR methods only use the same training image pair to produce the SR results except for the method of [11], and thus it is not surprising that [11] required significantly longer computation time to learn their SR models (see Table 1. We also note that, learning a single SVR using all image patches in pixel domain is very computationally expensive, and our approach required the shortest training time among all learning-based methods. The runtime estimates in Table 1 were obtained on an Intel Quad Core PC with 2.33GHz processors and 2G RAM. Finally, we conduct a more challenging experiment with a larger magnification factor of 4. We use the image of boat for our tests, and the SR results using bicubic interpolation and our method are shown in Fig. 4. We found that, comparing to the bicubic interpolation, we improved the PSNR by 9% (20.92 vs. 19.19). As can be seen from Fig. 4, our sparserepresentation based SVR method produced a SR image with less noise and artifacts than the bicubic interpolation does. 4. CONCLUSION A novel single-image super-resolution framework based on learning sparse image representation with SVR is proposed in this paper. Given a low resolution image, we use sparse representation to extract image patches corresponding to low and high spatial frequencies, and learn the associated SVR models to refine the pixel labels in its high resolution version.

(b) Bicubic interpolation

(a) Ground truth

(c) Our method

Fig. 3. Example high-resolution images. Note that the face regions are scaled for detailed comparisons.

[4] [5]

[6] (a) Bicubic interpolation

(b) Our method

[7]

Fig. 4. High resolution images magnified by a factor of 4. The PSNR values are (a) 19.19 and (b) 20.92. Both images are scaled for illustration. Our approach produced very attractive SR images with better PSNR than those with bicubic interpolation or other leaningbased methods. Compared to prior SR methods using SVR, the complexity of our method is significantly reduced due to the use of sparse image representation. Future research will be directed at extensions of our approach to multi-scale SR problems (i.e. larger magnification factors). Another interesting point observed during our experiments is that, our SVR models are not limited to same-scale SR problems. Specifically, we found that this SC+SVR approach is able to super-resolve input images at other higher resolutions with excellent PSNR values. We will also pursue the study of this issue in our future work. 5. REFERENCES

[8] [9] [10] [11] [12] [13] [14]

[1] S. Baker and T. Kanade, “Limits on super-resolution and how to break them,” IEEE Trans. PAMI, vol. 24, no. 9, pp. 1167–1183, 2002. [2] S. Farsiu, M. Robinson, M. Elad, and P. Milanfar, “Fast and robust multiframe super resolution,” IEEE Trans. Image Processing, vol. 13 (10), pp. 1327–1344, 2004. [3] R. C. Hardie, K. J. Barnard, and E. E. Armstrong, “Joint MAP registration and high resolution image estimation

1976

[15] [16]

using a sequence of undersampled images,” IEEE Trans. Image Processing, vol. 6, no. 12, pp. 1621–1633, 1997. M. E. Tipping and C. M. Bishop, “Bayesian image super-resolution,” in NIPS, 2002, pp. 1279–1286. H. Y. Shum and Z. C. Lin, “Fundamental limits of reconstruction-based superresolution algorithms under local translation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 5, pp. 847–847, 2006. M. Protter et al., “Generalizing the non-local-means to super-resolution reconstruction,” IEEE Trans. Image Processing, vol. 18, no. 1, pp. 36–51, Jan. 2009. W. T. Freeman, T. R. Jones, and E. C. Pasztor, “Example-based super-resolution,” IEEE Computer Graphics and App., vol. 22, no. 2, pp. 56–65, 2002. H. Chang et al., “Super-resolution through neighbor embedding,” in CVPR, 2004, pp. 275–282. K. S. Ni and T. Q. Nguyen, “Image superresolution using support vector regression,” IEEE Trans. Image Processing, vol. 16, no. 6, pp. 1596–1610, June 2007. D. L. Donoho, “Compressed sensing,” IEEE Trans. on Information Theory, vol. 52 (4), pp. 1289–1306, 2006. J. C. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution as sparse representation of raw image patches,” in CVPR, 2008. D. Li, S. Simske, and R.M. Mersereau, “Single image super-resolution based on support vector regression,” in IJCNN, 2007, pp. 2898–2901. D. Comaniciu and P. Meer, “Mean shift: A robust approach toward feature space analysis,” IEEE Trans. PAMI, vol. 24, no. 5, pp. 603–619, 2002. J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse coding,” Journal of Mahine Learning Research, 2009. V. Vapnik, Statistical Learning Theory, WileyInterscience, 1998. Chih-Chung Chang and Chih-Jen Lin, LIBSVM: a library for support vector machines, 2001.