LEARNING CONTEXT-AWARE SPARSE REPRESENTATION FOR SINGLE IMAGE SUPER-RESOLUTION Min-Chun Yang1,2 , Chang-Heng Wang1,3 , Ting-Yao Hu1,3 , and Yu-Chiang Frank Wang1 1
2
Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan Dept. Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan 3 Dept. Electrical Engineering, National Taiwan University, Taipei, Taiwan ABSTRACT
This paper presents a novel learning-based method for single image super-resolution (SR). Given an input low-resolution image and its image pyramid, we propose to perform contextconstrained image segmentation and construct an image segment dataset with different context categories. By learning context-specific image sparse representation, our method aims to model the relationship between the interpolated image patches and their ground truth pixel values from different context categories via support vector regression (SVR). To synthesize the final SR output, we upsample the input image by bicubic interpolation, followed by the refinement of each image patch using the SVR model learned from the associated context category. Unlike prior learning-based SR methods, our approach does not require the reoccurrence of similar image patches (within or across image scales), and we do not need to collect training low and high-resolution image data in advance either. Empirical results show that our proposed method is quantitatively and qualitatively more effective than existing interpolation or learning-based SR approaches. Index Terms— Super-resolution, sparse representation, support vector regression, self-learning 1. INTRODUCTION Super-resolution (SR) is an inverse process of producing a high-resolution (HR) image from a single or multiple lowresolution (LR) inputs. Conventional reconstruction-based SR methods require alignment and registration of several LR images in sub-pixel accuracy [1, 2]; however, ill-conditioned registration and inappropriate blurring operator assumptions limit the scalability of this type of approach. While methods which introduce additional regularization alleviate the above problems [1, 2, 3], their performance will still be limited by the number of LR images/patches available. As pointed out in [4, 5], the magnification factor is typically limited to be less than 2 for this type of apporach. Single-image SR is more practical for real-world applications, since it only requires one LR input to determine its HR version. The nonlocal-means (NLM) is a representative single-image SR technique, which utilizes the reoccurrence
(i.e. self-similarity) of image patches for synthesizing its HR version. Much attention has also been directed to example or learning-based single-image SR approaches (e.g., [6, 7]). For a LR input, example-based methods search for similar image patches from training LR image data, and use their corresponding HR versions to produce the final SR output. Learning-based approaches, on the other hand, focus on modeling the relationship between the images with different resolutions by observing priors of specific images [8, 9, 10, 11]. For example, Ma et al. [9] applied sparse coding techniques [12] and proposed to learn sparse image representation for SR; Yang et al. [11] further extended this idea by introducing group sparsity constraints when learning sparse image representation for SR. Recently, Irani et al. [13] advanced an image pyramid structure which downsamples an input image into several lower-resolution versions, and they integrates both classical and example-based approaches for SR. This method overcomes the limitation of example/learning-based approaches which require the collection of training image data in advance. Although promising SR results were reported in [13], the assumption of image patch self-similarity within or across image scales might not be practical. Motivated by [13], we propose a novel self-learning SR framework which does not require the reoccurrence of image patches, nor the collection of training LR/HR image data is needed in advance. We apply the image pyramid in [13] and learn context-aware sparse representation for SR. The flowchart of our proposed method is shown in Fig. 1. We note that, while prior SR methods utilizing context or texture information exist, they typically applied classical or examplebased approaches with training data. For example, Turgay et al. [14] considered image gradient and texture information and applied the maximum a posteriori technique for SR; Sun et al. [15] collected a super-pixel database from training data, and used those with similar texture information to synthesize the SR output accordingly. Later in Sect. 2, we will discuss how we automatically extract image segments, and construct an image segment dataset with different context category information from a single input LR image. Sect. 3 will detail our self-learning and predicting algorithms using sparse coding and support vector regression techniques for SR.
Fig. 1. Flowchart of our SR framework. (a) Input image I0 , its lower-resolution versions (e.g., I−1 , etc.), and the SR output ISR . (b) Extraction and clustering of image segments in terms of context information by affinity propagation.(c) Synthesized higher-resolution images {Ui+1 } from {Ii } in (a) using bicubic interpolation. (d) Learning of support vector regression (SVR) models for image segments in each context category. (e) Refining U+1 into ISR with the associated SVR models.
2. CONTEXT-CONSTRAINED IMAGE SEGMENTATION AND CATEGORIZATION We refer to image segmentation with the constraint of context information as context-constrained image segmentation and categorization. In this section, we focus on extracting and identifying image segments with similar textural context for image sparse representation and super-resolution. 2.1. Construction of an Image Segment Database Traditional learning-based SR methods need to collect training LR/HR image data when synthesizing the SR output for a LR input image. Although Irani et al. [13] proposed to generate down-scaled versions of a LR input for constructing an image patch (ground truth) database, their SR method still requires the existence of image patch self-similarities. To address this problem, we apply the same image pyramid structure and focus on constructing an image segment database using the associated context information. We now discuss the details of our image segment database construction. For a LR input, we first downgrade its resolution with a nearest neighbor approach and form an image pyramid {Ii } in Fig. 1(a). Using mean-shift [16], images in this pyramid are further divided into several segments. We consider the textural context information of each segment, and categorize them into different groups using the extracted context information accordingly. To describe the textural features for each segment, we calculate the responses of derivative filter banks with 6 different orientations and at 3 different scales (as suggested in [15]). For each filter with scale s and orientation o, its response is quantized into a histogram with 20 bins. As a result, the final form of our textural feature vector h for each image segment is of size 6 × 3 × 20 = 360. 2.2. Context-Constrained Image Segmentation and Categorization Once the image segment database is constructed from the image pyramid {Ii } as ground truth, an automatic approach is needed to divide the image segments into different categories
using their textural features. We apply affinity propagation (AP) [17], since AP is an unsupervised clustering algorithm which groups the data by identifying the exemplars of each cluster. One of the major advantages of AP is that, unlike prior clustering methods such as k-means, it does not require the number of clusters k as prior knowledge. Therefore, the use of AP allows us to categorize the image segments into different context categories automatically, while no user interaction such as the prior knowledge on k is needed. When using AP, we apply the χ2 -distance to measure the difference between two image segments S i and S j in terms of their 360-dimensional textual features h, i.e. 360 X (hin − hjn )2 χ2 (S i , S j ) = . (1) j i n=1 hn − hn Assume that there exist N image segments in the image pyramid {Ii }, we determine the optimal clustering configuration by maximizing the net-similarity N S between image segments, which is calculated as: NS
=
N X N X
cij s(S i , S j )
(2)
i=1 j=1
−α
N N N N X X X X (1 − cii )( cij ) − α |( cij ) − 1|. i=1
j=1
i=1
j=1
In (2), s(S i , S j ) = exp(−χ2 (S i , S j )) measures the similarity between segments S i and S j . The coefficient cij = 1 indicates that the segment S i is the exemplar (i.e. cluster representative) of the segment S j . In such cases, S j is categorized to cluster i, and cii equals 1 since the segment S i itself is an exemplar. The first term in (2) is to calculate the similarity between segments within each cluster, while the second term penalizes the case when segments PNare assigned to an empty cluster i (i.e. cii = 0 but with j=1 cij ≥ 1), and the third terms penalizes the condition when segments belong to more than one cluster, or no cluster label assigned; the parameter α is set to +∞ to avoid these two scenarios. More details of AP can be found in [17], and the process of our contextconstrained image categorization is illustrated in Fig. 1(b).
3. LEARNING CONTEXT-AWARE SPARSE REPRESENTATION FOR SUPER-RESOLUTION 3.1. Learning of Context-Aware Sparse Representation Besides the image pyramid constructed from {Ii }, we use bicubic interpolation to synthesize the associated high-resolution versions {Ui+1 }, as shown in Fig. 1(c). Consequently, for each synthesized image {Ui } (i ≤ 0), we have the ground truth image {Ii } for training purposes. Using this framework, we propose to utilize the property (i.e. context information) present in the target image and its image pyramid for designing an image-specific SR algorithm. Unlike prior learningbased SR work, we need not collect training LR/HR image data beforehand. As discussed in Sect. 2, we extract image segments from {Ii } and automatically categorize them into different groups depending on their textural context. For each ground truth image segment in {Ii }, the associated segment in {Ui } will be considered as the input of our learning algorithm. Thus, what we aim to learn is the relationship between the segment in {Ui } and its ground truth label (i.e. pixel value) in {Ii }; for an image pyramid with k different context categories, we will train k different models accordingly. As previously proposed in [10], we have applied the sparse representation for learning the support vector regression (SVR) [18] models between the input image patch and its ground truth data; this SVR model is successfully used to predict the output pixel value for SR purposes. In this work, we choose to learn the dictionary for each context category using the associated input patches in {Ui }. We determine their sparse representation by solving the following optimization problem: 1 min kxk − Dk αk k22 + λkαk k1 , k = 1, 2, ..., c, (3) 2 where xk is the image patch of context category k (out of c categories defined from {Ii }), Dk is the over-complete dictionary to be learned, αk is the resulting sparse coefficient vector, and the parameter λ controls the sparsity of αk . 3.2. SVR Leaning and Prediction for SR We use the above sparse representation as the feature to learn SVR, which models the relationship between the patches from the pyramid {Ui } and the associated ground truth pixel values in {Ii }, as shown in Fig. 1(d). We apply SVR for SR learning due to its excellent generalization ability in predicting the output labels for input data. Our SVR solves the following optimization problem: n
min ∗
w,b,ξ,ξ
s.t.
X 1 T w w+C (ξi + ξi∗ ) 2 i=1
(4)
yi − (wT φ(αik ) + b) ≤ + ξi , ξi ≥ 0, (wT φ(αik ) + b) − yi ≤ + ξi∗ , ξi∗ ≥ 0.
In (4), yi is the pixel value (at the same location of the center of the input patch) in the associated ground truth im-
Table 1. PSNR values of SR images with different methods. Bicubic LLE [7] Ma et al. [9] Wang et al. [10] Irani et al. [13] Our method
boat 26.59 25.25 26.27 26.81 25.02 28.52
cars 27.46 26.59 27.63 27.75 26.03 28.81
skyView 23.69 22.30 23.49 23.83 22.05 25.50
lena 29.70 27.43 28.86 29.89 28.31 32.02
fruit 32.56 29.84 32.47 31.83 34.42 32.43
station 22.82 21.63 23.57 22.73 21.34 24.09
age, n is the number of patches in context category k, φ(αik ) is the sparse representation of the input patch in the transformed space. The mapping function w represents the SVR model, and C is the tradeoff between the generalization and upper/lower training errors ξi /ξi∗ with precision . After the SVR models for each context category are learned, we use them to refine the synthesized HR image U+1 into the final SR output ISR , as depicted in Fig. 1(e). More precisely, we first segment U+1 into several regions by the same mean-shift algorithm (see Sect. 2.1), and each image segment is described by our textual context feature. For each of these segments, we search for its nearest context category exemplar determined from the image pyramid Ii (i.e. those with red rectangles in Fig. 1(b)), and we determine the context categories k of this segment accordingly. Once this category information is obtained, we extract the sparse representation αk and apply the associated SVR to refine/predict the pixel value in U+1 into the final SR result. 4. EXPERIMENTAL RESULTS We collect images from USC-SIPI (http://sipi.usc.edu/database) and Fattal et al. 1 for experiments. We apply SPAMS 2 to learn sparse representation, and our SVR models are trained by LIBSVM 3 . Only linear kernels are used in this paper. The PSNR values of six different images are reported in Table 1, all with a magnification factor of 2. To compare our approach with other existing interpolation or learning-based SR methods, we consider bicubic interpolation, locally linear embedding (LLE) for SR [7], sparse representation for SR [9], our previous SVR-based approach [10], and the approach of Irani et al. [13]. For fair comparisons, lower-resolution images were all produced in a nearest-neighbor(NN) way, and no back-projection operation was performed for methods in Table 1. From Table 1, we see that our approach outperforms other SR methods in PSNR except for fruit. Compared to bicubic interpolation, we obtained an average PSNR improvement of 5.5%, which is remarkably better than other learningbased SR methods. Fig. 2 shows example high-resolution images of the ground truth, and SR results produced by Irani et al. and our approach. It is clear that our SR image is qualitatively better especially in highly textural parts (e.g. hat regions). Moreover, we consider a larger magnification factor of 4 and compare the SR performance of Ma et al. in Fig. 3. 1 Images
available at http://www.cs.huji.ac.il/ yoavhacohen/upsampling/ available at http://www.di.ens.fr/willow/SPAMS/ 3 C.-C. Chang and C.-J. Lin, LIBSVM: a library for SVMs, 2001. 2 Software
(a) Ground truth (b) Irani et al. [13] (c) Our Method Fig. 2. Example SR images with a magnification factor of 2. Note that the hat regions are scaled for detailed comparisons. Acknowledgements This work is supported in part by National Science Council of Taiwan via NSC 99-2221-E-001020 and NSC 100-2631-H-001-013. 6. REFERENCES
(a) Ma et al. [9] (b) Our method Fig. 3. Example SR images with a magnification factor of 4. We see that our approach produced less noise and artifacts, and thus a more satisfactory SR result is achieved. 5. CONCLUSION We proposed a context-constrained self-learning approach for single image SR in this paper. Our method utilizes the image pyramid produced by an LR input, and performs contextconstrained image segmentation to categorize image segments into different categories using their textural context information. We utilize such image-specific context information, and focus on the learning of SVR models between the interpolated image patches and the ground truth pixel values in the image pyramid. Our proposed method is very unique, since we do not require the assumption of image patch self-similarities, and we need not collect training image data in advance. Given a single input image, our SR method constructs an image segment database and automatically determines the context categories of interest. Without the need of user interaction or expert prior knowledge, the SVR models learned from different context categories are used to refine the interpolated target image into its final SR version. Our experimental results verified that our approach outperforms state-of-the-art learning-based SR methods qualitatively and quantitatively. This confirms the feasibility of our method for practical SR applications, in which multiple LR images or LR/HR training image data are not possibly available.
[1] R. C. Hardie et al., “Joint map registration and high-resolution image estimation using a sequence of undersampled images,” IEEE Trans. Image Processing, 1997. [2] S. Farsiu et al., “Fast and robust multi-frame super-resolution,” IEEE Trans. Image Processing, 2003. [3] M. E. Tipping and C. M. Bishop, “Bayesian image superresolution,” in NIPS, 2002. [4] S. Baker and T. Kanade, “Limits on super-resolution and how to break them,” IEEE PAMI, 2002. [5] H. Y. Shum and Z. C. Lin, “Fundamental limits of reconstruction-based superresolution algorithms under local translation,” IEEE PAMI, 2006. [6] W. T. Freeman, T. Jones, and E. Pasztor, “Example-based superresolution,” IEEE Computer Graphics and Applications, 2002. [7] H. Chang, D.-Y. Yeung, and Y. Xiong, “Super-resolution through neighbor embedding,” in IEEE CVPR, 2004. [8] K. S. Ni and T. Q. Nguyen, “Image superresolution using support vector regression,” IEEE Trans. Image Processing, 2007. [9] J. Yang, J. Wright, T. Huang, and Y. Ma, “Image superresolution via sparse representation,” IEEE Trans. Image Processing, 2010. [10] M.-C. Yang, C.-T. Chu, and Y.-C. F. Wang, “Learning sparse image representation with support vector regression for singleimage super-resolution,” in IEEE ICIP, 2010. [11] C.-Y. Yang et al., “Exploiting self-similarities for single frame super-resolution,” in Asian Conf. Computer Vision, 2010. [12] D. L. Donoho, “Compressed sensing,” IEEE Trans. Information Theory, 2006. [13] D. Glasner, S. Bagon, and M. Irani, “Super-resolution from a single image,” in IEEE ICCV, 2009. [14] E. Turgay and G. B. Akar, “Context based super resolution image reconstruction,” in IEEE Workshop on LNLA, 2009. [15] J. Sun, J. Zhu, and M. F. Tappen, “Context-constrained hallucination for image super-resolution,” in IEEE CVPR, 2010. [16] D. Comaniciu and P. Meer, “Mean shift: A robust approach toward feature space analysis,” IEEE PAMI, 2002. [17] D. Dueck B. J. Frey, “Clustering by passing messages between data points,” Science, 2007. [18] V. Vapnik, Statistical Learning Theory, Wiley-Interscience, 1998.