IEEE SIGNAL PROCESSING LETTERS, VOL. 20, NO. 1, JANUARY 2013
19
Face Hallucination via Similarity Constraints Hongliang Li, Senior Member, IEEE, Linfeng Xu, Member, IEEE, and Guanghui Liu
Abstract—In this letter, we present a new face hallucination method based on similarity constraints to produce a high-resolution (HR) face image from an input low-resolution (LR) face image. This method is modeled as a local linear filtering process by incorporating four constraint functions at patch level. The first two constraints focus on checking if the training images are similar to the input face image. The third is defined in the HR face image, which is to impose the smoothness constraint between neighboring hallucinated patches. The final constraint computes the spatial distance to reduce the effect of patches that are far from the hallucinating patch. Experimental evaluation on a number of face images demonstrates the good performance of the proposed method on the face hallucination task. Index Terms—Face hallucination, local linear filtering, superresolution.
I. INTRODUCTION
I
N many cases the face images captured by live cameras are often of low resolutions due to the environment or equipment limitations. Thus, how to recover human faces automatically has become an important problem for the further works such as face analysis and recognition. In order to generate a high resolution face image effectively, a lot of methods have been presented in the last decade. The work of learning based face hallucination can be traced back to the Baker and Kanada’s work [1], which first introduced a probabilistic model based face hallucination algorithm that uses the information included in a collection of recognition decisions. Motivated by an MRF based image super-resolution framework [2], Liu et al. [3] proposed a two-step approach to hallucinate low-resolution face images by decomposing face appearance into a global parametric model and a local Markov network model. A major problem of the probabilistic model based method is the high computational load and heavy memory load requirement. Unlike the probabilistic models [1], [3], Wang and Tang [4] proposed a face hallucination method using eigen-transformation, which considered the input face image as a linear combination of the low-resolution face images in the training set based on Principal Component Analysis. Park et al. [5] presented a face hallucination method to reconstruct a high-resolution facial image by combining an example-based reconstruction method
Manuscript received September 07, 2012; accepted October 25, 2012. Date of current version November 16, 2012. This work was supported in part by the NSFC under Grants 60972109 and 61271289 and by the Ph.D. Programs Foundation of Ministry of Education of China under Grant 20110185110002. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Xiao-Ping Zhang. The authors are with School of Electronic Engineering, University of Electronic Science and Technology of China, Chengdu, China (e-mail: hlli@uestc. edu.cn;
[email protected]; ghliu @uestc.edu.cn). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/LSP.2012.2227113
with an extended morphable face model. Jia et al. [6] proposed a generalized approach to hallucinate high-resolution face images across multiple modalities based on a hierarchical tensor space representation. Inspired by the position-patch based face hallucination method [7], Ma et al. [8] presented a simple multiview face hallucination (MFH) method to generate high-resolution multiview faces from a single low-resolution one. Jung et al. [9] improved this method based on convex optimization instead of the least square estimation. In addition, Li et al. [10] improved the face hallucination based on manifold alignment by projecting the point-pairs from the original coupled manifolds into the embeddings of the common manifold. Recently, Zhang et al. [11] propose a learning-based face hallucination method in the Discrete Cosine Transform (DCT) domain, which can produce a high-resolution face image from a single low-resolution one. Hu et al. [12] use an input low-resolution face image to search a face database for similar example high-resolution faces in order to learn the local pixel structures for the target high-resolution face. In this letter, a new face hallucination approach based on similarity constraints is proposed to hallucinate a high-resolution face image from an input low-resolution face image. The proposed method formulates the face hallucination as a local linear filtering progress based on training LR-HR face image pairs. II. PROPOSED METHOD A. Framework of the Proposed Method and denote the low resolution and high resolution Let training face images, respectively, where is downsampled from by an integer factor. In our work, we take the default downsampling factor as 4 unless specified otherwise. Assume be an input low-resolution face image, while represents its high-resolution face image to be hallucinated. Fig. 1 shows the overview of our proposed method, which is to automatically reconstruct a high-resolution face image by a linear filtering process. Three stages are involved in this work. For a given low-resolution face image patch, e.g., the mouth patch with red outline on the top of Fig. 1, we first search a LR-HR face database for all patches that are stored beforehand. Then, the similarities between the input patch and each pair of LR-HR face patches are measured under different constraint conditions. Finally, we hallucinate a high-resolution image by inferring the lost details within the input low-resolution image. Assume each image has been divided into overlapping patches with identical spacing. Let denote the set of pairs of training LR-HR patches, and and are patch indices. For an input LR face patch , our goal is to utilize the training patch pairs to recover the missing high frequency details in the hallucinated patch . Inspired by the guided and patch based synthesis schemes [13]–[15], the hallucinated patch can be formulated as
1070-9908/$31.00 © 2012 IEEE
(1)
20
IEEE SIGNAL PROCESSING LETTERS, VOL. 20, NO. 1, JANUARY 2013
Fig. 1. Framework of our face hallucination approach. Here, and are the low resolution and high resolution training face patches, respectively. and are the input low-resolution face patch and the high-resolution face patch to be hallucinated.
where
and are the mean values of the input LR patch and the HR patch , respectively. Here, the second is to perform the normalization by subterm tracting the mean from the HR patch . is defined as a filter kernel that depends on , , and . According to the relationship among them, we take different constraints to build this kernel, which can be expressed as
2) LR-HR Similarity Constraint: The LR-HR constraint is designed to measure the similarity between an input photo patch and a HR patch . Since HR patches usually contain a great of high frequency contents that are missed for the LR patches, it is difficult to compare their similarity directly based on their difference. In this work, we design a new descriptor called local appearance similarity (LAS) descriptor to measure the similarity between LR and HR patches. This descriptor is generated based on patch pairs similarity within a local region, which is illustrated in Fig. 2. Given a LR patch and a HR training patch , i.e., the patches marked with solid yellow line, the LR-HR constraint is defined to measure the similarity between them. To achieve this goal, the similarity matrix is separately computed for patch or with respect to all patches in a search window (e.g., the red window in the left column of Fig. 2). The final LAS descriptor for a patch is the concatenation of the matrix elements in terms of the raster scan order. Since this descriptor is computed based on a local appearance representation, it is able to capture the similarity in different resolutions. Let and denote the dimensional LAS descriptors for patches and . In our work, the LR-HR constraint can be formulated as (5) with
(2) where the normalizing term is to ensure that the sum of is equal to one. Here represents the neighborhood of patch . It is noticed that there are four terms defined in the kernel , which perform the similarity constraints, i.e., LR-LR similarity , LR-HR similarity , smoothness constraint and spatial similarity . B. Similarity Constraints Computation 1) LR-LR Similarity Constraint: Given a LR training face image, we have stored its corresponding HR training image beforehand. It means that all the missing high-frequency details in the LR image can be accurately estimated from its HR one. Based on this consideration, we define the LR-LR similarity to compare an input patch with a LR patch taken from the training set. Let denote the distance function. In this work, we characterize this similarity constraint as follows: (3) where the control parameter adjusts the range of intensity similarity, which means that smaller allows large changes between the two LR patches. A straightforward computation of is their Euclidean distance, which may result in poor performance in the case of the significant lighting variation or noise corruption. To achieve a robust performance, we incorporate a normalization process into the distance computation by subtracting the mean value from each patch, which can be viewed as a lighting balance process. The distance can be expressed as (4) where the operation
denotes the -norm distance.
(6) (7) where the parameter and adjust the descriptors similarity, and denote the neighborhoods of patches and , respectively. In our work, we set unless otherwise specified. The final LAS descriptor will be a 25-dimensional vector. 3) HR Smoothness Constraint: In this section, we tend to design a constraint to answer if those similar patches have good compatibilities with the neighboring ones. We call as a smoothness term, which aims to impose the smoothness constraint between neighboring hallucinated patches. Since we perform the face hallucination in a raster scan manner, we should enable the hallucinating patch has good compatibility in the overlapping regions with respect to the top and left patches. Let and represent the top and left patches with respect to the hallucinating patch , respectively. The HR smoothness constraint can be formulated as
(8) and denote the top and left overlapping regions where for pairs of patches and , respectively. Here, is used to control the range of smoothness variation. From (8), we can see that the smoothness term is designed to measure the difference within the overlapping areas,
LI et al.: FACE HALLUCINATION VIA SIMILARITY CONSTRAINTS
21
Fig. 2. Illustration of computation, where red window refers to the search region, and yellow windows with solid lines denote the LR and HR patches. Top row: LAS descriptor computation in the input LR image. Bottom row: LAS descriptor computation in a HR training image.
which achieves the smooth variation between neighboring hallucinated patches. It is noticed that a blurring effect will be introduced if we set too large overlapped regions. On the contrary, less overlapped regions will also lead to a distinct blocking artifact due to the weak smooth constraint. 4) Spatial Similarity: It is reasonable to assign small constraints for those patches that are far from the hallucinating patch . In this work, we define a new constraint to compute the similarity between and based on the spatial distance. If and represent the coordinates of patches and , respectively, we have (9) with if otherwise
(10)
where the parameter adjusts the spatial similarity. is a spatial window function defined by the set of the neighborhood of (i.e., ). Here, we set in our work. By increasing the window , we can incorporate more training patches to perform the HR patch hallucination. III. EXPERIMENTS In our work, all the face images are first aligned based on three marked positions: the centers of the two eyes and the center of the mouth. Then the original HR image is cropped to a size of 128 96, where the distance between the eye centers is fixed at 40 pixels, and the vertical distance between the eyes and mouth is fixed at 48 pixels. Its corresponding 32 24 LR images can be obtained by the downsampling process. Given an input LR face image, we divide it into a number of overlapping patches with the size of 4 4. The overlapping pixel is set to 3, which corresponds to 12 pixels in the HR face image. In this letter, we employ laplacian cost function, i.e., , to compute the similarity constraints. Four control parameters defined in (3), (5), (8) and (9) are set to , , and , respectively, which shows good performance from our empirical study. To handle the possible alignment errors, we simply set to treat the spatial effect of the neighboring patches uniformly.
We first perform the evaluation on a large number of face images taken from FERET face database. About 1200 images of 873 persons were selected as training images and 300 images of 227 persons for testing. Note that different persons are used between the training and testing images. We compare our method with the state-of-the-art methods, which include the general bicubic interpolation, Liu et al. [3], Wang et al. [4], Ma et al. [7], and Zhang et al. [11]. Some comparison results are presented in Fig. 3, where the input LR face images and the corresponding HR face images are shown in the first and last columns in Figs. 3, respectively. As shown in second column in Fig. 3, we can see that distinct blurring and jagged effects can be observed for the bicubic interpolation algorithm. In contrast, the results by Wang et al.’s algorithm suffer from the noisy problem for the hallucinated HR face images, while there exist the blurring problem for the results by Liu et al. [3]. In contrast to the previously mentioned methods, Zhang et al. [11] achieve a better visual quality for the hallucinated face images. Our results are presented in column 7 in Fig. 3(g), which shows that good visual quality can be achieved for most of face images hallucinated by our method. To highlight the differences, two enlarged results, i.e., the eye and mouth regions in the last face images, are shown in the Fig. 3(b), which shows that more details in the eyelid or mouth can be recovered by our method compared with the existing methods. In addition, we also evaluate our proposed method on some face images taken from the CMU+MIT face database. Fig. 4 shows the experimental results, where the input raw image is at the left, while the manually aligned LR image is in the middle, and the hallucinated result at the right. We can see that our proposed method can produce reasonable results even though the test images are different from the training examples. For example, the girl in the last image is degraded by noise. Our method still hallucinates the facial details successfully. We also perform the objective evaluation on our method. Two quantitative parameters are used to measure the similarity between the original HR face image and the hallucinated one, namely peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) [11]. The default parameters in SSIM are set to (constant term), (local window size), and (dynamic range of the pixel values), which were recommended by the authors. The results
22
IEEE SIGNAL PROCESSING LETTERS, VOL. 20, NO. 1, JANUARY 2013
still achieves the higher mean opinion score (i.e. 75.2781 on average) than the existing methods. It is noticed that since four similarity constraints are considered in our face hallucination method, the computational complexity of the proposed method is relatively high. It may be reduced by using the weight terms for early rejection of bad candidates. IV. CONCLUSION
Fig. 3. (a) Some examples of face hallucination results. (b) Locally enlarged results for the last two face images. Column 1: Input LR face images. Columns 2–7: Result by bicubic interpolation, Liu et al. [3], Wang et al. [4]. (e) Ma et al. [7], Zhang et al. [11], and our result. Last column: Original face image.
In this letter, we have presented a new method to generate a HR face image from an input LR face image. Inspired by our guided synthesis framework, this method provides an effective way to infer the missing high frequency details within the input LR face image based on the similarity constraints. Given the training set, four constraint functions are designed to learn the lost information from the most similar training examples. The first is to compute the similarity between an input LR face image and a training LR face image, while the second is to describe the local structure similarity between an input LR face image and a HR training image. The third is used to impose the smoothness constraint between neighboring hallucinated patches. The final constraint aims to reduce the effect of patches that are far from the hallucinating patch. Experimental evaluation demonstrates the good performance of the proposed method on the face hallucination task. REFERENCES
Fig. 4. Experimental results on some LR face images. For each example, the input image is at left, the aligned LR image in the middle, and the hallucinated result at the right. TABLE I OBJECTIVE AND SUBJECTIVE COMPARISON RESULTS
on FERET data set are listed in Table I, which shows that our method is superior or competitive to the state-of-the-art methods in terms of both PSNR and SSIM index. Compared with the recent work by Zhang et al. [11], there are about 0.5626 PSNR gain and 0.0367 SSIM gain achieved by our method. However, as discussed in [11] and [12], we also found a similar phenomenon that PSNR and SSIM are not always consistent with the human perceptual quality. Therefore, we follow the method [13] to perform a subjective viewing test by juxtaposing hallucination results to the viewers. We invite 19 subjects to give The Mean Opinion Score (MOS) and vote for the hallucination quality, i.e., Excellent(80–100), Good(60–80), Fair(40–60), Poor(20–40), and Bad(0–20). The mean opinion scores of all test images are given in Table I, which shows that our method
[1] S. Baker and T. Kanade, “Limits on super-resolution and how to break them,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 24, no. 9, pp. 1167–1183, Sep. 2002. [2] W. T. Freeman, E. C. Pasztor, and O. T. Carmichael, “Learning low level vision,” Int. J. Comput. Vis., vol. 40, no. 1, pp. 25–27, 2000. [3] C. Liu, H. Y. Shum, and W. T. Freeman, “Face hallucination: Theory and practice,” Int. J. Comput. Vis., vol. 75, no. 1, pp. 115–134, 2007. [4] X. Wang and X. Tang, “Hallucinating face by eigentransformation,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 35, no. 3, pp. 425–434, Aug. 2005. [5] J.-S. Park and S.-W. Lee, “An example-based face hallucination method for single-frame, low-resolution facial images,” IEEE Trans. Image Process., vol. 17, no. 10, pp. 1806–1816, Nov. 2008. [6] K. Jia and S. Gong, “Generalized face super-resolution,” IEEE Trans. Image Process., vol. 17, no. 6, pp. 873–886, Jun. 2008. [7] X. Ma, J. Zhang, and C. Qi, “Hallucinating face by position-patch,” Patt. Recognit., vol. 43, pp. 2224–2236, 2010. [8] X. Ma, H. Huang, S. Wang, and C. Qi, “A simple approach to multiview face hallucination,” IEEE Signal Process. Lett., vol. 17, no. 6, pp. 579–582, Jun. 2010. [9] C. Jung, L. Jiao, B. Liu, and M. Gong, “Position-patch based face hallucination using convex optimization,” IEEE Signal Process. Lett., vol. 18, no. 6, pp. 367–370, Jun. 2011. [10] B. Li, H. Chang, S. Shan, and X. Chen, “Aligning coupled manifolds for face hallucination,” IEEE Signal Process. Lett., vol. 16, no. 11, pp. 957–960, Nov. 2009. [11] W. Zhang and W.-K. Cham, “Hallucinating face in the DCT domain,” IEEE Trans. Image Process., vol. 20, no. 10, pp. 2769–2779, Nov. 2011. [12] Y. Hu, K.-M. Lam, G. Qiu, and T. Shen, “From local pixel structure to global image super-resolution: A new face hallucination framework,” IEEE Trans. Image Process., vol. 20, no. 2, pp. 433–445, Feb. 2011. [13] H. Li, G. Liu, and K. N. Ngan, “Guided face cartoon synthesis,” IEEE Trans. Multimedia, vol. 13, no. 6, 2011. [14] K. He, J. Sun, and X. Tang, “Guided image filtering,” in Proc. 11th Eur. Conf. Computer Vision, Greece, Sep. 5–11, 2010, pp. 1–14. [15] W. Zhang, X. Wang, and X. Tang, “Lighting and pose robust face sketch synthesis,” in Eur. Conf. Computer Vision, Greece, Sep. 5–11, 2010, pp. 420–433.