Coarse to Fine Face Detection Based on Skin Color Adaption Hichem Sahbi and Nozha Boujemaa INRIA Rocquencourt, BP 105, 78153 Le Chesnay, France {Hichem.Sahbi, Nozha.Boujemaa}@INRIA.fr http://www-rocq.inria.fr/imedia/
Abstract. In this paper we present a skin color approach for fast and accurate face detection which combines skin color learning and image segmentation. This approach starts from a coarse segmentation which provides regions of homogeneous statistical color distribution. Some regions represent parts of human skin and are selected by minimizing an error between the color distribution of each region and the output of a compression decompression neural network, which learns skin color distribution for several populations of different ethnicity. This ANN is used to find a collection of skin regions which are used to estimate the new parameters of the Gaussian models using a 2-means fuzzy clustering in order to adapt these parameters to the context of the input image. A Bayesian framework is used to perform a finer classification and makes the skin and face detection process invariant to scale and lighting conditions. Finally, a face shape based model is used to validate or not the face hypothesis on each skin region.
1
Introduction
With the increasing emergence of the Internet, more and more data is becoming available on the Web. So how can this data be organized in order to retrieve information with an accurate precision and in a reasonable time? Visual information retrieval systems use multiple generic and specific descriptors. The application of specific face descriptors on databases containing human faces is significant only if these descriptors are applied to the regions of interest, which means that a face localization process is required. Several methods for face detection are discussed in the literature, including that developed by Rowley et al [1] who tests for the existence of faces of different sizes and rotations at each image position using a neural network. Osuna [2] uses support vector machine to build a face / no face classifier by maximzing the margin between the two separated classes. Leung et al [3] use a graph-matching method to find probable faces from detected facial features. These graphs are generated from the detected features and the true faces are detected among the candidates by random graph matching. Goshtasby et al [4] use the chrominance invariant color space (ab) to learn skin color, using a Gaussian models for color learning and face detection is performed using a template matching process. M. Tistarelli, J. Bigun, A.K. Jain (Eds.): Biometric Authentication, LNCS 2359, pp. 112–120, 2002. c Springer-Verlag Berlin Heidelberg 2002
Coarse to Fine Face Detection Based on Skin Color Adaption
113
Fleuret et al [5] presents a coarse-to-fine face detector which is entirely based on edge configurations. This algorithm visits a hierarchical partition of the face pose space, and in order to declare detection, a chain of classifiers from the root to one leaf is found. Recently, Viola et al [6] proposed a face detection method that compute fastly features using an Integral image and combine classifiers in a cascade allowing to reject background regions quickly. Their learning approach is purely statistical and it is based on AdaBoost 1 . In this paper we present an approach for precise skin and face detection based on the use of color space properties. This approach aims to track the variation in skin color distribution from one person to another. A coarse skin color learning stage from a very large population is performed offline. This color model is used to select skin regions and a finer color learning step is performed using a maximum confidence scheme in order to adapt parameters of the skin model to persons present in the scene. In the final stage, a skin/no-skin classification is performed using a Gaussian model and a Bayesian framework. A face shape based model is used to validate the face hypothesis.
2
Skin Region Selection
To perform a better skin color learning process based on the conditions of the input image, we search for a distribution of pixel colors which is the most likely to be from human skin. An ad-hoc method, which attempts to search for every subset of the image pixels and to measure a distance for every combination from a given skin color model, is very time consuming. So we start by a coarse initial segmentation such as the DFDM method [7]. This segmentation provides connected regions which have an homogeneous local color distribution in the image space. Among these regions Ri , we have skin parts (noted SRi ) which are detected using a distance E given by : E(Ri ) =
1 Ri
(Φ(c(x,y) ) − c(x,y) )2
(1)
(x,y)∈Ri
Here c(x,y) is the color of a pixel (x, y) represented in the RGB color space. Φ is the output of a neural network trained over a large population of skin colors collected from the World Wide Web. A quadratic or more generally a nonlinear function such as one hidden layer neural network is a good choice for a satisfactory approximation of skin color distribution(Fig.1). The learning is performed using the traditional back-propagation [8] algorithm, and our network performs no linear PCA since for every color c(x,y) in the training set, the difference between the input and output is minimized. For each candidate region to be a skin part, we use a decision rule based on computing an empirical error between this region Ri and the learned model using 1
Selecting a small number of critical visual features from a learning set.
114
Hichem Sahbi and Nozha Boujemaa
(b)
(a)
Fig. 1. (a) Neural network architecture used for skin color learning. (b) 3D distribution of skin color in the RGB color space. (1). Our approach does not aim to classify each pixel directly into skin or no skin according to the ANN only. Indeed, a decision based on a direct computation of the error function E to each pixel color can cause an increase in the number of false positives and false negatives, related to noisy data and lighting variations (cf. Fig.(5).A). In order to reduce these effects, we learn skin color under the lighting conditions of the input image (difference in lighting and melanin), so the goal is to have a set of color pixels (which may even be small), in order to perform a second color training process for more accurate classification.
The Finer Skin Detection Process
(b)
Database
N. Network
(a)
Skin Region
Gaussian model
Detection
parameter learning
Off-Line skin color learning
(c)
Fast Skin Region Detection
Online skin color learning
The finer skin color classification.
Fig. 2. The whole diagram of skin region selection (a) The input image of Clinton (b) Segmented image using the DFDM (c) Selection of skin regions marked with white. The coarse-to-fine algorithm shown in (Fig.2) is now summarized as follows: 1. Learn the neural network weights from a skin color population which is different from the input image (Off-line Step). 2. For every query image I (On line step): – Do a coarse segmentation to have a collection of candidate regions Ri i = 0...L. – Classify each candidate region Ri as a skin or no skin region, this is undertaken by considering the regions where the error E(Ri ) is below a given threshold.
Coarse to Fine Face Detection Based on Skin Color Adaption
3
115
Accurate Face Detection
By considering K and L−K clusters for both skin and noisy regions respectively, we make a decision rule for whether a pixel (x, y) is a skin point given its color observation c(x,y) . This decision rule is based on the following condition: P ((Y (c(x,y) = 1)|c(x,y) ) > P ((Y (c(x,y) ) = 0)|c(x,y) )
(2)
Here Y (c(x,y) ) = 1 (resp Y (c(x,y) ) = 0) denotes the event which expresses that the color c(x,y) is a skin (resp no-skin) color and X(c(x,y) ) = si (resp X(c(x,y) ) = ni ) the event expressing that the color c(x,y) is a skin color from the region SRi (resp Ni : The noisy region). In what follows, we denote by c the color c(x,y) depending on the pixel (x, y). The two members of equation (2) are given by: P ((Y (c) = 1)|c) =
K P (c|(X(c) = si )).P (X(c) = si )
P (c)
i=1
P ((Y (c) = 0)|c) =
L−K i=1
P (c|(X(c) = ni )).P (X(c) = ni ) P (c)
(3)
(4)
We can set the priors P (X(c) = si ), P (X(c) = ni ) to be equal. The density function P (c|(X(c) = si )) is modeled as a Gaussian having parameters which are estimated as explained in the following section. 3.1
Accurate Online Training Model
Let c1 , ..., ck , ..., cMi to be a quantification of colors in a skin region SRi , and h1 , ..., hk , ..., hMi the related histogram which denotes the color frequencies. The average µi and the variance-covariance Σi matrices of the related color distribution, are respectively given by: Mi Mi hk (ck − µi )(ck − µi )T k=1 hk ck µi = Mi , Σi = k=1 Mi k=1 hk k=1 hk During the generation of parameters of the Gaussian model, the noisy points in a skin region SRi affect µi and Σi estimation quality and this is related to the presence of no skin parts as hair or glasses. In order to reduce the effect of outliers, we model each skin region as two clusters which contain relevant and noisy skin points respectively. We apply the fuzzy clustering approach [9] to compute for each color in SRi a confidence coefficient. This coefficient is given by: Up,ck = 2
1 1
2 2 (m−1) q=1 [(dp,ck ) /(dq,ck ) ]
(5)
116
Hichem Sahbi and Nozha Boujemaa
J(U, v) =
Mi 2
(Up,ck )m (dp,ck )2
(6)
p=1 k=1
Here Up,ck expresses the color membership of the color ck to the cluster p (p is either skin or a noisy cluster) and dp,ck is a simple Mahalanobis distance of the color ck to the cluster p. Relating to [9], we perform a 2-mean fuzzy clustering of points present in each skin region into noisy and relevant skin points. This is carried out by minimizing the functional (6) which reaches its global minimum when each color ck is assigned to its relevant (noisy or skin) cluster. This preprocessing step gives much greater accuracy to the learned parameters of the Gaussian model, which are now modified as follows: Mi µi =
k=1 hk Uskin,ck ck , Mi k=1 hk Uskin,ck
Mi Σi =
k=1
hk Uskin,ck (ck − µi )(ck − µi )T Mi k=1 hk Uskin,ck
(7)
The coefficients Uskin,ck are introduced as weighting values to reduce the noise effects when computing the Gaussian model’s parameters. 3.2
Validating the Face Hypothesis
Given a skin region, a shape model is used to make a decision as to whether this region is a face or not. We compute two histograms corresponding to the horizontal and vertical sum of gray level information in the X and Y coordinates as shown in figure (Fig.3(b)). These two histograms are smoothed using a Gaussian filtering function to eliminate high frequency components. This process is summarized as follows: – Construct an entropy map using a snapshot descriptor [10](as the gray level histogram) on each window w(x, y) of the skin region. Assuming that each descriptor takes values in c1 , ..., cr the computed entropy is given by: H(w(x, y)) = −
r
[P rw(x,y) (ci )]log2 [P rw(x,y) (ci )]
(8)
i
– The Y and X histograms are computed using equations (9). yi =
Tx j=1
H(w(j, i)) , xj =
Ty
H(w(j, i))
(9)
i=1
– Perform a progressive filtering, to extract respectively the principal y and x coordinates corresponding to the lowest frequencies or the principal variation modes of the X and Y histograms. A skin region is taken to be a frontal face if the following two conditions are satisfied:
Coarse to Fine Face Detection Based on Skin Color Adaption
117
– The number of local extrema are three both in the horizontal and the vertical histograms and noted x1,x2,x3 and y1,y2,y3 respectively (cf. Fig.3(b)). – We estimate the likelihood for (x1, y1),(x3, y1), (x2, y2) and (x2, y3) to be respectively eyes, nose and mouth coordinates using a learning model. A Gaussian mixture model is used where each cluster attempts to capture the statistical distribution of the (xi , yj ) coordinates of the related feature. A decision rule is made using a maximum likelihood principal.
4
Experiments
To build our neural network, we collected a set of skin maps from the World Wide Web (Fig.3.(a)). These images were chosen to span a wide range of environmental conditions (blur, noise, etc), with people of different ethnicity and various skin colors. We tested our algorithm on the French TV Channel (TF1) database, the detection performances are estimated using the precision recall curves (cf. equations (10),(11)) with respect to the acceptance rate σ which represents the fraction of accepted and used skin colors (considered as relevant) during the online fuzzy learning step. relevant detected skin pixels detected skin pixels relevant detected skin pixels Recall = all correct skin pixels
P recision =
(a)
(10) (11)
(b)
Fig. 3. (a). A sample of skin maps from the WWW used during the Off line learning process. (b) X and Y gray level histrogram projections used for frontal face feature detection. According to the results (cf. Fig.5), even though the segmentation algorithm does not provide a good result, each detected skin region contains a significant
118
Hichem Sahbi and Nozha Boujemaa
Fig. 4. Recall and precision of skin classification for both (1) ANN direct classification (2) The coarse to fine approach.
part of skin color distribution, which is sufficient to perform a successful learning process. Figure (4) presents the precision-recall curves in both direct color filtering (using the ANN directly) and the coarse-to-fine approach. From this diagram, a considerable improvement is observed in both precision and recall for our method with respect to using the ANN directly as a skin filter. According to our experiments, the acceptance rate σ ranges almost between 30 − 60%, so an improvement both in precision and recall is guaranteed with respect to the ANN filter (cf. Fig.4). Time processing is an other aspect which have been evaluated. For images of 400 × 300 pixels, the face extraction process was performed in 0.8 (s) using a standart Pentium II 450 MHZ, so the face detection is carried out interactively and can be used to bootstrap a face tracking system.
5
Conclusion
A ”coarse to fine” method is presented for an accurate skin and face detection based on the combination of two coarse approaches. This approach starts from a coarse segmentation which performs a subdivision of an image into regions of homogeneous statistical color properties and a neural network skin detector provides a vote to select regions of interest in order to perform a second online training step which improves the skin model parameters. We are currently investigating to use our skin classifier as an input to an SVM classifier [11] to perform the face validation step. This can be performed by applying the SVM function only in the skin regions detected by our algorithm rather than sliding a window on the whole image space. This SVM is considered as a shape model which is able to handle large variations in face pose to decide whether a skin region is a face or not. Combining a fast skin detection with an SVM face detector allows
Coarse to Fine Face Detection Based on Skin Color Adaption
119
(A)
(B)
(C)
Fig. 5. (A) Skin detection using the ANN. (B) Segmentation using the DFDM followed by a skin region selection. (C) Face detection using the coarse to fine approach followed by the application of the frontal face shape model.
us to build a face localizer which is more faster and accurate than many other existing methods. Acknowledgment: We would like to thank TF1, the French TV Channel for providing us with images for tests.
References 1. H. Rowley, S. Baluja and T. Kanade : Neural network-based face detection. In IEEE Trans on PAMI. Vol. 20, Num. 1. (1998) 23–38. 2. E. Osuna, R. Freund and F. Girosi : Training support vector machines: an application to face detection. In IEEE CVPR. (1997) 130–136. 3. T. Leung, M.C. Burl and P Perona : Finding faces in cluttered scenes using random labelled graph matching. In ICCV. (1995). 4. J. Cai and A. Goshtasby : Detecting humans faces in color images. Image and Vision Computing. Vol. 18, Num. 1. (2000) 63–75.
120
Hichem Sahbi and Nozha Boujemaa
5. F. Fleuret and D. Geman : Coarse-to-fine visual selection. In IJCV. Vol. 41,Num. 2. (2001). 6. P. Viola and M. Jones : Robust real-time object detection. In Second International Workshop On Statistical and Computational Theories of Vision-Modeling, Learning, Computing and Sampling. (2001). 7. A. Winter and C. Nastar : Differential feature distribution maps for image segmentation and region queries in image databases. CBAIVL workshop at CVPR. (1999). 8. C.M. Bishop : Neural networks for pattern recognition. CLARENDON PRESS OXFORD. (1995). 9. Rajesh N. Dave : Characterization and detection of noise in clustering. Pattern Recognition. Vol. 12,Num. 11. (1995) 545–561. 10. S. Gilles : Robust description and matching of images. Oxford University. (1998). 11. H. Sahbi, D. Geman and N. Boujemaa : Face detection using coarse-to-fine support vector classifiers. Submitted to the IEEE, ICIP. (2002).