Lip contour extraction using a deformable model - Image ... - IEEE Xplore

Report 2 Downloads 111 Views
LIP CONTOUR EXTRACTION USING A DEFORMABLE MODEL Alan W.C. Liew,S.H. h u n g and W.H. Lau Department of Electronic Engineering, City University of Hong Kong. Tat Chee Avenue, Kowloon Tong, Hong Kong. Email: { eewcliew, eeugshl, eewhlau} @cityu.edu.hk finding the optimum partition of a given lip image into lip and non-lip regions such that the joint probability of the two regions is maximised.

ABSTRACT The use of visual information from lip movements can improve the accuracy and robustness of a speech recognition system. Accurate extraction of visual features associated with the lips is thus very important. In this paper, we present a deformable model-based technique for lip contour extraction from color lip images. The geometric lip model we proposed is able to fit different lip shapes. Our method starts by generating a probability map of the lip image using spatial fuzzy clustering. Then the optimum set of model parameters that partitions a lip probability map into lip region and non-lip region is found, such that the joint probability of the two regions is maximised. Experimental results indicate that accurate and robust lip contour extraction is possible using this approach.

2. LIPMODEL Given a lip model as shown in Fig. 1, the equations describing the lip shape are given by

for X E [-w,w]with origin at (0,O). The parameter s describes the skewness of the lip shape and the skewing is achieved by using the transformation matrix The exponent 6 describes the deviation of y2 from a quadratic curve and it allows lip with different degree of curvature to be captured accurately. When the origin of the model is at location (xc,yc)and the lip is inclined at an angle 8 (where 8 is positive for counter clockwise rotation) with respect to (xc,yc),the x and y in (1) and (2) are replaced by (x-x,)cosO +(y-y,)sinO and (y-y,)cosO -(x-x,)sinO respectively. The set of parameters for the geometric lip model is given by

6 fl.

1. INTRODUCTION Lip movement information from a speaker can significantly improve the performance of an automatic speech recognition system, especially in a noisy environment [1,2]. However, a major challenge with incorporating visual information,into an acoustic speech recognition system is to find a robust and accurate method for extracting visual speech features. This is a difficult task due to the large variation caused by different speakers, illumination and the low contrast between lip and non-lip region. In this paper, we propose a deformable modelbased approach for lip contour extraction from color lip sequences. The use of color information is important since intensity value alone in many of the lip images does not have sufficient contrast to robustly discriminate lip and non-lip regions. The use of a parametric lip model as a template has the advantage of being more robust since the allowable shape is pre-constrained [3,4]. In addition, the lip shape can be described using only a small number of parameters and the parameters all have clear physical interpretation. Once the equations defining the lip shape are given, the lip contour extraction task is formulated as

a={xc, w,hl, hz,xoff, 6, s, 6 1. Y TXOfl

Figure 1.Geometric Lip Model

255 0-7803-6297-7/00/$10.00 0 2000 IEEE

3.

PROBABILITY MAP GENERATION

We use spatial fuzzy clustering [ 5 ] to generate a probability map of a color lip image where the value at a pixel location gives the probability of that pixel being a lip pixel. The clustering algorithm attempts to assign a probability value to each pixel based on the distribution of the pixels in feature space and the spatial interaction of each pixel with its 8-neighborhoods such that the inter and intra cluster distance is maximized and minimized, respectively. The clustering uses both the luminance and chrominance information as features. The lip image, originally in the non-uniform RGB format, is transformed into the approximately uniform CIELAB and CIELUV color spaces [6] such that color distance has an Euclidean measure. Color feature vector consisted of {L, a, 6 , U, v), where L is the luminance information and a, b, U, v are the chrominance information, is then generated for each pixel and used in the clustering. For our application, we are interested in clustering the lip image into 2 classes, i.e., lip class and non-lip class. The presence of teeth in some images will make the two-class clustering problem less accurate since outliers (i.e., teeth pixels) in the image data will bias the cluster centroids. As the chrominance value for teeth pixel is fairly consistent across speakers, we can estimate a threshold to mask out possible teeth region so that the teeth pixels do not enter into the clustering process. Low luminance region in which the chrominance information is less reliable is also masked out. Since lip region usually has a lower luminance value than skin region, we can use this information to further refine and enhance the probability map. To do this, we first compute the mean and variance of the skin region. We then set the luminance threshold to be 4 standard deviations below the mean. Pixels having a luminance value below this threshold and have a lip probability of greater or equal to 0.5 is very likely to be a lip pixel. The probability value of such a pixel can then be stretched to a higher value using a predefined non-linear mapping function. Sometime the probability map may have several blobs of high probability. Only the largest blob corresponds to the lip. We use morphological opening and closing [7] to smooth up boundary and clean up erroneous blobs of small area. To locate the mouth region and remove any erroneous blobs that are left behind after morphological filtering, we compute the best-fit ellipse and weigh down high probability pixels outside of the best-fit ellipse. The parameters of the best-fit ellipse, i.e., the center of mass (xm,ym),the inclination 8 about the center of mass, the semi-major and semi-minor axes, x, and y,, are computed by

x=l y=l

1

x=l y=l

,=I y=l

1

x=ly=l

(3) where M,N are the dimensions of the column and row, respectively. The (p,q)-th order central moment p p q, the column and row moments of inertia, I, and Iy, are given by ppq

=

cc M

N

,=I

y=l

M

N

I, =

(x-x,

)p

(Y -Y m

p"b(x, Y )

( ( y - y, ) cos8 - (x- x, ) sin

prob(x, y )

((y - y , )sin 8 + (x-

prob(x, y )

,=I y=l

I, =

M

N

,=I

y=l

c

X,

)

(4) Equations for x, and ya in (3) can be obtained by evaluating the moment of inertia for an ellipse of constant density of mass of one. The inclination angle 8 is obtained by minimizing I, in (4) with respect to 8 such that the semi-major and semi-minor axes are maximized and minimized respectively. 4.

LIP CONTOUR EXTRACTION

The objective is to find the optimum model parameters such that the region enclosed by the lip model has high probability and the region outside has low probability. Assuming that the probability associated with each pixel is independent of the probability of other pixel, the optimum partition is found when the joint probability of the lip and non-lip region is maximized, i.e.,

&

where is the model parameters for the parametric lip model of (1) and (2), RI and R2 are the region enclosed by the lip model and the region outside, respectively,

256

probl(x,y) is the probability of pixel at location (x,y) being a lip pixel and probz(x,y)= l-probl(x,y) is the probability of being a non-lip pixel. By taking logarithm and extending to a continuous domain, maximizing of (5) can be formulated as minimizing y,(A.x)

X,+W

Min A E(&) = -

seconds. For image sequence, the parameters from the previous frame is used as initial estimates for the current frame. At frame rate of 25 frames per second, this simple initialisation is found to be adequate. We observed that the parameter optimization in our approach is not sensitive to the initial parameter settings. The initial parameter values are not required to be close to the actual values since a region-based formulation of the cost function has a larger region of influence than an edgebased formulation. Also, a region-based formulation is inherently more tolerant to noise and image artifacts than an edge-based formulation.

J

Jf(x9y)dyh

(6)

yz(&,x)

xe-w

where fix,y) is set to be equal to log probl(x,y) - log probz(x,y) and the origin of the lip model is assumed to be at (x9Y)=(xc, Y C ) . The lip model parameters to be determined are: the translation parameters x,, y,, the model parameters w, h,, hZ,xoa, 6, s and the rotation angle 8 with respect to (x,,y,,>. To simplify the parameter estimation task, we propose to optimize a reduced parameter set p = ( yo hl, hZ,x08 6, s). The other parameters x,, w, 8 are determined beforehand and their values fixed once the two corner points of the lip are found. Before the optimization task is carried out, the image is rotated such that the inclination angle 8 is zero with respect to the model. We use conjugate gradient search [8] to estimate the optimum parameter values for p. By fixing x, and w ,the partial derivatives with respect to the parameterspk is given by (see Appendix)

i.e., the integration is carried out along the lip contours yl and y ~ .Since the image data is discrete, bilinear interpolation is used to obtain a continuous surface fix,y) when implementing the conjugate gradient optimization routine. The optimization problem can then be solved iteratively in the continuous domain without resorting to discrete approximation. S.

RESULTS

We present in Fig.2 some lip extraction results using the proposed method. As can be seen, the lip contour can be extracted accurately for different mouth shape. We performed lip extraction on a large number of lip images collected from different speakers with unadorned lips, uttering digits of 0 to 9 in Cantonese. Overall, we were able to achieve above 95% success rate in extracting a good lip contour. Geometric features useful for visual lipreading such as vertical opening, width and area of the mouth can be computed easily from the estimated model parameters. The computation time required for extracting lip contour from an RGB lip image of size about 70 by 100 pixels on a PI11 5OOMHz PC is about 0.05 to 0.1

257

Let

U

= y1(A,) and v = y 2 (&) such that,

F(&) = I”f(x, y)dy = G(u,v) then, by chain rule,

aF - aG --_aPk

au

au aPk

+--aG

av

av aPk

where Pk is the k-th element of A,. Using the fundamental theorem of calculus, we have,

’hus, we obtain,

9. REFERENCES 13 E.D. Petajan, “Automatic Lipreading to Enhance Speech Recognition”, Proceedings Of IEEE Global Telecommunications Conference,Atlanta, Georgia, pp.265272,1984. [2] C. Bregler, H. Hild, S. Manke and A. Waibel, “Improving Connected Letter Recognition by Lipreading”, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 557-560, 1993. [3] Tarcisio Coianiz, Lorenzo Torresani, and Bruno Caprile, “2D Deformable Models for Visual Speech Analysis”, in Speechreading by Humans and Machines, D.G. Stork & M.E. HenneckeEds., Springer, NY, 1996. [4] M.E. ‘Hennecke, K.V. Prasad and D.G. Stork, “Using deformable templates to infer visual speech dynamics”, in Proceedings of the 28* Asilomar Conference on Signals, Systems and Computers, pp.578-582, 1995. [5] A.W.C. Liew, S.H. h u n g and W.H. Lau, “Fuzzy Image Clustering Incorporating Spatial Continuity”, to appear in IEE Proc.-Vision, Signal & Image Processing. [6] R.W.G. Hunt, “Measuring Color”, 2”dEd., Ellis Honvood Series in Applied Science and Industrial Technology, Ellis Honvood Ltd., 1991. [7] R. Klette and P. Zamperoni, “Handbook of Image Processing Operators”, John Wiley and Sons Ltd., 1996. [8] W.H. Press, S.A. Teukolsky, W.T. Vetterling and B.P. Flannery, “Numerical Recipes in C”, 2”dEdition, Cambridge University Press, 1992.

Figure 2. Lip extraction results for different lip shapes

6. CONCLUSIONS We present a robust deformable model-based technique for lip contour extraction from a color RGB lip image. The method uses a region-based stochastic cost function to find an optimum partition of a given lip image into lip and nonlip regions. Spatial fuzzy clustering using both luminance and chrominance features from the CIELAB and CIELUV color spaces is used to produce a probability map of the lip image. The optimum model parameters are then found by performing a conjugate gradient search on the cost function. Unlike edge-based methods, our method is not sensitive to initial parameter settings and image artifacts. Extensive experimental results show the feasibility of our approach.

7. ACKNOWLEDEMENT The authors wish to acknowledge that this work is supported by CERG under grant number 9040272.

8. APPENDIX Given

258