MORPHOLOGICAL NORMALIZATION OF VOCAL TRACT SHAPE Jianguo Wei1 2 and Jianwu Dang1 ˈ
ˈ2
Japan Advanced Institute of Science and Technology 1-1, Asahidai, Nomi, Ishikawa 923-1211, Japan 2 School of Computer Science, Tianjin University, China
[email protected];
[email protected] 1
ABSTRACT The articulatory databases are not utilized so widely as acoustic databases. One of the reasons is the difficulty of reducing morphological variations among subjects. To reduce morphological differences in speech organs among speakers and remain their speech dynamics, this study proposed a framework of normalizing vocal tract by using a Thin-plate spline method. Electromagnetic Midsagittal Articulographic data for three subjects have been used in this research. The template for normalization was obtained by averaging all three subjects’ palates and tongue shapes. The landmarks of the template and subjects have been defined according to a gridline system of the vocal tract. The results show that the variances among subjects were reduced 0.8 mm in horizontal and 2.4 mm in vertical direction. The similar vowel structure of pre/postnormalization data indicates that speaker specific characteristics can be maintained by this framework. The effects of the normalization in acoustic space are also investigated by using a physiological articulatory model. Results show that the variations have also been reduced in acoustic space. Index Terms— Vocal tract normalization, Articulatory data, Thin-plate spline 1. INTRODUCTION In recent years, more and more articulatory data are obtained and used for speech researches as well as acoustic data. However, the articulatory data are not utilized so widely. Beside the difficulty for acquiring the data, one of reasons is that the normalization of the vocal tract is a bottleneck for the multi-subject articulatory data study. In order to discover the essential articulatory and kinematic properties involved in different speakers, the inter-subject normalization of articulatory data is a necessary procedure to reduce the morphological variability across subjects. Since speech articulation involves with large deformations, it is difficult to handle by affine transformation of simple rigid objects. Several vocal tract normalization techniques have been proposed in articulatory space. Bechman et al.[1] straightened the vocal tract wall to transform the coordinates of MRI data. Hashi et al.[2] normalized the vowel posture for an x-ray microbeam
978-1-4244-4296-6/10/$25.00 ©2010 IEEE
4186
database. These two methods both straighten the palate wall to normalize the vocal tract length, which could not guarantee the relationship between the palate and tongue surface after transformation. In addition, these methods could not reflect the nonlinear relationship between subjects, especially for highly locally-deformed vocal tract. Since the vocal tract shape usually reflects local and nonlinear deformations, it can be treated as an elasticdeformation. Accordingly, this research proposed a framework to normalize inter-subject’ EMMA data by using a thin-plate spline warping (TPS) [3] ,which is a widelyapplied transformation function in image alignment and shape matching. We adopted TPS to realize point-based normalization of different subjects’ EMMA data, in which the mean shape of vocal tract served as the template. A gridline system was used to define the landmarks. This framework is able to keep the relation between the palate and tongue position so as to maintain the kinematic properties during normalization. Reducing the morphological variability of the vocal tracts would be expected to facilitate analysis, allowing computation of a single kinematical representation of an entire group of subjects or to compare the kinematical representations of two different groups of subjects. Three subjects’ articulatory and acoustic data from NTT Electromagnetic Midsagittal Articulographic (EMMA) database[4] were used in the analysis. We evaluated the performance of our method in articulatory space. The acoustic effects of our normalization method are also investigated in this study, for which a physiological articulatory model [5] was used to generate speech according to normalized articulatory postures. 2. ELASTIC DEFORMATIONS OF VOCAL TRACT Changes in the Vocal tract (VT) shapes are caused by the deformation of the elastic tissues of the tongue and the movements of the jaw. The effect is that the shapes of vocal tracts of different subjects could not fit properly after rigid registration. In order to reduce the morphological variations among the speakers, some methods [1,2] normalized the vocal tract by straightening it and then normalizing the length of the vocal tract, thus these methods are mainly considered to be
ICASSP 2010
vocal tract length normalization. According to the results shown in [6], however, the inter-speaker variability is not only related to the vocal tract length but also to the volumes of back and front cavities of the vocal tract. The other drawback of straightening the vocal tract is that, this method does not take the non-linear elastic nature of vocal tract deformation into account. Furthermore, the relative positions of different sensors attached to the articulators were lost after normalization, this possibly lose the kinematic properties of articulators. A number of non-rigid normalization approaches have been proposed in the image processing field. Among them, the thin-plate spline is a class of non-rigid spline mapping functions with several desirable properties for our application. They are globally smooth, separable into affine and non-affine components, and transform the data only according to landmarks that reflect the physical features between source data and target data. Given a set of n corresponding 2D points, the TPS warp is described by 2(n+3) parameters, which include 6 global affine motion parameters and 2n coefficients for correspondences of the control points. These parameters are computed by solving a linear system [7]. Suppose ( xˆ i , yˆ i ) 2 ,i=1,…n, are the n control points in a planar, and their corresponding function values are, vˆi , i=1,2,…,n, then the thin plate spline interpolation f x, y denotes a mapping: f : 2 o . The TPS interpolating the points is defined by n (1) 2 2 a1 a2 x a3 y ¦ wi ri ln ri
f x, y
i 1
where ri2 ( x xˆi )2 ( y yˆi )2 . Eq. (1) is the equation of a plate of infinite extent deforming under loads centered at ( xˆ i , yˆ i ) .The plate deflects under the imposition of loads to take values wi [7]. The interpolation spline function consists of two parts: affine transformation specified by the first 3 elements, and the last warping part. The function f minimizes the bending energy Ef over the class of such interpolations where Ef is defined as: §§ w 2 f ·2 § w 2 f ·2 § w 2 f ·2 · (2) ¨ ¸
³³ ¨ ¨¨© wx
Ef
©
2
¸¸ ¨¨ 2 ¸¸ dxdy ¸¸ ¨¨ ¹ © wxw y ¹ © wy ¹ ¸¹
Three more equations are obtained using the following three constraints: n
n
n
(3) ¦ xˆ w 0 (4) ¦ yˆ w 0 (5) Constraint (3) shows that the sum of the loads applied to the plate should be zero. This is needed to ensure that the plate would not move under the imposition of the loads but remain stationary. Constraints (4) and (5) require that moments with respect to x and y axes are zero, ensuring that the plate would not rotate under the imposition of the loads. The TPS parameter vectors a including a1, a2 and a3, and w including wi, can be computed by solving the following linear equation: (6) ª A P º ª wº ª v º « P T O » « a » «0 » ¬ ¼ ¬ ¼ ¬ ¼ ¦
i 1
w
i
0
i
i 1
i
i
i 1
i
4187
Where Aij rij2 ln rij2 , i=1,…n (the number of landmarks), j=1,…m (the number of raw data to be transformed ); the ith row of P is (1, xˆ i , yˆ i ) . O is 3u 3 matrix of zeros. The 0 is a 3 zero vector in the rightmost part of equation 6. w, a and v are vectors formed from wi, from a1, a2, a3 and from vi. The leftmost (n+3)×(n+3) matrix is denoted as K hereafter. In this research, we focus on mapping points ( x , y ) of EMMA data to template coordinates ( x ', y ') in light of given landmarks ( xˆi , yˆ i ) for one subject’s EMMA data vs. ( xˆ i' , yˆ i' ) defined for the landmarks of the template. So we are interested in warping 2D points using TPS defined by pairs of control points. Toward that end, we applied TPS functions to x and y coordinates separately. From Equation 6, the TPS warp which maps ( xˆ i , yˆ i ) to ( xˆ i' , yˆ i' ) , can be recovered by (7) ª wx w y º ˆ ' yˆ ' º 1 ª x K « » ¬0 0¼ Where xˆ ' and yˆ ' are the vectors formed with xˆ i' and yˆ i' «a ¬ x
a y »¼
respectively. The wx and ax are the parameters for xdimension, and wy and ay are for y-dimension. The transformed coordinates ( x 'j , y 'j ) of points ( x j , y j ) are given by (8) ª wx w y º ' '
>x
y
@
[ B Q] « ¬ ax
a y »¼
Where Bji ((xj xˆi )2 (yj yˆi )2)ln((xj xˆi )2 (yj yˆi )2) , i=1,…,n, j=1,…,m. The j-th row of Q is (1, x j , y j ) , and j-th row of the resulting vectors x’ and y’ are the interpolated x and y ' coordinates x 'j and y j , respectively [8]. 3. LANDMARK SELECTION The articulatory data recorded by EMMA, however, can not show the vocal tract shape as sharpness as that recorded by some imaging systems i.e. Magnetic Resonance Imaging (MRI) or X-ray system. It is not easy to find the corresponding points having clear morphological meaning along vocal tract among subjects. In order to overcome this problem, we defined the landmarks in the vocal tract space by a gridline system modified from [9], which has been used to measure the morphology of the vocal tract for describing its acoustical properties. In this research, we used a mean shape of the vocal tracts obtained from three subjects’ EMMA data as the template in the normalization. A set of landmarks is first defined in the template, and then the landmarks are defined on EMMA data for each subject. There is no explicit way to define the corresponding feature points of the template and the subjects, the most identifiable feature points are the points marking the sensors on the tongue surface from tongue tip to tongue rear, named T1 to T4. In the processing, we first calculated the average tongue positions along the tongue surface (from tongue tip to tongue rear) over all the vowels in the database, and then the centroid point of the
tongue surface. The gridline system is constructed based on the tongue surface and its centroid point, which has equal fan sections to cover the tongue movement regions. Consequently, ten sub-fans’ edges intersected the palate line, the middle line, the tongue surface and the line below the tongue surface, so there were 44 intersection points in total, which served as the landmarks in the normalization. The landmarks of each subject were defined under the same procedure. The results are shown in Fig.1. The landmarks of subject 1
The landmarks of subject 2 2
o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o
0 -2
Vertical(cm)
Vertical(cm)
2
0 -2
o
-4
-4 -6
0
2
4
6
-6
8
0
2
4
6
Fig.3. The data after normalization. The symbols havethe same meanings as in Fig. 2.
8
Anterior-posterior(cm)
Anterior-posterior(cm)
The landmarks of subject 3
The landmarks of template
2
2 o o o o o o o o o o o oo o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o
0
o
-2
Vertical(cm)
Vertical(cm)
o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o oo o o o o oo oo o o oo
o
-4
0
o o oo o o o o o o o o o o o o o o oo oo o o o o o ooo o o o o o o o o o oo o o o
o
-2
o
-4
-6
0
2
4
6
-6
8
0
2
4
6
8
Anterior-posterior(cm)
Anterior-posterior(cm)
Fig.1. The landmarks of three subjects and the template. There show the palate, the tongue surface and the grid lines. The lips point to the left side. The circles are the landmarks. 4. EXPERIMENTS In our experiments, we extracted 320 configurations of the vocal tract for each vowel in different contexts from the EMMA data, which involved in 5 Japanese vowels and 8 consonants. Fig.2 shows the distribution of the original configurations that are from stable segments of the vowels. Fig. 3 has shown the normalized vowel from three subjects’ EMMA data. Comparing with Fig. 2, one can see that the variances between different subjects were reduced. The palate curves of each subject almost overlapped the palate of the template. pre-normalized /a/
2
1 i iiiiiiiiiiiiiiiiiiiiiii iiiiiiiiiiiiiiii iiii iiiiiiiiiiiiii e ee e e e e e e e iiii i e ee e e e e e e e e e e e e e e eee e iiiiiiiiiiiii e e e e e e aaa eee e ee iiiiiiiiiii e e e a a e a a eee aa aa a a aa a a a a a aa a aa aa a a a a a a aa aaa a a a aa aa a a a aa a aa a a aa a aa aa a a a a a aaa aa a aa a a a a a a a
0 -1 a -2 -3
iiiiiii iiiiii ii iiiiiiiii iii
-2
1
-2
o o o o o ooo o o o
4
6
8
pre-normalized /e/
0
2
4
6
8
10
-5 -2
0
2
4
6
8
10
pre-normalized /o/
2
2
u u u u u u u u u uu u u
o oo oo o
-4
-5 -2
10
uu u uu u u u u u u u u u uu uu u u u u u u u u
-3
-4
2
iii iiiiiiiiiiiiiiiiii iiiiiiiiii i ii ee e e eeee e e e e e ee e ee e e ee ee ee e e e e e e iiiii i e ee e e iiiiiiiiiii e eee e e e aaaa aee aa a aa ee aa a aa a aa aa a aa aa a a a e a a a a aa a a aa a a aa aa a a aaaa a a a a a a a a a
0 -1
-3
0
pre-normalized /u/
2
u uu u u u uu u uu u u u u u u u uuu uu uu uu u u uu u
a a a a a aa a aaa a a a aa aaa a a a a aa a a aa
-1
u uu u uu u uu u uu u u u u u u u u uu
o o o o o oo o o oo o o o o
e ee e ee e e ee e e ee ee e e e ee e e e e
0
u u u u u u uu u uu u u u uuuu u u u u u u u uu u u uu u u u uu
-4 -5 -2
pre-normalized /i/
2
1
5. EVALUATIONS The normalization results are evaluated in articulatory space and acoustic space, respectively. 5.1. Evaluation in articulatory space In order to evaluate our method, we chose the straightening palate wall method [1], a dominant method in this field, to serve as a baseline method for comparison. Fig. 4 shows the results after straightening palate-based method on the same articulatory data set. Comparing with the results from our TPS-based method shown in Fig. 3, the results obtained by straighten palate method show more variance in the distribution. The cross-subject standard deviations (SD) of sensors on the tongue surface of raw data, normalized data by straightening palate-based method and normalized data by TPS-based methods are shown in Fig. 5. The crosssubject standard deviations have been reduced about 0.8 mm of X-dimension and 2.4 mm of Y-dimension over all sensors and all five Japanese vowels by TPS method. From Fig. 5, we can see that the straightening palate wall method makes the standard deviations over X-dimension even worse than in the raw data. The reason is that straightening palate wall method deteriorated the data in X-dimension when it straightens the palate wall and guarantees the observed points to be perpendicular to the palate.
1
0 a aaa a aa aa a a aa a a a a a a a a a a aa a a a aa a a a a a aa a
-1 -2
e e e e e ee ee ee e e e e e e ee e e eee e e ee e e e ee e eee
iiiiiiiiii i iiiiiiiiiiiiii ii iiiiii iiiiiiiiiii iiiiiiii iiiiiiiiiiiiiii i
0 u u u uu uu u
-1
u u uu uu u u u u u u u u u u u
Vertical(cm)
1
u u u u u u u u uu uu u uu
o o o o oo oo oo o
-2
o ooo o o oo
i ii iiiiiii iiiiiiiiiiiiiiiii iiiii e e e iiiiiiiiii ee e ee e e e e e ee e ee ee ee e e e e e e ee e e e ee e e iiiiiiiiiiiii iiiiiiii ee e a aa e a a a ee e ee a ee a iiii a a a aa aa a a ee a a a aa aa a a a a aa aa a a a a a ee a a a a a a a a a a a a a a aa a aa aa aa aa a a a aa a a a a a aa a a a a a a a a a a a
u u u uuu u uu u u u uu uu u uu u uu u u u u uu u uu u u u u u u u u u u uu u u uu uu uu
-3 -4 -5
-3
-6 -4 -5 -2
-7 -2 0
2
4
6
8
10
0
2
4
6
8
Anterior-posterior(cm)
Fig.2. The raw data before normalization, each panel shows the data for one vowel, with the 3 subjects each denoted by a different color. The stars are tongue tip, cross symbols denote the tongue blade, triangles stand for tongue dorsum and circles depict tongue rear. The contour of a physiological articulatory model was drawn by dashed line in the last panel for reference.
4188
Fig. 4. The data after normalization with the straighten palate method. The symbols have the same meanings as in Fig. 2.
4
4
2
2
0
/a/
/i/
/u/
/e/
/o/
0
SD(mm) on X-dim of Blade 8
6
6
4
4
2
2 /a/
/i/
/u/
/e/
/o/
0
8
8
6
6
4
4
2
2
0
/a/
/i/
/u/
/e/
/o/
0
SD(mm) on X-dim od Rear 8
6
6
4
4
0
/a/
/i/
/u/
/e/
/o/
/i/
/u/
/e/
/o/
0
/a/
/i/
/u/
/e/
/o/
-1
i ui i e u e
e
o a a
u
eu u e e u oa a o
i e uu e i eu
i
ou a i
oa
ao i
/a/
/i/
/u/
/e/
/o/
/a/
/i/
/u/
/e/
/o/
STD_F2 41 93 132 106 98 STD_F2 111 151 140 139 104
F3 2607 2790 2207 2534 2602 F3 2544 2941 2462 2568 2688
1
2
3
4
e
5
o a
u e
i
0.5
u
o a ao
a o
o
e
0
STD_F3 35 62 121 49 86 STD_F3 529 470 387 437 582
5.3. Evaluation of remaining of speech dynamic In order to evaluate if the speech dynamic remained after normalization, we plotted the vowel diagram of raw data and normalized data of each subjects as shown in Fig. 6. We can clearly see that the shapes of vowel structure of normalized data on T1-T4 of each subject are highly similar with the original ones. This result indicates that the speaker specific characteristics are able to be maintained, while the inter-speaker dynamic range is reduced.
4189
u uu ee
ee u u e u
-0.5
ii i u u eeu e
-1 -1.5
o a
i ie
i ii
i
ooa a ao
aaoo ao
i i
o ao aa
e ee
u
u o ao oa a
-2
6
-2.5
1
2
Anterior-posterior(cm)
Table 1. The average and SD of Formants F2 1296 1956 1235 1785 887 F2 1277 2141 1319 1832 798
-0.5
-2.5
5.2. Evaluation in acoustic space In order to evaluate the effects of the TPS-based normalization method on acoustics, we investigate the acoustic features of raw data and normalized data. We used the normalized data as input to a physiological articulatory model for generating full vocal tract shape, and then synthesized each vowel in 320 contexts, for which the first three formants were calculated and compared with the original sound in EMMA database. Table 1 shows the first three formants averaged from EMMA data and the averaged formants of synthesized vowels. The results imply that the TPS based normalization can maintain the acoustic characteristics for the vowels.
STD_F1 11 43 46 32 30 STD_F1 60 50 55 49 48
0
The post-normalization vowel structure of subjects
1.5 1
-2
Fig.5. Comparisons of cross-subject standard deviations of raw data and normalized data. The Blue bars denote the raw EMMA data. The green bars denoted the straightening palate method, the red bars for the TPS-based method.
TPS F1 /a/ 667 /i/ 357 /u/ 323 /e/ 531 /o/ 503 EMMAF1 /a/ 626 /i/ 315 /u/ 366 /e/ 458 /o/ 424
ii i
-1.5
2 /a/
i
0.5
SD(mm) on Y-dim od Rear
8
2
1
SD(mm) on Y-dim of Dorsum
of SD(mm) on X-dim of Dorsum
The pre-normalization vowel structure of subjects
1.5
SD(mm) on Y-dim of Blade
8
0
SD(mm) on Y-dim of Tip
Vertical(cm)
Raw 8 Straighten 6 TPS
Vertical(cm)
SD(mm) on X-dim of Tip 6
8
3
4
5
6
Anterior-posterior(cm)
Fig.6. The vowel diagram of each subject before and after normalization. 6. CONCLUSION In this research, we proposed a framework to normalize articulatory data across subjects by means of TPS transformation method. The performance of this framework was evaluated. The evaluation results showed that the intersubject variations were reduced in articulatory space as well as acoustic space. The averaged standard deviations have been reduced around 0.8 mm of horizontal direction and 2.4 mm of vertical direction for vowels over all tongue sensors. Vowel diagrams indicate that this method is capable of maintaining the speaker specific characteristics. Correctly minimizing the morphological variations would be great help for discovering and modeling the essential properties of articulatory movements. 7. ACKNOWLEDGMENTS The authors especially thank NTT Communication Laboratories for the permission to use the articulatory data. This study was supported in part by SCOPE (071705001) of Ministry of Internal Affairs and Communications (MIC), Japan.
8. REFERENCES [1]M. E. J. Beckman, T., T.-P. Jung, S.-h. Lee, K. d. Jong, A. K. Krishnamurthy, S. C. Ahalt, K. B. Cohen, and M. J. Collins, "Variability in the production of quantal vowels revisited," J. Acoust. Soc. Am., vol. 97, pp. 471-490, 1995. [2]M. Hashi, J. R. Westbury, and K. Honda, "Vowel posture normalization," JASA, vol. 104, pp. 2426–2437, 1998. [3]B. FL, "Principal warps: Thin plate splines and the decomposition of deformations," IEEE Trans Pattern Anal. Mach. Intell, vol. 11, pp. 567-85, 1989. [4]T. Okadome and M. Honda, "Generation of articulatory movements by using a kinematic triphone model," J. Acoust. Soc. Am, pp. 453-463, 2001. [5]J. Dang and K. Honda, "Construction and control of a physiological articulatory model," JASA, vol. 115, pp.853-870, 2004. [6]Yang, C.-S. and Kasuya, H., “Uniform and non-uniform normalization of vocal tracts measured by MRI across male, female and child,” IEICE Trans. On Inf. & Syst., Vol.E78-D, No.6, pp.732-737, 1995 [7]L. Zagorchev and A. Goshtasby, " A comparative study of transformation functions for nonrigid image registration," IEEE Trans. Image Processing, vol. 15, pp. 529-538, 2006. [8]J. Lim and M. H. Yang, "A Direct Method for modeling Nonrigid Motion with Thin Plate Spline," in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. [9]Beautemps, D., Badin, P., and Laboissière, R. (1995). Deriving vocal-tract area function from midsagittal profiles and formants frequencies: A new model for vowels and fricative consosnants based on experimental data. Speech Communication, 16, 27-47.