Speech Synthesis Based on a Physiological Articulatory Model Qiang Fang1, 2, Jianwu Dang1 1
IIPL, school of information science, Japan Advance Institute of Science and Technology 2
Phonetics Lab., Institute of Linguistics, Chinese Academy of Social Sciences
[email protected],
[email protected] Abstract. In this paper, a framework for speech synthesis is proposed to realize the process of speech production of human, which is based on a physiological articulatory model. Within this framework, it begins with given articulatory targets, then muscle activation patterns are estimated according to the targets by accounting for both the equilibrium characteristics and muscle dynamics, consequently, the articulatory model is driven to generate a time-varying vocal tract shape corresponding to the targets by contracting the corresponding muscles. Thereafter, a transmission line model is implemented for the time-varying vocal tract to produce speech sound. At last, a primary experiment is carried out to synthesize the single vowels and diphthongs of Chinese with the physiological articulatory model based synthesizer. The result shows that the spectra of the synthetic sound for single vowels are consistent with those of the real speech, and proper acoustic characteristics are obtained in most cases for diphthongs. Keywords: physiological articulatory model, speech production, acoustic model, speech synthesis, Chinese vowel
1.
Introduction
The first synthesizer was constructed by Kratzenstein in 1779, which was a mechanical model composed of the vocal tract, glottal and lung. Since then, speech synthesis has gone through the stages: mechanical machine, circuit based method, electronic based facilities, and ultimately computer based algorithms. Before 1990’s, the mainstream of speech synthesis was formant based synthesis, which requires small memory and less computation resource, because of the limitation of computer power. With the development of computer science and speech technology, concatenative synthesis based on large-scale corpus becomes popular due to its priority in synthesizing fairly intelligible and natural speech sounds. However, this kind of synthesizer heavily depends on the prerecorded speech corpus and lacks flexibility to generate various styles of speech, especially emotional and personalized speech, with high quality. One alternative to solve these kinds of problems is articulatory based synthesis,
which generates speech sounds by imitating the mechanisms of speech production of human. In this paper, a framework of articulatory synthesis is presented based on a physiological model.
2. Physiological Model A partial 3D model with a thick sagittal layer of the tongue has been constructed based on volumetric MR images using an extended finite element method (X-FEM), where the MR images are obtained form a male Japanese speaker. The outlines of the tongue are extracted from two sagittal slices: one is the midsagittal plane and the other is a plane 1.0cm apart from the midsagittal on the left side. The outline of left side is duplicated to the right side with an assumption that the left and right sides of the tongue are symmetrical. Mesh segmentation of the tongue tissue roughly copy the fiber orientation of the genioglossus. The outline in each sagittal plane is divided into 10 radial sections which fan out from the attachment of the genioglossus on the jaw to the tongue surface. While in perpendicular direction, the tongue is divided into 6 sections. Eventually, a 3D mesh model is built by connecting the intersection nodes in the midsagittal plane to corresponding nodes on the left and right side, accordingly, each mesh is a brick with 8 corners. Fig. 1 illustrates the tongue model based on this segmentation. Ultimately, the tongue tissue is represented as 120 hexahedrons, each hexahedron is modeled by 28 viscoelastic solid cylinders (12 edges, 2 cross-wise connection in each surface of the hexahedron, and 4 connections between 8 diagonal vertices inside the hexahedron), which have not only masses but also volumes.
Fig. 1 The oblique view of the physiological articulatory model
To generate the shape of the vocal tract, an articulatory model should include the lips, teeth, tongue, hard palate, soft palate, pharyngeal wall and larynx. At present, the lips and soft palate are taken into account when constructing area function for the vocal tract, though they are not modeled physiologically. Outlines of the vocal tract wall and mandibular symphysis are extracted from MRI images in the midsagittal and parasagittal planes (0.7 and 1.4cm form the midsagittal plane one the left side), then copy the configuration of the left side to the right side. The model for the articulators is shown in Fig. 1 [2].
3. Control Mechanism The extrinsic muscles (genioglossus, geniohyoid, hyoglossus, styloglossus) and intrinsic muscles (superior longitudinalis, inferior longitudinalis, transversus, and verticalis) of the tongue, as well as the rigid organs (jaw and hyoid bone), are taken into account for manipulating the motion of the articulatory model. To drive the physiological articulatory model, a target-based strategy has been developed. It consists of two parts: one is muscle workspace [3], and the other is an equilibrium position mapping (EP-map) [2]. Several representative points are chosen to represent the motion of the model, namely control points, which are used to control the shape and/or position of the tongue and the jaw. They are the apex of the tongue in the midsagittal plane for the tongue tip, a weighted average position of the highest three points of the midsagittal plane in the vocalic configuration for the tongue dorsum, a point 0.5cm inferior to the tip of the mandible incisor for the jaw. The EP-map associates muscle forces with the equilibrium position of control points, in spite of where the start position is. That’s to say, if a certain force is given to a specific muscles, the control points are bound to converge to their equilibrium positions no matter where they start from (see [2] for the details). It reflects a static force for control of the model. Fig.2 shows the EP-map for the tongue tip and tongue dorsum.
Fig. 2 EP-maps of the muscle activation and articulator location: (a) EP-map for tongue tip (b) EP-map for tongue dorsum. The curves spreading out from the central point are the trajectory of the equilibrium position for the control points as the activation force increases from 0 to 6 N.
The muscle workspace is a description of the relationship between the muscle activation and the displacement of control points of the articulators. It accounts for the dynamic characteristics of articulation by reducing the distance between the current position and target for each control point with the stepwise method. At first, four typical muscle workspaces are set up for both the tongue tip and tongue dorsum respectively, and two muscle workspaces for the jaw, as shown in the left panel of Fig.3. Then, a dynamical muscle workspace for the current position is derived by a nonlinear interpolation based on the typical muscle workspaces. When projecting the articulatory vector of current point (Pc) to the target (Tg) onto the dynamic muscle workspace, a force projection is generated for each muscle. Only the projection that positively correlates to the articulatory vector is taken into account for the control (see [3], [4] for the details).
(a)
(b)
Fig. 3 Muscle workspace. (a)Typical muscle workspaces for the tongue tip, tongue dorsum and jaw (b) an example for estimation of force vector based on the muscle workspace.
4. Underlying acoustic model By far, the physiological model and its control strategy have been briefly introduced. For given targets, the articulators are driven to the desired position by appropriate muscle contraction under the control of the strategy, and the shape of the vocal tract is formed by the surfaces of the articulators. In order to facilitate estimating the acoustic features of the vocal tract, a gridline system is adopted to describe the width of the vocal tract in the midsagittal and parasagittal plane (as shown in the left panel of Fig. 4, gridline system consists of the thin lines), which are used to estimate the area function with the improved α − β model (see [5] for the details). Ultimately, the vocal tract is divided into 30 sections according to the representation based on the gridline system. As for the nasal cavity, it is divided into 12 sections. Most of these sections have constant cross-sectional areas, except the sections around the nasal-pharyngeal port. And for each section, both in the
nasal cavity and in the vocal tract, a transmission line model is adopted to simulate its characteristics, which is described by the following equations:
( Pr − Pj ) Aj + ρ 0 Aj
x j ∂v x + rS j j v = 0 2 ∂t 2
(1)
(U j +1 − U j ) ρ 0 Δt + [( ρ 0 + Δρ j )( A j + ΔA j ) − ρ 0 A j ]x j = 0
(2)
PV γ = const T
(3)
mj
∂2 y ∂y + bj + k j y = PS j 2 ∂t ∂t
(4)
A j = A0 j + yS j
(5)
Where Pj , Pr are the pressures at the middle and right end of the jth sub-tube respectively,
A j is the cross-sectional area of the jth sub-tube, A0 j is the
cross-sectional area of the jth sub-tube when the vocal tract wall is at its equilibrium position, v is the velocity of air particles within the jth sub-tube, S j is the perimeter of the jth sub-tube,
ρ0
is the air density at the equilibrium state,
U j +1 and U j are volume velocities, and have the relationship with v j and v j +1 : U j = v j A j , U j +1 = v j +1 A j +1 , where v j and v j +1 are the particle velocity at the inlet and outlet of the jth sub-tube respectively; m j , b j and k j are mass, viscosity, and mechanic capacity of the wall per unit length of the jth sub-tube respectively, and y is the displacement of the vocal tract wall. Here, equation 1 reveals the relationship based on Newton second law, equation 2 reflects the law of mass conservations, equation 3 is the gas law, and equation 4 discloses the phenomenon of wall vibration. To simplify equation 1 and 2, the volume velocity U within the jth section is represented by U j for the left part and U j +1 for the right part, while the pressure
P within the sub-tube is represented by the pressure, Pj , at the middle of the sub-tube. Eventually, the following equations are derived for the transmission line model of a single uniform sub-tube:
Pj − Pr =
ρ0 x j ∂U j +1 2 Aj
∂t
+
rS j x j 2 Aj 2
U j +1
(6)
U j − U j +1 = Let L j =
Cwj =
Aj x j ∂Pj
ρ0c ∂t 2
ρ0 x j 2 Aj
kj xjS j
2
, Rj =
, U dj = x j
+ xj
∂A0 j ∂t
rS j x j 2 Aj 2
∂A0 j ∂t
+ xjS j
, Cj =
∂y ∂t
Aj x j
ρ0c 2
(7) , Lwj =
mj x jS j2
, Rwj =
bj x jS j2
,
, then the equivalent circuit unit is built in the right
panel of Fig. 4. Therefore, a transmission line model for the supra-glottal system is obtained by cascading all the sub-tubes (as shown in the bottom panel of Fig. 4). The branch for the nasal cavity only exists in producing nasal sounds. The details for calculating the volume velocity and pressure in each sub-tube and the performance of the acoustic system are described in [9].
(a)
(b)
(c) Fig. 4 The supra-glottal system and its transmission line model. (a) The profile of the vocal tract for producing sound /È/. The thin lines set up the gridline system, which partition the vocal tract into 3 major parts: polar system part, horizontal part and vertical parts. (b) Transmission line model for a sub-tube (c) Transmission line model for supraglottal system (adopted from [9]).
As for the piriform fossa, a side branch behind the larynx, and the nasal sinuses, whose details were reported in [6], [7], and [8], they are modeled as Helmholtz resonators. For a Helmholtz resonator, set the cross-sectional area of the neck is A , the length of the neck is l , and the volume of container is V , the following equations are derived:
F = Pin A − A2
dP x − Rv dV
rP dP =− 0 dV V
(8)
(9)
According to the Newton second law, the following equation is formulated:
ρ lA
d 2 x R dx A2 rP0 + + x = Pin A ∂t 2 A ∂t V
(10)
Where r is the heat capacity ratio, ρ is the air density, R is the viscous resistance caused by the wall of the neck, Pin is the pressure at the inlet of Helmholtz resonator,
P0 is the undisturbed pressure inside the Helmholtz resonator, and x is the dx U displacement of the air column within the neck. Let U = A , then x = ∫ dt , A dt 2 d x 1 dU = , hence, a new equation is generated: dt 2 A dt
ρl
ArP0 dU R Udt = P in + 2U + dt A V ∫
Fig.4. Helmholtz resonator.
V
(11)
l is the length of the neck, A is the cross-sectional area of neck,
is volume of Helmholtz resonator.
At the glottis, a glottal waveform model is used to generate the sound source for voiced sound. Nevertheless, a noise source is generated at the constriction along
the vocal tract for the voiceless noise (turbulence) [9].
5. Speech synthesis In above sections, each part of the physiological articulatory model based speech synthesizer has been described individually. In this section, the flowchart of speech synthesis is given systematically, and synthesis experiments are carried out on Chinese single vowels and diphthongs. 5.1 Flowchart for synthesis To produce a specific speech sound, speakers should have a set of targets, e.g. articulatory targets, and move the articulators by activating certain muscles according to the targets to generate a specific vocal tract shape, and stimulate the vocal tract with proper sources simultaneously. That’s the process of speech production of human. Since the purpose of this study is to generate speech sound by simulating human’s mechanism, the speech synthesizer has the potential to realize this procedure. Figure 5 gives the flowchart of the processes involved in the proposed speech synthesizer. First, the articulator targets of the control points as well as the parameters for the lip tube and source are set according to the properties of phonemes, where the latter ones are used in calculating the acoustic characteristics. Then, the static forces are estimated by the EP-map at the beginning and exploited to activate the muscles, whereas the dynamic forces are calculated based on the muscle workspace stepwise during the articulatory movement. As a result, a time-varying vocal tract is obtained, by extracting the outlines of the articulators and the side branch of nasal cavity (if nasals are planed). An acoustical model is constructed from calculating the area function and adopting the transmission line model. Speech sounds are generated by applying a subglottal pressure to the acoustical model. In this study, a sub-glottal pressure with 8cm H2O is employed.
Fig. 5 Flowchart for speech synthesis by applying the physiological acoustic model
5.2 Synthesis of Chinese vowels and diphthongs In this section, we attempt to synthesize vowels and diphthongs of Chinese with the proposed synthesizer. To do so, the first step is to define a target set for the basic elements: vowels and consonants. At present, only vowels (/a/, /o/, /È/, /i/, /u/, /y/) and diphthongs (/ai/, /§u/, /ei/, /Èu/, /ia/, /iQ/, /u§/, /uo/, /yQ/) are taken into account [10]. This physiological articulatory model was derived from a Japanese speaker. To obtain the targets for Chinese vowels, the difference between Japanese vowels and their corresponding Chinese vowels was investigated in the articulation level. For Japanese vowels /a/, /o/ and /i/, they are almost the same as the corresponding vowels of Chinese. But for Chinese vowels /u/, the lip protrudes and tongue moves more backward, which result in different positions for the articulators. For /e/, the corresponding Chinese vowel is /È/, which has a more neutral position, with the profile of the vocal tract looking like somehow a uniform tube. There is no corresponding vowel in Japanese for Chinese vowel /y/. However, the articulatory targets for this vowel can be derived from Chinese vowel /i/ by protruding the lips and moving the highest point of tongue forward. Table.1 The articulatory targets for Chinese vowels (/a/, /o/, /È/, /i/, /u/, /y/). Tt and Td represent tongue tip and tongue dorsum respectively, where the origin is at the apex of the upper incisor. (Unit: cm)
/a/ /o/ /È/ /i/ /u/ /y/
Jaw_x 0.7728 0.6328 0.3828 0.3828 0.4528 0.4028
Jaw_y -1.5782 -1.3182 -0.5182 -0.4582 -0.7382 -0.4882
Tt_x 1.5428 2.4128 1.1228 1.0828 1.8728 1.1128
Tt_y -1.5282 -0.6582 -0.5682 -0.7182 -0.3182 -0.4782
Td_x 6.3428 6.7428 5.6128 4.8328 7.8728 4.3728
Td_y 0.7318 1.2618 1.2818 2.1218 1.6818 2.1218
After examining the difference between the Chinese vowels and their corresponding Japanese vowels at the articulator level, the targets for Chinese vowels are estimated based on the targets of Japanese vowels by means of analysis-by-synthesis method manually. The targets for Chinese vowels are listed in Table 1, where the origin is at the apex of the upper incisor. Fig. 6 gives the spectra of the synthetic vowels and that of real speech, which are calculated by means of LPC. It demonstrates that the spectra of the synthetic vowels are consistent with those of the real speech.
/a/
/o/
/È/
/i/
/u/
/y/
Fig. 6 LPC based spectra for single vowels of Chinese (/a/, /o/, /È/, /i/, /u/, /y/). The solid
line represents the spectrum envelope of synthetic speech and the dash line represents that of real speech sounds (the order is 16 for LPC analysis). As for the diphthongs, the targets are derived from those of the single vowels. The circumstance that coarticulation occurs between the vowels, which constitute the diphthongs, is taken into account. Moreover, for Chinese, the coarticulations between vowels are always not symmetrical because one of the vowels should be more dominant than others in triphthong and diphthongs of Chinese. Therefore, there is a requirement for quantifying the degree of coarticulation for each vowel within triphthong and diphthong, which is reflected by the deviation from its typical target. The targets for the diphthongs are generated based on the above considerations. Fig. 7 gives the spectrogram of synthesized diphthongs.
/ai/
/§u/
/ei/
/ia/
/iQ/
/Èu/
/u§/
/uo/
/yQ/ Fig. 7 Spectrogram for diphthongs of Chinese (/ai/, /§u/, /ei/, /Èu/, /ia/, /iQ/, /u§/, /uo/, /yQ/).
6. Summary The goal of this study is to construct a corpus independent speech synthesizer that can faithfully realize the mechanism of speech production, so that it can potentially provide a way to synthesize speech sounds with a variety of styles. In this paper, a physiological articulatory model based speech synthesizer is proposed to synthesize single vowels and diphthongs of Chinese. As mentioned above, the physiological articulatory model is aimed to realize human’s processes of speech production. For given articulatory targets, the muscle
activation patterns are estimated by EP-map and muscle workspace, and employed to drive the articulators to their targets. In this way, a time-varying vocal tract is generated and its area function is estimated from the width of the vocal tract in sagittal planes. Eventually, the sound is produced by implementing the transmission line model with a proper sound source. For a primary examination, this framework is employed to synthesize Chinese vowels and diphthongs. The results, for the single vowels, illustrate the synthetic sound have consistent spectra with real speech sound. For the diphthongs, most of them show proper characteristics in spectrogram. However, for some diphthongs, such as /Èu/ and /yQ/, there seems to be some problems with both transitions and the duration for individual phoneme. These problems can be caused by a number of factors such as the given target, the coarticulation between the vowels, and the control strategy of the articulatory model. In the future, we will clarify the causes using MRI system and the electromagnetic articulography and improve our speech synthesizer.
7. Acknowledgements This research is conducted as a program for the "21st Century COE Program" for promoting Science and Technology by Ministry of Education, Culture, Sports, Science and Technology. This study is also supported in part by the MSRA/IJARC project and in part by Grant-in-Aid for Scientific Research of Japan (No. 17300182).
References 1. Gonghuan Du, Zhemin Zhu, Xiufen Gong. : Foundation for Acoustics. Nanjing University Publishing House 2nd Edition. (2001) 2. Dang, J., and Honda, K. Construction and control of a physiological articulatory model. J. Acoust. Soc. Am. 115(2), 2004, 853–870 3. Dang, J., and Honda, K. Estimation of vocal tract shape from sounds via a physiological articulatory model. J. Phonetics, Vol30(2002), 511-532 4. Dang, J., and Honda, K. A physiological model of a dynamic vocal tract for speech production. J. Acoust. Soc. Jpn (E), Vol22(2001), 415-425 5. Dang, J., and Honda, K. Speech production of vowel sequences using a physiological articulatory model. ISCLP1998, 6. Dang, J., and Honda, K. Acoustic characteristics of the piriform fossa in models and humans. J. Acoust. Soc. Am. 101(1997), 456-465. 7. Dang, J., and Honda, K. Acoustic characteristics of the human paranasal sinuses derived from transmission characteristic measurement and morphological observation. J. Acoust. Soc. Am. 100(1996), 3374-3383. 8. Dang, J., Honda, K. and Suzuki, H. Morphological and acoustical analysis of the nasal and the paranasal cavities. J. Acoust. Soc. Am. 96(1994), 2088-2100. 9. Maeda, S. A digital simulation method of the vocal tract system. Speech Communication (1982) 199-229. 10. Wu, Z., Lin, M. Outline of experimental phonetics. Higher education Press, 1988.