Statistics-based Direction Finding for Training Vowels

Report 3 Downloads 29 Views
Statistics-based Direction Finding for Training Vowels Ilsuh Bak, Cheolwoo Jo SASPL, School of Mechatronics Changwon National University, Korea [email protected], [email protected]

Abstract In this paper, we tried to develop a vowel training assistant method using vowel formant statistics. Formant statistics were obtained from PBW set consists of 452 words from 8 persons. Then, we calculated distance from input formants to each center of vowel formant space. Based on the distance, directions to correct the speaker's manner of articulations, i.e. position of jaw and tongue.

1. Introduction Importance of communication through right pronunciation in info-age is increasing. Correction of disabled’s pronunciation is being performed by small number of speech pathologists. And to be such an expert, much time and training is required. Accordingly using training software is one method to overcome such barriers. Training by computer software can give visual feedback and can give trainee detailed goal and motivation. So such softwares are being developed in various ways and types. [1][2][3][4] Purpose of this paper is the development of assistant software for automatic correction and training of speech disabilities focused on vowels using speech signal processing methods. We used formant statistics from large speech database. Then we suggested a method to give informations to guide vowel articulatory movements in right way based on the measured distance from the center of the formant map of vowels. .

2. Overview of Proposed Method The proposed method in this paper is based on the formant space of vowels. Figure1 shows vowel rectangle of Korean. It is well known that formant(F1, F2) space is similar to the articulatory vowel map. If we can detect the formant differences between speaker’s vowel and normal vowel, it is possible to compute the relative difference of location of tongue and jaw. [1] Figure2 shows a concept of proposed method. First we compute the formant statistics from large database, which consist of normal pronunciation. To obtain good result enough size of database have to be collected. Also such database have to be clustered in sex, age groups. From the formant statistics, mean and deviations of each vowel is computed. So mean of F1 and F2 in each vowel within the F1-F2 space becomes the numerical center the vowel. Such centers are marked as o in figure2.

Figure 1. Articulation Map of Korean Vowel If the unknown speech is input, it calculates formant value too, and the distance calculates between each vowel’s average formants and formant values of input voice. If calculated distance lists from the smallest thing, beginning of history between input voice and standard vowel can calculate order.

F2

INPUT

TARGET

Near POINT

F1

Figure 2. Tracking of Target Voice from Input If the distance between input and the target is not closest, it computes a distance between input and target and notifies how the speaker’s pronunciation can move toward the target by moving the speaker’s mouth and jaw. The movement is computed by the difference of values of F1 and F2.[5] Value of F1 is inversely proportional to the tongue position and value of F2 is proportional to jaw position and height of tongue. F2 tends to have bigger value for front vowels and smaller value for back vowels. F1 tends to have bigger values for lower jaw position. Based on such knowledges, we can suggest directions of correct pronunciation by referencing formant positions between input vowel and target vowel. Generally F1-F2 space have wide variety according to the age and gender. So some kind of normalization procedures are required. But in this experiment we confined to the age range between 20~30. And the database is collected from male announcers in that age range. To analyze the collected voice, we computed formant frequencies of vowels in the database using linear predictive analysis. 17th order LPC analysis is performed. To compute F1 and F2 values, parabolic interpolation method was used and Euclidian distance measure is used to compute distance between target and input.

3.

Experimental result

The data used in this experiment is the subset of SITEC's PBW DB which include 452 isolated words or composition recorded from 8 man announcer whose ages are between 20~30. The speakers have normal voice to be thought of standard. We used only vowel part after segmention of each phoneme by automatic segmentation using HMM. Recording equipments and conditions are as follows: -Microphone: Senheizer HMD224X -Place: soundproofing booth -Sampling: 16KHz, 16bit after it recorded, and is stored on digital audio tape, A/D converted in PC environment, and AD/DA Module used is KAY’s CSL 4300B. Table 1 shows mean and variance distribution of each vowel formant using linear prediction coefficient from the database. Table 1.Average and distributed of vowel formant F1 F2 Frequency Kor mean variance mean variance ㅐㅔ 468 89 1895 157 1441 ㅏ 662 115 1440 168 1357 ㅓ 502 90 1193 218 1090 ㅡ 409 144 1604 309 829 ㅢ 337 42 2166 321 336 ㅣ 347 112 2174 234 1198 ㅑ 665 93 1496 175 266 ㅖ 404 72 2047 177 105 ㅕ 493 79 1449 237 609 ㅛ 395 64 1375 265 277 ㅠ 334 73 1850 263 217 ㅗ 388 75 1087 261 932 ㅜ 372 151 1360 401 708 ㅘ 651 103 1292 191 343 ㅞ 445 70 1799 162 468 ㅟ 332 59 2108 195 259 ㅝ 465 51 1062 153 217

Figure 3. Formant Distribution of Utterance Database

Vowel

Eng aec axc eoc euc euic ixc jac jec jeoc joc juc oxc uxc Wac wec wic woc

From the database we only used Korean single vowels which are /axc /, / eoc /, / oxc /, / uxc /, / ixc /, / euc /, / aec /. Figure3 shows the formant distribution from the 8 man announcers' utterance database. From the figure, it is observed that separate regions of each vowels in different gray level. From the distribution the center values of F1 and F2 is computed. This values are marked ‘o’ in figure4. F1 value is the smallest in vowel /uxc/ and F2 value is the smallest in vowel /oxc/. To implement the suggested method, separate input vowels are spoken. One is the correct vowel and another is a simulated wrong vowel. Those vowels are recorded in the same sampling rate and resolution. Then, we calculated formants from the recorded voice and computed the distance from the standard center values of each vowel.

Figure 4. Statistical Vowel Center Map . From the computed distance of F1, the direction of a tongue’s movement is decided as high or low. Those rules are as follows: If distance of F1 is positive, the tongue’s position have to be lowered. And if the distance of F1 value is negative, the tongue’s position have to be moved higher. If the distance of F1 value is smaller than 20, this time tongue have to be moved small amount. If it is bigger than 20, it has to be moved much more. Distance of F2 decide direction of a tongue’s movement into front or back. If distance of F2 value is positive, the tongue’s position have to move back. And if the distance of F2 value is negative, the tongue’s position have to move to the front. And if the distance range is samller than 200 the movement is becomes smaller. The following table2 and figures display two case of input vowels. One is when the input is similar vowel and the other is in case of different vowel about /eoc/ vowel. First, in similar vowel’s case, we can confirm that input has the closest distance with /eoc/ from table 2.

Table.2

acx eox

oxc uxc euc ixc aec

Distance for identical vowel F1 F2 Distance 197 283 345 37 36 52 -77 -70 104 -93 203 223 -56 447 450 -118 1017 1024 3 738 738

In this case, the screen shot of implemented trainer is shown in fig 4. It shows speaker’s utterance is the closest to target vowel and it is in normal range.

Figure 6. Case of wrong pronunciation for vowel / eoc / .

4. Conclusions

Figure 5. Case of correct pronunciation for vowel / eoc / Next case is when the input is different from target vowel /eoc/. Table 3 shows the computed distance values and fig 6 shows the screen shot of the trainer. Table.3 Distances for wrong vowel input

acx eox oxc uxc euc ixc aec

F1

F2

Distance

264 104 -10 -26 11 -51 70

573 326 220 493 737 1307 1028

630 342 220 493 737 1308 1030

In this case, trainer suggests to move your tongue lower and to the front.

In this paper we calculated statistical characteristic of formant value from normal voice database and based on that we proposed a method to suggest directions for correct pronunciation using the distances of F1 and F2 frequences. Vowel trainer is constructed and showed the usefulness of the tool. The suggest method can be used to teach and train vowel pronunciation to the disables. But the limit of this system is that current formant distribution is not normalized form and variable for different age group and genders. To reduce such drawbacks, separate statistical values have to be computed from the large data set and proper normalization method have to be derived.

5. References [1] Cheolwoo Jo, Ilsuh Bak, Euntae Jung, ‘Development of Vowel training Assistant Method Using Formant Statistics’, Proceeding of the 2003 Korean Signal Processing Conference, vol.16, No.1, pp 325-328, 2003 [2] ALAN D. BLAIR, JOHN INGRAM 'Learning to Predict the Phonological Structure of English Loanwords in Japanese', Applied Intelligence 19, pp 101-108, 2003 [3] K. VICSI, P. ROACH, 'A Multimedia, Multilingu -al Teaching and Training System for Children with Speech Disorders', INTERNATIONAL JOU -RNAL OF SPEECH TECHNOLOGY 3, pp289-300, 2000 [4] Blamey P.J, Sarant J.Z, Paatsch L.E, 'Effects of Articulation Training on the Production of Trained and Untrained Phonemes in Convert -sations and Formal Tests', Journal of Deaf Studies and Deaf Education vol. 6, no.1, pp. 32-42, 2001 [5] Yang, B. ‘An acoustical study of Korean monophthongs’, J. Acoust. Soc. Am. 91, 4, pp 2280-83, 1992