CROSS-LANGUAGE SPEECH PERCEPTION: SWEDISH, ENGLISH, AND SPANISH SPEAKERS’ PERCEPTION OF FRONT ROUNDED VOWELS Raquel Willerman Patricia K. Kuhl Dept. of Speech & Hearing Sciences, University of Washington, Box 357920, Seattle, WA 98195 - 7920
ABSTRACT
1. INTRODUCTION Research on speech perception shows that linguistic experience plays a substantial role in the perception of consonants in adults (Miyawaki et al., 1975; Iverson & Kuhl, 1996). The effect of linguistic experience on vowel perception, on the other hand, has seemed less robust in adults (although see Kuhl et al., 1992 for robust effects of linguistic experience on vowel perception in infants). In a study comparing English and Swedish speakers’ discrimination of front rounded vowels (phonemic in Swedish but not in English), Stevens et al. (1969) concluded that “[T]he listeners’ linguistic experience has essentially no effect upon their ability to discriminate small differences in vowel formant frequencies” (p. 12). However, task variables may have obscured the effects of language experience on vowel perception in that study. Our goal is to better understand the relationship between linguistic experience, phonetic labeling, and acoustic sensitivity in vowels. We ask two main questions: 1) How do speakers of different languages map vowel space? 2) How i s vowel discrimination affected by experience? Figures 1, 2, and 3 display the front, high-mid corner of the vowel space. Mean production data from three different studies in three different languages: Swedish, English, and Spanish are reproduced here. Note that below F2 = 1900 Hz, Swedish has two vowels, but English and Spanish have none. How d o English and Spanish speakers map this part of the vowel space, and how does this mapping affect their discrimination functions?
2400
F2 (Hz)
2200 2000
e:
i: y:
1800 1600 1400 200
ö:
:
300 400 F1 (Hz)
500
Figure 1: Swedish male speakers’ mean production data from Fant, 1973.
2400
i
2200 F2 (Hz)
Cross-language research on adult speech perception demonstrates a strong effect of linguistic experience on consonant perception but not on vowel perception. Our paper re-examines the effect of linguistic experience on adults’ vowel perception. First, identification and goodness functions for the high front quadrant of the vowel space were mapped for speakers of Swedish, English, and Spanish. Second, speakers performed a discrimination task for one vector in this vowel space. Stimuli along this vector were identified by Swedish speakers as belonging to the Swedish front rounded vowel series / :/ - /ö:/. However, English and Spanish speakers reported that the stimuli were not in their language. Significant differences in discriminability of these stimuli were observed across speakers of different languages. Our results show that linguistic experience plays a significant role in vowel discrimination.
2000
I e
1800 1600 1400 200
300 400 F1 (Hz)
500
Figure 2: English male speakers’ mean production data from Hillenbrand et al., 1995.
published in Neary (1989) and shown to be a good fit for both Swedish and English vowels. The function is as follows: F3 Hz= 0.522(F1 Hz) + 1.197(F2 Hz) + 57. The tokens were equalized in RMS intensity and were 615 ms long. Other synthesis parameters were the same for all tokens.
2400
i
F2 (Hz)
2200
e
2000
2.4. Procedure
1800 1600 1400 200
300 400 F1 (Hz)
500
Figure 3: Spanish male speakers’ mean production data from Godinez, 1978. Two experiments address these issues. In Experiment I, Swedish, English, and Spanish speakers are asked to label and rate the goodness of an area of the vowel space approximately equivalent to that shown in Figures 1, 2, and 3. In Experiment II, the same speakers perform a discrimination task for a subset of vowels in Experiment I which cross a category boundary for Swedish speakers but not for English or Spanish speakers.
2. EXPERIMENT I: METHODS 2.1. Participants Fifteen Swedish speakers, fifteen American English speakers, and eleven Spanish speakers participated in this study. The Swedish speakers were run at Stockholm University i n Sweden, the English speakers at the University of Washington in Seattle, and the Spanish speakers at the Texas Intensive English Program in Austin. All participants were adult, native speakers of their particular language who reported having n o known hearing impairments. All speakers were paid for their participation.
2.2. Apparatus The stimuli were synthesized at a 10kHz 16-bit sample per second. They were presented by a SoundBlaster 16 digital audio board controlled by a Compaq 486 microcomputer which played through the right-ear speaker of a pair of Telephonics TDH 39P headphones. The participants sat in a quiet room and responses were entered and recorded using the computer that controlled the presentation of the stimuli.
2.3. Stimuli There were sixty-six stimuli in an F1-F2 grid, all equally spaced along the mel scale. F1 ranged from 250 mels (189 Hz) to 500 mels (414 Hz) and F2 ranged from 1300 mels (1462 Hz) to 1800 mels (2482 Hz). The stimuli were synthesized as monophthongs at 50 mel steps in each dimension. F3 was calculated using a regression formula on F1 and F2 that was
Identification and goodness proceeded in three stages. In the first stage, participants were asked whether the sound was a vowel in their native language or not. If the answer was yes, then in the second stage the listener identified the vowel with reference to his/her native language categories. The listener did this by referring to a list of common words in that particular language which exemplified all the vowel categories. In the third stage, the listener rated the goodness of the vowel on a scale of 1 (poor) to 7 (good). If in the first stage the participant said that the vowel was not in their language, then he/she was still asked to choose the vowel that best matched what they heard, although they did not have to rate the goodness of that sound. The instructions for the task and the computer interface were all in the native language of the participant. Each participant completed a block of ten practice trials presented i n random order. After the practice, they completed an experimental session of 198 trials (3 blocks of 66 tokens) with the order of trials randomized within each block. Participants were allowed to hear the sounds as many times as they needed in order to make their judgments.
2.5. Results A stimulus had to be reported as belonging to a native language vowel category in stage 1 of the experiment at least 50% of the time in order to undergo identification and goodness calculations. Stimuli that did not reach this criterion are represented by an empty space in Figures 4, 5, and 6. The category identified the majority of the time was assigned to that token. Goodness ratings for the majority category were averaged across subjects and are shown for each token. The identification responses observed in Figures 4, 5, and 6 are similar to the production data in Figures 1, 2, and 3. The main departure from the production data is that English speakers identify part of the front rounded vowel space as the back rounded vowel /u/ as in boot, though they do not give this category a very high goodness rating. Spanish speakers, however, do not assimilate the front rounded vowel to their back rounded vowel /u/. In addition, English speakers did not use the lax /I/ vowel in their labeling. Although the grid encompasses formant frequencies normally associated with /I/, it may be that the duration of the stimuli (615 ms) favors the identification of longer, tense vowels. Goodness ratings of 4 and above for English /i/ and /e/, Spanish /i/ and /e/, and Swedish /i:/, /e:/, /y:/, / :/, and /ö:/ show that all three language groups found at least one good example of these vowel categories in the grid.
2482 11
3.73
4.03
4.15
3.25
3.74
3.8
11 2482
5.33
5.29
4.96
2364 10
3.61
4.7
3.79
3.67
4.09
3.97
10 2364
4.9
5.46
5.35
22499
3.79
4.06
3.5
3.57
3.41
3.76
22499
4.71
5.38
5.4
21388
4.19
4
3.94
3.35
2.88
3.64
19286
3.48
3.75
3.69
18285
3.72
3.76
1732 4
3.36
16393 15492
3.36
/u:/ /:/ /e:/ / /e: /i:/ / /i: /y:/ : /y / /o$: /ö:/ /
2.3
3.17
5
2.67
2.95
4.09
3.61
3.58
3.86
4
4.26
4.39
3.27
4.44
17324
3.45
3.48
3.32
3.16
3.85
4.54
16393
2.76
3.03
3.03
2.47
4
4.29
15492
4
2.48
2.17
2.96
3.47
1462 1
231 2
275 3
320 4
366 5
414 6
14621
189 1
0
11 2482
5.36
5.84
5.55
4.54
3.32
3.42
2364 10
4.93
5.57
5.29
4.08
3.17
4.06
2249 9
4.41
4.61
5.08
3.69
3.78
4.14
2138 8
3.39
3.55
3.97
3.32
3.22
4.23
2.25
3.29
3.37
2.79
3.24
F2 (Hz)
2031 7
3.09
1928 6
2.73
5 1828 1732 4
3.19
3.1
2.82
3 1639
3.24
3.52
3.31
2 1549
3.36
3.45
3.73
4.56
4.33
2.62
3.22
3.62
3.42
4.23
4.07
189 1
231 2
3 275
320 4
366 5
1 1462 0 0
414 6
F1 (Hz)
Figure 5: American English listeners’ identification and goodness ratings.
5.67
5.2
5.47
5.09
5.14 5.25
18285
5.12
189 1
0
F1 /i/ F1 /e/ F1 /u/ F1 /¿/ F1 /er/
5.28
19286
231 2
275 3
320 4
366 5
X /i/ /e/ X
414 6
F1 (Hz)
F1 (Hz)
Figure 4: Swedish listeners’ identification and goodness ratings.
5
5.25
0
0
5.4 5.09
20317 F2 (Hz)
F2 (Hz)
20317
21388
3.73
Figure 6: ratings.
Spanish listeners’ identification and goodness
Identification functions appear to be very different across the three languages. If language experience affects vowel discrimination, Swedish speakers should show greater sensitivity than English and Spanish speakers between F1 = 320 and 366 Hz, which straddles the boundary of a Swedish vowel category. Furthermore, English speakers should show greater sensitivity than Spanish speakers due to the fact that they separate /u/ and /er/ at this juncture for these stimuli. To test this, we ran a discrimination experiment on stimuli which differed across languages as to whether they contained a category boundary or not.
3. EXPERIMENT II: METHODS 3.1. Participants, Apparatus, & Stimuli These are the same as in Experiment I.
3.2. Procedure The horizontal vector corresponding to F2 = 1828 Hz was selected for the discrimination experiment because Swedish speakers show a category boundary for / :/-/ö: / between F1 = 320 Hz and F1 = 366 Hz and English and Spanish speakers d o not. The five pairs of adjacent stimuli along this vector were used in an AX discrimination task with a roving design. O n half the trials, participants heard the same stimulus presented twice. On the other half of the trials, participants heard two different stimuli. The offset-to-onset interval was 250 ms for each stimulus pair. After each response, participants received feedback indicating whether their response was correct. Instructions, feedback, and other aspects of the computer interface were in the native language of the participant. Participants first completed a practice session of 5 trials. The experimental session contained 10 blocks of 20 stimuli pairs i n random order.
3.3. Results Sensitivity (d ’) was measured for each stimulus pair for each subject through the application of signal detection theory (Macmillan & Creelman, 1991). Figure 7 displays the mean discrimination sensitivity for each pair of adjacent stimuli i n Swedish, English, and Spanish. Notice that although the stimuli are physically equidistant from each other, sensitivity varies for each pair. A 2 factor ANOVA (Language X Pair) confirmed that Pair was significant beyond the p < .001 level. Language was not significant, but there was a significant Language X Pair interaction (p < .05). Analysis of the simple effects shows that Language is significant at pairs 4 and 5 (p < .01). Also, a planned comparison between the means of the three languages reveals that Spanish is significantly different from English and Swedish (p < .05). 4
E H B
3.5 3 d' 2.5 2
E H B
E H B
B E H
1.5
B
Spanish
H
English
E
Swedish
H E B
1 1
2
3
4
5
pair
Figure 7: Mean sensitivity (d ’) functions across subjects for each pair in Swedish, English, and Spanish.
4. GENERAL DISCUSSION The results of this experiment demonstrate an effect of language experience on adults’ perception and discrimination of vowels. Spanish speakers show significantly lower sensitivity t o vowel contrasts when compared to Swedish and English speakers, indicating that linguistic experience affects the ability to discriminate small differences in vowels. English speakers assimilated the / :/ vowel space to their native /u/. That English speakers perceive a category here, albeit not a good one, may explain why the English sensitivity function i s not significantly different from the Swedish. Recall that Stevens et al. (1969) also found no significant differences between Swedes and Americans in sensitivity functions. In Stevens et al. , however, Americans did not use their native categories in the labeling task, thereby obscuring the relationship between phonetic labeling and discrimination. There are, however, differences in acoustic sensitivity that phonetic labeling and goodness ratings do not account for. For instance, although they had no phonetic label for the stimuli along the vector, the sensitivity function for Spanish speakers was not monotonic. The peak in sensitivity at pair 3 may reflect a general nonlinearity of the auditory system. Or, i t may reflect other aspects of the Spanish vowel system. Figure 6 suggests that the /i/ - /e/ boundary for Spanish speakers is at F1 between 275 and 320 Hz. This may have contributed to the
peak in sensitivity in Figure 7. Moreover, English speakers may show a peak at pair 4 because this is where Figure 5 suggests their /i/ - /e/ is. Similarly, Swedish listeners’ excellent discrimination of pairs 3 and 4 may be explained b y their use of F1 to separate the /i:/ - /e:/ and the / :/ - /ö:/ contrasts in these respective locations. The fact that the vector varied only in F1 may have focused listeners’ attention on the F1 dimension. On this view, vowel discrimination is not just a function of phonetic labeling, but due instead to the entire vowel map as opposed to a single pair of vowels. Similar thinking can be applied to the trough in sensitivity for all languages at pair 2. This too may be a nonlinearity inherent i n the auditory system. Or, it may be that something like a “magnet effect” is produced by a systemic warping of the vowel space which lowers sensitivity in this area of F1 because it lies within the /i/ category for all three languages. Several processes may converge to produce vowel labeling and discrimination. Nevertheless, our data suggest a strong relationship between linguistic experience and vowel perception.
5. REFERENCES 1. Fant, G. (1973). Speech Cambridge: MIT.
sounds
and
features,
2. Godinez, M. (1978). “A comparative study of some Romance vowels,” UCLA Working Papers in Phonetics, 41, 3-19. 3. Hillenbrand, J., Getty, L. A., Clark, M. J., & Wheeler, K. (1995). “Acoustic characteristics of American English vowels,” Journal of the Acoustical Society of America, 97, 3099-3111. 4. Iverson, P., & Kuhl, P. K. (1996). “Influences of phonetic identification and category goodness on American listeners’ perception of /r/ and /l/,” Journal of the Acoustical Society of America, 99, 1130-1140. 5. Kuhl, P. K., Williams, K. A., Lacerda, F., Stevens, K. N., & Lindblom, B. (1992). “Linguistic experience alters phonetic perception in infants by 6 months of age,” Science, 255, 606-608. 6. Macmillan, N. A., & Creelman, D. (1991). Detection theory: A user’s guide, New York: Cambridge. 7. Miyawaki, K., Strange, W., Verbugge, R., Liberman, A.M., Jenkins, J. J., & Fujimura, O. (1975). “An effect of linguistic experience: The discrimination of [r] and [l] b y native speakers of Japanese and English,” Perception & Psychophysics, 18, 331-340. 8. Neary, T. (1989). “Static, dynamic, and relational properties in vowel perception,” Journal of the Acoustical Society of America, 85, 2088-2113. 9. Stevens, K. N., Libermann, A. M., Studdert-Kennedy, M., & Öhman, S. E. G. (1969). “Crosslanguage study of vowel perception,” Language and Speech, 12, 1-23.