IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010
1041
Maximum Entropy-Based Reinforcement Learning Using a Confidence Measure in Speech Recognition for Telephone Speech Carlos Molina, Student Member, IEEE, Nestor Becerra Yoma, Member, IEEE, Fernando Huenupán, Claudio Garretón, and Jorge Wuth
Abstract—In this paper, a novel confidence-based reinforcement learning (RL) scheme to correct observation log-likelihoods and to address the problem of unsupervised compensation with limited estimation data is proposed. A two-step Viterbi decoding is presented which estimates a correction factor for the observation log-likelihoods that makes the recognized and neighboring HMMs more or less likely by using a confidence score. If regions in the output delivered by the recognizer exhibit low confidence scores, the second Viterbi decoding will tend to focus the search on neighboring models. In contrast, if recognized regions exhibit high confidence scores, the second Viterbi decoding will tend to retain the recognition output obtained at the first step. The proposed RL mechanism is modeled as the linear combination of two metrics or information sources: the acoustic model log-likelihood and the logarithm of a confidence metric. A criterion based on incremental conditional entropy maximization to optimize a linear combination of metrics or information sources online is also presented. The method requires only one utterance, as short as 0.7 s, and can lead to significant reductions in word error rate (WER) between 3% and 18%, depending on the task, training-testing conditions, and method used to optimize the proposed RL scheme. In contrast to ordinary feature compensation and model parameter adaptation methods, the confidence-based RL method takes place in the frame log-likelihood domain. Consequently, as shown in the results presented here, it is complementary to feature compensation and to model adaptation techniques.
Index Terms—Confidence measure, incremental conditional entropy, reinforcement learning, robust automatic speech recognition, telephone speech.
I. INTRODUCTION HE mismatch between the training and testing conditions has been widely studied in the field of automatic speech recognition (ASR) due to its relevance in practical applications of the technology. This problem has been addressed with model adaptation/compensation techniques or with methods for noise removal from the corrupted signal. Conventional adaptation/
T
Manuscript received March 25, 2009; revised August 20, 2009. First published September 18, 2009; current version published June 16, 2010. This work was supported by Conicyt-Chile under Grants Fondecyt No. 1070382 and Fondef No. D05I-10243. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Haizhou Li. The authors are with the Speech Processing and Transmission Laboratory, Department of Electrical Engineering, Universidad de Chile, Santiago 837-0451, Chile (e-mail:
[email protected]). Digital Object Identifier 10.1109/TASL.2009.2032618
compensation techniques (e.g., MAP1 and MLLR2) dramatically degrade when only a few adapting utterances are available [1]–[3]. One problem is the number of parameters that need to be reestimated: the higher the number of model parameters, the larger the amount of required adaptation data. For example, in [1] classic MLLR adaptation does not always lead to improvements in word error rate (WER) with five adaptation utterances at moderate or high signal-to-noise ratio (SNR) ( 15 dB). To alleviate the problem of limited estimation data, the use of structures of hidden Markov models (HMMs) or adaptation clusters has been proposed [3]–[7]. Another possibility to reduce the number of required adaptation utterances would be to directly reestimate the observation log-likelihoods instead of their parameters (i.e., mean vectors and covariance matrices). The effectiveness of unsupervised adaptation is also significantly degraded with respect to supervised adaptation [3], [8], [9]. In MLLR and MAP, this result may be considered a consequence of the fact that the recognized labels may be incorrect, hence violating the basic assumption of data driven learning. This problem is also typically counteracted by making use of structures of HMMs or adaptation clusters [3]–[7]. Unsupervised adaptation performance should be improved by means of the use of a confidence measure. The applicability of a reliability measure to model adaptation has already been addressed in the literature by using straightforward approaches [10]–[12]. In [10], a 5% reduction in WER, when compared with standard MLLR, is achieved when frames belonging to words with confidence score higher than a given threshold are selected to estimate the MLLR transforms in unsupervised adaptation. A similar procedure to select frames according to a confidence measure is applied in [11] to improve unsupervised MLLR. In [12], a higher reduction in WER, when compared with ordinary MAP adaptation, is achieved by using a confidence score to dynamically estimate the learning parameter. Observe that in all these cases no restriction on the number of adaptation utterances was imposed. At this point it is worth highlighting that this paper attempts to propose a theoretical framework where a confidence measure could be used to improve recognition accuracy by itself without the restrictions of ordinary unsupervised adaptation methods. In this sense in [13] a Bayes-Based Confidence Measure (BBCM) was proposed to address some of the limitations of ordinary confidence metrics. For instance, BBCM is itself a probability and incorporates information about speech recognition performance. 1Maximum
a posteriori. linear regression.
2Maximum-likelihood
1558-7916/$26.00 © 2010 IEEE
1042
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010
The model adaptation techniques presented in the literature improve the discrimination of the HMMs selected by supervised or unsupervised methods, but they do not attempt to specifically make the rest of the models more or less likely. So, these methods could be improved by applying the principle of reinforcement learning (RL). RL could be defined by the following quote [14]: “if an action taken by a learning system is followed by a satisfactory state of affairs, then the tendency of the system to produce that particular action is strengthened or reinforced. Otherwise, the tendency of the system to produce that action is weakened” [15], [16]. Surprisingly, the applicability of RL to speech recognition has not been widely addressed in the literature. In [17], RL was employed to modify the learning rate in MAP adaptation. In this paper, the concept of RL is proposed to estimate a correction factor for the observation log-likelihoods based on a confidence measure. The idea is to reward or penalize models depending on their reliability in the first Viterbi decoding step. According to RL, hypotheses with higher confidence or BBCM will be strengthened; otherwise, hypotheses with lower confidence will be weakened. In this paper, confidence based RL is expressed as the linear combination of the observation log-likelihood and the logarithm of a confidence (e.g., BBCM) based correction factor. A linear combination of metrics is a model used by RL in other contexts and certainly appears in several pattern recognition applications [18], [19]. For instance, multimodal classifiers and multisensor systems need to address the problem of optimally combining metrics and information sources [20]–[22]. Ordinary metric fusion methods (e.g., Bayes combination) in multiclassifier systems usually show the following limitations: they need a priori multivariable distributions that require large amount of training data to be estimated [18]; and, they assume matching conditions between training and testing procedures. The first restriction is problematic where adaptation data are limited. Moreover, the second restriction is rarely satisfied in real applications where testing conditions (user, telephone channel, additive noise, and low-bit rate coding–decoding distortion) change from call-to-call and are generally different from those found in the training process. In this paper, an information theory-based criterion is proposed to address the problem of online optimization of a metric fusion. The presented approach, which is called incremental conditional entropy maximization (ICEM), needs no a priori distribution, does not make any assumptions about training/testing matching conditions and leads to highly significant reductions in WER with as few as one adaptation utterance as short as 0.7 s. The contribution of this paper includes: 1) a confidence-based RL scheme to correct the observation log-likelihood and to address the problem of unsupervised adaptation with limited estimation data; 2) a model to implement the proposed confidence-based RL scheme; 3) a novel criterion based on incremental entropy (ICEM) to optimize the linear combination of metrics or information sources during testing; 4) a polynomial approximation-based algorithm to improve the computational efficiency of ICEM; and, 5) an answer to the question of how confidence metrics can improve speech recognition accuracy. The proposed method requires only one adaptation utterance, as short as 0.7 s, and can lead to a reduction in WER between 3% and 18% depending on the task, training–testing conditions, and on the method used to optimize the proposed RL scheme.
The testing restriction imposed here (i.e., using only one short utterance) has not been addressed exhaustively elsewhere. In ASR, compensation/adaptation techniques usually take place at the feature or model parameter domain. The proposed RL mechanism is implemented as a compensation bias to the frame log-likelihood, is complementary to any conventional feature compensation or model adaptation schemes, and is applicable in combination with methods such as vocal tract length normalization (VTLN) and MLLR. Also, the proposed scheme takes place in the testing procedure and is not comparable with discriminative training where models are trained. Finally, the results presented should be important to some commercial applications such as telephone dialogue systems. II. BAYES-BASED CONFIDENCE MEASURE In this section, three ASR confidence measures employed in the literature are mentioned. Then, a brief introduction to BBCM [13] is presented. Observe that BBCM could be applied to any confidence metric. A. Confidence Measures or Word Features Several word-based features can be extracted from the Viterbi decoding [23]. They are based on word length, word acoustic score, word density or word posterior probability. The following confidence measures or word features delivered by the ASR procedure are employed here: word density confidence measure (WDCM) [23], maximum hypothesis log-likelihood (ML) [13], and word posterior probability (WPP) [24]–[26]. B. Definition of Bayes-Based Confidence Measure (BBCM) If WF denotes a given word feature, and considering that the list of words that appear in the N-best hypotheses is denoted by
BBCM is defined as in [13] as
s s
s (1)
where “OK”, that substitutes “correct” in [13], denotes the fact that a word was properly recognized (i.e., it is in the transcripdenotes the tion of the testing utterance); and, word feature associated to word . is a Notice that probability itself. Moreover, the distributions s and , and the probability s provide information about the recognition engine performance. s is defined as the accuracy of the baseline system. III. REINFORCEMENT LEARNING BASED ON BBCM The motivation is to apply the RL principle to correct or reestimate the observation log-likelihoods. Basically, the idea is to
MOLINA et al.: MAXIMUM ENTROPY-BASED RL USING A CONFIDENCE MEASURE IN SPEECH RECOGNITION FOR TELEPHONE SPEECH
1043
component of the frame log-likelihood. Given a model adaptation method, and the HMM parameters of triphone model and , then state ,
(2) corresponds to the HMM parameters after adapdenotes frame in the observation sequence and is the number of frames; is the additive correction component in the log domain and, that is a function of , HMM , state , and the model adaptation technique itself. In a similar way, the effect of a noise cancellation or feature compensation technique could also be expressed as where tation;
Fig. 1. Definition of alignment.
(t) and
(t; p) given a Viterbi
strengthen an action that produces a satisfactory state of the system; otherwise, this action should be weakened [14]. A confidence measure is one way to assess the stability or reliability of output delivered by recognizers. Accordingly, if the output shows a low confidence in the first search, the method should make the recognized HMMs less likely and should increase the likelihood of the rest of the models in the second search. In contrast, if the delivered output presents a high confidence in the first decoding step, the method should make the recognized HMMs more likely and should penalize the likelihood of the rest of the models in the second search. Given the recognized string of words and triphone models, “neighboring models” at frame denotes all the triphones in the task that do not correspond to the recognized model at this frame. As explained above, BBCM is certainly a candidate for implementing this RL proposed approach because: BBCM is a probability and so is defined in the interval [0,1]; and, it incorporates information about the ASR performance. However, as shown later, other confidence metrics such as WPP could also be employed by the RL scheme presented in this paper. A BBCM score (or WPP) could easily be associated to every recognized word and triphone model. Those HMMs that do not appear in the output delivered by the recognizer will not be directly associated with a confidence measure. However, the probability of neighboring HMMs being correct could be estimated as the complement of the BBCM (or WPP) probability. As can be seen in Fig. 1, the recognized model at time , denoted , corresponds to the model aligned with the with acoustic feature vector at time , , as a result of the Viterbi deindicates coding. Observe that that the recognized hypothesis or sequence of models is defined here as the first hypothesis in the N-best list. Also in Fig. 1, , neighboring HMMs at time , indicated by : are all the models except and , where is the total number of models in the task. RL could be applied as a correction factor in the observation log-likelihood. This correction factor could also be realized as a result of ordinary model adaptation techniques. In fact, the result of HMM adaptation can also be modeled as an additive
(3) is the frame that results from the noise cancellawhere is a function of tion or feature compensation method; and , noise, HMM , state , and the noise removing technique. As can be seen, the noise cancellation or feature compensation technique should strengthen the likelihood of the observation vector given a model and a state. As a consequence, the problems of HMM adaptation and feature compensation could be , which in turn interpreted as the adequate estimation of can be estimated easily. does not mean that In this paper, a two-step RL procedure is proposed to estimate the additive correction component of the observation log-probabilities. Fig. 2 summarizes the proposed approach. First, Viterbibased N-best analysis delivers a set of hypotheses, recognized words, HMMs, and confidence measures. The recognized string of words corresponds to the first or most likely hypothesis in the N-best list. Finally, a second Viterbi decoding provides the sebe a function of a quence of recognized words by making probabilistic confidence score (e.g., BBCM or WPP) or its complement depending on whether the HMM appears or not in the first-step recognition output. As a result, this second Viterbi step delivers the recognized words by using the principle of RL. The principal difference with model adaptation/compensation techniques is that, in RL, recognized models are made more or less likely when compared with neighboring HMMs depending on the confidence scores achieved in the first recognition step. and are the observation If given state in HMM in the first and likelihood of frame second Viterbi decoding step, respectively, then, the proposed RL scheme can be expressed as
(4) where could be seen as a scaling or weighting factor. As with any other variable in a speech recognition engine (e.g., language model scaling factor), could be tuned on a task-by-task basis. However, expression (4) could be interpreted as the linear combination of two metrics or information sources. As discussed later, this paper proposes the optimization of this linear combination by maximizing incremental conditional entropy that is also presented here.
1044
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010
decoding. By employing the principle of RL [14] and using the fact that BBCM is a probability, in (4) can be expressed as
s
(5)
and s (6) where gives the word allocated at frame in the first hypothesis (the most likely one) of the N-best list, denotes the word-based feature associated to recognized word at , and indicates instant , word at instant and hypothesis . Observe that the recognized word string corresponds to the first hypothesis in the N-best list. As shown later, this paper also presents results when BBCM is replaced with WPP in (6). B. Rescoring Neighboring HMMs In this subsection, the confidence-based correction factor for the frame log-likelihood is estimated for the neighboring HMMs. Neighboring HMMs are defined at a given instant and are denoted by , is defined above. Consequently, in where (4) could be expressed as
s
(7)
where Fig. 2. Block diagram of the proposed RL scheme.
s s
A. Rescoring Recognized HMMs In this subsection, the confidence-based correction factor for the frame log-likelihood is estimated for the recognized HMMs (i.e., HMMs contained in the word string output by is the th hypothesis in the recognizer). Consider that hypotheses. the N-best Viterbi list that is composed of Every hypothesis corresponds to an alignment that allocates frames to a state in a given HMM. The list of words defines the set of HMMs that and are contained in the N-best list, where . The pair defines frame and hypothesis . Recognized HMMs, , are defined as those models contained in the first or most likely hypothesis in the N-best . The idea list: is to penalize recognized HMMs according to their confidence measure: if the recognition output is not reliable, HMMs in the neighborhood should be made more likely in the second Viterbi
(8) As in (6), BBCM can be replaced with WPP in (8). At this point, it is worth mentioning that in (7) could include further information about and such as the distance between both models and their word contexts. Nevertheless these improvements are secondary with respect the main contributions presented in this paper. C. RL Re-Scoring Summarizing, the proposed RL scheme described in (4) requires a re-scoring factor as a function of model , state , and frame , , that is estimated as shown in (9) at the bottom of the page, where , is the number of states within and a HMM, and
if if
(9)
MOLINA et al.: MAXIMUM ENTROPY-BASED RL USING A CONFIDENCE MEASURE IN SPEECH RECOGNITION FOR TELEPHONE SPEECH
are defined as in (5) and (7). IV. LINEAR COMBINATION AND INCREMENTAL ENTROPY To simplify the notation and the analysis that foland lows, consider the definition of as: and . Consequently, the proposed RL scheme based on BBCM defined in (4) can be rewritten as (10) is defined in (9). where Observe that the estimation of in (10) that minimizes the error rate could be seen as a metric or information source fusion problem. Classical information fusion techniques are based on the probability theory associated with Bayes-based decision approaches [18]. Despite the fact that Bayes theory leads to an optimal solution from the classification error point of view, it shows several limitations [18], [27]. For instance, Bayes-based approaches usually require a large amount of data to estimate a priori distributions. Also, the use of a priori distributions implicitly assumes matching conditions between training and testing conditions, which in turn is not always true in real applications. Notice that the RL mechanism in (10) is highly nonstationary because confidence is certainly utterance and frame dependent. That means that every utterance may be recognized with a different degree of confidence. Also, within an utterance, every word (i.e., corresponding to a set of frames) may also be recognized with a different degree of confidence. In order to counteract some of the limitations presented by Bayes-based fusion techniques, information theory-based criteria have been proposed by several authors [19], [27], [28]. The idea is to take into consideration the uncertainty of every information source and then to improve the accuracy of the a priori conditional distribution estimation. Those methods apply criteria based on maximum entropy [19], [28]–[30], on maximum mutual information [31], [32], and on conditional entropy [27], [28]. However, all those methods usually assume matching conditions between training and testing data with variable requirements for estimation data size. Given the nature of the RL scheme presented here, this paper proposes an online optimization, in the decoding process, of the metric combination in (10) without any a priori distribution. Also, the proposed approach attempts to reduce the mismatch between training and testing condition in ASR. Consequently, the training-testing matching hypothesis is no longer an issue and should not be required. Accordingly, the method could require as few as one short adaptation utterance. In this scenario, an information theory criterion is proposed in this paper to optimize the linear combination in (4) that models the RL process described in this paper. It is worth highlighting that linear combination is also the most straightforward procedure to combine metrics in a classification problem. The maximization of entropy is a popular approach that is employed in several problems [19], [28]–[30]. In this case, the estiin (10), mation of according to entropy maximization of
1045
, considered as a stochastic variable independently of is not applicable. If the probability density function (pdf) of is considered Gaussian, the entropy of is a function of its variance [33]. Consequently, it can easily be shown that is monotonically decreasing or increasing the entropy of and if they also are modeled between the entropies of with Gaussian distributions. Actually, many authors have used entropy-based methods to choose or weight those parameters with the highest discrimination ability [28], [34], because this discrimination ability depends on intra- and inter-class parameter dispersion. On the other hand, conditional entropy and mutual information [27], [28], [31], [32] could be interesting candidates to optimize the linear combination in (10) by taking into with respect consideration the incremental information of . However, the estimation of joint pdfs should be highly to dependent on the size of estimation data. As a result, the online optimization of (10) in the decoding procedure would be unfeasible. In order to avoid these restrictions, this paper proposes the optimization of (10) based on the conditional entropy given the pdf of . As shown later, of can be estimated with the distributhe pdf of and , can be analytically extions of pressed as a function of and requires less estimation data than the joint pdf of and . , and Given a frame , are the pdfs of and , respectively, where: , and and are, respectively, the mean and variance of at frame ; and , and are, respectively, the mean and variand at frame . The mean and variance of ance of and are estimated in the first Viterbi step as follows:
Considering and as two sources could be estimated by optiof information, the optimal mizing the incremental information provided by or , which in turn also depends on , to . First, the samples of are shifted by . a constant at each frame in order to make
1046
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010
This procedure attempts to equalize the effect of two metrics that span over different range of numerical values in the estimation of the optimal . Observe that is not shifted in the linear combination defined in (10). at frame , Then, the variance of the pdf of , is estimated as
(15) (16) is the number of samples and is estimated
where as in (11). Step 2) Approximating
curve a polynoand applying mial of order by using the linear least square fitting method [35]. Then, can be written as
where , and . Observe that , , and are estimated at every frame . Moreover, also depends on . This paper proposes that the RL scheme, defined in (10) as the combination of metrics or information sources, could be optimized by maximizing the incremental conditional entropy of . This conditional entropy is defined as
in the interval [0,1]
equally spaced samples where
the with
(17) where
denotes the polynomial coefficients and . Step 3) By making use of the polynomial approximation in the previous step, estimating the derivative of with respect to and equaling to zero
(11) (18)
Then, the incremental entropy could be written as
(12) Notice
that
the optimal as
is equivalent to when . Accordingly, could be estimated before the second Viterbi step
Step 4) Solving (18) with the Newton–Raphson method [36] is the to obtain the optimal , . Consider that solution of (18), then if
then
Else
then
Although
(13) As a result, the optimization problem defined in (10) could be solved by estimating the derivative of with respect to and equaling is a constant and to zero. The term does not depend on . Consequently,
it
is
not
shown in this paper, could be modeled as a Gaussian or gamma distribution. However, the gamma pdf seems a better approximation than the Gaussian function and this result is corroborated by estimating the log-likelihood of given both distributions. The resulted samples of could also be modeled with a gamma function and this result was employed to estimate as defined in (11). V. EXPERIMENTS
(14) However, does not lead to an analytical solution and the following polynomial approximation based procedure is adopted on a utterance-by-utterance basis. Step 1) Obtaining a set of samples of , , with
The approach proposed in this paper was tested with three different training-testing database configurations. A. Training-Testing Database Configurations Three tasks were considered: a small vocabulary task recorded over the telephone; a medium vocabulary recorded in a clean environment; and a medium vocabulary recorded over the telephone.
MOLINA et al.: MAXIMUM ENTROPY-BASED RL USING A CONFIDENCE MEASURE IN SPEECH RECOGNITION FOR TELEPHONE SPEECH
1) Cinema Enquiry Telephone Experiment (CETE): A Spanish database recorded over a telephone line was used. In CETE, users phoned to an ASR-based cinema information system implemented with Galaxy II [37] at the Speech Processing and Transmission Lab., Universidad de Chile. The dialogue sequence was the following: first, the system asked the user to choose one film from a list composed of 80 films; second, the system prompted for the name and neighborhood of the cinema; and finally, the user had to say if he/she wanted to go to the cinema in the morning, afternoon or evening. The vocabulary is composed of 221 words. The training database corresponded to 12 494 utterances. All the training data was used to train the CDHMMs. The training utterances were recorded by approximately by 150 speakers. The a priori pdfs employed in (1) were estimated with an evaluation database composed of 1036 utterances (1437 words). The testing database corresponds to 3261 sentences (4566 words). Each utterance is 0.7 s long on average, and the training and testing material correspond to 2.4 h and 0.6 h, respectively, of recorded speech. 2) LATINO-40 Clean Condition Experiment (LA40CE): Speaker independent continuous speech recognition results with experiments using the LATINO-40 database (LDC, 1995) are also presented in this paper. This database is composed of continuous speech from 40 Latin American native speakers, with each speaker reading 125 sentences from newspapers in Spanish. The training data corresponds to 4500 sentences provided by 36 speakers. The vocabulary is composed of almost 6000 words. The a priori pdfs employed in (1) were estimated with a subset of the training data composed of 1000 utterances. The testing database contains 500 utterances (4000 words) provided by four testing speakers (two females and two males) that are not contained in the training speakers set. Each utterance is 4.6 s long on average, and the training and testing material correspond to 5.8 h and 0.6 h, respectively, of recorded speech. 3) LATINO40 Telephone Experiment (LA40TE): The training data set is composed by the training speech material from CETE (Section V-A1) plus 1485 utterances from 99 Spanish native speakers (50 males and 49 females), 15 utterances per speaker. These utterances were recorded on the telephone line by asking the speakers to read a subset of the training sentences from LATINO-40 database. The testing data set corresponds to the same 500 testing sentences employed in LA40CE recorded by 20 Spanish native speakers (ten males and ten females), 25 utterances per speaker, on the telephone line. Each testing utterance is 6.7 s long on average, and the training and testing material correspond to 5 h and 0.9 h, respectively, of recorded speech. B. Experimental Setup The recognized sentence corresponded to the first hypothesis (the most likely one) within the N-best list obtained from Viterbi decoding. As mentioned in Section III-B, the neighborhood models are defined with respect to the model in the output delivered by the recognizer. Confidence scores are based on the N-best analysis, and the maximum number of hypotheses was made equal to ten . It was observed that the number of
1047
hypotheses output by Viterbi decoding is almost always fewer than ten, which in turn does not justify an exhaustive analysis of the improvements given by the proposed approach versus the size of the N-best list. Thirty-three MFCC parameters (static, delta, and delta-delta coefficients) per frame were computed. Cepstral Mean Normalization (CMN) was also employed. Each triphone was modeled with a three-state left-to-right topology without skip-state transition, with eight multivariate Gaussian densities per state with diagonal covariance matrices. The HMMs were trained by using HTK and a trigram language model was employed during recognition. The Viterbi decoding is obtained with a recognition engine implemented at the Speech Processing and Transmission Lab., Universidad de Chile. The polynomial approximation in (17) was implemented with the linear least square fitting method [35], and several combinations of polynomial order and number of samples were compared. as As mentioned in Section IV, defined in (11) was estimated by approximating the distribution with a gamma function. The baseline system gave a WER equal to 13.06%, 3.08%, and 14.33% with CETE, LA40CE, and LA40TE, respectively. The baseline oracle error rates are equal to 8.85%, 2.93%, and 13.53% with CETE, LA40CE, and LA40TE, respectively. BBCM was implemented according to (1) with WDCM and ML word-based features by using the approximation in [13]: . This paper also presents results by employing WPP and by applying BBCM to WPP as confidence metrics. WPP was implemented according to [24] and was estimated with the N-best list. As suggested in [24], WPP is obtained by aligning the recognized output with each N-best hypotheses by using a dynamic time warping algorithm. Standard MLLR with vector mean adaptation [2] and VTLN [38], [39] were also implemented to be compared and combined with the approach proposed in this paper. The transforms in MLLR were estimated by using the toolkit provided by HTK. VI. DISCUSSIONS A. Reinforcement Learning Re-Scoring Scheme Based on Confidence Measures Fig. 3 and Table I presents results by directly tuning in (10) by using BBCM(WDCM,ML) and WPP with the whole testing data in CETE, LA40CE, and LA40TE. As can be seen in Fig. 3(a) and Table I, the proposed RL scheme with BBCM(WDCM,ML) can lead to reductions in WER , 18.83% , and as high as 11.64% when compared to the baseline system 8.51% with, respectively, CETE, LA40CE, and LA40TE. Significance analysis using McNamar’s test [40] shows that these results are . Observe that the WER achieved by significant the proposed RL mechanism is lower than the baseline oracle error rate with two of the three training-testing conditions (i.e., LA40CE, and LA40TE). This may be due to the fact that the RL scheme attempts to correct frame log-likelihood of recognized and neighboring models. As a consequence, the second Viterbi search may provide a recognized output that is not contained in
1048
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010
TABLE II WER (%) OBTAINED BY EMPLOYING THE ICEM (INCREMENTAL CONDITIONAL ENTROPY MAXIMIZATION) CRITERION AND USING AND WPP WITH CETE, LA40CE, AND LA40TE
BBCM(WDCM,ML)
tions in WER with BBCM WPP and WPP are equal to 9.9% and 7.1%, respectively. These results nicely corroborate the assumption made by BBCM about including a priori information of the recognition performance. However, WPP seems to be an interesting choice because it is inherently more robust to mismatch between training and testing condition. C. ICEM Criterion to RL Re-Scoring Scheme
Fig. 3. Reduction in WER (%) versus Defined in (4) or (10) ), and LA40TE ( ) by using: with CETE (1 1 1 1), LA40CE ( (a) BBCM (WDCM ; ML) and (b) WPP as a confidence metric. TABLE I WER (%) OBTAINED BY DIRECTLY TUNING AND USING BBCM (WDCM ; ML) AND WPP WITH CETE, LA40CE, AND LA40TE
the N-best list obtained from the first decoding step. According to Fig. 3(b), when BBCM(WDCM,ML) is replaced with WPP in (6) and (8), the proposed RL scheme is also able to lead to highly significant reductions in WER equal to 10.64% , 15.58% , and 8.51% , when compared with the baseline system, with CETE, LA40CE, and LA40TE, respectively. B. Confidence Measures Performance Table III compares the reduction in WER when RL is applied with BBCM WPP and WPP with CETE. As can be seen in Table III, BBCM WPP provides a significantly higher reduction in WER than WPP where compared with the baseline system. By directly tuning , the reductions in WER with BBCM WPP and WPP are equal to 13.9% and 10.6%, respectively. By estimating with the ICEM criterion, the reduc-
Results with the proposed ICEM criterion, defined in Section IV, to optimize the RL mechanism presented here with BBCM WDCM ML and WPP are shown in Table II. The polynomial approximation of in (17) was done with and . As can with ICEM be seen in Table II, the online estimation of using BBCM(WDCM,ML) led to reductions in WER, when compared with the baseline system, equal to 8.35%, 17.21%, and 6.28% with CETE, LA40CE, and LA40TE, respectively. Significance analysis using McNamar’s test [40] shows that . Also in Table II, the these results are significant online estimation of with ICEM using WPP led to reductions in WER, when compared with the baseline system, equal to 7.12%, 2.60%, and 3.84% with CETE, LA40CE, and LA40TE, respectively. According to both sets of experiments, the ICEM criterion was able to lead to significant improvements in all the tasks employed in this paper on an utterance-by-utterance basis, with no prior information about the task, independently of the training-testing matching conditions, and using only one utterance as short as 0.7 s. Moreover, ICEM represents a novel paradigm in the framework of multi-classifier fusion, which may be applicable to other pattern recognition problems. However, the improvements achieved by ICEM are lower than those obtained by directly tuning in (10) on the testing data. This could be a consequence of the following restrictions: first, the gamma distribution fits better and than the Gaussian pdf, but it is still an approximation of the observed distribution of and ; second, the linear combination of feature log-likelihood and confidence metric represented by (4) or (10) should be analyzed from a wider perspective than the information theory point of view; and finally, the polymay introduce nomial model for further approximation error in the optimization of . D. Comparison With Standard MLLR and VTLN According to Fig. 4, standard MLLR can lead to reductions in WER equal to 22.7% and 28.0%, depending on the number of adaptation utterances, with LA40CE and LA40TE, respec-
MOLINA et al.: MAXIMUM ENTROPY-BASED RL USING A CONFIDENCE MEASURE IN SPEECH RECOGNITION FOR TELEPHONE SPEECH
TABLE III REDUCTION IN WER (%) ACHIEVED WITH CETE BY APPLYING , THE ICEM CRITERION AND USING ( ) IN (6) AND (8) , AND
WPP
BBCM(WDCM,ML)
BBCM WPP
1049
TABLE IV WER (%) AND REDUCTION IN WER (%) WHEN COMPARED WITH THE BASELINE SYSTEM. THE RESULTS WITH THE RL SCHEME ARE OBTAINED BY DIRECTLY TUNING
respectively. These results corroborate the hypothesis mentioned above that the proposed scheme is complementary to the conventional model adaptation techniques. As can be seen in Table IV, VTLN can lead to reduction of 16.2% in WER with only one adaptation utterance with LA40CE. It is worth highlighting that the improvement due to VTLN does not rise dramatically when the number of adaptation utterance increases. In fact the reduction in WER with five and 25 utterances is equal to 18%. The proposed RL scheme provides a reduction in WER almost as high as VTLN with one adaptation utterance. However, when VTLN is also combined and applied in sequence with RL scheme the reduction in WER increases from 16.2% to 19.5% with LA40CE. Moreover, also according to Table IV, the reduction in WER achieved by VTLN is lower than the one obtained by the proposed RL scheme with telephone speech (CETE). In fact, VTLN and the RL mechanism can lead to reductions in WER equal to 6.34% and 10.64%, respectively. This result must be due to the fact that telephone speech presents speaker and channel mismatch, and VTLN mostly attempts to reduce the speaker mismatch effect. The highest reduction in WER with CETE is also observed when the RL scheme and VTLN are combined. Finally, these results with LA40CE and CETE also validate the hypothesis mentioned above that the proposed scheme is complementary to feature compensation techniques. E. Utterance Length Dependency Fig. 4. Reduction in WER (%) when compared to the baseline system: MLLR ); and, MLLR plus the proposed ( ); the proposed RL scheme (
01010
RL scheme ( ), with: (a) LA40CE and (b) LA40TE. The results with the RL scheme are obtained by directly tuning .
tively. However, MLLR does not lead to any improvement with only one adaptation utterance. In contrast, the proposed RL scheme can provide reductions in WER equal to 15.6% and 8.5% by directly tuning with LA40CE and LA40TE, respectively, with only one adaptation utterance. Observe that confidence measures are defined on an utterance-by-utterance basis in the recognition engine. As a result, the proposed RL scheme is always applied to every utterance individually despite the fact that the number of adapting sentences is greater than one. When MLLR is combined and applied in sequence with RL schemes, the reduction in WER increases from 22.7% to 43.1% and from 28.0% to 37.3% with LA40CE and LA40TE,
Observe that, according to the result presented here, the improvements due to the confidence-based RL scheme do not depend on the utterance length. On the other hand, conventional unsupervised adaptation/compensation techniques usually improve the recognition accuracy asymptotically when the number of adaptation utterance increases [5], [42], [43] (Fig. 4). The exception is when over fitting is observed [44]. On the other hand, the scheme proposed here is memoryless and so cannot make use of previous information or accumulated data. Nevertheless, as shown above, it is important to emphasize that the RL confidence-based mechanism can reduce both speaker and channel mismatch more effectively than VTLN (Table IV). F. Difference With Standard Re-Scoring Scheme This manuscript proposes: a new paradigm (RL rescoring); a confidence based model to implement the RL rescoring scheme represented by (5) and (7); and an incremental entropy-based method to estimate the parameter, , that is defined by the RL model. Notice that the proposed RL scheme is not equivalent to
1050
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010
N-best/lattice rescoring [7], [26], [45]. In the later case, only the N-best hypotheses are rescored using, among others, confidence measure. However, it is interesting to compare the proposed approach with the family of N-best or lattice rescoring methods. Oracle error rate or oracle lattice error rate is defined as the error rate that would be achieved if an “oracle” chooses the best option from each hypothesis list [46]. This gives a lower bound on the error that could be achieved for a given the N-best list of hypotheses. Accordingly, the lowest WER that can be achieved by an N-best list or lattice rescoring algorithm corresponds to the oracle error rate. It is worth highlighting that, according to the literature [47]–[49], the highest reduction in WER of N-best list or lattice rescoring approaches is around 3%–4%. The proposed RL mechanism allows rescoring all the hypotheses, including those that are pruned and lost in the first Viterbi decoding. This is a result of considering all models, even those that are contained in the lost hypotheses, in contrast to ordinary N-best rescoring family methods that focus the analysis only on the hypotheses and models within the N-best list. Also, as a result of the RL rescoring method, a new N-best list can be obtained, which in turn explains why it can achieve an accuracy rate higher than the oracle one in some tasks. This behavior cannot be observed in N-best/lattice rescoring schemes [7], [26], [45]. It is also worth highlighting that the proposed RL method could be comparable to classical N-best rescoring if all the possible hypotheses, , would be generated without any pruning, . Despite the fact that this scenario is not feasible i.e., in continuous speech recognition tasks, it can still be used to explain why the proposed technique can achieve an accuracy rate higher than the oracle one. Consider a very large M list with all , where the N-best list is at the possible candidates the top. The proposed RL mechanism rescores the recognized remaining hypotheses according to a hypothesis and the confidence metric. The rescoring process aims to make the recognized hypothesis more or less likely with respect to the remaining hypotheses in the second Viterbi search to reduce the error rate. At this point, the reader should be aware that discriminating, for instance, between N-best and not N-best models instead of recognized and not recognized models, as done here, is secondary in the context of the contribution presented in this paper. VII. CONCLUSION A novel confidence-based reinforcement learning scheme to correct observation log-likelihoods and to address the problem of unsupervised compensation with limited estimation data is proposed. The reinforcement learning mechanism is modeled here as a linear combination of metrics or information sources, which is a paradigm that appears in the fields of pattern recognition and distributed systems. Moreover, a new criterion based on the incremental conditional entropy maximization, to optimize the linear combination of metrics or information sources in the recognition procedure, and a polynomial approximation for this conditional entropy are also presented in this paper. The results reported here show that confidence-based reinforcement learning can lead to significant reductions in WER between 3% and 18%, depending on the task, on the training-testing conditions, and the method used to optimize the linear combination of metrics.
In contrast with ordinary feature compensation and model parameter adaptation techniques, the confidence-based reinforcement scheme takes place in the acoustic log-likelihood domain. As a result, it is applicable in cascade with any feature compensation or model adaptation method. Also, the proposed approach offers a novel “top-to-bottom” framework in speech recognition, whereas adaptation and noise canceling methods usually operate on a “bottom-to-top” basis. Moreover, problems such as multistream speech recognition and the combination of language and acoustic models, besides several other problems in pattern recognition and distributed systems, could also be addressed by applying the incremental conditional entropy maximization presented here. Finally, the following topics could be proposed for future research: the combination of the reinforcement learning mechanism described here with adaptation or noise removal techniques on a “top-to-bottom” basis; the applicability of the reinforcement learning model discussed in this paper to other problems in pattern recognition; the use of incremental conditional entropy maximization to optimize the linear combination of metrics or information sources in the fields of pattern recognition and multisensor systems; the optimization of the computational load required by the reinforcement learning scheme; and, the introduction of further information to discriminate neighboring HMMs. ACKNOWLEDGMENT The authors would like to thank Dr. S. King, from CSTR/University of Edinburgh, U.K., for having proofread this manuscript. REFERENCES [1] X. Cui and A. Alwan, “Noise robust speech recognition using feature compensation based on polynomial regression of utterance SNR,” IEEE Trans. Speech Audio Process., vol. 13, no. 6, pp. 1161–1172, Nov. 2005. [2] C. J. Leggetter and P. C. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models,” Comput. Speech Lang., vol. 9, no. 2, pp. 171–185, 1995. [3] T. Myrvoll, O. Siohan, C. H. Lee, and W. Chou, “Structural maximum a posteriori linear regression for unsupervised speaker adaptation,” in Proc. ICSLP, Beijing, China, 2000, vol. 4, pp. 540–543. [4] X. Mu, S. Zhang, and B. Xu, “Multi-layer structure MLLR adaptation algorithm with subspace regression classes and tying,” in Proc. Interspeech, Jeju, Korea, 2004. [5] K. Shinoda and C.-H. Lee, “A structural Bayes approach to speaker adaptation,” IEEE Trans. Speech Audio Process., vol. 9, no. 4, pp. 276–287, May 2001. [6] O. Bellot, D. Mutrouf, P. Noceru, G. Linures, and J. F. Bonastre, “Structural speaker adaptation using maximum a posteriori approach and a Gaussian distributions merging technique,” in Proc. ICASSP, 2003, vol. 2, pp. 121–124. [7] M. Padmanabhan, G. Saon, and G. Zweig, “Lattice-based unsupervised MLLR for speaker adaptation,” in ISCA ITRW ASR, Paris, France, 2000, pp. 128–131. [8] M. Afify, Y. Gong, and J. Haton, “A general joint additive and convolutive bias compensation approach applied to noisy Lombard speech recognition,” IEEE Trans. Speech Audio Process., vol. 6, no. 6, pp. 524–538, Nov. 1998. [9] L. F. Uebel and P. C. Woodland, “Improvements in linear transform based speaker adaptation,” in Proc. ICASSP, Salt Lake City, UT, 2001, pp. 49–52. [10] M. Pitz, F. Wessel, and H. Ney, “Improved MLLR speaker adaptation using confidence measure for conversational speech recognition,” in Proc. ICSLP, Beijing, China, 2000. [11] L. F. Uebel and P. C. Woodland, “Speaker adaptation using latticebased MLLR,” in Proc. ITRW Adaptation Methods for Speech Recognition, Sophia Antipolis, France, 2001.
MOLINA et al.: MAXIMUM ENTROPY-BASED RL USING A CONFIDENCE MEASURE IN SPEECH RECOGNITION FOR TELEPHONE SPEECH
[12] D. Wang and S. Narayanan, “A confidence-score based unsupervised MAP adaptation for speech recognition,” in Proc. 36th Asilomar Conf. Signals, Syst. Comp., 2002, vol. 1, pp. 222–226. [13] N. B. Yoma, J. Carrasco, and C. Molina, “Bayes-based confidence measure in speech recognition,” IEEE Signal Process. Lett., vol. 12, no. 11, pp. 745–748, Nov. 2005. [14] S. Haykin, Neural Networks. A Comprehensive Foundation. New York: Macmillan, 1994. [15] A. G. Barto, “Reinforcement learning and adaptive critic methods,” in Handbook of Intelligent Control. New York: Van Nostrand-Reinhold, 1992, pp. 469–491. [16] R. S. Sutton, A. G. Barto, and R. J. Williams, “Reinforcement learning is direct adaptive optimal control,” in Proc. Amer. Control Conf., Boston, MA, 1991, pp. 2143–2146. [17] M. Nishida, Y. Mamiya, Y. Horiuchi, and A. Ichikawa, “On-line incremental adaptation based on reinforcement learning for robust speech recognition,” in Proc. Interspeech, Jeju, Korea, 2004. [18] J. Kittler, M. Hatef, R. P. Duin, and J. Matas, “On combining classifiers,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 3, pp. 226–239, Mar. 1998. [19] F. Fouss and M. Saerens, “A Maximum Entropy Approach To Multiple Classifiers Combination,” IAG, Universite Catholique de Louvain, 2004, Tech. Rep.. [20] S. Tamura, K. Iwano, and S. Furui, “Toward robust multimodal speech recognition,” in Proc. LKR2005, Tokyo, Japan, 2005, pp. 163–166. [21] Y. Zhou and H. Leung, “Minimum entropy approach for multisensor data fusion,” in Proc. IEEE Signal Process. Workshop Higher Order Statist., 1997, pp. 336–339. [22] A. Chung and H. Shen, “Dependence in sensory data combination,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 1998, vol. 3, pp. 1676–1681. [23] K. Y. Kwan, T. Lee, and C. Yang, “Unsupervised N-best based model adaptation using model-level confidence measures,” in Proc. ICSLP, Denver, CO, 2002, pp. 69–72. [24] F. Wessel, R. Schlüter, K. Macherey, and H. Ney, “Confidence measures for large vocabulary continuous speech recognition,” IEEE Trans. Speech Audio Process., vol. 9, no. 3, pp. 288–298, Mar. 2001. [25] G. Evermann and P. Woodland, “Large vocabulary decoding and confidence estimation using word posterior probabilities,” in Proc. ICASSP, Istanbul, Turkey, 2000, pp. 2366–2369. [26] F. Wessel, R. Schluter, and H. Ney, “Using posterior word probabilities for improved speech recognition,” in Proc. ICASSP, Istanbul, Turkey, 2000, vol. 3, pp. 1587–1590. [27] B. Fassinut and J.-B. Choquel, “A new probabilistic and entropy fusion approach for management of information sources,” Information Fusion Elsevier, vol. 5, no. 1, pp. 35–47, 2004. [28] Y. Chen, C. Wan, and L. Lee, “Entropy-based feature parameter weighting for robust speech recognition,” in Proc. ICASSP, Toulouse, France, 2006, pp. 41–44. [29] B. Nasersharif and A. Akbari, “Improved HMM entropy for robust subband speech recognition,” in Proc. Eusipco Conf., Turkey, 2005. [30] A. L. Berger, S. D. Della Pietra, and V. J. D. Della Pietra, “A maximum entropy approach to natural language processing,” Comput. Linguist., vol. 22, no. 1, pp. 39–71, 1996. [31] M. Kamal, K. Chen, M. Hasegawa-Johnson, and V. Brandman, “An evaluation of using mutual information for selection of acoustic-features representation of phonemes for speech recognition,” in Proc. ICSLP, Denver, CO, 2002, pp. 2129–2132. [32] M. Matton, M. De Wachter, D. V. Compernolle, and R. Cools, “Maximum mutual information training of distance measures for template based speech recognition,” in Proc. Int. Conf. Speech Comput., Patras, Greece, 2005, pp. 511–514. [33] A. C. Lazo and P. N. Rathie, “On the entropy of continuous probability distributions,” IEEE Trans. Inf. Theory, vol. IT-24, no. 1, pp. 120–122, Jan. 1978. [34] F. Valente and C. J. Wellekens, “Maximum entropy discrimination (MED) feature subset selection for speech recognition,” in Proc. IEEE ASR Understanding Workshop, St. Thomas, Virgin Islands, 2003. [35] J. Rice, Mathematical Statistics and Data Analysis, 2sd ed. Belmont, CA: Wadsworth, 1995, pp. 507–570. [36] R. L. Burden and J. D. Faires, Numerical analysis. Boston, MA: PWS, 1988. [37] S. Seneff, E. Hurley, R. Lau, C. Pao, P. Schmid, and V. Zue, “Galaxy-II: A reference architecture for conversational system development,” in Proc. ICSLP, Sydney, Australia, 1998, pp. 931–934. [38] L. Lee and R. Rose, “A frequency warping approach to speaker normalization,” IEEE Trans. Speech Audio Process., vol. 6, no. 1, pp. 49–60, Jan. 1998.
1051
[39] S. Panchapagesan and A. Alwan, “Frequency warping for VTLN and speaker adaptation by linear transformation of standard MFCC,” Comput. Speech Lang., vol. 23, no. 1, pp. 42–64, Jan. 2009. [40] L. Gillik and S. J. Cox, “Some statistical issues in the comparison of speech recognition algorithms,” in Proc. ICASSP, Glasgow, U.K., 1989, pp. 532–535. [41] N. B. Yoma, I. Brito, and J. Silva, “Language model accuracy and uncertainty in noise canceling in the stochastic weighted Viterbi algorithm,” in Proc. Eurospeech 2003, Geneva, Switzerland, 2003. [42] S. Wang, X. Cui, and A. Alwan, “Speaker adaptation with limited data using regression-tree-based spectral peak alignment,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 8, pp. 2454–2464, Nov. 2007. [43] O. Siohan, C. Chesta, and C.-H. Lee, “Joint maximum a posteriori adaptation of transformation and HMM parameters,” IEEE Trans. Speech Audio Process., vol. 9, no. 4, pp. 417–428, May 2001. [44] O. Siohan, C. Chesta, and C.-H. Lee, “Hidden Markov model adaptation using maximum a posteriori linear regression,” in Proc. Workshop Robust Methods Speech Recognition in Adverse Conditions, 1999, pp. 147–150. [45] B.-H. Tran, F. Seide, and T. Steinbiss, “A word graph based N-best search in continuous speech recognition,” in Proc. 4th Int. Conf. Spoken Lang. Process. ICSLP 96, Philadelphia, PA, 1996, pp. 2127–2130, 1996. [46] D. Jurafsky and J. H. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd ed. Upper Saddle River, NJ: Prentice-Hall, May 26, 2008, pp. 339–340. [47] B. Wu, J. Guo, and G. Liu, “Research on confusion network algorithm for Mandarin large vocabulary continuous speech recognition,” in Proc. 6th Int. Conf. Inf., Commun. Signal Process., Singapore, 2007, pp. 1–5, 2007. [48] V. Goel, S. Kumar, and W. Byrne, “Segmental minimum Bayes-risk decoding for automatic speech recognition,” IEEE Trans. Speech Audio Process., vol. 12, no. 3, pp. 234–249, May 2004. [49] L. Mangu, E. Brill, and A. Stolcke, “Finding consensus in speech recognition: Word error minimization and other applications of confusion networks,” Comput. Speech Lang., pp. 373–400, 2000. Carlos Molina (S’08) received the B.Sc. and M.Sc. degrees in electrical engineering from the Department of Electrical Engineering, Universidad de Chile, Santiago, in 2003 and 2005, respectively. He is currently pursuing the Ph.D. degree in electrical engineering at the Universidad de Chile. Since 2003, he has been a Research Student at the Speech Processing and Transmission Laboratory, Universidad de Chile, where he is currently working on noise canceling and speaker adaptation techniques for speech recognition, computer-aided pronunciation training, and second language learning. He is the coauthor of six journal articles and first author of three conference papers. His research interests include robustness in automatic speech recognition and second language learning. Mr. Molina is a student member of the International Speech Communication Association. Néstor Becerra Yoma (M’04) received the B.Sc. and M.Sc. degrees in electrical engineering from UNICAMP, Sao Paulo, Brazil, in 1986 and 1993, respectively, and the Ph.D. degree in electrical engineering from the University of Edinburgh, Edinburgh, U.K., in 1998. In 1998 and 1999, he was a Postdoctoral Researcher at UNICAMP and a full-time Professor at Mackenzie University, Sao Paulo. From 2000 to 2002, he was an Assistant Professor at the Department of Electrical Engineering, Universidad de Chile, Santiago, where he is currently lecturing on telecommunications and speech processing, and working on robust speech recognition/speaker verification, computer-aided pronunciation training and language learning, dialogue systems, and voice-over IP. At the Universidad de Chile, he has set up the Speech Processing and Transmission Laboratory to do research on speech technology applications on the Internet and telephone line. He has been an Associate Professor since 2003 and is the coauthor of 20 journal articles and 27 conference papers. His research interests include speech processing, language learning, real-time Internet protocols, QoS, and usability evaluation of human–machine interfaces.
1052
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010
Fernando Huenupán received the B.Sc. degree in electronic engineering from the Universidad de La Frontera, Temuco, Chile, in 2004. He is currently pursuing the Ph.D. degree in electrical engineering at the Universidad de Chile, Santiago. Since 2003, he has been a Research Student at the Speech Processing and Transmission Laboratory where he is currently working on speaker verification and multiple classifier systems. He is the coauthor of two journal articles and five conference papers. His research interests include robustness in speaker verification and multiple classifier systems.
Claudio Garreton received the B.Sc. and M.Sc. degrees in electrical engineering from the Department of Electrical Engineering, Universidad de Chile, Santiago, in 2005 and 2007, respectively. He is currently pursuing the Ph.D. degree in electrical engineering at the Universidad de Chile. Since 2005, he has been a Research Student at the Speech Processing and Transmission Laboratory where he is currently working on techniques for channel distortion and noise canceling in speech recognition, speaker verification, and second language learning. He is the coauthor of two journal articles and six conference papers in the last four years. His general research interests include channel and noise robustness in speech applications on the Internet and telephone line.
Jorge Wuth received the B.Sc. degree in electrical engineering with highest distinction from the Universidad de Chile, Santiago, in 2007. He has been a Research Associate at the Speech Processing and Transmission Laboratory at the Universidad de Chile since 2006, where he is currently working on the implementation of speech recognition-based telephone dialogue systems and computer-aided language learning platforms. His main research interests include speech recognition-based systems and the design of human–machine interfaces.