Enhancement Network Based ASR using ... - Semantic Scholar

Report 2 Downloads 115 Views
JOURNAL OF MULTIMEDIA, VOL. 6, NO. 5, OCTOBER 2011

395

Inhibition/Enhancement Network Based ASR using Multiple DPF Extractors Foyzul Hassan Department of CSE, United International University, Dhaka, Bangladesh Email: [email protected] Mohammed Rokibul Alam Kotwal Department of CSE, United International University, Dhaka, Bangladesh Email: [email protected] Mohammad Mahedi Hasan Blueliner Bangladesh, Dhaka, Bangladesh Email: [email protected] Ghulam Muhammad Department of CE, College of CIS, King Saud University, Riyadh, Kingdom of Saudi Arabia Email: [email protected] Mohammad Nurul Huda Department of CSE, United International University, Dhaka, Bangladesh Email: [email protected]

Abstract—— This paper describes an evaluation of Inhibition/Enhancement (In/En) network for robust automatic speech recognition (ASR). In distinctive phonetic features (DPFs) based speech recognition using neural network, In/En network is needed to discriminate whether the DPFs dynamic patterns of trajectories are convex or concave. The network is used to achieve categorical DPFs movement by enhancing DPFs peak patterns (convex patterns) and inhibiting DPFs dip patterns (concave patterns). We have analyzed the effectiveness of In/En algorithm by incorporating it into a system which consists of three stages: a) Multilayer Neural Networks (MLNs), b) In/En Network and c) Gram-Schmidt (GS) orthogonalization. From the experiments using Japanese Newspaper Article Sentences (JNAS) database in clean and noisy acoustic environments, it is observed that the In/En network plays a significant role on the improvement of phoneme recognition performance. Moreover, In/En network reduces required number of mixture components in Hidden Markov Models (HMMs). Index Terms—— Articulatory Features, Hidden Markov Model, Inhibition/Enhancement Network, Local Features, Multilayer Neural Network, Distinctive Phonetic Features.

I. INTRODUCTION Various methods of phoneme recognition based on Inhibition/Enhancement (In/En) network were proposed by Huda, et al. [1-4]. These papers introduced In/En functionality to discriminate whether the distinctive

© 2011 ACADEMY PUBLISHER doi:10.4304/jmm.6.5.395-403

phonetic features (DPFs) dynamic patterns of trajectories are convex or concave. The In/En network was used to achieve categorical DPFs movement by enhancing DPFs peak patterns (convex patterns) and inhibiting DPFs dip patterns (concave patterns). These papers showed that In/En network has an effect of improving phoneme recognition performance in clean acoustic environment. The impact of In/En network in practical condition was not analyzed in [1-4]. On the other hand, we analyzed the effectiveness of In/En algorithm [5] in practical conditions by incorporating it into a system which consists of three stages: a) Multilayer Neural Networks (MLNs) to extract DPFs from acoustic features, local features (LFs) [6], b) In/En Network and c) Gram-Schmidt (GS) orthogonalization. The objectives of that paper were to reduce the computation time by reducing mixture components in hidden Markov models (HMMs) and to increase phoneme correct rate (PCR) by introducing In/En network in different acoustic environments (clean, 0dB, 5dB, 10dB and 20dB). Some experiments for evaluating PCR with/without using In/En network in clean acoustic environment are designed using two stage MLN(s). We have also evaluated phoneme recognition performance for real environments at different SNRs using this method. In this study, we propose a DPF extraction method for constructing a more accurate phoneme recognizer. This method incorporates (i) three MLNs instead of two

396

JOURNAL OF MULTIMEDIA, VOL. 6, NO. 5, OCTOBER 2011

MLNs [5], first MLN is same as [5] and the other two MLNs are: MLNcntxt, which reduces phoneme boundary errors, and MLNDyn, which restricts DPF dynamics, and then embeds (ii) an In/En network to obtain more precise DPF patterns for an HMM-based classifier by enhancing convex patterns (DPF peaks) and by inhibiting concave patterns (DPF dips). The effects of three MLNs and the In/En network in our method increase phoneme recognition performance significantly over the other investigated method. A phoneme recognition method with low cost can be obtained by reducing the required number of Gaussian mixture components in the HMMs. In this study, we investigate and evaluate three types of DPF extraction methods along with the conventional MFCC-based method from the viewpoint of phoneme recognition performance. These methods are (i) DPF using an MLN, (ii) DPF using two MLNs[5] and (iii) our proposed method using three MLNs and an In/En network. The paper is organized as follows: Section II and III discuss the articulatory features and local features, respectively. Section IV explains the system configuration of the proposed phoneme recognition method based on the In/En network. Experimental database and setup are provided in Section V, while experimental results are analyzed in Section VI. Finally, Section VII draws some conclusion and provides some remarks on the future works. II. ARTICULATORY FEATURES A phoneme can easily be identified by using its unique DPFs set [7-11]. The Japanese balanced DPFs [12-15] for classifying phonemes have 15 elements, which are vocalic, high, low, intermediate between high and low , anterior, back, intermediate between anterior and back , coronal, plosive, affricate, continuant, voiced, unvoiced, nasal and semivowel.



Figure 1. Examples of LFs.

© 2011 ACADEMY PUBLISHER

III. LOCAL FEATURES AND IT’’S EXTRACTION PROCEDURE At an acoustic feature extraction stage, firstly, input speech is converted into local features (LFs) that represent a variation in spectrum along time and frequency axes. Two LFs are first extracted by applying three point linear regressions (LRs) along the time t and frequency f axes on a time spectrum pattern respectively. Figure 1 exhibits an example of LFs for an input utterance. After compressing these two LFs with 24 dimensions into LFs with 12 dimensions using discrete cosine transform (DCT), a 25-dimensional (12ǻt, 12ǻf and ǻP, where P stands for log power of raw speech signal) feature vector named LF is extracted (see Fig. 2).

Figure 2. LFs extraction procedure.

IV. EXTENDED PHONEME RECOGNIZER BASED ON THE INHIBITION/ENHANCEMENT NETWORK Figure 3 shows the proposed feature extraction method. This method comprises three stages. The first stage extracts 45-dimensional DPF vectors from the LFs of an input speech using three MLNs. The second stage incorporates In/En functionalities to obtain modified DPF patterns. The third stage decorrelates the DPF vectors using the GS orthogonalization [12] before connecting with an HMM-based classifier. A. DPF Extractor In this method, three MLNs instead of a single MLN are used to construct the DPF extractor. The first MLN, MLNLF-DPF, maps acoustic features, or LFs, onto discrete DPF features [13, 14, 15], the second MLN, MLNcntxt, reduces misclassification at phoneme boundaries, and the third MLN, MLNDyn restricts the DPF dynamics. Here, MLNLF-DPF has the same architecture as that described in [13], and it is trained using the same learning algorithm. The 45-dimensional context-dependent DPF vector provided by the MLNLF-DPF at time t is appended into the MLNcntxt, which consists of five layers including three hidden layers of 90, 180, and 90 units, respectively, and generates a 45-dimensional DPF vector with a small number of errors at phoneme boundaries. This 45dimentional DPF vectors and its corresponding ¨DPF and ¨¨DPF vectors calculated by three-point LR (given in Equation 1) are appended into the subsequent MLNDyn, which consist of four layers including two hidden layers of 300 and 100 units, respectively, and outputs a 45dimensional DPF vector with reduced fluctuations and dynamics. Both the MLNcntxt and MLNDyn are trained

JOURNAL OF MULTIMEDIA, VOL. 6, NO. 5, OCTOBER 2011

xt-3

45 'DPF

DPFt-3

45

Gram-Schmidt Orthogonalization

DPFt+3

45

Inhibition/Enhancement Network

45 DPF

ML N

xt+3

DPFt

D yn (R es tric ting D P F dyna mic s )

xt

ML N c ntxt

ML N

L F ͲD P F (Ma pping L F toD P F )

Speech signal

L oc a lF ea tureE xtra c tion



397

LM (syllable) 45

HMM

Phoneme string

AMs

45 ''DPF Input for MLN: 25 dim. x 3 fr.

Figure 3. Proposed Phoneme Recognition Method.

using the standard back-propagation algorithm.

Step3: Calculate f ('')

't

¦ ( xt  i  xt  i ) 2 i 1

n

¦i

(1) 2

i 1

B. Inhibition/Enhancement Network The DPF extractor, MLNLF-DPF + MLNcntxt + MLNDyn, generates 45 DPF patterns (15 preceding context DPF patterns, 15 current context DPF patterns, and 15 following context DPF patterns) for each input speech. Because all of these 45 DPF patterns may not follow the input DPF patterns of a phoneme string exactly, there exists an ambiguity among some phonemes for classifying the target phoneme in the HMM-based classifier. Consequently, some phonemes are not correctly recognized. An ambiguity sometimes occurs when the values of consecutive DPF peaks and DPF dips in a DPF time pattern of a phoneme string are closer to each other. For example, left peak, middle dip, and right peak values generated by a DPF extractor are , , and , respectively. The classifier faces a problem to decide whether this pattern is or , while input DPF pattern was . Here, the value, 0.4 is assumed as either zero or one, while the value, 0.7 is considered as one. So, there must have clear distinction between a DPF peak and dip along time axis. If there exist a mechanism that enhances DPF peak values up to a certain level and that suppresses DPF dip values accordingly, then a distinction between a peak and dip is found. We have incorporated an In/En network to get this type of effect. An algorithm for this network is given below: Step1: For each element of the DPF vectors, find the acceleration (ǻǻ) parameters by using three-point LR. Step2: Check whether (ǻǻ) is positive (concave pattern) or negative (convex pattern) or zero (steady state).

© 2011 ACADEMY PUBLISHER

c1 1  c1  1 e E'' 2 1  c2 If pattern is concave, f '' c2  1  e E''

If pattern is convex, f ''

n

If steady state, f ('') 1.0 Here, c1, c2 and ȕ represents enhancement, inhibition and steepness coefficient respectively. Step4: Find modified DPF patterns by multiplying the DPF patterns with f ('') . Figure 4 shows the working mechanism of the In/En network using the "anterior" DPF pattern of an input utterance, /mam/ along time axis. In the figure, five curves, which represent labeled "anterior" DPF for the input utterance, corresponding output "anterior" generated by a neural network, ǻǻ for the output "anterior" values, f(ǻǻ) for ǻǻ values, and modified "anterior" DPF, are indicated by (a), (b), (c), (d), and (e), respectively. Here, the curve (e) is obtained by multiplying curve (b) with curve (d). After applying the In/En network algorithm on curve (b), the DPF values of frames 1-6 and 13-19 ( convex pattern or DPF peak) are enhanced, and frames 7-11 (concave pattern or DPF dip) are inhibited. V. EXPERIMENTS A. Speech Databases The following data sets are used in our experiments. Clean data set: D1. Training data set for MLNLF-DPF: A subset of the Acoustic Society of Japan (ASJ) Continuous Speech Database comprising 4503 sentences uttered by 30 different male speakers (16 kHz, 16 bit) is used [16]. D2. Training data set for MLNcntxt: This data set contains 5000 sentences that are taken from Japanese Newspaper Article Sentences (JNAS) [17] Continuous Speech Database; the sentences have been uttered by 33 different male speakers (16 kHz, 16 bit).

398

JOURNAL OF MULTIMEDIA, VOL. 6, NO. 5, OCTOBER 2011

 Input: /m//m/……./m//a//a/………………/a//m//m/…………………….../m//m/ Labeled "anterior" Output "anterior" ǻǻ f(ǻǻ) Modified "anterior"

4

DPF value

3

(d)

(e)

2

1

(b)

(a) 0 1

3

5

7

9

11

13

15

17

19

(c) -1

time

Figure 4. Working mechanism of the In/En network. Five curves are denoted by (a), (b), (c), (d), and (e), respectively. The curves: a) Labeled "anterior" DPF for input utterance, /mam/, b) Output "anterior" DPF by a neural network, c) ǻǻ for output "anterior", d) f(ǻǻ) for ǻǻ, and e) Modified "anterior" by multiplying curve (b) with curve (d).

D3. Training data set for MLNDyn: This data set contains 5000 JNAS [17] sentences uttered by 33 different male speakers (16 kHz, 16 bit). Speakers of this data set are different from the D2 data set. D4. Training data set for HMM classifier: This data set takes 5000 JNAS [17] sentences uttered by 33 different male speakers (16 kHz, 16 bit). Speakers of this data set are different from the D2 and D3 data set. D5. Test data set: This test data set comprises 2379 JNAS [17] sentences uttered by 16 different male speakers (16 kHz, 16 bit). Since open test are done, different training data sets are used for the neural network-based and HMM-based classifiers. Noisy Data set: D6. Noisy test data set: Two thousand three hundred seventy nine utterances from JNAS [17] continuous speech sentences uttered by 16 male speakers are used as test data. Test utterances are noise corrupted (car noise) speech. Noise from Japan Electronic Industries Development Association (JEIDA) Noise Database [18] is added to the clean JNAS dataset D5 at different SNR (0 dB, 5 dB, 10 dB, 20 dB) conditions. For each SNR (0 dB, 5 dB, 10 dB and 20 dB), there are 2379 utterances. Sampling rate is 16 kHz. B. Experimental Setup The frame length and frame rate are set to 25 ms and 10 ms, respectively, to obtain acoustic features from an input speech. LFs are a 25-dimensional vector consisting of 12 delta coefficients along time axis, 12 delta coefficients along frequency axis, and delta coefficient of log power of a raw speech signal [6]. For the phoneme recognizer using two MLNs [5], PCRs for D5 and D6 data set are evaluated using an HMM-based classifier. The D1 data set is used to design 38 Japanese monophone HMMs with five states, three loops, and left-to-right models. Input features for the classifier are orthogonalized DPFs. In the HMMs, the output probabilities are represented in the form of Gaussian mixtures, and diagonal matrices are used. The mixture components are set to 1, 2, 4, 8, and 16.

© 2011 ACADEMY PUBLISHER

In our experiments of the MLN(s), the non-linear function is a sigmoid from 0 to 1 (1/(1+exp(-x))) for the hidden and output layers. For the In/En network, the value of the enhancement coefficient, C1, is set to 4.0 after evaluating the method, MLN +MLN+In/En+GS, for different values of C1, such as 2, 4, and 6, and the value of the steepness coefficient, ȕ, is set to 80. The value of inhibitory coefficient, C2, is fixed to 0.25 after observing the DPF data patterns to keep the values of f(ǻǻ) between 0.25 and 1.0. The following experiments, which input a 45dimensional feature vector for the HMM-classifier, are designed for evaluating PCR using D5 data set to obtain optimum value of C1 based on the investigated values. (1) MLN+MLN+In/En+GS, C1=2.0 (2) MLN+MLN+In/En+GS, C1=4.0 (3) MLN +MLN+In/En+GS, C1=6.0. To observe the effectiveness of the In/En network, we have evaluated the following phoneme recognition methods using D5 data set. The input feature for the HMM-based classifier is of 45 dimensions. (f) DPF (MLN+GS, dim: 45) (o) DPF (MLN+In/En+GS, dim: 45) (j) DPF (MLN+MLN+GS, dim: 45) (r) DPF(MLN+MLN+In/En+GS, dim: 45). The impact of In/En network in car noise corrupted environment at different SNRs (0dB, 5dB, 10dB and 20dB) are also evaluated using D6 data set. The following experiments are designed for this purpose. (n) Car.DPF(MLN+MLN+GS,dim:45) at 0 dB (s) Car.DPF(MLN+MLN+In/En+GS,dim:45) at 0dB (t) Car.DPF(MLN+MLN+GS,dim:45) at 5 dB (u) Car.DPF(MLN+MLN+ In/En+GS,dim:45) at 5dB (v) Car.DPF(MLN+MLN+GS,dim:45) at 10 dB (w) Car.DPF(MLN+MLN+ In/En+GS,dim:45) at 10dB (k) Car.DPF(MLN+MLN+GS,dim:45) at 20 dB (l) Car.DPF(MLN+MLN+ In/En+GS,dim:45) at 20dB. For evaluating the performance of a DPF extractor, we measure DPF correct rate (DCR) using D5 data set. Here, a DPF value, which is in current frame (middle 15 of 45dimensional output vector), below 0.5 is considered to be a negative feature; otherwise, it is a positive feature. The

JOURNAL OF MULTIMEDIA, VOL. 6, NO. 5, OCTOBER 2011

DCR

Nc N

For the In/En network, the value of the enhancement coefficient, C1, is set to 4.0 after evaluating the proposed method, DPF(3-MLN+In/En+GS,dim:45), for different values of C1 , such as 2, 4, and 6, and the value of the steepness coefficient, ȕ, is set to 80. The value of inhibitory coefficient, C2, is fixed to 0.25 after observing the DPF data patterns to keep the values of f(ǻǻ) between 0.25 and 1.0. Since our goal is to design a more accurate phoneme recognizer, PCRs for D5 data set are evaluated using an HMM-based classifier. The D4 data set is used to design 38 Japanese monophone HMMs for the extended proposed method and the existing methods, (a), (b), (c), (d), (e), (f) and (g). The same constructions of HMMs like as two MLNs [5] are used in the proposed method. In our experiments, the phoneme recognition tests are carried out to compare the performance of the proposed DPF extractor with the baseline system, MFCC-based method, which inputs a feature vector with 38 dimensions to the HMM classifier and the method proposed by T. Fukuda. Besides, some experiments are done to observe the effects of In/En network on phoneme recognition performance over the methods, MLN(s)+GS. Here, DPFbased methods insert 45-dimensional DPF vector to the classifier and the following methods along with baseline are evaluated. (a) MFCC(baseline,dim:38) (b) DPF(MLN+GS,dim:45) (c) DPF(2-MLN+GS,dim:45) (d) DPF(3-MLN+GS, dim:45) (e) DPF(MLN+In/En+GS,dim:45) (f) DPF(2-MLN+In/En+GS,dim:45) (g) DPF(3-MLN+In/En+GS,dim:45) [Proposed].

It is noted from the clean acoustic environment that the In/En network has significant role for increasing PCR. Phoneme Correct Rate(%)

overall DCR is obtained by the following equation after counting the total number of correctly recognized DPF values, Nc, and total number of DPF values, N, for all the phonemes.

399

C1=2.0 C1=4.0 C1=6.0

85 84 83 82 81 80 1

2

4

8

16

Number of mixture component(s) Figure 5. Variation of phoneme recognition performance for different values of enhancement coefficient, C1 for clean data.

Figure 6. Effectiveness of In/En network on phoneme recognition performance over MLN+GS.

VI. EXPERIMENTAL RESULTS AND DISCUSSION The variation of performance of the proposed method for different values of enhancement coefficient, C1, is shown in Fig. 5. At all mixture components except 1, C1=4.0 exhibits the highest PCR over the other investigated values of C1 and hence, C1 is set to 4.0 for our experiments. The impact of In/En network over the methods, (b) and (c) are shown in the Figs. 6 and 7, respectively for all the mixture components using D5 data set. It is observed from the figures that the methods incorporating In/En network (e) and (f) provide a higher phoneme recognition accuracy for all mixture components investigated than the methods (b) and (c), respectively. For an example, at mixture component 16, the phoneme recognition accuracy for the methods (b) and (c), which do not incorporate In/En network, are 79.19 % and 81.32%, respectively, while the corresponding values for the methods (e) and (f) are 81.9% and 83.33%, respectively.

© 2011 ACADEMY PUBLISHER

Figure 7. Effectiveness of In/En network on phoneme recognition performance over MLN+MLN+GS.

The PCR comparison between the In/En based methods (e) and (f) are given in Table I. From the table it is shown that, our proposed method (f) provides best PCR

400

for all the mixture components investigated. For example, at mixture component 8, the proposed method (f) exhibits 83.30% PCR, while the method (e) shows 81.35% PCR. TABLE I. OVERALL PHONEME CORRECT RATES FOR IN/EN BASED METHODS

JOURNAL OF MULTIMEDIA, VOL. 6, NO. 5, OCTOBER 2011

components. On the other hand, the corresponding values for the methods (s), (u), (w) and (l), which do not embed In/En network are 53.48%, 62.55%, 73.36% and 83.46%, respectively. From the experiments in car noise corrupted environment, it is claimed that the improvement of PCR can be obtained by using the In/En network.

Figure 10. Effect of In/En network on MLN+MLN+GS (car-noise, SNR=10dB).

Figure 8. Effect of In/En network on MLN+MLN+GS (car-noise, SNR=0dB).

Figure 11. Effect of In/En network on MLN+MLN+GS (car-noise, SNR=20dB).

Figure 9. Effect of In/En network on MLN+MLN+GS (car-noise, SNR=5dB).

For all the mixture components using D6 data set, the effect of In/En network in car noise corrupted environment at different SNRs (0 dB, 5 dB, 10 dB and 20 dB) over the methods (n), (t), (v) and (k) are shown in the Figs. 8, 9, 10 and 11, respectively. The methods (s), (u), (w) and (l) that embed In/En network provide a higher phoneme recognition performance for all the mixture components investigated than the methods (n), (t), (v) and (k), respectively. The methods (n), (t), (v) and (k) that do not incorporate In/En network exhibit PCR of 51.34 %, 60.13%, 70.99% and 81.58%, respectively at 16 mixture

© 2011 ACADEMY PUBLISHER

The SNR-wise (0 dB, 5 dB, 10 dB, 20 dB and clean) PCRs for our proposed method, DPF (MLN+MLN+In/En+GS, dim: 45), and the method DPF (MLN+MLN+GS, dim:45) are shown in Fig. 12 using the mixture component 16. For all the investigated SNRs, the proposed method shows a higher recognition performance. For example, at SNR 20 dB and clean, the proposed method gives 83.46% and 83.33% PCRs respectively, while 81.58% and 81.32% PCRs are shown by the method DPF(MLN+MLN+GS, dim:45). From all the observations in clean and car noise corrupted environment, it is clear that In/En network has significant effect in increasing the phoneme recognition performance.

JOURNAL OF MULTIMEDIA, VOL. 6, NO. 5, OCTOBER 2011

It is claimed that the In/En network reduces mixture components in HMMs and hence computation time. For an example, at SNR=5dB of Fig. 9, approximately 60% PCR is obtained by the In/En based method (u) at mixture component one, while this numerical value of PCR can be achieved at 16 mixture component by the method (t) that does not incorporate In/En network. To obtain 60% PCR, the methods, (u) and (t) requires 5K (=1x52x200 using mS2T, where m, S and T represent number of mixture component(s), states and observation sequences, respectively) and 80K (=16x52x200) multiplications, respectively. Here, the number of observation sequences is assumed to be 200 frames. Hence, In/En based method reduces computation time. On the other hand, our proposed method, which incorporates three MLNs, increases overall computation time in comparison with the method that incorporates single MLN. For example, a single MLN requires 1000X(75X256+256X96+96X45) or 48,096,000 (§48.1M) multiplications, where three MLNs require 48,096,000 +1000 X (45 X 90 + 90 X 180 +180X90+90X45) +1000X(45X3X300+300X100+100X 45) or 163,596,000 (§163.6M) multiplications.

401

s ilB

ML N

j i N

k

o

e s e

s emivowel

nas al



unvoic ed



voic ed continuant affric ative

plos ive c oronal



nil(ant./bac k)

bac k 㽷

anterior nil(high/low)



low 㽴

high mora

Figure 13. Segmentation for utterance, /jiNkoese/ using MLN. s ilB

3ͲML N

j

i N

k

o

e s e

s emivowel

nas al



unvoic ed



voic ed c ontinuant affric ative

plos ive c oronal



nil(ant./bac k)

bac k 㽷

anterior nil(hig h/low)



low hig h



mora

Figure 14. Segmentation for utterance, /jiNkoese/ using 3-MLN.

Figure 12. Effect of In/En network for 16 mixture components on MLN+MLN+GS for different SNRs using car noise.

Segmentation for a clean /jiNkoese/ utterance is shown in Figs. 13 and 14 for a balanced-DPF set using a single MLN and 3-MLN, respectively. In both the figures, "Solid thin line" and "Solid bold line" represent "ideal segmentation" and "output segmentation", respectively; "nasal", "nil(high/low)", and "high" of phoneme /N/, and "unvoiced", "coronal", and "anterior" of phoneme /s/ are denoted by (1), (2), (3), (4), (5), and (6), respectively. After observing these marked places, we can say that the 3-MLN exhibits more precise segmentation (less deviation from ideal boundary) than the single MLN, reduces some fluctuations caused by the first MLN, MLNLF-DPF [8] and provides more smoothed DPF curves, and hence, it misclassifies fewer phonemes. The overall DCRs for the MLN, 2-MLN and 3-MLN are shown in Table II; the 3-MLN exhibits 1.6% improvement of DCR over the DPF extractor implemented by a single MLN.

© 2011 ACADEMY PUBLISHER

TABLE II. OVERALL DPF CORRECT RATE FOR DPF EXTRACTORS

The phoneme recognition performance after applying the GS orthogonalization using the methods (b), (c), and (d) are given in Fig. 15; (d) exhibits its highest performance (82.94%) for mixture component 16, while (b) and (c) show 79.06% and 80.66% PCRs, respectively. At mixture component 16, (d) shows an improvement of 3.88% over the method (b). After incorporating the In/En network with the methods (b), (c) and (d) of Fig. 15, we can evaluate the orthogonalized DPF, and the recognition results are shown in Fig. 16. From the figure, we can observe that the proposed method (g) provides a higher PCR over the

402

JOURNAL OF MULTIMEDIA, VOL. 6, NO. 5, OCTOBER 2011



(b)DPF(MLN+GS,dim:45) [8] (c)DPF(2-MLN+GS,dim:45) (d)DPF(3-MLN+GS,dim:45)

Phoneme Correct Rate(%)

90

85

80

75

70 1

2

4

8

16

Mixture component(s) Figure 15. Phoneme recognition performance using GS.

 Phoneme Correct Rate(%)

90

(e)DPF(MLN+In/En+GS,dim:45) (f)DPF(2-MLN+In/En+GS,dim:45) (g)DPF(3-MLN+In/En+GS,dim:45)[Proposed]

85

80

75

70 1

2

4

8

16

Mixture component(s) Figure 16. Phoneme recognition performance using In/En+GS.

© 2011 ACADEMY PUBLISHER

90

Phoneme Correct Rate(%)

method (e) and (f) for all mixture components. The proposed method exhibits its best PCR (84.52%) for mixture component 4. From Figs. 15 and 16, an improvement of 1.91% PCR at mixture component 4 by the proposed method over the method (d) illustrates the advantage of an In/En network. Fig. 17 shows a comparison of the phoneme recognition performance of the proposed method with baseline (a) and the method (b) proposed by T. Fukuda for the investigated mixture components. It should be noted that the proposed method outperformed the baseline at all mixture components. For example, at mixture component 16, the proposed method (84.50% PCR) improves the performance by 4.92% in comparison with the baseline (79.58% PCR). On the other hand, at mixture component 16, an improvement of 5.44% is achieved by the proposed method in comparison with the method proposed by T. Fukuda (79.06% PCR). Moreover, the proposed method requires fewer mixture components in the HMMs.

(a)MFCC(baseline,dim:38) (b)DPF(MLN+GS,dim:45) [8] (g)DPF(3-MLN+In/En+GS,dim:45)[Proposed]

85 80 75 70 65 1

2

4

8

16

Mixture component(s) Figure 17. Comparison among proposed and existing methods.

VII. CONCLUSION This paper has presented the significance of Inhibition/Enhancement network for an ASR. The following conclusions are drawn from the study: 1. The addition of extra MLNs provides higher PCR. 2. The proposed method, DPF (3MLN+In/En+GS,dim:45), provides higher PCR for clean acoustic environments over the other method investigated. 3. The methods that incorporates Inhibition/Enhancement network increases phoneme recognition performance significantly. 4. The Inhibition/Enhancement network reduces mixture components in HMMs. Consequently, HMMs requires less computation time. The authors would like to do same experiments using Bengali speech databases in future for the same network architecture. Moreover, we would like to evaluate noisecorrupted speech data at different acoustic environments for different signal-to-noise ratios (SNRs) using the proposed DPF extractor. REFERENCES [1] M. N. Huda, H. Kawashima, and T. Nitta, ““Distinctive Phonetic Feature (DPF) extraction based on MLNs and Inhibition/Enhancement Network,”” IEICE Trans. Inf. & Syst., Vol.E92-D, No.4, April 2009. [2] M. N. Huda, H. Kawashima, K. Katsurada, and T. Nitta, ““Distinctive phonetic feature (DPF) based phoneme recognition using MLNs and Inhibition/Enhancement network for noise robust ASR,”” Proc NCSP’’09, Honolulu, Hawaii, USA, March 2009. [3] M. N. Huda, H. Kawashima, and T. Nitta, ““Distinctive phonetic feature extraction based on 3-stage MLNs and Inhibition/Enhancement network,”” Proc Technical Report of IEICE, SP08, December 2008. [4] M. N. Huda, et al. ““Phoneme recognition based on hybrid neural network with inhibition/enhancement of distinctive phonetic feature (DPF) trajectories,”” InterSpeech08, Brisbane, Australia, September 2008. [5] M. N. Huda, M. S. Hossain, F. Hassan, M. M. Hassan, N. J. Lisa, G. Muhammad, ”” An Inhibition/Enhancement

JOURNAL OF MULTIMEDIA, VOL. 6, NO. 5, OCTOBER 2011

[6] [7] [8] [9] [10] [11] [12] [13]

[14] [15]

[16] [17] [18]

Network for Noise Robust ASR,”” Proc. ICCIT’’10, Dhaka, Bangladesh, December 2010. T. Nitta, "Feature extraction for speech recognition based on orthogonal acoustic-feature planes and LDA," Proc. ICASSP’’99, pp.421-424, 1999. S. King and P. Taylor, "Detection of Phonological Features in Continuous Speech using Neural Networks," Computer Speech and Language 14 (4), pp. 333-345, 2000. E. Eide, "Distinctive Features for Use in an Automatic Speech Recognition System," Proc. Eurospeech 2001, vol.III, pp.1613-1616, 2001. S. King, et. al, ““Speech recognition via phonetically features syllables,”” Proc ICSLP’’98, Sydney, Australia, 1998. K. Kirchhoff, et. al, "Combining acoustic and articulatory feature information for robust speech recognition," Speech Commun.,vol.37, pp.303-319, 2002. K. Kirchhoffs, ““ Robust Speech Recognition Using Articulatory information,”” Ph.D thesis, University of Bielefeld, Germany, July 1999. T. Fukuda, ““A study on feature extraction and canonicalization for robust speech recognition,”” Ph.D thesis, March 2004. T. Fukuda and T. Nitta, "Orthogonalized Distinctive Phonetic Feature Extraction for Noise-Robust Automatic Speech Recognition," The Institute of Electronics, Information and Communication Engineers (IEICE) Transactions on Information and Systems, Vol. E87-D, No.5, pp. 1110-1118, 2004. T. Fukuda, W. Yamamoto, and T. Nitta, "Distinctive Phonetic feature Extraction for robust speech recognition," Proc. ICASSP'03, vol.II, pp.25-28, 2003. T. Fukuda and T. Nitta, ““Noise-robust Automatic Speech Recognition Using Orthogonalized Distinctive Phonetic Feature Vectors,”” Proc. Eurospeech 2003, Vol.III, pp.2189-2192, Sep. 2003. T. Kobayashi, et al. "ASJ Continuous Speech Corpus for Research," Acoustic Society of Japan Trans. Vol.48, No.12, pp.888-893, 1992. JNAS: Japanese Newspaper Article Sentences. http://www.milab.is.tsukuba.ac.jp/jnas/instruct.htm Itahashi, ““A noise database and Japanese common speech data corpus,”” J. Acoust. Soc. Jpn., vol. 47, no. 12, pp. 951–– 953, 1991 (in Japanese).

Foyzul Hassan was born in Khulna, Bangladesh in 1985. He completed his B.Sc. in Computer Science and Engineering Degree from Military Institute of Science and Technology (MIST), Dhaka, Bangladesh in 2006. He has participated several national and ACM Regional Programming Contest. He is currently doing M. Sc. in CSE in United International University, Dhaka, Bangladesh. His research interests include Speech Recognition, Robotics and Software Engineering.

© 2011 ACADEMY PUBLISHER

403

Mohammed Rokibul Alam Kotwal was born in Dhaka, Bangladesh in 1983. He completed his B. Sc. in Computer Science and Engineering (CSE) Degree from Ahsanullah University of Science and Technology, Dhaka, Bangladesh. He is currently a M. Sc. in CSE student in United International University, Dhaka, Bangladesh. His research interests include Neural Networks, Phonetics, Automatic Speech Recognition, Fuzzy Logic Systems, Pattern Classification, Data Mining and Software Engineering. He is a member of IEEE, IEEE Communication Society and a member of Institution of Engineers, Bangladesh (IEB). Mohammad Mahedi Hasan was born in Dhaka, Bangladesh in 1984. He received his B. Sc. in Computer Science and Engineering from Military Institute of Science and Technology (MIST), University of Dhaka, Bangladesh in 2007. He is currently working as a Senior Software Engineer in Blueliner Bangladesh, Dhaka. Mr. Hasan is currently doing research in speech recognition, speech synthesis and Natural Language Processing in general. His research interest also includes the study of artificial intelligence, parallel programming, data mining, software engineering and distributed computing. Ghulam Muhammad was born in Rajshahi, Bangladesh in 1973. He received his B. Sc. in Computer Science and Engineering degrees from Bangladesh University of Engineering & Technology (BUET), Dhaka in 1997. He also completed his M.E and Ph. D from the Department of Electronics and Information Engineering, Toyohashi University of Technology, Aichi, Japan in 2003 and 2006, respectively. Now he is working as an Assistant Professor in King Saud University, Riyadh, Saudi Arabia. His research interest includes Automatic Speech Recognition and humancomputer interface. He is a member of IEEE. Mohammad Nurul Huda was born in Lakshmipur, Bangladesh in 1973. He received his B. Sc. and M. Sc. in Computer Science and Engineering degrees from Bangladesh University of Engineering & Technology (BUET), Dhaka in 1997 and 2004, respectively. He also completed his Ph. D from the Department of Electronics and Information Engineering, Toyohashi University of Technology, Aichi, Japan. Now, he is working as an Associate Professor in United International University, Dhaka, Bangladesh. His research fields include Phonetics, Automatic Speech Recognition, Neural Networks, Artificial Intelligence and Algorithms. He is a member of International Speech Communication Association (ISCA).