Intelligent Systems - Semantic Scholar

Report 3 Downloads 313 Views
Intelligent Systems NOR~. ~

Speaker-Independent Recognition of Isolated Words Using Rough Sets ANDRZEJ CZYZEWSKI

Sound Engineering Department, Faculty of Electronics, Technical Uniuersity of Gdansk, 80-952 Gdansk, Poland

ABSTRACT

The aim of the presented research is to elaborate and to test the speaker-independent system for the man-machine voice interfacing. The trajectory tracking method was implemented to feature vector extraction from speech signal. Statistical method was employed to the quantization of attribute values. The rough set method was used to derive decision rules for the recognition of speech patterns. Results and conclusions on the effectiveness of implemented algorithmic solutions were presented. @Elsevier Science Inc. 1998

1.

INTRODUCTION

The task to recognize speech utterances independently from the speaker is not solved in the general case, yet. Hence, there is a need to search for new or for modified algorithms applicable to deal with speech signal redundancy or irrelevance and with speech data unrepeatability and uncertainty. This paper presents an approach to the creation of a speaker-independent system for the recognition of isolated words based on the rough set decision algorithm. Described experiments allow one to study results obtained using the elaborated method. 2.

PREPROCESSING

Input speech was sampled with a sampling frequency equal to 22.05 kHz and quantized using the 8-bit linear quantization and recorded in the hard disk of a UNIX workstation. Subsequently, some initial processing procedures were executed in order to isolate words. INFORMATION SCIENCES 104, 3-14 (1998) © Elsevier Science Inc. 1998 655 Avenue of tile Americas, New York, NY 10010

0020-0255/98/$19.00 Pll S0020-0255(97}00072-8

4

A. CZYZEWSKI

The elaborated procedure for the word end-point detection is based on tracking the frequency of time-domain envelope evolutions. The model for the utterance envelope is based on the assumption that the envelope of word is trapezoidal. The values k l , k 2 representing the word boundary time indices provide the boundaries of the trapezoid basis and easily discernible values tl, t 2 are localized outside the utterance (in the silence area). Consequently, the centroid value c is to be found using the following expression,

f/2 ts( t ) at c

dt'

(1)

where: s(t) is the signal envelope peak density function to be determined on the basis of a simple threshold algorithm for the analysis of envelope variations. The determination of the approximate values k~,k 2 demands the dispersion to be calculated as follows,

d

,/f,? (t - c)2s(t) at

V

/tt: s ( t ) dt

(2)

Then, k I ~c-wd,

(3) k2 -~ c + wd,

where w is a coefficient experimentally optimized and set to be equal to: w = 1.7. The foregoing procedure using the statistical approach to the detection of word boundaries permits determination of the beginning and the end points of words with the expected accuracy. Subsequently, the signal samples are divided into packets containing a predefined number of samples (equal to 256). Moreover, joined packets are grouped into segments (of width depending on the utterance length). Due to the use of the logarithm of spectrum energy and an orthogonal trajectory representation in the feature extraction system, the information of energy of utterances and on their time duration is no longer directly exploited. Consequently, no additional operations are needed at the preprocessing stage to ensure the normalization of utterances.

SPEAKER-INDEPENDENT RECOGNITION 3.

5

FEATURE VECTOR EXTRACTION

Various methods for feature vector extraction were previously tried by the authors [1], [2]. The cepstrum representation allowing one to the determine cepstral coefficients calculated using nonlinear frequency scale (the mel scale) are particularly applicable to the presented experiments [4]. In order to determine these parameters, the spectrum is calculated based on the typical Hamming windowing procedure and 256-point DFT (Discrete Fourier Transform) computation. Subsequently, the mel-frequency cepstrum coefficients (MFCC) are determined for each packet on the basis of [4], 2O

Mi = ~7~ X~, cos[ i( k-O.5)rr/20] ,

(4)

k=l

where X k is a result of DFT calculation as follows, N

1

Xk = E X(~) e-j2rrk(n/N) n = 0 N

I

x(n)[cos(2~kN)-jsin(2~kN) l ,

(5)

n=O

and: F is a number of filters, i is a number of the cepstrum coefficient, k is a number of the frequency subband, N is a number of points in the Hamming window (equal to 256), x(n) is the result of the convolution of speech samples and nth point of the Hamming window.

j=~--1. Results of spectral analysis X k were located in 20 frequency subbands on the mel-frequency scale (F was set to 20). The MFCC representation (that may be called shorter: mel-cepstrum representation) is particularly useful as the source of distinctive parameters of speech signal. Nevertheless, a wide margin is left for the intelligent decision algorithms when speech patterns are to be recognized on this basis. To observe this fact, it may be useful to analyze the plots presented in Figures 1 and 2. When it is seen in Figures l(a) and l(b), a digit pronounced two times by the same speaker has similar, but not identical

6

A. CZYZEWSKI 3U, 25 2O lS

5 e

io

4o

~o

"6o'

20

40

$0

80

~6o

(a)

200

(b)

2a 15

I0 5 0

ZO

40

GO

80

200

(e) Fig. 1. First mel-cepstrum coefficient for the digit "zero" pronounced (in Polish) three times: voice of the same speaker (a) and (b); voice of another speaker (c).

mel-cepstrum representation. Another speaker produces speech of considerably different representation (Fig. l(c), however, the overall character of the plot reveals some similarities, while the representations of another digit (Fig. 2) are definitively different. Unfortunately, similarities among representations that are discernible to the human observer remain difficult to be described algorithmically. The feature vector also consists of two other parameters reflecting the time-domain characteristics of the signal. These are: the density of local

SPEAKER-INDEPENDENT RECOGNITION

7

~U 2S 20 15 10

...... /l..

S 0

20

40

GO

80

100

(a) 25 20 15 10 5 o

(b)

.~~r~^~~. 20

40

I;o

90

lO0

~U

25 20 15

0

20

.....

40

60

80

100

(c) Fig. 2. First (a), second (b), and third (c) mel-cepstrum coefficient for the digit "six" (pronounced in Polish).

time-envelope peaks and the relative amplitude midpoint value P.4 defined as follows, Ei/Ai PA-- E i A i ,

(6)

where: i is the number of packet, A i is the relative average amplitude for packet i; Similarly to the mel-cepstrum coefficients, the foregoing parameters are independent of the signal energy.

8

A. CZYZEWSKI

There is an another possibility to compress data before it is utilized for the building of the knowledge base containing speech patterns. Namely, the parameters calculated for individual signal packets (consecutive 256 sample sets) may be replaced by the trajectories reflecting parameter evolutions in the whole segment containing certain number of packets. Consequently, each parameter is to be represented by its/-point evolutionary path. Time-domain parameters may be represented by the positive numbers only, while the mel-cepstrum coefficients may have both: positive and negative values. The orthogonal system, such as the Fourier basis should be used to describe the parameter trajectories fully. In the elaborated parametrization system the even parts of the trigonometric function cosinus were employed to this task. Consequently, the description of the parameter trajectory T is to be determined on the basis of the following relationship, that formally resembles the previous equation (4),

T~=~ Pjcos[i(j-O.5)~/n],

(7)

j=l

where: i is the number of trajectory coefficient (i= 1,2 ..... 6), j is the number of segment, n is the number of segments in the whole utterance, Pj is the parameter calculated for the jth segment. It is noted that after the calculation of trajectory coefficients the information of utterance time duration is no longer exploited. Hence, this way of parametrization comprises also the time-normalization of speech patterns. 4. QUANTIZING OF ATFRIBUTES Rule-based classification system can operate on the quantized parameter values. The essence of the quantization procedure applied to the feature vector parameters is to replace numbers by ranges. Due to the nonlinear character of most speech parameters, the "blind" quantization consisting of dividing the parameters into arbitrarily chosen ranges of the same width cannot be particularly effective, however even such a simplified approach may lead to acceptable recognition scores when the rough set classifier is employed [2], [3]. The quantization of real value attributes provides one of the central problems related to the rule-based decision systems, thus it was investigated by many authors [5], [6]. The presented system for speech recognition uses an original method for scaling of attribute values based on the statistical approach to the determination of

SPEAKER-INDEPENDENT R E C O G N I T I O N

9

ranges of nonlinear feature vector parameters. The same statistics were employed as in the previous experiments with the speech recognition using neural networks. However, the purpose of this approach is different--the statistics serve as a tool for the determination of division points while determining the attribute ranges basing on the clustered parameter values. Feature vectors can be paired on the basis of the Behrens-Fisher statistics calculated as follows,

V=

X-Y , ~S~/n + S~/m

(8)

where, X = t-T

Xi'

m

i=I

Yi,

(9)

i=1

are arithmetic averages of observed parameter values X i, Yi and,

n-1

m-1

i=1

i=1

are estimators of variances of the corresponding random variables; n , m cardinality of a test set of the populations X and Y. In the case of the fixed cardinalities n and m the statistics serve as a distance measure between the compared classes for the individual parameters (in most cases n - - m ) . The possibility to discern between patterns is more probable for the pairs giving higher values of these statistics. Lower dispersion and bigger differences between average values are calculated on the basis of data considered as random tests, while the whole population is of normal distribution. In order to find the discriminator it is necessary to use distribution estimators of the examined parameters. Provided these estimated distributions are having the same dispersions, the discriminator value may be calculated on the basis of the following equation,

d~

X+Y 2

'

(11)

dxy is the discriminator value, X, Y are the arithmetic averages, observed. The described situation is illustrated in Figure 3.

10

A. CZYZEWSKI

Fig. 3. Probability density plots of two Gaussian distributions with the same dispersions and different mean values.

For the case of unequal dispersions the discriminator should be closer to the mean value of this one distribution that is having lower dispersion, thus the following term is to be fulfilled,

e(x>cG )=f(y dxy) is the probability that the random variable x fulfils the term: x>dxy, P(Y