SIMULTANEOUS RECOGNITION OF MULTIPLE ... - Semantic Scholar

Report 5 Downloads 171 Views
。・::-f

SIMULTANEOUS RECOGNITION OF MULTIPLE SOUND SOURCES BASED ON 3-D N-BEST SEARCH USING MICROPHONE ARRAY Panikos HERACLEOUS1, Takcshi YAMADA2, Satoshi N AKAMURA3, Kiyohiro SHIKAN04 Graduate School 01In10rmation Science, NarαInstitute 01 Science and Technology 8916・5, Takayama-cho, Ikoma-shi, Nara, 630・0101, JAPAN

ABSTRACT

of distant talking speech in a noisy berant environments is key issue in any recognition system. A so-called hands-free 符 recognition system plays an important role Í1atural and friendly human-machine interface. the practical use of a speech recogni­ we realize that such a system has to with the cωe of the presence of mul­ sources, induding multiple talkers, 酪 noise sources. This paper proposes a which recognizes multiple talkers si­ in real envirouments by extending the search to a 3-D N-best search algorithm. 3-D Viterbi method finds the most likely 3・D trellis space, the proposed method hypotheses for each direction in Combinations of the direction sequence sequence of multiple sources are in­ N-best list. The paper investigates the pfJhe proposed method ttiro可h exper噌 ��lll utterances of multiple talkers.

systems are microphone

The use of the microphone ar­ fact that the microphone array of the use of the spatial informa­ Sources to suppress noise signals

environments. One way to solve this problem is to integrate mi­ crophone array processing and speech recognition. In the last year, a speech recognition algorithm based on a 3-D Viterbi search has been proposed [4][5]. A direction-frame sequence of parameter vec­ tors (e.g. mel-frequency cepstrum coefficients) can be obtained by steering a beamformer to each di­ rection in every frame. The parameter vectors in the talker direction are extracted from high qual­ ity speech. Therefore, the talker direction may be estimated by matching between the direction-frame sequence of parameter vectors and HMMs. The 3D Viterbi method performs talker localization and speech recognition simultaneously by finding the most likely path in a 3-dimensional trellis space composed of talker directions, input frames and HMM states. Speaker-dependent isolated-word recognition experi­ ments have shown that word recognition rates of the 3-D Viterbi method with adaptive beamforming in a real room for a moving talker case are drastically im­ proved compared with those of a remote single micro­ phone [5J. Although thc 3-D Viterbi search method is a promising way to realize hands-free speech recog­ nition in a real environment, its applicable situations are restricted to those of only one talker. However, for practical use it is necessary to deal with multiple talker situations, too. This paper proposes a new al­ gorithm for simultaneous recognition of multiple talk­ ers by introducing the N-best paradigm in the 3・D Viterbi search.

2.

3-D VITERBI SEARCH

Due to the fact that the localization errors have an impact eπ'ect to the performance, several methods have been proposed in order to solve the localization problem. Most of the proposed methods are based on the extraction of the direction with the maximum power. The speech recognition algorithm based on 3-D Viterbi search approa-:hes the problem in a diι ferent way. By steering a beamformer to every direc­ tion in each frame a direction-frame sequence of pa­ rameter vectors is obtained. Based on the fact that



ESCA. Eurospeech99. Budapest. Hungary. ISSN 1018・4074. Page 69

51

In a similar way with the conventional 3・D search based method, the direction-frame of parameter vector is extracted by steering former to each direction in every frame.

Direction

HMM state



The 3-D N-best search considers multiple ses for each state and direction ( q,d). The hypotheses are found by considering all the sor hypotheses which end in ( q,d)抗frame arrival hypotheses are merged and the with different direction sequence are sorted to find the N-best hypotheses. The formula 3. the general way to calculate the likelihood of best hypotheses.

Frame

Figure 1. 3・D trellis space

the parameter vectors in the talker direction are ex­ tracted with high quality, the talker can be localized by matching between the direction-frame sequence of parameter vectors and the HMMs.

g_N(q,d,n) = 71F{gN(q',d,九一1) +logα1(q', q)

In the case of the 3-D Viterbi search based algo­ rithm, the extraction of the direction-frame param­ eter vector is followed by the Viterbi search, which is performed in 3-D trellis space [fig.1] composed of talker directions, input frames and HMM states. Based on the maximum likelihood an optimal path can be found and, in this way a combination of talker direction sequence and phoneme sequence of the speech can be obtained.

As a result of the 3-D N-best search multiple ses can be obtained and in this way multiple sources can be localizedωld recognized ously.

)

(q,d) =αrgmax Pr(xJd,q, M)

噌『ム rl、

The optimal combination of the direction and state sequence (d, q) can be found by usi時the formula q,d

This likelihood can be calculated using the Viterbi formula

α(q,d,n) =号!?{α(q',d,π- 1)+loga1(q',q)

+loga2(d',d)}+logb(q,x(d,n)), (2)

where 1.1 is the model, q,d,n are the state, direction and frame index respectively, b is the output proba­ bility,向(q',q) is the transition probability from state q' to state q and α2(d',d) is the transition probability from direction d' to direction d. The a2(d',d) proba­ bility represents how likely the talker moves.

+Iogα2(d',d)}+logb(q,x(d川))

The beamformer is steered to each multiple hypotheses are taken into account. tem should deal with a huge number of ses, which results high memory requirements recognition speed. Convetional beam pruning be applied in order to reduce the number of the sidered hypotheses. An additional problem which the described best search faces is the case when the the correct direction is lower than that in rections. In this case the performance of the is degraded. The effect of this problem can be by introducing a weight function based on the which raises the likelihood in directions with like characteristics. The introduced weight which results higher recognition rates is given following formula :



ω(d, π) = log





{p(d;πザ

d'=1 n'=n-(ν-1) 3.

3・D N・BEST SEARCH

Page 70

52

{p(d'; n')}μ

where p(d;π) is the power. This value is ext for the (d,η) direction, frame index respectively. μis the parameter to control the weight effect,ν!s:‘ the parameter for adjusting the continuation and D :r1 て苅 is the number of directions. .�