Spoken Inquiry Discrimination Using Bag-of-Words for Speech-Oriented Guidance System Haruka Majima', Rafael Torres', Yoko Fujita', HirOlηichi Kawanami', Tomoko Matsui2, Hiroshi Saruwαtari', Kiyohiro Shikano'
'
Graduate School ofInformation Science, NaraInstitute of Science and Technology, Japan 2 Department of Statistical Modeling, The Institute of Statistical Mathematics, Japan {ha工uka-rn,
工afael-t,
kawanarni,
sawata工i,
shikano}@is.naist.jp,
[email protected] Moreover, we employ it with the likelih∞d values of G恥仏1s,
Abstract
and duration and SNR information of input utterances to
We investigate a discrimination meth吋伽invalid and valid
spoken inquiries, received by a speech-oriented guidance system
operating in a real environment. Invalid spoken inquiries include background voices, which are not directly uttered to the system,
and nonsense utterances. Such spoken inquiries should be
rejected beforehand. By now, we have reported a method using the likelih∞d values of Gaussian mixture models (Gl\仏1s) to
discriminate invalid spoken inquiries台。m valid ones. In this
pap町, we improve the performance by utilizing not 0凶y the
likelih∞d values but also other information in spoken inquiries
improve the total performance of discrimination between invalid and valid inputs
In this paper, we describe effective feature combinations for
invalid input discrimination, using the BOW from the 1 0・best
ASR results to train mode1s with either SVM or ME. The out1ine of the paper is as folJows. First, we describe th巴 overview of Takemaru-kun system. Then, we describe our proposed method
with several e仔ective features using SVM or ME, and show the experimenta1 results using real users' utterances. Finally, we
concl ude our proposal.
such as bag-oιwords (BOW), utterance duration, and signaトto noise ratio (SNR). To deal with these multiple information, we
use support vector machine (SVM) with radial basis functior】
2.
Speech-oriented guidance system
(阻F) kemel and maximum entropy (ME) method and ∞mpare
Takemaru・kun
measure for SVM and 84.2% for ME, while F・measure for
The Takemaru-kun system川(F明re 1 ) is a real-environment
Index Terms: speech-oriented guidance system, spoken inquiry
of the lkoma City Nor/h白mmunity Cen/er 10ωted in 血e
discrimination, support vector machine, maximum entropy, bag of-words
since Nov. 2∞2, providing guidance to visitors regarding the
the performance.In the experiments, we achieve 86.6% of F・
GMM-based method is 81.7%.
1.
speech-oriented guidance system, pla∞d inside the entrance hall
Prefecture of Nara, Japan. The system has been operated dai1y
center facilities,
foreωst,
Introduction
Automatic sp田ch re∞gnition (ASR) has been widely applied to
news,
services,
and
about
neighboring the
agent
sightseeing, itse1f,
weather
among
other
mゐrmation. Users can a1so activate a Web s回rch feature that
allows searching for Web pages over the Internet containing出e
dictation, Voice Search, and ωr navigation, to name a few. 1n
uttered keywords 羽山system is also aimed at serving as field
this paper, we describe the speech圃oriented guidan∞ system
test of a speech interface, and to collect actual utterance data
Takemaru-kun [1], which aims to realize a natural speech
interface using ASR. Takemaru-kun
is
a
real-environment
speech-oriented
guidance system adopting an example-based question answering
(Q八) that is flexible to respond to u田r's questions on demand. 八n answer 10 a user's question is selected by referring to the
The system displays an animated agent at the front monitor, which is the mascot character ofIkoma city, Takemaru-kun.百le interaction with the system follows a one-question-1・ω
Só óO;.a
おゐ2・0
と同・" S
お2明。
群2 �Q.
穆.. 一 一
2民2・"
世・・一一一一一 一一 -
E
:"'s伊"
-
7S・.
-
圃圃・
760。
s\ �I ・Mf
・・・
(1正jMM
伯尚M.\l. 白01'、durallon
自0\\
(6K.iMM. 且01\. S'1R
t7K.i�1�1. 自O込,
durat1on. S!\ R
ωmhinalion "f featu町S
Fi思Ire 5: F-measures of SVM and ME for combinalion patlerns (1) and (5) to (7)
5.
Conclusions
We investigated discrimination between inva1id and valid spoken inquiries using multiple features as the likelihood values of GMMs, BOWs, utterance durations and SNRs. SVM and ME methods were compared in performance with Ihe F-measure and both methods outp町お口n吋the convenlional melhod, which uses Ihe likelih∞d values of GMMs. Especially SVM p町formed the besl and achieved 86.6% ofF-measure. Our future work includes further investigationゐr effective combination methods of differenl kinds of features.
6. Acknowledgements This work was partially supporled by CREST (Core Research for Evolutional Science and Technology), Japan Science and Technology Agency (JST).
7. [1]
M 臥U 船 田園田・ 一 め 同 t 加 一 に 十 ,M M 一川固 川 山 剛 . .. m 曲 出 脳 一 v 削 ・・・・・・・・・・・・・ 目叫 | : 。 船 川 | 榊u | 寸 aコ U6 0 AUい 岡 町叫 同 M M N
Pa仕ern (1) (2) (3) ( 4) (5) (6) (7)
- - . - - . -且一白.司 -一 一. -・. -
Table 4: F mb Combination of features GMM likelihood, BOW GMM Iikelihood, duration GMM Iikelihood, SNR GMM Iikeliho吋, duration, SNR GMM Iikelihood, BOW, duration GMM likelihood, BOW, SNR G恥仏1likelihood, BOW, duration, SNR
[2]
自H・。
島4.�.o
お45・@
R, l・。
[3]
M
:
[4]
[5]
Figure 4目F-measures of SVM and ME for combination patt町ns ( l ) 10 (4)
R. Nisimurn,
A.
Speech-Oriented
References Saruwotori
Lee, H.
Guidance
System
ond
K.
with
Shikono, “Public
Adult
and
Child
Oiscrimination Capability", In Proc. ICASSP2004, vol. 1, pp.433436,2004 H. Saka� T. Cincarek,H. Kawanami, H.Saruwatari, K. Shikano, A Lee,
“Voice
Activity
Oetection
applied to
hands-free
spoken
dialogue robot based on decωing using acoustic and language model", Proceedings of the 1 st international conference on Robot commu町田tion and coordination (ROBOCOMM2007),Article No 16,8 page邑2007 A.
Lee,
Shikano
K. Nakamura, R. Nishimura, H. Saruwatari, and
K.
‘'Noise R obust R田1 World Spoken Oialogue System
using GMM Based R句ection of Unintended Inputs", In Proc. Int町田tional Conference on Spoken Language Processing, pp. 847-850,2004 C. Chang and C. Lin,“LIBSVM: a Library for Support Vector Machines",
2001.
Software
available
at
http://www.csie.n刷edu.tw,ム-cj linIlibsvm C. Manning and O. Klein,“Optimization, Max朗t Models, and Conditio回1 Estimation without Magic", Tutorial at HLT-NAACL 2003
and
ACL
2003.
Software
available
at
http://nJp.stanford.edu/softwar巴/classifi町shtml [6]
A.
Lee, T. Kawahara
Realtime
Large
and
K.
Shikano,
Vocabulary
,‘JuIius
Recognition
Eurospeech2001,pp. 1691-1694,2001
- an Open Source Engine",
Proc