FEATURE EXTRACTION AND SENTENCE RECOGNITION ALGORITHM IN SPEECH INPUT SYSTEM K. S h i r a i Department o f E l e c t r i c a l E n g i n e e r i n g , Waseda U n i v e r s i t y , T o k y o , Japan becomes e a s y . However, t h e y may be good f e a t u r e s f o r t h e speech r e c o g n i t i o n . In t h i s study, an a r t i c u l a t o r y model i s c o n s t r u c t e d o n t h e b a s i s o f X - r a y d a t a and a n o n l i n e a r r e g r e s s i o n method i s used t o e s t i m a t e t h e a r t i c u l a t o r y p a r a m e t e r s . The r e s u l t s o f the e s t i m a t i o n p r e s e r v e t h e t y p i c a l n a t u r e o f each phoneme and t h e method w o u l d p r o vide a useful feature e x t r a c t i o n technique. Second, t h e p r o b l e m o f s e n t e n c e r e c o g n i t i o n is mathematically formulated as an o p t i m i z a t i o n problem w i t h the c o n s t r a i n t o f sentence s t r u c t u r e s and is s o l v e d by a method of dynamic p r o g r a m m i n g . H e r e , i t might b e necessary t o d e f i n e what is r e c o g n i t i o n of sentence. The f u n d a m e n t a l s i t u a t i o n o f t h e speech i s c o n v e r s a t i o n . In the c o n v e r s a t i o n between A and B, t h e r e p r e s e n t a t i o n " B u n d e r s t a n d s t h e s e n t e n c e spoken by A , " has t o o much c o n t e n t . Then we c o n s i d e r o n l y , " B r e sponds t o t h e s e n t e n c e spoken b y A . " The r e s p o n s e is described by a s t a t e t r a n s i t i o n of the machine. Of c o u r s e , t h e machine may e x h i b i t some o u t p u t s at the t r a n s i t i o n . Each moment of t h e c o n v e r s a t i o n is accompan i e d by a scene and u s u a l l y t h e s p e a k e r and t h e r e c e i v e r have a common r e c o g n i t i o n of t h e s c e n e . The c o n c e p t o f t h e scene I s n a t u r a l l y t a k e n I n t o a c c o u n t b y the s t a t e o f t h e m a c h i n e . Probable s e n t e n c e s t h a t may appear under a s t a t e are l i m i t ed an ' the e f f e c t i v e number of t h e s e n t e n c e s t h a t a f f e c t s the r e c o g n i t i o n s c o r e i s r e d u c e d . I n a p r a c t i c a l a p p l i c a t i o n , t h e p u r p o s e and the a b i l i t y o f t h e machine I s a l w a y s l i m i t e d , and the c o n t e n t s o f t h e c o n v e r s a t i o n may b e f i n i t e . Then, i t i s a l l o w a b l e t o set sentence s t r u c t u r e s t o o r d e r words i n r e s t r i c t e d w a y s . I n t h e framework m e n t i o n e d a b o v e , t h e s e n t e n c e r e c o g n i t i o n can be c o n s i d e r e d on t h e e x t e n s i o n o f a c l a s s i f i c a t i o n p r o b l e m and a n e f f e c t i v e o p t i m a l a l g o r i t h m f o r the s e n t e n c e r e c o g n i t i o n can b e o b t a i n e d .
Abstract A f e a t u r e e x t r a c t i o n method f o r speech waves and an a l g o r i t h m f o r sentence r e c o g n i t i o n are s t u d i e d . The f e a t u r e e x t r a c t i o n i s based o n a n a r t i c u l a t o r y model c o n s t r u c t e d f r o m the s t a t i s t i c a l a n a l y s i s o f X-ray data. The model h o l d s i m p l i c i t l y t h e p h y s i o l o g i c a l c o n s t r a i n t s and made p o s s i b l e t o e s t i m a t e the s t a t e o f the a r t i c u l a t o r y mechanism. The e s t i mated a r t i c u l a t o r y p a r a m e t e r s p r o v i d e a s e t o f good f e a t u r e s f o r the speech r e c o g n i t i o n . The s e n t e n c e r e c o g n i t i o n problem is mathematically formulated as an o p t i m i z a t i o n problem w i t h c o n s t r a i n t s by i n t r o d u c i n g s e n t e n c e s t r u c t u r e s f r o m t h e s y n t a c t i c and semantic c o n s i d e r a t i o n s . The a l g o r i t h m p r e s e n t s a n o p t i m a l s o l u t i o n i n the B a y e s i a n s e n s e . Introduction I n t h i s paper two m a j o r components o f t h e speech u n d e r s t a n d i n g s y s t e m are d i s c u s s e d . One i s a f e a t u r e e x t r a c t i o n method f o r t h e speech wave and t h e o t h e r i s a s e n t e n c e r e c o g n i t i o n a l g o r i t h m . Many speech p a t t e r n r e c o g n i t i o n systems have been made to c l a s s i f y spoken words and a few have been t r i e d t o t r e a t spoken s e n t e n c e s . A speech r e c o g n i t i o n s y s t e m w h i c h employs i n f o r m a t i o n from a l l l e v e l s - from the a c o n s t i c to the s e m a n t i c - to u n d e r s t a n d the meaning o f i n t e r e s t . 1 ~ 5 I t i s t r u e t h a t the reseaches a t the h i g h e r l e v e l s such a s the s y n t a c t i c and s e m a n t i c l e v e l s have n o t been s u f f i c i e n t l y a p p l i e d t o speech recognition. However, the b a l a n c e o f each l e v e l of the system is I m p o r t a n t . The a p p l i c a t i o n of the i n f o r m a t i o n a t the h i g h e r l e v e l s m i g h t i m p r o v e the t o t a l p e r f o r m a n c e o f t h e s y s t e m b u t i t s h o u l d b e n o t e d t h a t the c o m m u n i c a t i o n s y s t e m i n w h i c h l i t t l e redundancy i s remained i s d e p r i v e d o f v a r i a b i l i t y and a d a p t a b i l i t y . Though t h e i n f o r m a t i o n at t h e upper l e v e l may be used in t h e manner of human speech u n d e r s t a n d i n g , i t does n o t mean t h a t h e cannot c o r r e c t l y r e c o g n i z e a s i n g l e word w h i c h i s pronounced c l e a r l y . T h e r e f o r e , t h e more e f f o r t s h o u l d b e n e c e s s a r y f o r the r e c o g n i t i o n o f w o r d s o r phonemes. I f the speech u n d e r s t a n d i n g system i s objected. In this study, f i r s t , a feature ext r a c t i o n method f o r t h e speech wave i s t r e a t e d a s t h e most e l e m e n t a l p r o b l e m . R e c e n t l y , s e v e r a l a u t h o r s have s t u d i e d t h e e s t i m a t i o n o f the v o c a l t r a c t shape f r o m the speech wave. But t i l l now the r e s u l t s a r e n o t n e c e s s a r i l y s a t i s f a c t o r y , because, a s i s w e l l known, the s p e c t r a l c h a r a c t e r i s t i c s o f the speech wave d o n o t c o r r e s p o n d one t o one t o t h e v o c a l t r a c t shape a s an a c o u s t i c tube and because the speech wave c a r r i e s the e f f e c t s o f t h e v o c a l c o r d o s c i l l a t i o n , n o i s e from t h e t u r b u l e n c e and s o o n . Then i t i s d e s i r a b l e t o a p p l y t h e knowledge f r o m p h y s i o l o g i c a l and p h o n o l o g i c a l s t u d y f o r the e s t i m a t i o n o f t h e vocal t r a c t shape. And such a r e s e a c h has p o t e n t i a l i t y t o make p o s s i b l e t h e e s t i m a t i o n o f t h e m o t o r commands t h a t move t h e a r t i c u l a t o r y mechanism. I t i s d o u b t f u l t h a t , i f t h e m o t o r commands a r e p r e c i s e l y e s t i m a t e d , the phoneme r e c o g n i t i o n
Feature E x t r a c t i o n of A r t i c u l a t o r y Construction
via Estimation Parameters
o f A r t i c u l a t o r y Model
A n a r t i c u l a t o r y model p r e s e n t s a n e f f e c t i v e r e p r e s e n t a t i o n f o r the s t r u c t u r e s o f the a r t i c u l a t o r y mechanism and t h e d y n a m i c a l c h a r a c t e r i s t i c s o f t h e a r t i c u l a t o r y m o t i o n , and f u r t h e r i t r e l a t e s the a r t i c u a t o r y p a r a m e t e r s t o t h e a c o u s t i c ones? 9 The c o n f i g u l a t i o n o f t h e a r t i c u l a t o r y model i s shown i n F i g . l . The m l d s a g g i t a l v o c a l t r a c t o u t l i n e can b e r e p r e s e n t e d b y t h e v a r i a b l e s s p e c i f y i n g t h e p o s i t i o n s o f t h e movable s t r u c t u r e s , i . e . j a w , tongue and l i p s . The m a x i l l a , r e a r p h a r y n g e a l w a l l and l a r y n x o u t l i n e s are f i x e d and a p p r o x i m a t e d b y the sequence o f c i r c u l a r a r c s and s t r a i g h t l i n e s . The Jaw I s assumed t o r o t a t e w i t h the f i x e d r a d i u s about t h e f i x e d p o i n t F j , and i t s l o c a t i o n 1s given by the angle w i t h respect t o the r e f e r e n c e l i n e which i s tangent to the hard p a l a t e . The jaw movement executes the passive e f f e c t t o the p o s i t i o n o f
506
the tongue and l i p s , and I n f l u e n c e s n o t o n l y rUa mouth o p e n i n g a r e a b u t t h e o v e r a l l v o c a l t r a c t shape. The l i p shape i s s p e c i f i e d b y t h e h e i g h t L n and t h e p r o t r u s i o n L p r e l a t i v e t o t h e j a w p o s i t i o n o n the m i d s a g i t t a l p l a n e . Only the l i p p r o t r u s i o n p a r a m e t e r may b e n e c e s s a r y t o s p e c i f y t h e l i p movement f o r t h e v o w e l s , b u t b o t h a r e r e q u i r e d t o e x p l a i n t h e d i f f e r e n t g e s t u r e s i n the u s u a l speech containing l a b i a l consonants. The tongue c o n t o u r i s d e s c r i b e d i n terms o f a s e m i - p o l a r c o o r d i n a t e system d e f i n e d w i t h r e f erence to the jaw p o s i t i o n . The c e n t e r F t r o t a t e s s y n c h r o n o u s l y w i t h t h e j a w movement1.1 Therefore, t h e tongue c o n t o u r i s measured w i t h t h e j a w based coordinate system. Though t h e tongue may be a b l e t o f o r m v a r i o u s s h a p e , i t has l i m i t e d freedom t o move a b o u t i n a r t i c u l a t o r y p r o c e s s o n a c c o u n t o f the p h y s i o l o g i c a l and p h o n o l o g i c a l c o n s t r a i n t s . Thdse c o n s t r a i n t s can be e x p r e s s e d by t h e s t r o n g c o r r e l a t i o n o f t h e p o s i t i o n o f each segment a l o n g the tongue c o n t o u r and may be e x t r a c t e d f r o m t h e s t a t i s t i c a l a n a l y s i s o f the X-ray d a t a . I t i a known t h a t f o r t h e tongue a r t i c u l a t i o n o f v o w e l s , t h e e x t r i n s i c m u s c l e a c t i v i t y i s more s i g n i f i c a n t than t h a t o f the i n t r i n s i c one. Then, the p r i n c i p a l components f o r t h e e x t r i n s i c a c t v i t y a r e o b t a i n e d f r o m t h e tongue c o n t o u r d a t a f o r v o w e l s , and t h e tongue c o n t o u r v e c t o r f o r v o w e l s X v can b e e x p r e s s e d i n t h e l i n e a r f o r m a s ,
The f i r s t component r e p r e s e n t s t h e movement o f t h e tongue body between the r e a r - p h a r y n g e a l w a l l and t h e h a r d p a l a t e d i r e c t i o n and p r o d u c e s m a i n l y a n a n t i s y m m e t r i c p e r t u r b a t i o n o f the v o c a l tract. I t I n d i c a t e s t h e o p p o s i t e f e a t u r e o f back and f r o n t v o w e l s , i . e . ( a ] v s [ i ] . The second component r e p r e s e n t s t h e movemerat of t h e tongue towards t h e v e l u m and p r o d u c e s a s y m m e t r i c p e r t u r b a t i o n t h a t i s e f f e c t i v e f o r the r o u n d e d v o w e l [ u ] . The t h i r d component i s l e s s c l e a r l y e x p l a i n e d and may be i n t e r p r e t e d as the r e s u l t i n g tongue d e f o r m a t i o n from t h e c o n t r a c t i o n o f t h e p o s t e r i o r f i b e r s o f g e n i o g l o s u s and t h e i n t r i n s i c muscle of the tongue t i p . The f o u r t h component I s a n i n t r i n s i c component and r e p r e s e n t s t h e t o n g u e t i p retroflex. A s a n example t h e l o c i o f t h e tongue movement o n a i ~ a 2 space f o r t h r e e u t t e r a n c e s / h 3 t V / (V: a , i , u ) are i l l u s t r a t e d i n F i g . 3 . The p o i n t s A o n t h e l o c i i n d i c a t e t h e o n s e t and t h e o f f s e t o f t h e tongue t i p c l o s u r e . From t h e above d i s c u s s i o n s , i t i s seen t h a t t h e f o l l o w i n g a r t i c u l a t o r y p a r a m e t e r s a r e enough t o d e s c r i b e the m i d s a g l t t a l v o c a l t r a c t o u t l i n e , i . e . the j a w a n g l e , t h e l i p p r o t r u s i o n L p and t h e w e i g h t i n g c o e f f i c i e n t s o f t h e tongue compon e n t s a j , bfc. For t h e v o w e l s t h e l i p h e i g h t i s d e p e n d e n t o n t h e l i p p r o t r u s i o n and can b e approximated b y ,
(1) where, are e i g e n v e c t o r s and is a mean v e c t o r f o r vowels which corresponds to the n e u t r a l tongue c o n t o u r . The e i g e n v e c t o r s are c a l c u l a t e d from the next e q u a t i o n . (2)
and A i s the c o r r e s p o n d i n g e i g e n v a l u e the c h a r a c t e r i s t i c e q u a t i o n .
Lh
to s a t i s f y
-
0.3 -
0.25(Lp-1.0).
(7)
The r e l a t i o n ( 7 ) was d e t e r m i n e d f r o m t h e a n a l y s i s o f t h e f r o n t and s i d e p h o t o g r a p h s . The c r o s s - s e c t i o n a l d i m e n s i o n a l o n g t h e v o c a l t r a c t i s d e t e r m i n e d from a semi—polar c o o r d i n a t e s y s t e m f i x e d w i t h r e g a r d t o t h e m a x i l l a and t h e rear-pharyngeal w a l l . The r e l a t i o n between t h e c r o s s d i m e n s i o n d and t h e c r o s s s e c t i o n a l a r e a S i s a p p r o x i m a t e d b y power f u n c t i o n S - 2 d 1 , f i . In the l a b i a l r e g i o n the area is approximated by an e l l i p s e w i t h the w i d t h g i v e n b y
(A) For t h e c o n s o n a n t s , t h e e f f e c t o f t h e i n t r i n s i c muscle a c t i v i t y appears p a r t i c u l a r l y i n the f r o n t p a r t o f t h e tongue b u t i t i s d i f f i c u l t t o s e p a r a t e p r e c i s e l y t h e i n t r i n s i c muscle a c t i v i t y from the e x t r i n s i c one. At f i r s t the c o n t r i b u t i o n w h i c h comes f r o m t h e e x t r i n s i c components are s u b t r a c t e d b y p r o j e c t i n g t h e tongue c o n t o u r v e c t o r X c t o t h e v o w e l space w h i c h i s spanned b y t h e e i g e n v e c t o r s f o r the v o w e l s . And t h e r e m a i n d e r is calculated as,
where
L s i s the v e r t i c a l s e p a r a t i o n o f the l i p s . The v o c a l t r a c t i s d e v i d e d I n t o 3 0 u n i f o r m c y l i n d r i c a l t u b e s and t h e r e f l e c t i o n c o e f f i c i e n t s between t h e a d j o i n i n g s e c t i o n s a r e c a l c u l a t e d . R e g a r d i n g t h e l o s s e s a t t h e g l o t t i s , l i p s and w i t h i n t h e t r a c t , a t r a n s m i s s i o n - l i n e model i s c o n s t r u c t e d and t h e t r a n s f e r f u n c t i o n i s e x p r e s s e d using z-transform.
(5)
where means t h e p r o j e c t i o n o f to the vowel space. A g a i n tho p r i n c i p a l component a n a l y s i s is performed on and t h e e i g e n v e c t o r s (k»l,2, . . . , q ) i s c a l c u l a t e d I n t h e same manner a s F i n a l l y the e x p r e s s i o n f o r t h e c o n s o n a n t s can b e o b t a i n e d as,
507
where z - e x p ( is the l e n g t h o f one s e c t i o n and c i s the v e l o c i t y o f s o u n d . The f o r m a n t f r e q u e n c i e s a r e c a l c u l a t e d f r o m E q . ( 9 ) b y the F i b o n a c c i s e a r c h i n g m e t h o d . Estimation
of A r t i c u l a t o r y
where and
Cy and C are the covariance m a t r i c e s means t h e e s t i m a t e d v a l u e o f Four v a r i a b l e s a r e used a s t h e a r t i c u l a t o r y p a r a m e t e r s , namely the j a w o p e n i n g the w e i g h t i n g c o e f f i c i e n t s o f t h e p r i n c i p a l components of t h e tongue c o n t o u r a1 and a2, and t h e l i p protrusion L p , w h i c h were i n t r o d u c e d i n t h e preceding s e c t i o n . As the a c o u s t i c f e a t u r e s , f i r s t two f o r m a n t f r e q u e n c i e s F 1 and F 2 a r e u s e d , because t h e y a r e the most s i g n i f i c a n t f e a t u r e s t o b e c l o s e l y r e l a t e d t o t h e v o c a l t r a c t shape f o r the v o w e l - l i k e s o u n d s . The d a t a f r o m w h i c h t h e r e g r e s s i o n c o e f f i c i e n t s are d e t e r m i n e d c o n s i s t o f t h e samples d i s t r i b u t e d a r o u n d t h e f i v e J a p a nese v o w e l s and r e a l ones o b t a i n e d f r o m t h e X - r a y p i c t u r e s , and t h e t o t a l number o f t h e samples i s 300. B y u t i l i z i n g the r e a l a r t i c u l a t o r y d a t a f o r the e s t i m a t i o n , some c o o p e r a t i v e r e l a t i o n between the a r t i c u l a t o r s i s i n c l u d e d i n t h e r e g r e s s i o n coefficients. The f o r m a n t f r e q u e n c i e s a r e c a l c u l a t e d according to the a l g o r i t h m presented in the p r e c e d i n g s e c t i o n . The e s t i m a t e d r e s u l t f o r t h e s y n t h e s i z e d c o n t i n u o u s speech / a i u e o / i s shown i n F i g . 4 i n comparison w i t h the o r i g i n a l a r t i c u l a t o r y p a r a meters. S o l i d l i n e s show t h e o r i g i n a l a r t i c u l a t o r y mot i o n . The f i r s t and t h e second f o r m a n t f r e q u e n c i e s a t e v e r y moment a r e c a l c u l a t e d f o r t h i s a r t i c u l a t o r y movement and c o n v e r s e l y t h e a r t i c u l a t o r y p a r a m e t e r s are e s t i m a t e d b y E q . ( l l ) f o r those formant f r e q u e n c i e s . The e s t i m a t e d v a l u e s agree w i t h t h e o r i g i n a l ones e x c e p t f o r t h e s l i g h t d e v i a t i o n i n the tongue p a r a m e t e r s a ^ and .IT, However, t h i s r e s u l t means t h a t i f t h e model i s j u s t f i t t e d f o r the s p e a k e r , t h e e s t i m a t i o n w o u l d b e s u c c e s s f u l b y the n o n l i n e a r r e g r e s s i o n method. The r e s u l t f o r t h e r e a l speech d a t a /a i u e 0 / i s shown i n F i g . 5 . The s p e a k e r i s d i f f e r e n t f r o m the p e r s o n whose d a t a were used t o c o n s t r u c t the m o d e l . A l t h o u g h the e s t i m a t e d v a l u e s c a n n o t b e compared w i t h t h e t r u e ones because t h e X - r a y d a t a were n o t t a k e n f o r t h i s u t t e r a n c e , t h e y convey t h e t y p i c a l n a t u r e o f each v o w e l . For example, vowel i s c h a r a c t e r i z e d b y t h e most p o s i t i v e v a l u e o f a 2 and the l i p p r o t r u s i o n and c l e a r l y d i s t i n g u i s h e d from [ o ] b y the d i f f e r e n c e s o f t h e jaw o p e n i n g and the tongue p a r a m e t e r a * . F r o n t v o w e l [ i ] i s c h a r a c t e r i z e d b y t h e most n e g a t i v e v a l u e o f a 1 and d i s t i n g u i s h e d f r o m [ e ] b y t h e d i f f e r e n c e o f the d e g r e e o f t h e jaw o p e n i n g . I n F i g . 4 and F i g . 5 , the marks w h i c h i n d i c a t e the estimated values of the fonnant frequencies mean the c a l c u l a t e d f o r m a n t v a l u e s c o r r e s p o n d i n g t o the estimated a r t i c u l a t o r y parameters. It is seen t h a t t h e c o r r e s p o n d e n c e i n t h e f o n n a n t d o m a i n is very w e l l . N e v e r t h e l e s s , there are s l i g h t d e v i a t i o n I n the a r t i c u l a t o r y parameters i n F i g . 4 . T h i s i n d i c a t e s t h a t t h e f i r s t two f o n n a n t f r e quencies are not s u f f i c i e n t t o decide p r e c i s e l y t h e a r t i c u l a t o r y p a r a m e t e r s I n some r e g i o n . In a p r a c t i c a l p o i n t of v i e w , the o u t p u t s of f i l t e r bank i s more c o n v e n i e n t t h a n t h e f o n n a n t frequencies as a set of a c o u s t i c parameters, because t h e c a l c u l a t i o n o f t h e f o r m a n t f r e q u e n c i e s i s n o t s o easy m a t t e r . Then t h e e s t i m a t i o n u s i n g t h e o u t p u t o f t h e f i l t e r bank was t r i e d . The o u t -
Parameters
I t i s w e l l known t h a t the v o c a l t r a c t shape i s not u n i q u e l y determined from the s p e c t r a l c h a r a c t e r i s t i c s o f t h e speech wave w i t h o u t the a d d i t i o n a l c o n s t r a i n t s i n terms o f t h e speech production process. Such c o n s t r a i n t s m u s t b e c o n s i d e r e d f r o m t h e p h y s i o l o g i c a l , p h o n o l o g i c a l and p e r s o n a l i t y p o i n t s of view. The c o n s t r a i n t s can b e r e f l e c t e d o n t h e a r t i c u l a t o r y model i n two ways One i s i n the p h y s i c a l d i m e n s i o n o f the a r t i c u l a t o r y o r g a n s and In the components v e c t o r s V i and O k The o t h e r i s t h e maimer o f the c o n t r o l o f the a r t i c u l a t o r y parameters. The l a t t e r i s c o n s i d e r e d i n t h i s study only a l i t t l e . It is clear that t h e number o f t h e a r t i c u l a t o r y p a r a m e t e r s i s much s m a l l e r than t h a t o f the c y l i n d r i c a l tubes t o d e s c r i b e the v o c a l t r a c t shape and t h i s f a c t w i l l make the e s t i m a t i o n e a s y . T h e r e f o r e , the vocal t r a c t shape i s d e t e r m i n e d b y e s t i m a t i n g t h e s e a r t i c u l a t o r y parameters. The p r e s e n t model i s u s e f u l f o r t h e e s t i m a t i o n f r o m t h e speech w a v e , because i t i s c o n s t r u c t e d n o t o n l y t o d e s c r i b e the a r t i c u l a t o r y s t a t e s t r i c t l y b u t a l s o t o b e d i r e c t l y r e l a t e d t o the v a r i a t i o n o f the vocal t r a c t shape. However, t h e r e r e m a i n s a p o s s i b i l i t y t o b r i n g about a freedom i n the a r t i c u l a t o r y parameters f o r a c e r t a i n r e g i o n of the s p e c t r a l c h a r a c t e r i s t i c s o f a speech s o u n d . Therefore, it i s d e s i r a b l e t o a p p l y the c o o p e r a t i v e r e l a t i o n between t h e a r t i c u l a t o r s i n s t a t i c and dynamic senses to a v o i d such a f r e e d o m . The t i m e c o n s t a n t s o f t h e a r t i c u l a t o r y m o t i o n a r e l a r g e compared w i t h the sound p r o p a g a t i o n phenomena and i n the case o f t h e a r t i c u l a t i o n o f v o w e l s , t h e power s p e c t r a l d e n s i t y o f t h e speech sound i s c o n s i d e r e d t o b e s t a t i o n a r y i n a smal1 inLerval. T h e r e f o r e , the s t a t i c correspondence between the a r t i c u l a t o r y p a r a m e t e r s and the a c o u s t i c ones i s v e r y i m p o r t a n t . The r e l a t i o n s h i p between the a r t i c u l a t o r y p a r a m e t e r s and the a c o u s t i c f e a t u r e s i s f o r m u l a t e d in nonlinear regression as,
508
put s i g n a l s of the f i l t e r b a n k were reduced to 3 or 4 components by the p r i n c i p a l components a n a l y s i s and t h o s e m a i n c o m p o n e n t s were u s e d a s t h e a c o u s t i c parameters in the n o n l i n e a r r e g r e s s i o n . 'the r e s u l t i s shown i n F i g . 6 f o r t h e s y n t h e s i z e d v o i c e / a i u e o/. Compared w i t h F i g , 4 , t h e a c c u r a c y i s a l m o s t the same. The o r i g i n a l a n d t h e e s t i m a t e d s p e c t r a l p a t t e r n s a r e shown i n F i g . 7 . Formant f r e q u e n c i e s from f i r s t t o f o u r t h a r e i n good a g r e e ment. I t may b e c o n c l u d e d t h a t t h e a r t i c u l a t o r y parameters e s t i m a t e d by the n o n l i n e a r r e g r e s s i o n method can be e m p l o y e d as a f e a t u r e v e c t o r f o r the speech r e c o g n i t i o n . Sentence Formulation
of
the
Recognition Sentence
(U) w h e n ' LB shows the l e n g t h of the s e n t e n c e An e l e m e n t means t h e Y - t h s e n t e n c e wi th the s e n t e n c e s t r u c t u r e w h i c h may be a r i s e under t h e a - t h s t a t e , and can b e e x p r e s s e d a s .
Algorithm
Recognition
Problem
i n t h i s s e c t i o n the sentence r e c o g n i t i o n problem w i l l be given a mathematical f o r m u l a t i o n . S e n t e n c e s t r u c t u r e s mean t h e c a t e g o l i z a t i o n o f t h e types of the sentences a c c o r d i n g to the s y n t a c t i c and s e m a n t i c c o n t e n t s . The m e t h o d t o s e t t h e s e n t e n c e s t r u c t u r e s depends upon the s c a l e and t h e c o m p l e x i t y of the p r o b l e m . In t h i s s t u d y , from t h e p r a c t i c a l v i e w p o i n t , i t i s assumed t h a t t h e number o f the words i s n o t s o l a r g e and the language can be d e s c r i b e d by t h e c o n t e x t f r e e grammar. The c o n s t r u c t i o n o f t h e C . F . G . f o r t h e g i v e n p r o b l e m becomes i m p o r t a n t . I t may b e d i f f i c u l t t o f i n d the g e n e r a l p r o c e d u r e . However, i f an a p p r o p r i a t e r e s t r i c t i o n f o r the speaker is s e t t l e d , the c a t e g o l i z a t i o n of the sentences is not so hard in a small scale problem.
where i n d i c a t e s t h e p a r t o f speech o £ t h e h - t h word in the sentence of and f u r t h e r a d d r e s s e s a word In Wi The f a c t t h a t one sequence of words has meaning is c o n s i d e r e d a s t h a t t h e sequence i s one p o s s i b l e The p u r p o s e t o i n t r o d u c e the s e n t e n c e s t r u c t u r e i s t h a t the s t r o n g m u t u a l dependence of words in a s e n t e n c e Is* a b s o r b e d in t h e s e n t e n c e structure a n d b e c o m e s nearly s t a t i s t i c a l l y independent in The o p e r a t i o n o f t h e machine can be shown as F i g . 8 t h a t is an example t o g i v e o r d e r s f o r a r o b o t i n a d i a l o g u e t o move f o r w a r d o r t u r n i n a n a s s i g n e d manner.
F i r s t , a set of p a r t s of speech ,m) I s I n t r o d u c e d . The c o n c e p t o f a p a r t o f s p e e c h may b e d i f f e r e n t f r o m t h e l i n g u i s t i c o n e . It is d e f i n e d so as to i n c l u d e i t s meaning in a d d i t i o n to i t s role in a sentence. The j - t h w o r d o f t h e i - t h p a r t of speech is denoted by w j * , so that I t may h a p p e n t h a t a w o r d t e r e d i n two o r more p a r t s . The t o t a l vocabulary of the system is Second, sentence s t r u c t u r e s are d e s c r i b e d by a c o n t e x t f r e e grammar which g i v e s the arrangement oi W^ in a s e n t e n c e . The grammar i s assumed t o b e n o t a m b i g u o u s and
a set of n o n - t e r m i n a l symbols, a set of t e r m i n a l symbols, i n i t i a l symbol, a set of production r u l e s . The s e t
is
the set o f p a r t s o f speech, i . e . V t The i n i t i a l s y m b o l z a means t h a t t h e s t a t e o f t h e machine i s a t t h e a - t h s t a t e The s e t o f t h e s t a t e where d e n o t e s t h e t o t a l number o f t h e s t a t e s o f the machine. The grammar produces the sentence s t r u c t u r e s w h i c h can appear u n d e r t h e a - t h s t a t e . I t i s assumed t h a t i s not ambiguous. Therefore one s e n t e n c e c o r r e s p o n d s t o o n l y one l e f t most d e r i v a t i o n and is d e s c r i b e d by a sequence of t h e used p r o d u c t i o n r u l e s . The s e t o f t h e s e n t e n c e s w h i c h I s g e n e r a t e d b y a sequence o f t h e p r o d u c t i o n rules * • • i s denoted b y
The f i r s t t e r m o f E q . ( 1 9 ) comes f r o m t h e word r e c o g n i t i o n and t h e second t e r m s o f E q . ( 1 8 ) and E q . ( 1 9 ) c a r r y i n f o r m a t i o n o n t h e c o n t e x t and t h e situation. Then t h e s e n t e n c e r e c o g n i t i o n p r o b l e m
(13)
509
The two methods were d i s c u s s e d s e p a r a t e l y t h i s paper. B u t , if the s u i t a b l e a l g o r i t h m to d e c i d e phonemes f r o m t h e e x t r a c t e d f e a t u r e i s added, the system w i l l be completed.
in
The a u t h o r t h a n k s D r . H. F u j i s a w a and Mr, M . Honda f o r t h e i r c o o p e r a t i o n . T h i s r e s e a r c h was p a r t l y s u p p o r t e d b y Kawakami M e m o r i a l F o u n d a t i o n . References 1 ) D.R.Reddy e t a l : A Model and System f o r Machine R e c o g n i t i o n o f Speech, IEEE, T r a n s . A U - 2 1 , J u n e , 1973. 2 ) M.Kohda, K . S h i k a n o : Speech R e c o g n i t i o n o f A r i t h m e t i c Statements U t i l i z i n g S y n t a c t i c I n f o r m a t i o n , IECE J a p a n , R e p o r t EA 7 3 - 5 4 , M a r c h , 1974. 3) W.A.Woods: M o t i v a t i o n and O v e r v i e w of SPEECHLIS : An E x p e r i m e n t a l P r o t o t y p e Speech U n d e r s t a n d i n g R e s e a r c h , IEEE, T r a n s . A S S P - 2 3 , F e b . , 1975. 4 ) V . R . L e s s e r e t a l : O r g a n i z a t i o n o f Hearsay I t Speech U n d e r s t a n d i n g S y s t e m , IEEE, T r a n s . ASSP-23, F e b . , 1975. 5) J . K . B a k e r : The DRAGON System-An O v e r v i e w , I E E E , T r a n s . ASSP-23, F e b . , 1975. 6) T.Nakajima et a l : E s t i m a t i o n of Vocal T r a c t Area F u n c t i o n s b y A d a p t i v e I n v e r s e F i l t e r i n g Methods, B u l l , of E l e c t r o t e c h n i c a l Lab. Japan, V o l . 3 7 , N o . 4 , 1973. 7) H.Wakita: D i r e c t E s t i m a t i o n of the Vocal T r a c t Shape b y I n v e r s e F i l t e r i n g o f A c o u s t i c Speech Waveforms, IEEE T r a n s . V o l . A U - 2 1 , N o . 5 , O c t . 1973. 8) C.Celter: Speech S y n t h s i s w i t h a P a r a m e t r i c A r t l c u l a t o r y M o d e l , Speech S y m p o . , K y o t o , 1 9 6 8 . 9 ) P . M e r m e l s t e i n : A r t l c u l a t o r y Mode] f o r t h e S t u d y o f Speech P r o d u c t i o n , J . A . S . A . , N o . 5 3 , 1973. 10) S . H i k i , K . N i y a t a : A r t l c u l a t o r y Model f o r Vowel P r o d u c t i o n , Speech Data P r o c e s s i n g , Tokyo Univ. P r e s s , 1973. 11) B . L i n d b l o m , J . S u n d b e r g : A c o u s t i c Consequences o f L i p , T o n g u e , Jaw and L a r y n x Movement, J.A.S.A., No.50, 1971. 12) J . S . P e r k - e l l : P h y s i o l o g y o f Speech P r o d u c t i o n , M o n o g r a p h , 5 3 , MIT P r e s s , 1969. 13) R.Houde: A Study of Tongue Body M o t i o n d u r i n g S e l e c t e d Speech S o u n d , SCRL M o n . , 2, 1 9 6 8 . 14) K . S h i r a i , H . F u j i s a w a . Y . S a k a i : Ear and V o i c e o f t h e Wabot, B u l l . S c i . & Eng. Research L a b . Waseda U n i v . , N o . 6 2 , 1973. 15) K . S h i r a i , H . F u j i s a w a : A n A l g o r i t h m f o r Spoken Sentence R e c o g n i t i o n and I t s A p p l i c a t i o n t o the Speech I n p u t - O u t p u t S y s t e m , IEEE T r a n s , , V o l . S M C - 4 , N o . 5 , S e p t . 1974. 16) A . N e w e l l e t a l : Speech U n d e r s t a n d i n g S y s t e m s , N o r t h - H o l l a n d , 1973.
A n i m p o r t a n t p r o b l e m a r i s e s i n t h e above formulation. That I s t h e a b e r r a t i o n o f t h e scene r e c o g n i t i o n between the s p e a k e r and t h e m a c h i n e . The s p e a k e r does n o t know t h e s t a t e o f t h e machine o r the s p e a k e r u t t e r s a s e n t e n c e b y m i s t a k e t h a t should not be permitted in that s i t u a t i o n . These phenomena o f t e n o c c u r i n t h e a c t u a l c o n v e r s a t i o n . P a r t i c u l a r l y in the case t h a t t h e m a c h i n e made a m i s r e c o g n l t i o n and went t o a s t a t e u n e x p e c t e d b y the speaker, I t i s d i f f i c u l t f o r the speaker t o d o a suitable action. I n such a b e r r a t i o n c o n d i t i o n t h e above a l g o r i t h m c a n n o t work s a t i s f a c t o r i l y . T h i s phenomenon a l w a y s appear when t h e r e l a t i o n between t h e s p e a k e r and t h e r e c e i v e r i s made t i g h t t o improve the r e c o g n i t i o n s c o r e . Conclusion I n t h i s s t u d y two i m p o r t a n t p a r t s o f t h e speech u n d e r s t a n d i n g s y s t e m were c o n s i d e r e d . The f e a t u r e e x t r a c t i o n method t h a t u t i l i z e s t h e p h y s i o l o g i c a l and t h e p h o n o l o g i c a l c o n s t r a i n t s was proposed. I t w i l l b e e f f e c t i v e f o r t h e speech r e c o g n i t i o n b y i m p r o v i n g t h e word o r phoneme r e cognition score. R e c e n t l y , s e v e r a l a t t e m p t s have been made t o e s t i m a t e the c r o s s - s e c t i o n a l area f u n c t i o n o f the v o c a l t r a c t u s i n g t h e s t a t e space e x p r e s s i o n o f the a c o u s t i c vave i n the v o c a l t r a c t . However, i n t h o s e f r a m e w o r k , i t w i l l make t h e p r o b l e m t o o much c o m p l i c a t e d one t o c o n s i d e r t h e v a r i o u s c o n s t r a i n t s o f the a r t l c u l a t o r y m o t i o n . I f the dynamics i s t a k e n i n t o a c c o u n t i n any s e n s e , t h e dynamic c h a r a c t e r o f t h e a r t l c u l a t o r y m o t i o n s h o u l d b e c o n s i d e r e d f i r s t and t h e s t a t e space e x p r e s s i o n o f t h e a c o u s t i c l e v e l may b e i g n o r e d because o f t h e d i f f e r e n c e i n t h e i r t i m e c o n s t a n t s . The s e n t e n c e r e c o g n i t i o n a l g o r i t h m i s w e l l f o r m u l a t e d and v e r y c o m p a c t . T h e n , i t makes easy t h e r e a l t i m e o p e r a t i o n o f t h e speech u n d e r s t a n d i n g without using s p e c i a l hard wares. 510