Full paper Grounding of Word Meanings in Latent ... - Semantic Scholar

Report 1 Downloads 28 Views
Advanced Robotics 25 (2011) 2189–2206

brill.nl/ar

Full paper Grounding of Word Meanings in Latent Dirichlet Allocation-Based Multimodal Concepts Tomoaki Nakamura a,∗ , Takaya Araki a , Takayuki Nagai a and Naoto Iwahashi b a

b

Department of Electronic Engineering, University of Electro-Communications, 1-5-1 Chofugaoka, Chofu-shi, Tokyo 182-8585, Japan NICT Knowledge Creating Communication Research Center, 2-2-2 Hikaridai, Seika-cho Souraku-gun, Kyoto 619-0288, Japan Received 27 December 2010; accepted 22 February 2011

Abstract In this paper we propose a latent Dirichlet allocation (LDA)-based framework for multimodal categorization and words grounding by robots. The robot uses its physical embodiment to grasp and observe an object from various view points, as well as to listen to the sound during the observing period. This multimodal information is used for categorizing and forming multimodal concepts using multimodal LDA. At the same time, the words acquired during the observing period are connected to the related concepts, which are represented by the multimodal LDA. We also provide a relevance measure that encodes the degree of connection between words and modalities. The proposed algorithm is implemented on a robot platform and some experiments are carried out to evaluate the algorithm. We also demonstrate simple conversation between a user and the robot based on the learned model. © Koninklijke Brill NV, Leiden and The Robotics Society of Japan, 2011 Keywords Multimodal categorization, concept formation, symbol grounding, latent Dirichlet allocation

1. Introduction It is a well-known fact that the capability of categorizing objects is very important for our human-like intelligence [1–3]. For example, we can eat a tomato because we can infer that the tomato is edible through its category [3]. The generated categories are the bases of our concepts and each word works as a label of a specific category. Therefore, the categorization is very important for language understanding as well. These facts motivate us to pursue the abilities of the categorization and the symbol grounding algorithm for intelligent robots. *

To whom correspondence should be addressed. E-mail: [email protected]

© Koninklijke Brill NV, Leiden and The Robotics Society of Japan, 2011

DOI:10.1163/016918611X595035

2190

T. Nakamura et al. / Advanced Robotics 25 (2011) 2189–2206

In this paper we examine the algorithm for the grounding of word meanings. To achieve this goal we take a two-step procedure — the categorization and mapping of categories to linguistic labels. The categorization can be considered as a problem of unsupervised learning. Unsupervised learning of objects using only images has been extensively studied in the field of computer vision [4–8]. Such an unsupervised framework enhances the pliability of an object recognition system in various environments. However, it is obvious that object categories do not depend only on visual information, but also on various other hypes. In Ref. [9], we have proposed multimodal categorization that is based on probabilistic latent semantic analysis (pLSA) [10]. The multimodal categorization has been shown to be successful for categorizing objects in the same way as humans do without any supervision. Although the samples (objects) used in the experiments are limited, the robot can recognize the category of an unseen object almost 100%. The validity of the method is, rather, to be able to infer properties of the object from limited observations. For example, the robot can stochastically infer the sound that the object makes and/or hardness of the object only from the visual information. This kind of inference is required in day-to-day situations. However, pLSA requires heuristics to deal with novel input data, since it is a point estimation [9]. In order to solve this problem, latent Dirichlet allocation (LDA) [11] has been extended to multimodal LDA for multimodal categorization [12]. In contrast to multimodal pLSA, multimodal LDA requires no heuristics, since it is based on Bayesian learning. We utilize multimodal LDA for constructing concepts as the first step. In the second step, association of words with corresponding categories is carried out. The most important assumption of this paper is that the robot has a speech recognition system. If this is not the case, a word acquisition method (e.g., Refs [13, 14]) must be used for constructing a lexicon. Since we assume that the robot has a lexicon, the problem that we consider here is purely a correspondence problem. Multimodal LDA is also involved in the proposed framework; hence, the robot can stochastically recall words from observations and vice versa. We also consider the problem of measuring the connection between words and modalities. For example, the word ‘soft’ represents haptic information; hence, a strong connection between the word and a haptic channel can be observed by the proposed measure. This measure is important when the robot describes the object property regarding a certain modality. Related works include unsupervised categorization of objects using visual information [4–8] as we mentioned earlier. Language acquisition is an active research field [13, 15] and is closely related to this paper. However, multimodal categorization is not involved in these works. In Ref. [16], the same problem (i.e., categorization and grounding) has been tackled. However, Ref. [16] does not utilize multimodal categorization. Moreover, it considers only object names as words to be connected, while this paper deals with adjectives as well as nouns. This paper is organized as follows. In Section 2, multimodal LDA is described as a basis for the proposed symbol grounding model. In Section 3, we explain the word

T. Nakamura et al. / Advanced Robotics 25 (2011) 2189–2206

2191

grounding method, and then a degree of connection between words and modalities is introduced. The validity of the proposed method is shown through some experiments in Section 4. Finally, Section 5 summarizes this paper. 2. Multimodal Categorization Using LDA The proposed method consists of two steps. In this section, LDA-based multimodal categorization is described as the first step. 2.1. Overview of LDA-Based Multimodal Categorization A robot can grasp an object and observe it from different viewpoints. During the observation, the identity of the object is guaranteed. This fact motivates us to use the occurrence frequency of visual, audio and haptic data, which are collected over the observing period of a single object. This is nothing short of the ‘bag of words’ model when each feature is considered as a ‘word’. This idea makes it possible for robots to discover object categories using multimodal information. The robot is equipped with cameras, microphones, an arm and a hand with pressure sensors. Therefore, the robot can actually grasp each object and gain information such as the sequence of images, and signals from microphones and pressure sensors. Then the objects are categorized based on similarities in their appearance, the sounds made by objects in motion and their hardness. The graphical model of the proposed LDA-based multimodal categorization is shown in Fig. 1. In Fig. 1, wv , wa and wh represent visual, audio and haptic information, respectively. They are assumed to be drawn from each multinomial distribution parameterized by β v , β a and β h , respectively. z denotes the category and is chosen from multinomial distribution parameterized by θ that depends on the Dirichlet prior distribution Dir(α). 2.2. Signal Processing for Multimodal Categorization Each signal is preprocessed for multimodal categorization as follows. 2.2.1. Visual Information The robot has a stereo-camera that is attached to its head, and images are grabbed while it grasps and observes an object. These images are used as visual information (100 images are used in the later experiment). For each image, 128-dimensional

Figure 1. Graphical model for multimodal LDA.

2192

T. Nakamura et al. / Advanced Robotics 25 (2011) 2189–2206

scale-invariant feature transform (SIFT) descriptors [17] are computed. Then each feature vector is vector quantized using a codebook with 500 clusters. To cope with occlusion by the robot’s own hand, images of the robot hand are collected in advance and features (codebook indices) are computed. These features are removed from the visual information all the time. 2.2.2. Audio Information As for audio information, the sound is recorded while the robot grasps and shakes an object. The audio signal is then divided into frames followed by the transformation of them into 13-dimensional Mel-frequency cepstrum coefficients (MFCCs) as feature vectors. Finally, the feature vectors are vector quantized using the codebook with 50 clusters. 2.2.3. Haptic Information Haptic information is obtained through the two-finger robotic hand with four pressure sensors. When the robot grasps an object, the sum of digitized voltages from these pressure sensors, which encodes the hardness of the object, is obtained. During the two-finger grasp, the robot presses the object with two fingers, and the change in the angle between the base and left finger is measured. This change in the angle can be considered as the ‘softness’ of the object. Thus, we obtain twodimensional feature vectors as haptic information. The feature vectors are finally vector quantized using the codebook with five clusters. 2.3. Multimodal LDA The categorization is carried out as a parameter estimation of the graphical model in Fig. 1 using multimodal information observed by the robot. Parameters are estimated so that the log-likelihood of the multimodal information for the given model is maximized. Since the direct computation of the log-likelihood is intractable, we apply variational inference, which provides us with a tractable lower bound on the log-likelihood using Jensen’s inequality [18]. For given multimodal information wv , wa and wh , the log-likelihood can be written as: log p(wv , wa , wh |α, β v , β a , β h )   p(θ, z, wv , wa , wh |α, β v , β a , β h ) = log q(θ, z|γ , φ v , φ a , φ h ) z 

 

× q(θ, z|γ , φ v , φ a , φ h ) dθ q(θ, z|γ , φ v , φ a , φ h )

z

× log p(θ, z, wv , wa , wh |α, β v , β a , β h ) dθ   − q(θ, z|γ , φ v , φ a , φ h ) z

× log q(θ, z|γ , φ v , φ a , φ h ) dθ,

(1)

T. Nakamura et al. / Advanced Robotics 25 (2011) 2189–2206

2193

where q(θ, z|γ , φ v , φ a , φ h ) is the variational distribution, which approximates p(θ, z|wv , wa , wh , α, β v , β a , β h ) and is assumed to be the product of independent terms. φ ∗ denotes the variational parameter of the multinomial distribution from which z is sampled and γ is the variational parameter of the Dirichlet distribution from which the multinomial parameter θ is drawn. The variational expectationmaximization (EM) algorithm for the proposed multimodal LDA is as follows: [E-step] The following procedures are repeated until convergence for each object d:    v v φdwv k ∝ βkwv exp ψ(γdk ) − ψ γdk  (2) k

 a φdw ak

a ∝ βkw a

exp ψ(γdk ) − ψ

h ∝ βkw h

γdk = αk +

exp ψ(γdk ) − ψ



v φdw vk +

wv



 k

a φdw ak +

v βkw v ∝



(3)

γdk   γdk 



wa

[M-step]



k

 h φdw hk



h φdw hk

(4) (5)

wh

v ndwv φdw vk

(6)

a ndwa φdw ak

(7)

h ndwh φdw hk

(8)

d

a βkw a ∝

 d

h βkw h ∝

 d

    ∂L =N ψ αk  − ψ(αk ) ∂αk  k    ψ(γdk ) − ψ γdk  , + d

(9)

k

where d(= 1, . . . , N), k and ndw∗ represent index of the object, index of category and occurrence count of a feature w∗ for the object d, respectively. N represents the number of objects. αk is computed using the Newton–Raphson method so that the log-likelihood L is maximized. 2.4. Category Recognition for Unseen Objects By using the learned model, the category of the unseen object can be recognized. For given multimodal information of the unseen object wvobs , waobs and whobs , its category can be determined as z that maximizes p(z|wvobs , waobs , whobs ). Therefore,

2194

T. Nakamura et al. / Advanced Robotics 25 (2011) 2189–2206

the category can be found by computing: zˆ = argmax p(z|wvobs , waobs , whobs ) z  = argmax p(z|θ )p(θ |wvobs , waobs , whobs ) dθ,

(10)

z

where p(θ |wvobs , waobs , whobs ) is determined by recalculating α using the variational EM algorithm described above, while learned β v , β a and β h are kept fixed. 2.5. Inference Among Modalities From visual information, we can infer the hardness of the object, whether the object makes a sound or not, and so on. Such inference among modalities is a very important capability for robots as well as for us humans. Let us think about the inference of auditory information wa only from the observed visual information wvobs :   p(wa |wvobs ) = p(wa |z)p(z|θ )p(θ |wvobs ) dθ. (11) z

In the above equation, p(θ |wvobs ) should be recomputed in the same way as before. It should be noted that the recomputation implies that the inference is carried out through categories, since the probability of generating category z from wvobs is recomputed. Furthermore, one can see that (11) performs Bayesian inference, which is the essential difference between LDA and pLSA. 3. Grounding of Word Meanings in Multimodal Concepts In the foregoing section, we discussed the algorithm for forming multimodal concepts using LDA. In this section, we propose the method for the grounding of word meanings in the multimodal concepts that are formed by multimodal LDA. The multimodal LDA framework is also involved here as shown in Fig. 2. Hence, symbol grounding becomes the problem of the following parameter estimation. 3.1. Parameter Estimation Figure 2 shows the proposed graphical model for the grounding of word meanings in multimodal concepts. The part in the dashed box has been learned by the

Figure 2. Graphical model for the proposed words grounding.

T. Nakamura et al. / Advanced Robotics 25 (2011) 2189–2206

2195

multimodal categorization as mentioned in the previous section. In this model, ww denotes words information that is represented by the ‘bag of words’ model. Hence, similar to the perceptual information, words information is modeled by the occurrence frequency and assumed to be chosen from a multinomial distribution parameterized by β w . At first the robot collects sentences, which are uttered by a user, during the observing period of multimodal information. Continuous speech recognition and morphological analysis are utilized for converting speech signals into sequences of words. Only nouns and adjectives are extracted from these sequences of words, and represented as a numerical ID. Finally, the robot obtains set of words corresponding to each object. Estimation of the parameter β w is straightforward, since the category z is not a latent variable at this moment: nw,z βzw =  , (12) w nw,z where nw,z represents occurrence count of the word w for the category z. 3.2. Inference of Words Now, all of parameters in Fig. 2 can be estimated. This means that the robot is ready for inferring word meanings (i.e., the robot can recall highly probable perceptual information from input words). Conversely, the robot can also describe the input multimodal perceptual information (e.g., scene) using suitable words. These w ∗ processes are realized by computing p(w∗ |ww obs ) and p(w |wobs ) in the same way as in (11). Here, let us focus our attention on the example of inferring words ww only from visual information wvobs . Such inference can be carried out as:   p(ww |wvobs ) = p(ww |z)p(z|θ )p(θ |wvobs ) dθ. (13) z

We get p(θ |wvobs ) by recomputing α using the variational EM algorithm as before. 3.3. Degree of Connection Between Word and Modality There are words that represent rather abstract concepts such as ‘round’, ‘soft’, ‘shape’ and so on. Many of these words (e.g., adjectives) are strongly connected to a particular modality. For example, the word ‘shape’ is connected to visual information. If the robot is aware of the connection between the word and modality, it is possible for the robot to pay attention to the appropriate modality when a certain word is input. Therefore, the problem here is to measure the degree of connection between words and modalities. In order to do this, we pay attention to the fact that features represented by these words are shared among some relevant categories. For instance, the word ‘hard’ is connected to the haptic modality. Hence, similar haptic information would appear in the categories that are connected to ‘hard’. On the other hand, audio-visual information is not shared among these categories. For this

2196

T. Nakamura et al. / Advanced Robotics 25 (2011) 2189–2206

w ), which encodes the degree of reason, we propose a relevance measure Cm (wobs w connection between the word wobs and modality m (m ∈ {audio, visual, haptic}) as:

w )= Cm (wobs



w p(z|wobs )

z

Nm 

w m min(p(w¯ im |wobs ), βzi )

i

m 1  w m − min(p(w¯ im |wobs ), βzi ), Nm

N

(14)

i

where w¯ im and Nm the denote the ith component of modality m and number of  dimensions of modality m, respectively. i min(ai , bi ) represents a similarity measure between a and b, called the intersection, ranging from 0 to 1 when a and b are normalized to unity. The more similar a and b are, the closer to unity the output value becomes. Cm can be interpreted as a difference between the average of the similarity measure in all categories and the weighted average of the similarity conw ). sidering p(z|wobs 4. Experiment The proposed algorithm has been implemented on the robot shown in Fig. 3. The robot consists of a 6-d.o.f. arm, a 4-d.o.f. hand and a 2-d.o.f. head. There are four pressure sensors on its left fingertip as we mentioned earlier. One microphone is mounted on the right fingertip as well to capture audio information when the robot grasps and shakes an object. Five experiments are carried out to evaluate the proposed algorithm using the robot. Before the tests, we asked eight volunteers to classify the 50 toys according to their own criteria. Although the results differed from person to person, they had eight categories with 40 objects in common. These objects are shown in Fig. 4. In the following experiments, we use these 40 objects.

Figure 3. Robot platform used in the experiments.

T. Nakamura et al. / Advanced Robotics 25 (2011) 2189–2206

2197

Figure 4. Eight categories consisting of 40 toys. Category 1: animal rattle; category 2: maraca; category 3: hard rattle; category 4: tambourine; category 5: sandbox toy; category 6: rubber doll; category 7: plushie; category 8: ball.

Figure 5. Results of categorization: (a) hand categorization; (b) visual-only categorization; (c) audio-only categorization; (d) haptic-only categorization; (e) audio-visual categorization; (f) visual-haptic categorization; (g) audio-haptic categorization; (h) categorization using all modalities (audio, visual and haptic).

4.1. Results of Categorization The results of categorization under various conditions are given in Fig. 5, where the horizontal and vertical axes indicate category and object indices, respectively; the white bars represent that the object is classified into the category. Figure 5a shows the ground truth, which is the common result of hand categorization by eight volunteers. Figure 5b–d represents visual-only categorization, audioonly categorization and haptic-only categorization, respectively. From Fig. 5b, one can see that the visual only categorization failed to classify sounding objects such as rattles and maracas. Moreover, rubber dolls (category 6) were divided into two groups since they have a similar appearance to plushies (category 7). Not surprisingly, the audio-only categorization generated a single class of objects that make no sound, while the sounding objects were over-segmented. The haptic-only catego-

2198

T. Nakamura et al. / Advanced Robotics 25 (2011) 2189–2206

rization gave no reasonable result. Figure 5e–g shows audio-visual categorization, visual-haptic categorization and audio-haptic categorization, respectively. Although some of these combinations improve the categorization result, the combination of two modalities is not powerful enough for categorizing objects perfectly. On the other hand, it can be seen that the categorization using all modalities gives the perfect result as shown in Fig. 5h. These results clearly show that the three modalities are required for the object categorization. 4.2. Category Recognition for Unseen Objects To evaluate the performance of category recognition for unseen objects, leave-oneout cross-validation is carried out using above-mentioned 40 objects. All of 40 objects are recognized correctly. 4.3. Inference of Words We carried out an experiment to evaluate the proposed words grounding algorithm using the 40 objects shown in Fig. 4. It should be noted that this experiment is conducted in Japanese. In this experiment, one object was selected randomly from each category (i.e., eight objects were selected in total). Three users were asked to describe each object with some words (nouns and adjectives) that represent the category. Therefore, eight objects out of 40 are associated with some words at this moment. The word histogram was generated for each object. Then, the learning process was carried out using the method in Section 3. In the test phase, the object was put in front of the robot and three words with the three highest p(ww |wvobs ) were inferred from visual information. In order to judge whether the word is suitable for the object, we asked three users to give some suitable words for each of the 32 remaining objects. If the inferred word is included in the set of words given by the users, then the word is suitable for describing the object. Finally, we calculate the accuracy of all inferred words for describing the object category. We compare the results of each user and the result of the case where all data (all words given by three volunteers) were used for leaning by the robot. The words that were freely given by the users in this experiment are listed in Table 1. Figure 6 shows the word histograms for each user. Although a slight Table 1. Words used in the experiment big head rubber tambourine maraca handle soft

hole rubber doll hard rattle sound hard plain color

plushie sponge plastic instrument sandbox toy long

animal rattle flat ball round animal

T. Nakamura et al. / Advanced Robotics 25 (2011) 2189–2206

(a)

(b)

(c)

(d)

2199

Figure 6. Histograms of words: (a) user 1; (b) user 2; (c) user 3; and (d) data of all users.

Figure 7. Results of word inference.

difference in these histograms is observed, the words that describe some specific categories are used in common. Figure 7 shows the accuracy of correctly inferred words for each case. Almost 80% of the inferred words are suitable for describing the object regardless of the user who taught the words. Furthermore, the robot learned words using all of three users’ data. The result is also given in Fig. 7, which is comparable (77.5%) to the individual case. From these results, we can validate that the proposed model makes it possible to infer suitable words for objects through their categories. We emphasize that the words are given to a single object per each category. This means that recollection of words for an unseen object is possible through multimodal concepts represented by the proposed model. Figure 8 shows some examples of word inference. From Fig. 8, one can see that some suitable adjectives are recalled as well as nouns that represent specific object categories. Almost all of unsuccessful examples arise from recognition failure of the category due to similarity in appearance, as shown in Fig. 8. 4.4. Connection Between Word and Modality Next, the degree of connection between a word and modality Cm is calculated for each word according to (14). We divide the words into two groups: (i) words that are assigned to multiple categories, and (ii) words that are assigned to a single category. Figure 9 shows the result of the degree of connection between the word and modality for Group (i). First of all, words for describing appearance (e.g., ‘big

2200

T. Nakamura et al. / Advanced Robotics 25 (2011) 2189–2206

Figure 8. Examples of recalled words. Falsely recalled words are given in bold.

(a)

(b)

(c)

Figure 9. Degree of connection between the word and modality for Group (i). (a) Connection between the word and visual modality, (b) connection between the word and audio modality, (c) connection between the word and haptic modality.

head’, ‘round’, ‘long’ and ‘animal’) obtain relatively high values of Cvisual , which are obviously reasonable results. As for the connection between words and auditory information, auditory related words (e.g., ‘sound’, ‘instrument’, etc.) give high Caudio . It can be also seen that high values of Chaptic are observed for the hapticrelated words (e.g., ‘soft’, ‘hard’ and ‘plastic’).

T. Nakamura et al. / Advanced Robotics 25 (2011) 2189–2206

(a)

2201

(b)

(c)

Figure 10. Degree of connection between the word and modality for Group (ii). (a) Connection between the word and visual modality, (b) connection between the word and audio modality, and (c) connection between the word and haptic modality.

On the other hand, ‘long’ and ‘sound’ take high Chaptic values, which is against our intuition. This is because these words are assigned only to hard objects. This problem must be resolved by using these words for describing soft objects. Therefore, the robot needs to have considerable experience for learning the correct connection between words and modalities. The results of the Group (ii) are shown in Fig. 10. Many of words in this group are used for representing a single category such as maraca, sandbox toy and so forth. It can be clearly seen that these words have strong connections to both visual and haptic modalities. Moreover, the words that are related to sounding objects have a strong connection to the auditory modality as well. However, there are some words that failed to have a connection to the correct modality. For example, the words ‘sponge’ and ‘rubber’ should take high Chaptic and low Cvisual , since they are concepts regarding haptic modality. The words ‘hole’, ‘handle’, ‘plain color’ and ‘flat’ also have high Chaptic wrongly. The reason for these false connections is that these words were used only once for describing a single object category. These words should be used for describing more than one object category so that the robot learns the correct connections between words and modalities. 4.5. Dialog with the Robot Finally, we realized simple conversation between a user and the robot through some objects. The robot has the grounded words and concepts as described above. Fig-

2202

T. Nakamura et al. / Advanced Robotics 25 (2011) 2189–2206

Figure 11. Overview of the implemented LDA-based dialog system.

U1: The user shows a plushie to the robot. R1: (‘plushie’, ‘soft’, and ‘animal’ are recalled.) ‘This is a plushie’. U2: ‘What does this look like?’ R2: ‘This looks like animal’. U3: ‘How hard is this?’ R3: ‘This is soft’. U4: ‘How does this sound?’ R4: ‘Nothing’.

Figure 12. Example of conversation 1 (U: user; R: robot).

ure 11 illustrates the overview of the implemented dialog system using multimodal LDA. The system can take multimodal information and/or utterances of the user as input. Nouns and adjectives are extracted from the utterances using morphological analysis. Then the robot infers some words from the input using multimodal LDA. The input information is also used for the behavior generation, such as grasping an object, facial expression, utterance generation and so on. For instance, the robot tries to grasp an object for obtaining haptic information when the object is put in front of the robot. The robot’s utterance is generated from the sentence template, which is selected according to the user utterance, and the inferred words based on the multimodal LDA. We set thresholds to each Cm such as Cvisual = 0.08, Caudio = 0.2 and Chaptic = 0.2. Thus, the robot can judge whether the word is related to the modality m or not according to Cm (‘word’). The robot responds to some interrogations and to pay attention to a certain modality according to keywords (e.g. ‘look like’). Then the robot can answer the question using a sentence template and recalled words. Figure 12 shows an actual example of the conversation. In the example, the robot saw a plushie at first and recalled some words (‘plushie’, ‘soft’ and ‘animal’). Since the word ‘plushie’ has very large Cvisual and Chaptic , the robot inferred that the name of the object is ‘plushie’. When the user asks a question regarding its appearance, the robot is designed to answer the recalled word that is related only to visual information. In the example, the robot answers ‘This looks like animal’, since ‘animal’ is the word that is related only to visual information. When the user asked about the

T. Nakamura et al. / Advanced Robotics 25 (2011) 2189–2206

2203

U1: The user made a sound of the ‘animal rattle’ (the robot listened to it). R1: (‘tambourine’, ‘animal rattle’, and ‘sound’ are recalled.) ‘This is a tambourine’. U2: ‘This is soft’. R2: (‘animal rattle’, ‘tambourine’, and ‘sound’ are recalled.) ‘This is an animal rattle’.

Figure 13. Example of conversation 2 (U: user; R: robot).

U1: The user shows a maraca to the robot. R1: (‘hard’, ‘long’, and ‘sandbox toy’ are recalled.) ‘This is a sandbox toy’. U2: (Do nothing.) R2: The robot grasps the object and shakes it. R3: (‘hard’, ‘sound’, and ‘maraca’ are recalled.) ‘This is a maraca’.

Figure 14. Example of conversation 3 (U: user; R: robot).

U1: ‘This is an instrument’. R1: (‘instrument’, ‘tambourine’, and ‘maraca’ are recalled.) ‘This is a tambourine’. U2: ‘This is long’. R2: (‘instrument’, and ‘maraca’, and ‘long’ are recalled.) ‘This is a maraca’.

Figure 15. Example of conversation 4 (U: user; R: robot).

sound of the object, the robot answered ‘nothing’ since no auditory related word was recalled. In Fig. 13 the user made a sound of the ‘animal rattle’. However, the robot falsely responded as ‘This is a tambourine’ since the robot recognized the sound as the tambourine. Then the user gave information that the object is soft. This additional information led to correct recognition of the object. In the next example, the robot recognized a maraca as a sandbox toy because of the similarity in appearance between them. Then the robot grasped the object, shook it and recognized it as a maraca correctly as shown in Fig. 14. Finally in Fig. 15, the user said ‘This is an instrument’ without showing anything to the robot. The robot inferred the word ‘tambourine’. Then the user gave a word ‘long’ and the robot recalled the word ‘maraca’. Although the current conversation system is simple enough, the robot is proven to be able to use the perceptually grounded words using the proposed framework. Even though 23 words are insufficient for practical conversation in daily life, we think that the proposed system can deal with a larger vocabulary (consisting of 10 000 or

2204

T. Nakamura et al. / Advanced Robotics 25 (2011) 2189–2206

more words) if the robot can have enough linguistic experience for learning a such huge vocabulary. In fact, LDA has been successfully applied to language modeling that deals with over 30 000 words. Unfortunately, it seems hard to teach the robot hundreds of words in the current system. This issue is worth pursuing in the future. How to acquire new associations between an unknown word and multimodal concepts through human–robot interaction is another interesting research direction to pursue. The batch-type learning algorithm leads to severe limitations of the current system. In Ref. [19], online LDA, which makes it possible to update parameters in an incremental manner, has been proposed. Therefore, online learning of multimodal concepts and word association can be possible through a dialog between the user and the robot by extending the online LDA to online multimodal LDA. This issue is left for future work. 5. Conclusions In this paper, multimodal object categorization was explored. Then, the symbol grounding problem was examined based on the concepts formed by the multimodal categorization. The proposed framework is an extension of LDA. Experimental results with 40 objects (eight categories) show that the proposed algorithm works better than the visual-only categorization. We also demonstrated a possibility of a conversation between a user and the robot based on the grounded words. Currently, the number of categories must be given by hand in our LDA-based framework. The number should be determined automatically. We are now planning to solve this problem by introducing hierarchical Dirichlet processes [20]. The granularity of category is another problem to be solved. Obviously, categories are not fixed, but vary according to context from abstract to concrete. We believe that selective attention is a key to model this granularity of categories [21]. These problems are issues for the future. We are also planning to expand the experimental scale (i.e., categories, objects, users, etc.) to evaluate the proposed framework under severe environmental conditions. Acknowledgements This work was supported by a Grant-in-Aid for Scientific Research (C) (20500186, 20500179), Grant-in-Aid for Scientific Research on Innovative Areas (The study on the neural dynamics for understanding communication in terms of complex hetero systems) and the National Institute of Informatics Joint Study Program. References 1. F. G. Ashby and W. T. Maddox, Human category learning, Annu. Rev. Psychol. 56, 149–178 (2005). 2. D. J. Freedman and J. A. Assad, Experience-dependent representation of visual categories in parietal cortex, Nature 443, 85–88 (2006).

T. Nakamura et al. / Advanced Robotics 25 (2011) 2189–2206

2205

3. P. Bloom, Descartes’ Baby: How the Science of Child Development Explains what Makes Us Human. Basic Books, London (2004). 4. J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman and W. T. Freeman, Discovering object categories in image collections, in: Proc. Int. Conf. on Computer Vision, Beijing, pp. 370–377 (2005). 5. R. Fergus, P. Perona and A. Zisserman, Object class recognition by unsupervised scale-invariant learning, in: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Madison, WI, vol. 2, pp. 264–271 (2003). 6. B. C. Russell, A. A. Efros, J. Sivic, W. T. Freeman and A. Zisserman, Using multiple segmentations to discover objects and their extent in image collections, in: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, New York, NY, pp. 17–22 (2006). 7. F. F. Li, A Bayesian hierarchical model for learning natural scene categories, in: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, San Diego, CA, pp. 524–531 (2005). 8. F. F. Li, R. Fergus and P. Perona, One-shot learning of object categories, Pattern Anal. Mach. Intell. 28, 594–611 (2006). 9. T. Nakamura, T. Nagai and N. Iwahashi, Multimodal object categorization by a robot, in: Proc. IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, San Diego, CA, pp. 2415–2420 (2007). 10. T. Hofmann, Unsupervised learning by probabilistic latent semantic analysis, Mach. Learn. 42, 177–196 (2001). 11. D. M. Blei, A. Y. Ng and M. I. Jordan, Latent Dirichlet allocation, J. Mach. Learn. Res. 3, 993– 1022 (2003). 12. T. Nakamura, T. Nagai and N. Iwahashi, Grounding of word meanings in multimodal concepts using LDA, in: Proc. IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, St. Louis, MO, pp. 3943–3948 (2009). 13. N. Iwahashi, Robots that learn language: developmental approach to human–machine conversations, in: Proc. 3rd Int. Workshop on the Emergence and Evolution of Linguistic Communication, Rome, pp. 143–167 (2006). 14. M. Attamimi, A. Mizutani, T. Nakamura, K. Sugiura, T. Nagai, N. Iwahashi, H. Okada and T. Omori, Learning novel objects using out-of-vocabulary word segmentation and object extraction for home assistant robots, in: Proc. IEEE Int. Conf. on Robotics and Automation, Anchorage, AK, pp. 745–750 (2010). 15. D. Roy and A. Pentland, Learning words from sights and sounds: a computational model, Cogn. Sci. 26, 113–146 (2002). 16. C. Yu and D. H. Ballard, On the integration of grounding language and learning objects, in: Proc. 19th Natl. Conf. on Artifical intelligence, San Jose, CA, pp. 488–493 (2004). 17. D. G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comp. Vis. 60, 91 (2004). 18. M. I. Jordan, Z. Ghahramani, T. Jaakkola and L. K. Saul, An introduction to variational methods for graphical models, Mach. Learn. 37, 183–233 (1999). 19. M. Hoffman, D. M. Blei and F. Bach, Online learning for latent Dirichlet allocation, in: Proc. 24th Annu. Conf. on Neural Information Processing Systems, Vancouver, pp. 856–864 (2010). 20. Y. W. Teh, M. I. Jordan, M. J. Beal and D. M. Blei, Hierarchical Dirichlet processes, J. Am. Stat. Ass. 101, 1566–1581 (2006). 21. T. Nakamura, T. Nagai and N. Iwahashi, Forming abstract concept using multiple multimodal LDA models, in: Proc. 24th Annu. Conf. of the Japanese Society for Artificial Intelligence, Nagasaki, 1J1-OS13-3 (2010) (in Japanese).

2206

T. Nakamura et al. / Advanced Robotics 25 (2011) 2189–2206

About the Authors Tomoaki Nakamura received his Bachelor’s degree from the University of Electro-Communications, in 2007. He received his Master’s degree from the Graduate School of Electro-Communications, University of Electro-Communications, in 2009. He is currently a PhD candidate at the Graduate School of ElectroCommunications, University of Electro-Communications. He is a Research Fellow of the Japan Society for the Promotion of Science, and a Member of the RSJ and IEICE.

Takaya Araki received his Bachelor’s degree from the University of ElectroCommunications, in 2011. He is currently in a Master’s course at the Graduate School of Informatics and Engineering, University of Electro-Communications. He is a Member of the IEICE.

Takayuki Nagai received his BE, ME and DE degrees from the Department of Electrical Engineering, Keio University, in 1993, 1995 and 1997, respectively. Since 1998, he has been with the University of Electro-Communications where he is currently an Associate Professor of the Graduate School of Informatics and Engineering. From 2002 to 2003, he was a visiting scholar at the Department of Electrical Computer Engineering, University of California, San Diego. Since 2011, he has also been a Visiting Researcher at Tamagawa University Brain Science Institute. He is a Member of the IEEE, RSJ, JSAI, IEICE and IPSJ. Naoto Iwahashi received the BE degree in Engineering from Keio University, Yokohama, Japan, in 1985. He received the PhD degree in Engineering from Tokyo Institute of Technology, in 2001. In 1985, he joined Sony Corp., Tokyo, Japan. From 1990 to 1993, he was at Advanced Telecommunications Research Institute International (ATR), Kyoto, Japan. From 1998 to 2003, he was with Sony Computer Science Laboratories Inc., Tokyo, Japan. From 2004 to 2010, he was with ATR. In 2008, he joined the National Institute of Information and Communications Technology, Kyoto, Japan. His research areas include machine learning, spoken language processing, human–robot interaction, developmental multimodal dialog systems and language acquisition robots.