A Hybrid Grammar-Based Approach to Multimodal Languages ...

Report 7 Downloads 36 Views
A Hybrid Grammar-Based Approach to Multimodal Languages Specification Arianna D’Ulizia, Fernando Ferri, and Patrizia Grifoni IRPPS-CNR, via Nizza 128, 00198 Roma, Italy {arianna.dulizia,fernando.ferri,patrizia.grifoni}@irpps.cnr.it

Abstract. Multimodal interaction represents the future of human-computer interaction. It aims at enabling users to communicate more freely and naturally with automated systems. Current approaches to multimodal dialogue systems are based on two different strategies to perform the modality fusion. The former implements the multimodal fusion at dialogue management level, whereas the latter at grammar level. In this paper we propose a new strategy based on a hybrid multimodal grammar that puts formal grammars and logical calculus together in order to overcome the drawbacks of grammar-based and dialoguebased methods. Moreover, an application of the proposed approach in the context of a multimodal phone book is given and the experiments that we performed to evaluate the approach are described. Keywords: Multimodal languages, interaction, multimodal fusion.

hybrid

grammars,

human-computer

1 Introduction Human-computer interaction is a discipline that aims at enabling the interaction between humans and computational machines. In the last years several efforts have been made to make this interaction more intuitive and natural. In this direction, multimodal interaction has emerged as the future paradigm of human-computer interaction. In a multimodal system the user communicates with the computer through the simultaneous or alternative use of several channels of input/output at a time. Such a kind of systems offers a more flexible, efficient and usable environment allowing the user to interact through input modalities, such as speech, hand writing, hand gesture and gaze, and to receive information by the system through output modalities, such as speech synthesis and smart graphics. The studies of Oviatt et al. [9] demonstrate that multimodal systems enhance the accessibility, usability, flexibility and efficiency compared to unimodal ones. In particular, the use of multimodality leads to three benefits over unimodality. First, multimodality improves accessibility to the device, as it provides users with the means to choose among available modalities according to the device’s specific usability constraints. Second, multimodality improves accessibility by encompassing R. Meersman, Z. Tari, P. Herrero et al. (Eds.): OTM 2007 Ws, Part I, LNCS 4805, pp. 367–376, 2007. © Springer-Verlag Berlin Heidelberg 2007

368

A. D’Ulizia, F. Ferri, and P. Grifoni

a broader spectrum of users, enabling those of different ages and skill levels as well as users with disabilities to access technological devices. Finally, it offers improved flexibility, usability and interaction efficiency. In multimodal systems the two main challenges to face are: to combine and integrate information from different input modalities (fusion process), and to generate appropriate output information (fission process) in order to enable a natural dialogue between users and computer systems. Our specific concern in this paper is with the fusion of multiple input modalities. In the literature, two different approaches to the fusion process have been proposed. The first one, which we refer to as grammar-based approach, combines the multimodal inputs at grammar level. This means that the different unimodal inputs are considered as a unique multimodal input by using the multimodal grammar specification. Subsequently, the dialogue parser applies the grammar rules to interpret the multimodal sentence. The second strategy, which we refer to as dialogue-based approach, combines the multimodal inputs at dialogue level. This means that the different unimodal inputs are distinctly interpreted and then they are combined by the dialogue management system. A comparison of these two approaches has been made by Manchón et al. [7]. They concluded that the grammar-based approach is the most natural one as it is more coherent with the human-human communication paradigm in which the dialogue is seen as a unique and multimodal communication act. Moreover, this approach allows an easier inter-modality disambiguation. These are the reasons that have led us to follow the grammar-based paradigm. In particular, in this paper we aim at defining a grammar-based approach for specifying multimodal languages. To achieve that, we propose a hybrid multimodal grammar, that provides a logical level over multi-dimensional grammars. Specifically, we start from multi-dimensional grammars that are well suited to cover the formal definition of multimodal language syntax, as they yield a formalism that can encode the structure of a variety of multimodal inputs. As the main shortcoming of grammar-based methods is that they provide little support in processing semantics, we improve the grammatical specification through the use of a logical approach to represent aspects not directly specified in the grammar. This hybrid approach, is the most suitable for dealing with syntax without omitting the language’s semantic aspects. The paper is organized as follows. Section 2 briefly describes existing approaches to multimodal fusion. Section 3 describes our theoretical approach based on a hybrid multimodal grammar, that encodes the Constraints Multiset Grammars (CMGs) into Linear Logic. In section 4 an application of this grammar in the context of a multimodal phone book is illustrated. In section 5 the experiments that we performed to validate the proposed approach are described. Finally, conclusions and future work are given in section 6.

2 Related Work In this section an analysis of current fusion approaches, classified as dialogue-based and grammar-based, is given providing also the general architecture they rely on.

A Hybrid Grammar-Based Approach to Multimodal Languages Specification

369

2.1 The Dialogue-Based Approach In multimodal systems, both dialogue and grammar based, it is common to have separate recognizers for each modality managed by the system. This means that the recognition phase (see Figure 1 and 2) is the same for the two approaches. The difference is in the interpretation and integration phases. In dialogue-based fusion, the outcomes of each recognizer are separately interpreted and then sent to the dialogue manager that performs their fusion by using integration mechanisms, such as, for example, statistical integration techniques, agent theory, hidden Markov models, artificial neural networks. Figure 1 illustrates a general architecture for this strategy.

Fig. 1. The architecture of a dialogue-based fusion system

This approach was followed by several authors in the literature. Corradini et al. [3] propose a dialogue-based approach to modality fusion in which the fusion engine combines the time-stamped information received from the recognizers, select the most probable multimodal interpretation, and pass it on to the dialogue manager. The selection of the most probable interpretation is carried out by the dialogue manager that rules out inconsistent information by both binding the semantic attributes of different modalities and using environment content to disambiguate information from the single modalities. Another dialogue-based approach is ICARE [1]. This considers both pure modalities, described through elementary components, and combined modalities, specified by the designer through composition components. The fusion is performed within the dialogue manager by using a technique based on agents (PAC agents). Moreover, it is characterized by three steps and makes use of a particular data structure named melting pot. The first step, referred to as microtemporal fusion, combines information that is produced either in parallel or over overlapping time intervals. Further, macrotemporal fusion takes care of either sequential inputs or time

370

A. D’Ulizia, F. Ferri, and P. Grifoni

intervals that do not overlap but belong to the same temporal time window. Eventually, contextual fusion serves to combine input according to contextual constraints without attention to temporal constraints. Another approach to multimodal fusion that combines inputs coming from different multimodal channels at dialogue level is the MIMUS approach [10]. It starts from the assumption that each individual input can be considered as an independent dialogue move. In this approach, the multimodal input pool receives and stores all inputs including information such as time and modality. The dialogue manager checks the input pool regularly to retrieve the corresponding input. If more than one input is received during a certain time frame, further analysis is performed in order to determine whether those independent multimodal inputs are truly related or not. 2.2 The Grammar-Based Approach In the grammar-based approach, the outcomes of each recognizer are considered as terminal symbols of a formal grammar and consequently they are recognized by the parser as a unique multimodal sentence. Therefore, the fusion of the multimodal inputs occurs at grammar level. Finally, in the interpretation phase the parser uses the grammar specification (production rules) to interpret the sentence. Figure 2 illustrates a general architecture for this strategy.

Fig. 2. The architecture of a grammar-based fusion system

The grammar-based approach was followed by Johnston et al. in [6]. The authors perform multimodal parsing and understanding by using a weighted finite-state machine. Modality integration is carried out by merging and encoding into a finitestate device both semantic and syntactic content from multiple streams. In this way, the structure and the interpretation of multimodal utterances can be captured declaratively in a context-free multimodal grammar.

A Hybrid Grammar-Based Approach to Multimodal Languages Specification

371

Another grammar-based approach is MUMIF [11]. The fusion module of MUMIF applies a multimodal grammar to unify the recognized unimodal inputs into a unique multimodal input that is represented by using the TFS (Typed Feature Structures) structure proposed by Carpentier [2]. The MUMIF multimodal grammar is composed of two kinds of rules: lexical rules that are used to specify the TFS representation and grammar rules that constrain the unification of input.

3 The Hybrid Grammar-Based Approach The approach presented below follows the grammar-based paradigm. Therefore, our aim is to enable the fusion of multimodal inputs by using a hybrid grammar that allows also the formalization of multimodal interaction languages. This hybrid grammar consists in a logical level over multi-dimensional grammars. This is because multi-dimensional grammars are well suited to cover the formal definition of multimodal language syntax, as they yield a formalism that can encode the structure of a variety of multimodal inputs, but they provide little support in processing semantics. To override this drawback, the use of a logical approach can improve grammatical specification by representing aspects not directly specified in the grammar. This hybrid approach is thus the most suitable for dealing with syntax without omitting the language’s semantic aspects. In order to give a formal definition of a hybrid grammar we have to define what a multi-dimensional grammar is. In particular, we consider a kind of attributed multiset grammars, termed constraints multiset grammars (CMGs) [5] [8]. These grammars have been extensively used for the specification of visual languages and, as they are able to capture language multi-dimensionality, which is one of the main characteristics of multimodal languages, we aim at adapting them to the formalization of multimodal languages. Definition 1. A CMG grammar is a quadruple, G = (I,T,P,X), where: - I is a finite set of non-terminal symbols, also called variables alphabet; - T is a finite set of terminal symbols not in I, also called object alphabet; - P is a finite set of production rules, that have the form:

i ::= t1 ,..., t n

where (C) {E}

(1)

indicating that the non-terminal symbol i can be recognized from the terminal symbols t1,..,tn whenever the attributes of t1,..,tn satisfy the constraints C. The attributes of iare computed using the assignment expression E. - X is a start symbol in I. At this point we can define the hybrid grammar by embedding the CMG definition into Liner Logic [4] in order to give a logical level to the grammar. We choose this kind of logic rather than first-order logic, because it allows to model and manage different interpretations of the same multimodal sentence. The linear logic is a logic of resources and actions. Two main operators are used: the linear implication and the linear connective ⊗ . The former is used in the Y to indicate that resource X can be consumed to give Y. The latter is notation X used in the notation X ⊗ Y to indicate that we have both the resource X and the resource Y.

372

A. D’Ulizia, F. Ferri, and P. Grifoni

In the next section we illustrate an example of application of this approach to define a hybrid multimodal grammar for a speech-handwriting application. An advantage of the hybrid approach over grammars is that it allows the reasoning about multimodal languages providing a prerequisite for being able to resolve ambiguities and for integrating parsing with semantic theories. An other big advantage is that linear logic has a well-developed proof and model theory that allows to prove abstract properties of grammars.

4 The Multimodal Phone Book Application To clarify the formal definitions given in the previous section we have applied our approach in the context of a multimodal phone book in which the user can interact through the simultaneous use of speech and handwriting modalities. For example, the user might say “phone this person” and write the name of the person on a touchscreen display. This is an interesting scenario that might occur if the user wants to preserve his/her privacy. The architecture of the multimodal fusion component is shown in Figure 3. In particular, we predispose speech and handwriting recognition modules that give as outputs the utterances of each recognized spoken word and text of each recognized written word, respectively. These outputs are the terminal symbols of our hybrid grammar.

Fig. 3. The architecture of the multimodal fusion component

For defining the hybrid grammar for the multimodal phone book, we start from the definition of a CMG grammar (see Definition 1). The set of terminal symbols and their associated attributes for this CMG grammar is: T

= {SpokenWord(concept:string,start:time,end:time), WrittenWord(concept:string,start:time,end:time)}.

A Hybrid Grammar-Based Approach to Multimodal Languages Specification

373

The set of non-terminal symbols is: I

={call(name:string,surname:string), search(name:string,surname:string), insert(name:string,surname:string,number:string), modify(name:string,surname:string,number:string)} where insert is the start type. As an example of production rule in a CMG, consider the recognition of the multimodal sentence composed of the speech “phone this person” and the writing of the name of the person on a touch-screen display (see Figure 4). This multimodal sentence is made up of three spoken words S1 S2 and S3 and two written words W1 and W2 corresponding to the name and surname of the person the user wants to call (see Figure 5). These terminal symbols have to satisfy the following temporal constraints: the concept of S1 is “Phone”; S1 starts before S2 and S3. Therefore, the production rule for the call is: C:call ::= S1:SpokenWord, S2:SpokenWord, S3:SpokenWord, W1:WrittenWord, W2:WrittenWord (S1.concept = “Phone” and S1.start