IEICE TRANS. INF. & SYST., VOL.E88–D, NO.1 JANUARY 2005
150
PAPER
An Integrated Dialogue Analysis Model for Determining Speech Acts and Discourse Structures Won Seug CHOI†a) , Harksoo KIM†b) , and Jungyun SEO††c) , Members
SUMMARY Analysis of speech acts and discourse structures is essential to a dialogue understanding system because speech acts and discourse structures are closely tied with the speaker’s intention. However, it has been difficult to infer a speech act and a discourse structure from a surface utterance because they highly depend on the context of the utterance. We propose a statistical dialogue analysis model to determine discourse structures as well as speech acts using a maximum entropy model. The model can automatically acquire probabilistic discourse knowledge from an annotated dialogue corpus. Moreover, the model can analyze speech acts and discourse structures in one framework. In the experiment, the model showed better performance than other previous works. key words: speech act analysis, discourse structure analysis, maximum entropy model, statistical dialogue analysis model
1. Introduction Goal-oriented dialogue consists of a sequence of goaloriented utterances. Each utterance represents a linguistic action intended by a speaker [1]. For example, utterance (1) in Table 1 shows that a customer requests an agent to reserve a room. The requesting action comprised in utterance (1) is a speech act. Since identifying the speech acts of utterances is very important for identifying a speaker’s intention, it is an essential to a dialogue analysis system. It is difficult, however, to infer the speech act from a surface utterance since an utterance may represent more than one speech act according to the context. Previous approaches for dialogue analysis have analyzed speech acts based on knowledge such as recipes for plan inference and domain specific knowledge [2]–[7]. Since these knowledge-based models depend on costly handcrafted knowledge, these models are difficult to scale up and expand to other domains. Recently, machine learning models using a discourse tagged corpus have been utilized to analyze speech acts in order to overcome such problems [8]–[15]. Machine learning offers promise as a means of associating features of utterances with particular speech acts, since computers can automatically analyze large quantities of data and consider many different feature interactions. These models are based on the features such as cue phrases, change of speaker, length of an utterance, speech
act tag n-grams, and word n-grams, etc. Especially, in many cases, the speech act of an utterance is influenced by the context of the utterance, i.e., previous utterances. Therefore, it is very important to reflect the information about the context to the model. Discourse structures of dialogues are usually represented as hierarchical structures, which reflect embedding sub-dialogues [16] and provide very useful context for speech act analysis. For example, utterance (7) in Table 1 has several surface speech acts such as acknowledge, inform, and response. Such an ambiguity can be solved by analyzing the context. If we consider the n utterances linearly adjacent to utterance (7), i.e., utterances (6), (5), etc., as context, we will get acknowledge or inform with high probabilities as the speech act of utterance (7). However, as shown in Table 1, utterance (7) is a response to utterance (2) that is hierarchically recent to utterance (7) according to the discourse structure of the dialogue. If we know the discourse structure of the dialogue, we can determine the speech act of utterance (7) as response. Some researchers have used the structural information of discourse to the speech act analysis [7], [13]. It is not, however, enough to cover various dialogues since they used a restricted rule-based model such as RDTN (Recursive Dialogue Transition Networks) for discourse structure analysis. Most of the previous related works, to our knowledge, tried to determine the speech act of an utterance, but did not mention about statistical models to determine the discourse structure of a dialogue. In this paper, we propose a dialogue analysis model to determine both the speech acts of utterances and the discourse structure of a dialogue using a maximum entropy model. In the proposed model, the speech act analysis and
Table 1
Dialogue example annotated with speech acts.
Manuscript received March 25, 2004. Manuscript revised July 16, 2004. † The authors are with Diquest Inc., 7F, Sindo B/D, 1604–22, Seocho-dong, Seocho-gu, Seoul, 137–070, Korea. †† The author is with the Dept. of Computer Science, Sogang University, Sinsu-dong 1, Mapo-gu, Seoul, 121–742, Korea. a) E-mail:
[email protected] b) E-mail:
[email protected] c) E-mail:
[email protected] c 2005 The Institute of Electronics, Information and Communication Engineers Copyright
CHOI et al.: AN INTEGRATED DIALOGUE ANALYSIS MODEL
151
the discourse structure analysis are combined in one framework so that they can easily provide feedback to each other. For the discourse structure analysis, we suggest a statistical model with discourse structure tags (DSTs) similar to the idea of gaps suggested for a statistical parsing [17]. For training, we use a corpus annotated with various discourse knowledge. To overcome the problem of data sparseness, which is common for corpus-based works, we use split partial context as well as whole context. We discuss a statistical model for speech act and discourse structure analysis in Sect. 2. In Sect. 3, we propose a dialogue analysis model in which speech act and discourse structure analysis models are integrated using a maximum entropy model. In Sect. 4, we explain experimental results. Finally, we conclude in Sect. 5. 2. Statistical Model
P(S 1,n , U1,n ) n = P(S i |S 1,i−1 , U1,i−1 )P(Ui |S 1,i , U1,i−1 )
(3)
i=1
Now, we simplify Eq. (3) by making the following two Markov assumptions: P(S i |S 1,i−1 , U1,i−1 ) ≈ P(S i |S 1,i−1 ) P(Ui |S 1,i , U1,i−1 ) ≈ P(Ui |S i )
(4) (5)
That is, we assume that the current speech act is independent of the previous utterances and only dependent on the previous speech acts. Similarly we assume that the current utterance is independent of everything except knowledge of its speech act. With these assumptions we get the following equation [18], [19]: de f
We construct two statistical models: one for speech act analysis and the other for discourse structure analysis. We integrate the two models using a maximum entropy model. In the following subsections, we describe these models in detail. 2.1 Speech Act Analysis Model Let U1,n denote a dialogue which consists of a sequence of n utterances, U1 , U2 , . . . , Un , and let S 1,n denote the speech act sequences of U1,n . Then, the speech act tagging problem can be formally defined as finding sequences of speech act S 1,n which is the result of the following equation: de f
S (U1,n ) = arg max P(S 1,n |U1,n ) S 1,n
P(S 1,n , U1,n ) P(U1,n ) S 1,n = arg max P(S 1,n , U1,n )
= arg max
(1)
S 1,n
In Eq. (1), we dropped P(U1,n ) as it is constant for all S 1,n . Next, we break Eq. (1) into “bite-size” pieces about which we can collect statistics, as shown in Eq. (2). P(S 1,n , U1,n ) = P(U1 )P(S 1 |U1 )P(U2 |S 1 , U1 )P(S 2 |S 1 , U1,2 ) · · · P(Un |S 1,n−1 , U1,n−1 )P(S n |S 1,n−1 , U1,n ) = P(U1 )P(S 1 |U1 ) n P(Ui |S 1,i−1 , U1,i−1 )P(S i |S 1,i−1 , U1,i ) i=2
=
n
P(Ui |S 1,i−1 , U1,i−1 )P(S i |S 1,i−1 , U1,i )
(2)
i=1
We derived Eq. (2) by first breaking out P(U1 ) from P(S 1,n , U1,n ) and suitably defining terms like S 1,0 and their probabilities. In a similar way we can first break out P(S 1 ), we get the following equation:
S (U1,n ) = arg max P(S 1,n |U1,n ) S 1,n
≈ arg max S 1,n
n
P(S i |S 1,i−1 )P(Ui |S i )
(6)
i=1
It has been widely believed that there is a strong relation between the speaker’s speech act and the surface utterances expressing that speech act [20], [21]. That is, the speaker utters a sentence, which most well expresses his/her intention (speech act) so that the hearer can easily infer what the speaker’s speech act is. The sentential probability P(Ui |S i ) represents the relationship between the speech acts and the features of surface sentences. However, it is impossible to directly compute the sentential probability because sentences are too various to find identical surface forms. In a real dialogue, a speaker expresses identical contents with various surface utterances according to a personal linguistic sense. To overcome this problem, we assume that a syntactic pattern generalizes these surface utterances using syntactic features [13]. Therefore, we approximate the sentential probability using the syntactic pattern Pi . P(Ui |S i ) ≈ P(Pi |S i )
(7)
In this paper, we use a syntactic pattern that consists of 6 features: sentence type, main-verb, tense, negative, aux-verb and clue-word. Table 2 shows the syntactic features of a syntactic pattern with possible values. The syntactic features are automatically extracted from the corpus using a conventional parser [22]. For example, the syntactic pattern of utterance (1) in Fig. 1 can be represented as [decl, reserve, present, no, want, none]. In the pattern, decl stands for a declarative sentence, reserve stands for a type of the main verb, present denotes a tense type of an utterance, no in the 4th element stands for a positive sentence, want stands for a modality type, and the last, none, means that there is no predefined clue word in the utterance. The contextual probability P(S i |S 1,i−1 ) is the probability that utterance with speech act S i is uttered given that utterances with speech act S 1 , S 2 , . . . , S i−1 were previously
IEICE TRANS. INF. & SYST., VOL.E88–D, NO.1 JANUARY 2005
152 Table 2
Syntactic features used in the syntactic pattern.
Syntactic feature Sentence Type Main-Verb
Tense Negative Sentence Aux-Verb
Clue-Word
Values
Notes
decl, imperative, wh question, yn question pvg, pvd, paa, pad, be, know, ask, etc. (total 88 kinds) past, present, future yes or no
The mood of an utterance.
serve, seem, want, will, etc. (total 31 kinds) yes, no, ok, etc. (total 26 kinds)
Table 3 DST DE DC SS nE nB
The type of the main verb. For special verbs, lexical items are used. The tense of an utterance. Yes, if an utterance is negative. The modality of an utterance.
Table 4
DSTs and their meanings. Meaning Start a new dialogue Continue a dialogue Start a sub-dialogue End n level sub-dialogues nE and then SS
An example of DST tagging.
The special word used in the utterance having particular speech acts.
2.2 Discourse Structure Analysis Model 2.2.1 Discourse Structure Tagging
Fig. 1
An example of a hierarchical discourse structure.
uttered. Since it is impossible to consider all preceding utterances S 1 , S 2 , . . . , S i−1 as contextual information, we use the n-gram model. Generally, dialogues have a hierarchical discourse structure. Therefore, we approximate the context as speech acts of n utterances that are hierarchically recent to the utterance. An utterance (A) is hierarchically recent to an utterance (B) if (A) is adjacent to (B) in the tree structure of the discourse [23], as shown in Fig. 1. Equation (8) represents the approximated contextual probability in the case of using trigram where U j and Uk are hierarchically recent to the utterance Ui , where 1 ≤ j ≤ k ≤ i − 1. P(S i |S 1,i−1 ) ≈ P(S i |S j , S k )
(8)
As a result, the statistical model for speech act analysis is represented in Eq. (9). S (U1,n ) ≈ arg max S 1,n
≈ arg max S 1,n
n
2.2.2 Statistical Model for Discourse Structure Analysis
P(S i |S 1,i−1 )P(Ui |S i )
i=1 n i=1
P(S i |S j , S k )P(Pi |S i )
We define a set of discourse structure tags (DSTs) as the markers for discourse structure tagging. A DST represents the relationship between two consecutive utterances in a dialogue. Table 3 shows DSTs and their meanings, and Table 4 shows an example of DST tagged dialogue. Since the DST of an utterance represents a relationship between the utterance and the previous utterance, the DST of utterance (1) in the example dialogue becomes NULL. By comparing utterance (2) with utterance (1) in Table 4, we know that a new sub-dialogue starts at utterance (2). Therefore the DST of utterance (2) becomes SS. Similarly, the DST of utterance (3) is SS. Since utterance (4) is a response for utterance (3), utterance (3) and (4) belong to the same discourse segment. Therefore, the DST of utterance (4) becomes DC. Since a sub-dialogue of one level (i.e., the DS 1.1.1) consisting of utterances (3) and (4) ends, and new subdialogue starts at utterance (5). Therefore, the DST of utterance (5) becomes 1B. Finally, utterance (7) is a response for utterance (2), i.e., the sub-dialogue consisting of utterances (5) and (6) ends and the segment 1.1 is resumed. Therefore, the DST of utterance (7) becomes 1E.
(9)
We construct a statistical model for discourse structure analysis using DSTs. In the training phase, the model transforms discourse structure (DS) information in the corpus into DSTs by comparing the DS information of an utterance with that of the previous utterance. After transformation, we estimate probabilities for DSTs. In the analyzing process, the goal of the system is simply determining the DST of a
CHOI et al.: AN INTEGRATED DIALOGUE ANALYSIS MODEL
153
current utterance using the probabilities. Now we describe the model in detail. Let G1,n denote the DST sequences of U1,n . Then, the discourse structure tagging problem can be formally defined as finding the sequence of discourse structure G1,n which is the result of the following equation:
P(G1,n , U1,n ) P(U1,n )
= arg max P(G1,n , U1,n )
(10)
Using Eq. (3), we can reformulate P(G1,n , U1,n ) as the following equation:
(11)
i=1
Now, we simplify Eq. (11) by making the following Markov assumption: P(Ui |G1,i , U1,i−1 ) ≈ P(Ui |Gi )
G1,n
n i=1 n
P(Gi |G1,i−1 , U1,i−1 )P(Ui |Gi ) P(Gi |Fi−2 , Fi−1 )P(S i |Gi )
(18)
i=1
3. Integrated Dialogue Analysis Model
G1,n
P(G1,n , U1,n ) n = P(Gi |G1,i−1 , U1,i−1 )P(Ui |G1,i , U1,i−1 )
≈ arg max
G1,n
G1,n
G1,n
G(U 1,n )
≈ arg max
de f
G(U1,n ) = arg max P(G1,n |U1,n ) = arg max
analysis is represented as Eq. (18).
(12)
In general, speech acts and discourse structures are tightly coupled with each other. As shown in Sect. 2, discourse structures play an important role in analysis of speech acts and vice versa. Therefore, we propose a integrated dialogue analysis model which can analyze both speech acts and discourse structures at a time from surface utterances. Now we describe the integrated model in detail. Given a dialogue U1,n , let S 1,n denote the speech act sequences of U1,n , and G1,n the DST sequences of U1,n . Then, the integrated dialogue analysis model can be formally defined as the following equation: de f
D(U1,n ) = arg max P(S 1,n , G1,n |U1,n ) S 1,n ,G1,n
That is, we assume that the current utterance is independent of everything except knowledge of its DST. With this assumption, we get the following equation:
(19)
By using a chain rule, we can rewrite the probability as in Eq. (20). D(U1,n ) = arg max P(S 1,n |U1,n )
G(U1,n )
S 1,n ,G1,n
de f
= arg max P(G1,n |U1,n )
× P(G1,n |S 1,n , U1,n )
G1,n
≈ arg max G1,n
n
P(Gi |G1,i−1 , U1,i−1 )P(Ui |Gi )
(13)
i=1
In order to analyze discourse structure, we consider the speech act of each corresponding utterance. Thus, we can approximate each utterance by the corresponding speech act in the sentential probability P(Ui |Gi ) as shown in Eq. (14). P(Ui |Gi ) ≈ P(S i |Gi )
(14)
Similarly, we can approximate previous utterance series by the corresponding speech act series in the contextual probability. With this assumption, we get the following equation: P(Gi |G1,i−1 , U1,i−1 ) ≈ P(Gi |G1,i−1 , S 1,i−1 )
Fi ≡ (S i , Gi )
de f
D(U1,n ) = arg max P(S 1,n , G1,n |U1,n ) S 1,n ,G1,n
≈ arg max P(S 1,n |U1,n )P(G1,n |U1,n ) S 1,n ,G1,n
≈ arg max S 1,n ,G1,n
(17)
As a result, the statistical model for the discourse structure
n
P(S i |S j , S k )P(Pi |S i )
i=1
× P(Gi |Fi−2 , Fi−1 )P(S i |Gi )
(16)
We can approximate the contextual probability P(Gi |U1,i−1 , G1,i−1 ) as Eq. (17) in the case of using trigram. P(Gi |U1,i−1 , G1,i−1 ) ≈ P(Gi |F1,i−1 ) ≈ P(Gi |Fi−2 , Fi−1 )
In the right hand side (RHS) of Eq. (20), the first term is equal to the speech act analysis model shown in Sect. 2.1. The second term can be approximated as the discourse structure analysis model shown in Sect. 2.2 because the discourse structure analysis model is formulated by considering utterances and speech acts together. Finally, the integrated dialogue analysis model can be formulated as the product of the speech act analysis model and the discourse structure analysis model:
(15)
Let Fi be a pair of the speech act and DST of Ui to simplify notations:
(20)
(21)
All terms in RHS of Eq. (21) are represented by conditional probabilities. We estimate the probability of each term using the following representative equation: P(a, b) a P(a , b)
P(a|b) =
(22)
We can evaluate P(a, b) using a maximum entropy model
IEICE TRANS. INF. & SYST., VOL.E88–D, NO.1 JANUARY 2005
154
shown in Eq. (23) [24]. P(a, b) = π
k
αifi (a,b)
(23)
i=1
where 0 < αi < ∞, i = {1, 2, . . . , k} In Eq. (23), a is either a speech act or a DST depending on the term, b is the context (or history) of a, π is a normalization constant, and αi is the model parameter corresponding to each feature function fi . We use two feature functions: unified feature function and separated feature function. The former uses the whole context b as shown in Eq. (22), and the latter uses partial context split-up from the whole context to cope with data sparseness problems. Equations (24) and (25) show examples of these feature functions for estimating the sentential probability of the speech act analysis model. f (a, b) 1 = 0 f (a, b) 1 = 0
iff a = response and b = User:[decl,pvd,future,no,will,then] otherwise (24)
f (a, b) iff a = response and 1 b = −1 = Agent:ask-ref 0 otherwise where b−1 is the information of Uk defined in Eq. (8) f (a, b) iff a = response and 1 b = −2 = User:request 0 otherwise where b−2 is the information of U j defined in Eq. (8)
(27)
(28)
Similarly, we can construct feature functions for the discourse structure analysis model. For the sentential probability of the discourse structure analysis model, the unified feature function is identical to the separated feature function since the whole context includes only a speech act. Using the separated feature functions, we can solve the data sparseness problem when there are not enough training examples to which the unified feature function is applicable. 4. Empirical Evaluation 4.1 Data Sets
iff a = response and SentenceType(b) = User:decl otherwise
(25)
Equation (24) represents a unified feature function constructed with a syntactic pattern having all syntactic features, and Eq. (25) represents a separated feature function constructed with only one feature, named Sentence Type, among all syntactic features in the pattern. The interpretation of the unified feature function shown in Eq. (24) is that if the current utterance is uttered by “User”, the syntactic pattern of the utterance is [decl,pvd,future,no,will,then] and the speech act of the current utterance is response then f (a, b) = 1 else f (a, b) = 0. We can construct five more separated feature functions using the other syntactic features. The feature functions for the contextual probability can be constructed in similar ways as the sentential probability. Those are unified feature functions with feature trigrams and separated feature functions with distance-1 bigrams and distance-2 bigrams. Equation (26) shows an example of an unified feature function, and Eqs. (27) and (28) which are delivered by separating the condition of b in Eq. (26) show examples of separated feature functions for the contextual probability of the speech act analysis model. f (a, b) iff a = response and 1 b = User:request, Agent:ask-ref = 0 otherwise where b is the information of U j and Uk defined in Eq. (8)
(26)
We use Korean dialogue corpus transcribed from recordings in real fields such as hotel reservation, airline reservation and tour reservation. This corpus consists of 528 dialogues, 10,285 utterances (19.48 utterances per dialogue). We annotated each utterance in dialogues by two steps. In the first step, we automatically annotated each utterance with syntactic pattern (ST) by using a conventional parser [22] as mentioned in Sect. 2.1. In the second step, we manually annotated each utterance with the discourse knowledge such as speech act (SA) and discourse structure (DS) information. Figure 2 shows a part of the annotated dialogue corpus. In Fig. 2, KS represents the Korean sentence and EN represents the translated English sentence. SP has a value either “User” or “Agent” depending on the speaker. Manual tagging of speech acts and discourse structure information was done by graduate students majoring in dialogue analysis and post-processed for consistency. The inter-coder agreement scores of speech act tagging and discourse structure tagging were 92% and 88% respectively. The classification of speech acts is very subjective without an agreed criterion. In this paper, we classified the 17 types of speech acts that appear in the dialogue corpus. Table 5 shows the distribution of speech acts in the annotated dialogue corpus. Discourse structures are determined by focusing on the subject of current dialogue and are hierarchically constructed according to the subject. Discourse structure information tagged in the corpus is an index that represents the hierarchical structure of discourse reflecting the depth of the indentation of discourse segments. The proposed system transforms this index information to DST information
CHOI et al.: AN INTEGRATED DIALOGUE ANALYSIS MODEL
155 Table 6
Results of speech act and discourse structure analysis.
Candidates Lee (1997) Kim (2003) Kim (1998) Model I Model II
Table 7
Average precision of speech act analysis (%) 75.1 77.11 81.81 83.57
Average precision of discourse structure analysis (%) 81.00 81.41 83.11
Results of the integrated dialogue analysis model.
Average precision of speech act analysis (%) 80.68
Average precision of discourse structure analysis (%) 76.04
4.2 Performance Evaluation
Fig. 2
A part of the annotated dialogue corpus.
Table 5 Speech act type Accept Acknowledge Ask-confirm Ask-if Ask-ref Closing Correct Expressive Inform
The distribution of speech acts in corpus. Ratio (%) 2.49 5.75 3.16 5.36 13.39 3.39 0.03 5.64 11.90
Speech act type Introducing-oneself Offer Opening Promise Reject Request Response Suggest Total
Ratio (%) 6.75 0.40 6.58 2.42 1.07 4.96 24.73 1.98 100.00
to acquire various statistical information. In order to experiment the proposed model, we divided the annotated dialogue corpus into the training corpus and the testing corpus by a ratio of four (422 dialogues) to one (106 dialogues). Then, we performed 5-fold cross validation. Using the Maximum Entropy Modeling Toolkit [25], we estimated the model parameter αi corresponding to each feature function fi in Eq. (23). We made experiments with two models for each analysis model. Model-I uses only the unified feature function, and Model-II uses the unified feature function and the separated feature function together. Among the ways to combine the unified feature function with the separated feature function, we choose the combination in which the separated feature function is used only when there is no training example applicable for the unified feature function.
First, we tested the speech act analysis model and the discourse analysis model by using the same training and test sets. Table 6 shows the results for each analysis model. In Table 6, the results of speech act analysis are obtained by using the correct structural information of discourse, i.e., DST, as marked in the annotated dialogue corpus. Similarly, the results of discourse structure analysis are obtained by using the correct speech act information from the annotated dialogue corpus. As shown in Table 6, the proposed models show better results than previous works such as Lee (1997) [13], Kim (2003) [26], and Kim (1998) [27]. ModelII shows better results than Model-I in all cases. The fact reveals that the separated feature functions are effective for the data sparseness problem. We tested the integrated dialogue analysis model in which speech act and discourse structure analysis models are integrated. The integrated model uses Model-II for each analysis model because it showed better performance. In this model, after the system determines the speech act and DST of an utterance, it uses the results to process the next utterance, recursively. The experimental results are shown in Table 7. As shown in Table 7, the results of the integrated model are worse than the results of each analysis model. The performance of the speech act analysis fell off about 2.89% and that of the discourse structure analysis about 7.07%. Nevertheless, the integrated model still shows better performance than previous work in the speech act analysis. We analyzed the cases that the speech act analysis model and the discourse analysis model failed to return correct results. The failure reasons of speech act analysis are the following. • Generalization error: The conventional parser may convert some utterances with different speech acts into an identical syntactic pattern. For example, “I want to reserve a ticket for a flight to New York.” and “I want to obtain travel information.” have different speech acts (i.e. request and inform), but may be converted into an identical syntactic pattern [decl, pvg, present, no, want, none]. In this case, the speech act analysis model fails
IEICE TRANS. INF. & SYST., VOL.E88–D, NO.1 JANUARY 2005
156
to discriminate the two utterances. To resolve the problem, we should supplement more informative syntactic features. • Approximation error: We assumed that the speech act of a current utterance was affected by the syntactic pattern of the utterance and the previous speech acts. However, we found that we should sometimes consider the previous utterances in themselves as well as previous speech acts in order to determine correct speech acts. In the following example, utterance (3) has several surface speech acts such as inform and response. (1) Agent: What are your credit card type and its number? (cf. Cash or charge?) (2) User: A VISA card. (3) User: 123-456-7890 Such an ambiguity can be solved by considering the content of utterance (1). If we consider only the speech act of utterance (1), we cannot determine the speech act of utterance (3) as response. If the agent said “Cash or charge?”, the speech act of utterance (3) will be inform. Most failures of the discourse analysis occurred when the speech act of a current utterance was inform or request. After analyzing the failure cases, we found that the failures were caused by biased probabilities. In other words, if an utterance has inform or request as a speech act, the utterance will probably be annotated with DE or SS because the speech act affects the determination of DST with a high probability as compared with the context information like the previous speech acts and the previous discourse structures. In the following example, Utterance (2) will be annotated with DC if we consider the correlation of contents between Utterance (1) and Utterance (2). However, the proposed model may determine the DST of Utterance (2) as DE or SS because request, the speech act of Utterance (2), is tightly associated with DE or SS. (1) User: I made a reservation for a flight to Seoul at 15 September. (Inform) (2) User: I’d like to cancel it. (Request) To reduce the problem, we should consider the more informative context information. 5. Conclusion We proposed a statistical dialogue analysis model which can perform both speech act analysis and discourse structure analysis using a maximum entropy model. We defined the DSTs to represent the structural relationship of discourse between two consecutive utterances in a dialogue and used them for statistically analyzing both the speech act of an utterance and the discourse structure of a dialogue. In the experiment, the proposed model showed better results than previous works, and Model-II showed better results than Model-I in all cases. On the experimental basis, by using the
separated feature functions together with the unified feature functions, the proposed model alleviates the data sparseness problems and improves the performance. We believe that the model can analyze dialogues more effectively than other previous works because it manages speech act analysis and discourse structure analysis at the same time using the same framework. Acknowledgments This research was supported as a Brain Neuroinformatics Research Program sponsored by Korean Ministry of Science and Technology. References [1] J. Allen, Natural Language Understanding, The Benjamin/ Cummings Publishing Company, 1987. [2] D.J. Litman and J.F. Allen, “A plan recognition model for subdialogues in conversations,” Cognitive Science, vol.11, no.2, pp.163– 200, 1987. [3] S. Caberry, “A pragmatics-based approach to ellipsis resolution,” Computational Linguistics, vol.15, no.2, pp.75–96, 1989. [4] E.A. Hinkelman, Linguistic and Pragmatic Constraints on Utterance Interpretation, Ph.D. Dissertation, University of Rochester, Rochester, New York, 1990. [5] L. Lambert and S. Caberry, “A tripartite plan-based model of dialogue,” Proc. ACL, pp.47–54, 1991. [6] L. Lambert, Recognizing Complex Discourse Acts: A Tripartite Plan-Based Model of Dialogue, Ph.D. Dissertation, The University of Delaware, Newark, Delaware, 1993. [7] H. Lee, J. Lee, and J. Seo, “Speech act analysis model of Korean utterances for automatic dialog translation,” J. Korea Information Science Society (B): Software and Applications, vol.25, no.10, pp.1443–1552, 1998. [8] H. Kim and J. Seo, “Automatic extraction of a syntactic pattern for an analysis of speech act: A neural network model,” Proc. International Conference on Neural Information Processing, 2000. [9] S. Lee and J. Seo, “An analysis of Korean speech act using hidden Markov model with decision trees,” Proc. ICCPOL ’2001, pp.397– 400, 2001. [10] M. Nagata and T. Morimoto, “First steps toward statistical modeling of dialogue to predict the speech act type of the next utterance,” Speech Commun., vol.15, pp.193–203, 1994. [11] M. Nagata and T. Morimoto, “An information-theoretic model of discourse for next utterance type prediction,” Trans. Information Processing Society of Japan, vol.35, no.6, pp.1050–1061, 1994. [12] N. Reithinger and M. Klesen, “Dialogue act classification using language models,” Proc. EuroSpeech-97, pp.2235–2238, 1997. [13] J. Lee, J. Seo, and G.C. Kim, “A dialogue analysis model with statistical speech act processing for dialogue machine translation,” Proc. Spoken Language Translation (Workshop in conjunction with (E)ACL ’97), pp.10–15, 1997. [14] K. Samuel, S. Caberry, and K. Vijay-Shanker, “Computing dialogue acts from features with transformation-based learning,” Applying Machine Learning to Discourse Processing: Papers from the 1998 AAAI Spring Symposium, pp.90–97, Stanford, California, 1998. [15] Y.-H. Pao, Adaptive Pattern Recognition and Neural Networks, Reading, Addison-Wesley, MA, 1989. [16] B.J. Grosz and C.L. Sidner, “Attention, Intentions, and the Structure of Discourse,” Computational Linguistics, vol.12, no.3, pp.175–204, 1986. [17] M.J. Collins, “A new statistical parser based on bigram lexical dependencies,” Proc. 34th Annual Meeting of the Association for Computational Linguistics, pp.184–191, 1996.
CHOI et al.: AN INTEGRATED DIALOGUE ANALYSIS MODEL
157
[18] E. Carniak, Statistical Language Learning, A Bradford Book, The MIT Press, Cambridge, Massachusetts, London, England, 1993. [19] E. Charniak, C. Hendrickson, N. Jacobson, and M. Perkowitz, “Equations for part-of-speech tagging,” Proc. Eleventh National Conference on Artificial Intelligence, pp.784–789, 1993. [20] E.A. Hinkelman and J.F. Allen, “Two constraints on speech act ambiguity,” Proc. 27th Annual Meeting of the Association of Computational Linguistics, pp.212–219, 1989. [21] T. Andernach, “A machine learning approach to the classification of dialogue utterances,” Proc. NeMLaP-2, pp.98–109, 1996. [22] C. Kim, J. Kim, J. Seo, and G.C. Kim, “A right-to-left chart parsing for dependency grammar using headable path,” Proc. 1994 International Conference on Computer Processing of Oriental Languages (ICCPOL), pp.175–180, 1994. [23] M.A. Walker, “Limited attention and discourse structure,” Computational Linguistics, vol.22, no.2, pp.255–264, 1996. [24] J.C. Reynar and A. Ratnaparkhi, “A maximum entropy approach to identifying sentence boundaries,” Proc. Fifth Conference on Applied Natural Language Processing, pp.16–19, 1997. [25] E. Ristad, “Maximum entropy modeling toolkit,” Technical Report, Department of Computer Science, Princeton University, 1996. [26] H. Kim and J. Seo, “An efficient trigram model for speech act analysis in small training corpus,” J. Cognitive Science (JCS), vol.4, no.1, pp.107–120, 2003. [27] H. Kim, H. Lee, and J. Seo, “Analysis of discourse structure using neural network in dialogue sentences,” Proc. 15th KSCSP, vol.15, no.1, pp.419–424, 1998.
Won Seug Choi received the B.S. and M.S. degrees in Physics and Computer Science from Sogang University in 1994 and 1999. He now is a Ph.D. student of natural language processing laboratory in Sogang University. His research interests currently include natural language processing, information retrieval, statistical methods for NLP, dialogue understanding, and natural language interface to databases.
Harksoo Kim is a principal research engineer of Diquest Inc. He received his Bachelor of Science degree in Computer Science from Konkuk University, Seoul, in 1996, the Master of Science degree in Computer Science from Sogang University, in 1998, and the Ph.D. in Computer Science with major in Natural Language Processing from Sogang University, in 2003. He has been developed a practical questionanswering system and a dialogue understanding system through Diquest Inc. in Seoul. His research interests include natural language processing, information retrieval, question-answering, and dialogue understanding.
Jungyun Seo is a full professor of computer science at Sogang University. He was educated at Sogang University, where he obtained a B.S. degree in Mathematics in 1981. He continued his studies at the department of Computer Science in the University of Texas, Austin, receiving a M.S. and Ph.D. in Computer Science in 1985 and 1990 respectively. He returned to Korea in 1991 to join the faculty of Korea Advanced Institute of Science and Technology (KAIST) in Taejon where he leaded the Natural Language Processing Laboratory in the Computer Science Department. In 1995, he moved to the Sogang University in Seoul and became a full professor in Mar. 2001. His research interests include multi-modal dialogues, statistical methods for NLP, machine translation and information retrieval.