Syntactic Disambiguation by Using Categorial ... - Semantic Scholar

Report 1 Downloads 109 Views
Syntactic Disambiguation by Using Categorial Parsing in a DOOD Framework Werner Winiwarter, Osami Kagawa, Yahiko Kambayashi Department of Information Science, Kyoto University, Sakyo, Kyoto 606-01, Japan Abstract. We present a natural language interface for Japanese that relies on semantically driven parsing in that it applies syntactic analysis only if necessary for disambiguation. For this purpose we utilize a categorial parser which also analyzes incomplete or ungrammatical input eciently. The complete linguistic analysis is performed by means of deductive object-oriented database (DOOD) technology so that we achieve an integrated framework with the target application. The interface has been applied successfully to the question support facilities of the VIENA Classroom hypermedia teaching system.

1

Introduction

In spite of the large amount of work on natural language interfaces, they are today still far away from widespread practical use (for a good recent survey see [1]). The reason for this are the many still existing limitations which are caused by two main factors: missing customization, resulting in unexpected restrictions, and missing integration, responsible for insucient performance and wrong interpretation [10]. As concerns the former we stress the importance of empirically collecting the training and test data so that the system can be based on realistic input data to cover all relevant linguistic constructs. With regard to missing integration we adapt the Integrated Deductive Approach [15] by designing the natural language interface as component of a deductive object-oriented database system. Hereby we achieve a complete integration of the linguistic analysis with the target application which guarantees the consistent mapping of the user input to the corresponding semantic representation. Another important reason why many previous attempts to build successful natural language interfaces failed is the fact that their characteristics in contrast to other applications of natural language processing were neglected: speci c application domains with well-de ned semantics, rather small delimited vocabularies, mappings to simple target representations, short input sentences without complex linguistic phenomena but including misspellings, ungrammatical or incomplete statements. There has been a remarkable recent change of attitude in research on natural language processing and computational linguistics away from the \toy problem syndrome" to building real-world end-user applications [6]. By following this argumentation, we developed a natural language interface for Japanese which has been applied successfully to the question support facilities of the VIENA Classroom system. VIENA Classroom is a distance education

system in which the teaching material is prepared as hypermedia documents and presented to the students within a CSCW environment. The computed semantic representations are used for the retrieval of corresponding answers from a FAQ knowledge base or the collection of semantically equivalent new questions [16]. This paper focuses on the syntactic analysis in our interface which is only applied if necessary for disambiguation. For this purpose we utilize a robust categorial parser which also analyzes incomplete or ungrammatical input eciently (see also [2, 9]). The rest of this paper is organized as follows. First, we provide a brief overview of the interface architecture before we deal in detail with the developed categorial grammar formalism. Finally, we give an insight into the parsing algorithm and its implementation as well as into the applied methods for syntactic disambiguation.

2

Interface Architecture

The interface architecture shown in Fig. 1 consists of three main modules: morphological and lexical analysis, unknown value list (UVL) analysis and spelling error correction as well as semantic, syntactic and pragmatic analysis. Input

Morphological analysis Lexical analysis

Domainindependent lexicon

DFL

Domainspecific lexicon

UVL-analysis Spelling error correction UVL

Semantic analysis SAM Syntactic analysis Grammar Pragmatic analysis

Focus

Semantic representation

Fig. 1.

Interface architecture

Morphological and lexical analysis operate directly on the Japanese questions which are entered by utilizing standard Japanese editor functionality, i.e. the students type the pronunciation of words after which the most probable choices for Kanji characters are presented for selection. Since Japanese script uses no spaces to separate individual words, tokenization is here no longer a trivial task.

By accessing a domain-independent lexicon, the input is transformed into a deep form list (DFL) which indicates for each token the surface form, category, subcategory and semantic deep form. For the creation of the domain-independent lexicon we adapt the lexical approach in that we store only canonical forms and assign to them all syntactic and semantic features. We also support a hierarchical structure of the lexicon by making use of the inheritance mechanisms of objectoriented database technology. The second module includes UVL-analysis and spelling error correction. The former deals with domain-speci c terms that are part of the input. By separating the analysis of domain-independent and domain-speci c terminology, we guarantee an easy portation of the interface to other application domains. This also forms the basis for an ecient application of spelling error correction which is restricted to domain-speci c terms because they are much more susceptible to the occurrence of spelling errors and also possess particular importance for the sentence meaning (for a more detailed description see [17]). Semantic, syntactic and pragmatic analysis generate the semantic representation of the sentence by accessing the semantic application model (SAM). The latter provides activation rules for the selection of the correct semantic category on the basis of the DFL and UVL. If a semantic category is activated, the domain-speci c terms of the UVL are used as arguments to ll the corresponding parameters of a semantic template resulting in the semantic representation. Syntactic analysis is only applied if it is necessary for disambiguation (see Sect. 5). Finally, pragmatic analysis deals with incomplete input by keeping track of the actual focus of the user session. As implementation platform for the interface we use the deductive objectoriented database system ROCK & ROLL [4] which was developed at HeriotWatt University. It solves the problem of updates in deductive databases by neatly separating the declarative logic query language ROLL from the imperative data manipulation language ROCK within the context of a common objectoriented data model. Another characteristic of ROCK & ROLL is that the data de nition language makes a clean distinction between: 1. type declarations: describe the structural characteristics of a set of instance objects and the methods that can be applied to them, 2. class de nitions: specify how the methods associated with a type are implemented.

3

Grammar

The categorial grammar formalism has a long history that reaches back to the early work of Bar-Hillel in the 1960s [3]. There exist many variations in notation and methodology such as Combinatory Categorial Grammar [12], Categorial Uni cation Grammar [14] or Lambek Categorial Grammar [8]. For recent extensions to the categorial grammar formalism see [5, 7]. The main common di erence to other grammar formalisms is that all grammar rules are assigned to so-called categories which can be divided in:

1. basic categories: associated with the entries in the lexicon, 2. complex categories: derived through the application of grammar rules. The original categorial theory consisted of only two combinatory rules for the formation of complex categories (the rules read as: if category A=B (A B ) is directly followed (preceded) by B , it can be transformed into A):

n

1. forward functional application: A=B B

2. backward functional application:

!

n !

B A B

A ;

(1)

A :

(2)

One main shortcoming of the original notation of categorial grammar is that with a single application of a grammar rule only two adjacent categories can be applied to the derivation of a new category. To eliminate this de ciency we introduce several powerful extensions to the categorial grammar formalism which are not speci cally designed for the particular use with Japanese input but in view of applications to a broad class of languages. As rst basic step we change the notation for the two functional applications: 1. if category C is directly followed by B , then it can be transformed into A: :

C

A

=B ;

(3)

2. if category C is directly preceded by B , then it can be transformed into A: :

C

A

n

B :

(4)

With this it is possible to permit the use of more than one category at the right side of the rule so that several functional applications can be applied in one step, e.g.: C : D A B : (5)

n n

The right side of the rule can also be left empty which provides an easy way of specifying type raising [13]. Besides direct succession \=" and direct precedence \ " we provide the following additional sequence conditions to cover cases of free word order and long distance dependencies eciently (see also [11]):

n

1. no condition on sequence of C and B : C

:

A

B ;

(6)

2. indirect precedence, i.e. B must precede C but there can be several categories shifted between them: C : A >B ; (7) 3. indirect succession: C

:

A

SUB) (/PER)

3

SAV:

VP

(\DO)

5

SV:

SAV

\N

SV:

V

TOP:

SUB

Fig. 3.

17 15

\NP

Example grammar

7

Figure 3 gives a small Japanese example grammar which we use throughout this paper. The dominance of the direct precedence sequence condition as well as the important role of particles as post-positional syntactic function words are characteristics of Japanese language. Sahen verbs are derived from nouns by adding the irregular verb suru (to do). As example for the internal representation, the instance objects for the question particle are displayed in Fig. 4.

VP

application

applied_category sequence >

applied_category \ sequence occurrence

SUB

application

occurrence

r

PER

application applied_category / sequence

o

occurrence

o

Q

rule derived_category rightside

3

priority

basic_category QUE false

symbol associated_rules complex_category associated_tokens

mapping map_cat map_subcat map_deepform

particle question

Fig. 4.

4

Example of internal representation

Parser

The central object type for the parsing process is the constituent which is de ned according to Fig. 5. As initializing step all tokens in the DFL are assigned to a linear list of constituents by associating them with the corresponding basic categories according to the mapping rules (see Fig. 6 for an example). If more than one basic category applies (e.g. V and SV in our example), the more speci c one is selected. At the begin and at the end of the list an auxiliary start and end constituent is inserted as well as all references to child constituents are initialized with a nil constituent. The actual parsing algorithm follows a bottom-up strategy. It avoids spurious ambiguity [18] by making use of the priority values assigned to each rule. Therefore, the deliberate choice of these priority values is of crucial importance to the eciency of the parser. As basic heuristics the priority values decrease from the

type constituent: properties: public: constcat: category, consttoken: token, constsucc: constituent, constpred: constituent, constsub: constituent; ...

type declaration for constituents

constituent category associated token from DFL right sibling left sibling first child

end-type

ROCK & ROLL type declaration for constituents

Fig. 5.

Input: What does the process save ? DFL: surface form category unknown particle pronoun particle noun verb particle punctuation ST

N

Fig. 6.

TOP

PRO

OBJ

sub-category deep form kanji katakana kanji topic topic interrogative interrogative what object object save polite polite do question question period period N

SV

QUE

PER

END

Example of assignment of tokens to constituents

word to the sentence level and favor more complex constructs or exceptions (e.g. sahen verbs). The parsing algorithm consists of the following basic steps: repeat

retrieve the set of candidate grammar rules associated with the current list of constituents; while set of grammar rules not empty and no successful derivation do check applicability of rule with maximum priority value; if applicable then derive new category and perform transformation of parsing tree else remove rule from set of candidates; until no more successful derivation of new category.

Figure 7 shows the object type declaration of candidate, which is used to store the set of candidate grammar rules, and the associated ROCK method for checking the applicability of a candidate rule.

type candidate: properties: public: candconst: constituent, candrule: rule; ROCK: ..., applicable(cst: constituent, cend: constituent, cnil: constituent): bool; end-type class E.candidate public: ... applicable(cst: constituent, cend: constituent, cnil: constituent): bool begin var appl: bool; var cc: constituent; var cr: rule; var rs: applicationlist; var newcat: category; var clist: [constituent]; var c: constituent; var seq: string; var occ: string; var applcat: category; var continue: bool; var ccat: category; appl := true; cc := get_candconst@self; cr := get_candrule@self; rs := get_rightside@cr; newcat := get_derived_category@cr; clist := []; c := cc; foreach x in rs do begin seq := get_sequence@x; occ := get_occurrence@x; applcat := get_applied_category@x; continue := true; if (seq = "\") then while (c cst) and (continue) do begin continue := false; c := get_constpred@c; ccat := get_constcat@c; if (ccat = applcat) then begin clist ++ c; if (occ = "*") then continue := true; end else if (occ = "r") then appl := false; end ... end if (appl) then ... appl end end-class

Fig. 7.

type declaration for candidate rules candidate constituent candidate rule ROCK methods method for checking applicability of candidate rule and performing transformation of parsing tree persistent class definition

applicability flag candidate constituent candidate rule right side of candidate rule derived category from candidate rule list of new child constituents current constituent sequence condition occurrence condition applied category control flag for repetitive applications current category initialize applicability flag retrieve candidate constituent cc retrieve candidate rule retrieve right side of candidate rule retrieve derived category initialize list of new child constituents assign cc to current constituent c for each application in right side do retrieve sequence condition retrieve occurrence condition retrieve applied category initialize control flag if sequence condition equals "\", then while not arrived at start constituent and control flag is set reset control flag retrieve left sibling of c retrieve current category ccat if ccat equals applied category, then insert c into clist if occurrence condition equals "*", then set control flag for repetition else ( ccat differs from applcat ) if occurrence condition equals "r", then reset applicability flag

if rule is applicable, then (transformation of parsing tree) return applicability flag

ROCK & ROLL code segment for test of applicability of grammar rule

ST

N

TOP

PRO

OBJ

N

SV

QUE

PER

END

ST

N

TOP

NP

OBJ

N

SV

QUE

PER

END

SAV

QUE

PER

END

SV

N

SAV

QUE

PER

END

SV

N

SAV

QUE

PER

END

SV

N

SAV

QUE

PER

END

SV

N

VP

QUE

PER

END

PRO: NP PRO ST

N

TOP

SV:

SAV

\N

N:

NP

\MP*

NP

OBJ

PRO

ST

NP

TOP

N

ST

NP

NP

PRO

TOP

DO

OBJ

N

OBJ: DO

\NP

TOP: SUB

\NP

ST

SUB

PRO

NP

SAV: VP

(\DO)

N

ST

SUB

TOP

Q

ST

NP

DO

TOP

QUE:

OBJ

OBJ

NP

PRO

NP

SAV

N

SV

DO

N

OBJ

NP

\VP (>SUB) (/PER) Q

QUE

END

VP

SAV

SV

N

SUB

PER

DO

TOP

NP

OBJ

NP

N

PRO

PRO

Fig. 8.

Example of transformations of parsing tree

The transformation of the parsing tree is performed by the following steps: 1. create new constituent CD for the derived category, 2. add candidate constituent CC as rst child to CD , 3. remove all entries C in the list of new child constituents CCH from parsing tree, 4. replace CC by CD in parsing tree, 5. add all C CCH as right siblings to CC.

2

Figure 8 shows a detailed example of the transformations of the parsing tree during the syntactic analysis of the example sentence from Fig. 6.

5

Syntactic Disambiguation

In our architecture syntactic analysis is applied to disambiguating sentences which are semantically so closely related that the di erence cannot be decided only on the basis of semantic information (see Fig. 9 for an example).

[what, save, X] What does the process save ?

[what_kind, X, save] What kind of process do you save ?

[(q, c, [ (que, b, ), (vp, c, [ (sav, c, [ ), (sv, b, (n, b, ) ]), (do, c, [ (obj, b, ), (np, c, [ (pro, b, ) ]) ]) ]), (sub, c, [ (top, b, ), (np, c, [ (n, b, ) ]) ]), (per, b, ) ])]

[(q, c, [ (que, b, ), (vp, c, [ (sav, c, [ ), (sv, b, (n, b, ) ]), (do, c, [ (obj, b, ), (np, c, [ (n, b, ), (mp, c, [ (mod, b, ), (np, c, [ (pro, b, ) ]) ]) ]) ]) ]), (per, b, ) ])]

Fig. 9.

Example of two ambiguous sentences

Now, syntactic disambiguation rules are used to distinguish between the two cases. As especially useful operator we de ned syntactic dominance as follows (in contrast to the corresponding tree-theoretic concept). X dominates Y if the following conditions on the associated constituents CX and CY are satis ed:

1. 2. 3. 4.

C

X = CY or

2

R is the set of right siblings of CX and CY CR or Y is a descendent of CX or CY is a descendent of any C; C CR . For the example in Fig. 9 the rule X dominates what.df and its negation is applied, X signi es the domain-speci c term \process" and what.df the value \what" for the deep form feature. Figure 10 gives the corresponding ROLL method, its invocation is formulated here as: dominates(\what",\df")@CX . C

2

C

class E.constituent public: ... agree(string, string) begin agree(S, "df") :- Cat == get_constcat@Self, get_complex_category@Cat == false, Token == get_consttoken@Self, get_tdeepform@Token == S; ... end dominates(string, string) begin dominates(S, T) :- agree(S, T)@Self; dominates(S, T) :- Succ == get_constsucc@Self, Cat == get_constcat@Succ, get_symbol@Cat =\= "nil", dominates(S, T)@Succ; dominates(S, T) :- Sub == get_constsub@Self, Cat == get_constcat@Sub, get_symbol@Cat =\= "nil", dominates(S, T)@Sub; end

method for test of agreement retrieve constituent category fails if category is complex retrieve associated token test if deep form of token unifies with S

method for testing syntactic dominance test of agreement retrieve right sibling retrieve constituent category fails if no right sibling recursive call of method retrieve first child of constituent retrieve constituent category fails if no child recursive call of method

end-class

Fig. 10.

6

ROCK & ROLL code segment for syntactic disambiguation

Conclusion

For quite some time AI approaches to syntactic analysis were dominated by the investigation of arti cial, small-scale \toy problems". The understandable frustration in developers of real-world systems resulted often in the complete rejection of syntactic analysis, leading the way to oversimpli ed brute-force approaches. We think that we found a reasonable \compromise" in that we apply syntactic analysis only there where it is really necessary for disambiguation. By applying to this concept a powerful categorial grammar formalism, we developed an ecient parser within a deductive object-oriented framework. A rst successful test of the feasibility of this approach is the use of our interface architecture as component of the VIENA Classroom system. Future work

will concentrate on a detailed evaluation study of the coverage and performance of the parser in practical use. In particular we want to analyze the consequences of the proposed extensions to the categorial formalism as well as the degree of generality of the syntactic disambiguation rules.

References 1. Androutsopoulos, I., Ritchie, G.D., Thanisch, P.: Natural Language Interfaces to Databases { an Introduction. Journal of Natural Language Eng. (1994) 2. Ballim, A., Russell, G.: LHIP: Extended DCGs for Con gurable Robust Parsing. Proc. of the Intl. Conf. on Computational Linguistics (1994) 3. Bar-Hillel, Y.: On Categorial and Phrase Structure Grammars. Bar-Hillel, Y. (ed): Language and Information. Addison-Wesley, Reading (1964) 4. Barja, M.L. et al.: An E ective Deductive Object-Oriented Database Through Language Integration. Proc. of the Intl. Conf. on Very Large Data Bases (1994) 5. Bouma, G., Noord, G.: Constraint-Based Categorial Grammar. Proc. of the Annual Meeting of the ACL (1994) 6. Cunningham, H., Gaizauskas, R.J., Wilks, Y.: A General Architecture for Language Engineering (GATE). Techn. Rep. CS-95-21, University of Sheeld (1996) 7. Ho man, B.: The Formal Consequences of Using Variables in CCG Categories. Proc. of the Annual Meeting of the ACL (1993) 8. Lambek, J.: The Mathematics of Sentence Structure. Buszkowski, W., Marciszewski, W., Benthem, J. (eds): Categorial Grammar. John Benjamins, Amsterdam (1988) 9. Lavie, A.: An Integrated Heuristic Scheme for Partial Parse Evaluation. Proc. of the Annual Meeting of the ACL (1994) 10. McFetridge, P., Groeneboer, C.: Novel Terms and Coordination in a Natural Language Interface. Rhamani, S., Chandrasekar, R., Anjaneyulu, K.S.R. (eds): Knowledge Based Computer Systems. Springer, Berlin (1990) 11. Morrill, G., Solias, T.: Tuples, Discontinuity, and Gapping in Categorial Grammar. Proc. of the Conf. of the European Chapter of the ACL (1993) 12. Steedman, M.: Dependency and Coordination in the Grammar of Dutch and English. Language, Vol. 61 (1985) 13. Steedman, M.: Type-Raising & Directionality in Combinatory Grammar. Proc. of the Annual Meeting of the ACL (1991) 14. Uszkoreit, H.: Categorial Uni cation Grammars. Proc. of the Intl. Conf. on Computational Linguistics (1986) 15. Winiwarter, W.: The Integrated Deductive Approach to Natural Language Interfaces. Diss., University of Vienna (1994) 16. Winiwarter, W. et al.: Collaborative Hypermedia Education with the VIENA Classroom System. Proc. of the Australasian Conf. on Computer Science Education (to appear) 17. Winiwarter, W., Kagawa, O., Kambayashi, Y.: Multimodal Natural Language Interfaces for Hypermedia Distance Education { the VIENA Classroom System. Proc. of the Intl. Congress on Terminology and Knowledge Eng. (to appear) 18. Wittenburg, K.: Natural Language Parsing with Combinatory Categorial Grammar in a Graph-Uni cation Based Formalism. Diss., University of Texas (1986) This article was processed using the LATEX macro package with LLNCS style