Overcoming incomplete information in NLP systems - verb ... - CiteSeerX

Overcoming incomplete information in NLP systems - verb subcategorization J. G. Pereira Lopes1 and Jo~ao Balsa2 1 2

DI, FCT-UNL, Quinta da Torre, 2825 Monte da Caparica, Portugal [email protected]

DI, FC-UL, Bloco C5, Piso 1, Campo Grande, 1700 Lisboa, Portugal [email protected]

Abstract. A new methodology for overcoming incomplete information

available for current natural language parsers will be presented in this paper. Although our aim is more ambitious, in this paper, we will focus on incomplete descriptions of the subcategorization classes of verbs and will sketch a proposal for overcoming the same problem for other syntactic categories. We assume a hierarchical multi-agent system architecture where each bottom-layer agent has a specialised knowledge (perspective) about the problems a given feature (e.g. verb subcategorization) of a syntactic category may have. Each agent has a declarative description of those problems and can nd better solutions for the parsing problem once it has got an explanation for it. We are assuming logic based diagnosis agents. Each theoretically plausible hypothesis found must then be statistically validated. The pruning obtained and the ordering of validated hypothesis leads then to a learning problem that must be solved in order to enable a natural evolution of parsers (and their lexicons).

1 Introduction/Motivation In this paper we present a new methodology for overcoming incomplete information available for current natural language parsers. In order to motivate our approach, it is necessary to situate it in the context where we are developing it. The parsing system we are using involves three main components: a part-ofspeech tagger ([6]), a pre-processing module and a chart-parser ([9]). The preprocessor uses the lexicon and some specialised grammars (for proper nouns, dates, numbers, ...). Whenever a sentence isn't completely parsed, the chartparser connects to the diagnosis module. The architecture is shown in gure 1. Our concern in this paper is in the diagnosis component, or at least in part of it. When the chart-parser, recognises, for a given input, only a sequence of partial parses, it communicates these results to the diagnosis module, that will then try to nd possible explanations for the problem. The fact that the parser couldn't assign a single category classifying the input, may occur either because the input has errors (lexical, syntactic, semantic, ...) or because the system itself is awed (the lexicon and/or the grammar are either incomplete or incorrectly coded, or the tagger assigned an incorrect tag to a word - as the existing tagger has a

97% precision for the better cases). In order to overcome this problem we take it as a diagnosis problem and solve it accordingly. It should be stressed that our approach brings together symbolic processing and probabilistic modelling of corpora for validating symbolically obtained results. However, in this paper, we will just expand on the symbolic side of NLP. The statistical modelling of corpora will be worked out in a companion paper by the rst author and another author. One of the main motivations for our work is the fact that the lexicon that we have available is incomplete in what regards the information for word subcategorization. As a matter of fact, words have all assigned a single subcategorization class (they subcategorise nothing). This results in the fact that the a priori subcategorization classes assigned to the input words are frequently wrong or incomplete (in the sense that some classes are more general than others - see bellow some example de nitions). To surpass the problems that result from aws in the lexicon we devised a methodology that has three main goals:

{ { {

to nd theoretically plausible explanations for the problems, being these explanations statistically validated and ordered to update the system as new problems are solved, by incrementally eliminating its incompleteness and to achieve the previous goals in an ecient way

Obviously, due to the diversity of problems that can occur, the previously mentioned diagnosis module is somewhat complex. So, we adopt a multi-agent approach ([2]) that enables some problems to be tackled in a much simpler way. The idea is that we should have a hierarchical multi-agent architecture whose head is the "chief" of the diagnosis process and whose bottom-layer agents have very specialised perspectives about the problems that may occur. In this paper we will focus on a group of agents that have a very specialised knowledge about the errors that may occur regarding a given feature of some syntactic category. More precisely, we will illustrate our approach describing the agent whose perspective is that a problem occurred in the subcategorization feature of verbs. The general architecture that supports our methodology is described elsewhere. From now on, otherwise mentioned, we will be describing the functioning of the mentioned "verb subcategorization" fault nder agent.

1.1 A Detailed Example For illustrative purposes, we will assume the following very simple grammar: s ! np, vp. np ! det, np nucleus. np ! np nucleus. np nucleus ! n(I), args(I). vp ! v(I), args(I). pp ! prep, np. sub clause ! sub conj, s.

Fig. 1. Context of the diagnosis process where s, np, vp, det, np nucleus, n, v, pp, prep, sub clause and sub conj stand respectively for sentence, noun phrase, verb phrase, determiner, noun phrase nucleus, noun, verb, prepositional phrase, preposition, subordinated clause and subordinated conjunction and where args(I) stands for the arguments a verb or a noun should have if it belongs to subcategorization class I, where I is an identi er (a number, for example) for a subcategorization class. I ranges from 0 to some natural number. The args/1 predicate will be de ned according to the subcategorization classes we want to consider. For instance, if we want to cover the classes: 0 1 2 3

- no arguments - noun phrase or subordinated clause - prepositional phrase - noun phrase and prepositional phrase

then we will have to add the following rules to our grammar: args(0) ! [].

args(2) ! []. args(2) ! pp.

args(1) ! []. args(1) ! np. args(3) ! np, pp. args(1) ! sub clause. args(3) ! pp, np.

According to these de nitions, it is clear that classes 1 and 2 are more general than class 0. Consider now the situation where, due to the lexicon's incompleteness, it is assumed that every verb doesn't subcategorise any syntactic category (thus belonging to subcategorization class 0). The system will then be faced with situations like the one illustrated by the following example that shows the partial parses obtained (bottom-up) for a sequence of words expanding between positions 0 and v: (1) 0 np v(0) sub conj np v(0) np pp Assuming the previously mentioned subcategorization classes, the "verb subcategorization" agent will try to reduce the number of partial parses needed to cover positions 0 to v. It will do this by exploiting alternative subcategorization classes for the each of the verbs that occur in the sentence. As we have 4 classes and one has already been considered (class 0), we will have for each verb 3 alternatives. In general, for a given verb v(k) , we will say that SC is a candidate fault type for v(k) if assuming it belongs to class i will reduce the number of partial parses of the input, where i is any class other than k. Back to our example, it is clear that verb v(0) has only one candidate fault type, namely SC1 . Let's recall this candidate: v(0) , > SC1 (2) Assuming this fault type, it would be possible to parse a subordinated clause, sub clause , and, as a consequence a sentence, 0 s . However, the words stretching from t to v are still not covered. Nevertheless the number of partial parses was reduced from 41 to 3: (3) 0 s np pp On the other hand, verb 0 v(0) has two candidate fault types, SC1 and SC3 . With SC1 , the only gain would be to parse vp and a sentence s v(0) , > SC1 (4) while assuming SC3 , it would be possible to parse a verb phrase, vp , and also a sentence, s : v(0) , > SC3 (5) Yet, also with this choice it is not possible to parse a sentence from 0 to v. That will only be possible if we consider candidates 2 and 5 simultaneously. That will enable the parse of the sentence s , the verb phrase vp and nally the sentence 0 s , as desired. Although this agent is able, only by itself, to solve the problem, the solutions it founds might not be the best ones. The whole process assumes that other agents are exploring other perspectives. For instance, another agent might be working on noun subcategorization. This agent will also nd its own candidate solutions2 . Both agents will inform their immediate superior p

q

r

x

x

t

y

u

v

i

y

p

q

p

q

s

q

t

t

t

u

v

t

s

s

u

r u

t

s

v

r v

s

t

r v

p

v

v

Note that it is possible to parse only 4 partial parses in the example above (1). But, since we are interested in changing verb subcategorization, we showed the example with the 7 partial parses that are relevant for our illustration. 2 For this example, the "noun subcategorization" fault nder agent could nd a candidate fault type, SC2 , for a possible n(0) in t npu . 1

of their ndings, and this one would proceed with the statistical validation of the hypothesis in order to nd the most plausible explanation. The statistical validation takes into account a process of inference of subcategorization classes from text corpora ([7]). At the same time the statistical validation is performed, the previous solutions are sent (selectively) to other bottom-layer agents, so that they can improve them. In our example, although the "noun subcategorization" agent couldn't nd a global solution, its solution in conjunction with (1) and (2), would allow the "verb subcategorization" agent to nd additional solutions, including other full parse for the input. See gure 2 for a sketch of the global architecture. In the next sections we will explain how the suggested approach is formalised (section 2), then, in section 3, it is described how the formal model was implemented and in section 4, some conclusions are drawn as well as our perspectives for future work.

Fig. 2. Global diagnosis architecture

2 Formalization As already mentioned, each agent performs a diagnosis task. Although the results of the various tasks are ultimately co-ordinated by a higher level agent, each bottom-layer agent has a well de ned methodology. We will now describe

how the diagnosis task of a single bottom-layer agent is performed. Our supporting example is the "verb subcategorization" agent. As the goal is to nd what went wrong with the parsing, that is to say to diagnose a problem, we formalise the process as a Diagnostic Problem in the sense de ned by Console and Torasso ([4]). Moreover, we solve the diagnostic problem using an abductive approach, thus what we need is to de ne an Abduction Problem (AP) with integrity constraints. Detailed de nitions of these concepts that were used in earlier experiments can also be found in [3].

2.1 De ning the Abduction Problem For the de nition of a Diagnostic Problem (DP), it is necessary to provide three things: a description of the system to be diagnosed, a set (CXT) of contextual data that will be used in the diagnosis process, and a set (OBS) of observations to explain. The goal is to nd explanations for the observations, according to some criteria. An Abduction Problem (with integrity constraints) is just a reformulation of a DP. Instead of the set OBS, in an AP there are two sets built from OBS: + , a subset of OBS, that includes the observations that must be covered3 by the explanations; and , , a set of negated atoms with which the explanation must be consistent. The set , corresponds to the integrity constraints. Let's see these de nitions in more detail.

System description The system description has two parts: the behavior model

(BM) and the de nition of the system components (COMP). The behavior model must describe both the correct and faulty behavior of the system. Firstly, to model the correct behavior, BM includes a simpli ed version of the grammar. As each agent has a speci c perspective on the problem, it won't need to store all grammar rules but only those that in uence its perspective. For instance, the "verb subcategorization" agent doesn't need those rules that go inside other categories besides vp. That is to say that relatively to the grammar presented earlier, this agent doesn't need to have details about noun phrases. Secondly, to model the faulty behavior, it is also needed in BM a set of speci c rules that will allow the system to recognize the transformed sentence as a result of the new assumptions that must be made (the explanations found). COMP is the set of system components, i.e., the items of the system that might be faulty, are the partial parses the agent is interested in. For each sentence being analyzed a new set of components is generated. Considering the example of the previous section, the components of the system will be the two partial parses for the verbs: v(0) and v(0) . Each component has associated with it a set of behavior modes: the correct mode and an incorrect mode (fault mode) for each fault type it admits. In our example, the fault modes for both components are fault SC1 , fault SC2 and fault SC3 , where fault SC corresponds to the fault: subcategorization class should be i (instead of 0, in these examples). p

s

q

t

i

3

The observations are covered by an explanation if the they are derived from the program together with the explanation.

Contextual Data Contextual data are used for solving the problem but don't

provide direct explanations for the observations. In the example, all partial parses for categories other than verbs are contextual data. Some predicates that are needed for the rules of BM are also part of this set (see section 3).

Observations In our formalization we have only one observation that needs to

be explained: the one that corresponds to the fact that the sentence was only partially parsed. As we are interested in a purely abductive approach ([8]), for the de nition of AP, we will have the previously mentioned sets de ned as + = OBS and , = fg (since we don't need in our example any additional integrity constraint).

2.2 Building Explanations From the components of the system and the fault modes they may assume, a set of terms are built. These terms have the form b ehavior mode(comp). In our example, such terms will be, for instance, correct( v(0) ) or fault SC1 ( v(0) ). These terms, that correspond to the application of a behavior mode to a component, are the abducibles, i.e., the elements of an explanation, and the ones abducted (assumed true) when the system runs will represent the explanations found. A set of abducibles is an explanation only if it covers the observations, i.e., it is possible to derive the observations when they are assumed true. Note that, although each component has several possible behavior modes, at a particular moment it must assume one and only one. So, a solution for the Abduction Problem is an explanation (a set of abducibles) for the observations de ned. p

q

p

q

3 Implementation In order to implement the process formalized in the previous section, we take advantage of a program transformation de ned by Alferes and Pereira ([1]). This transformation takes an abduction problem (as de ned in section 2) and transforms it into a logic program extended with explicit negation and integrity constraints. The advantage of using this transformation is that we can use an existing interpreter de ned by Damasio et al ([5]) for this kind of logic programs. As a result of the application of the interpreter to the program de ned, the admissible explanations are obtained. The program transformation does two main things. For each component, comp, it adds a rule for each behavior mode it may assume. A rule4 for the correct mode: correct(comp) not ab(comp). and a rule for each of the fault modes: 4

We use the symbol

to de ne the rules of the logic program.

faultX(comp) ab(comp), fault mode(comp, faultX). In the previous rules, comp stands for a generic component, faultX for a generic fault mode, and ab for abnormal. The second thing the transformation does is to add an integrity constraint that guarantees that the observation must be explained. It is a rule with the form:

?

not obs(...).

This is to say that if the observation can not be proved there will be a contradiction. So the task of the interpreter will be to nd the assumptions that will allow the observation to be proved. Besides this, there is also a set of integrity constraints that will prevent any component to assume more than one behavior mode.

3.1 The example implemented Lets now see how the transformation de ned applies to our problem. First, we are assuming that a representation of the grammar is already part of the logic program. As mentioned, we are using only a simpli ed version of the previous grammar: s ! np, vp. vp ! rv(I), args(I). pp ! prep, np. sub clause ! sub conj, s. Note that some of the original rules are omitted, as already explained. Note also the change of v to rv (recognized verb) to establish the connection with subsequent rules. The rules for args/1 are the same as the previous ones. The second part (speci c rules) will be in our example agent, rules that will enable the recognition of verbs with the new assumptions. We need only two rules: rv(C,P1,P2) v(C,P1,P2), correct(v(C,P1,P2)). rv(C,P1,P3) change class(K,P1,P2,C,P1,P3), fault sc(C,v(K,P1,P2)). The rst rule corresponds to the case where it is useful to assume that the subcategorization class is correct (the verb that is between positions P1 and P2 keeps its class - C). The second rule covers the faulty situation where the verb must change from class K to class C and, as a consequence, it will then stretch between P1 and P3, instead of P1 and P2 as originally. These are generic rules for this agent. The predicate change class/6 is part of CXT, as de ned in the

previous section. The rules that result directly from the program transformation are dependent on the sentence being parsed. For component v(0) we need the rules that cover each possible behavior mode: p

q

correct(v(0,p,q)) not ab(v(0,p,q). fault sc(K, v(0,p,q)) ab(v(0,p,q)), fault mode(v(0,p,q), fault sc(K)). where K can be 1, 2 or 3 to represent each of the fault modes SC1 , SC2 and SC3 , and where p and q would be instantiated with the actual positions of the verb in the sentence. Finally, we must add the integrity constraint that will re the process: vp(C,p, P2), P2 6= q

true.

This is an indirect (more ecient) way to de ne the integrity constraint that will force the system (the interpreter) to nd the alternative candidates, i.e., to nd parses for vp that will improve (stretch between more distant positions) over the ones already found.

4 Conclusions and Future Work We described a methodology that can be used to overcome the problem of having incomplete or incorrectly coded natural language resources. To simplify the process, we use a distributed approach, in such a way that each agent has a very speci c perspective on the problem to solve. This way, the overall complexity of the task is reduced (since in many cases good solutions can be found with a single perspective), and it is possible to take advantage of the parallelism allowed by a multi-agent system. We illustrated this showing how the diagnosis process of the agent that deals with verb subcategorization works. Note that, since the perspective of each bottom-layer agent is very speci c, the candidate solutions are found very quickly. The advantage is that all the agents work in parallel and only in the worst case, when candidates from more than one agent must be combined, one agent has to diagnose the same sentence more than once. Besides, we make a statistical validation of the hypothesis formulated by the bottom-layer agents. This validation is made using data from text corpora. One of the advantages of our methodology is that it enables a constructive improvement of the diagnosis system, in the sense that the addition of more perspectives will not interfere with the ones already de ned. Another important feature of the system is that past experience is also to be taken into account, in the sense that if the changing of a subcategorization class for a verb leads to good results, that fact is considered and transmitted to the Learning module. This will not be tackled in this paper.

References 1. J. J. Alferes and L. M. Pereira. Reasoning with Logic Programming. Springer Verlag, 1996. 2. J. Balsa. A hierarchical multi-agent-system for natural language diagnosis. In Proceedings of the 13th European Conference on Arti cial Intelligence (ECAI'98), pages 195{196. John Wiley and Sons, 1998. 3. J. Balsa, V. Dahl, and J. G. P. Lopes. Datalog grammars for syntactic error diagnosis and repair. In Proc. of the 5th NLULP Workshop, 1995. 4. L. Console and P. Torasso. A spectrum of logical de nitions of model-based diagnosis. Computational Intelligence, 7(3):133{141, 1991. 5. C. Damasio, W. Nejdl, and L. M. Pereira. Revise: An extended logic programming system for revising knowledge bases. In Proc. of KR'94. Morgan Kaufmann, 1994. 6. N. M. Marques and J. G. P. Lopes. Using neural nets for portuguese part-of-speech tagging. In Proc. Of the 5th CSNLP Conference, September 1996. 7. N. M. Marques, J. G. P. Lopes, and C. Coelho. Learning verbal transitivity using loglinear models. In Proc. Of the 10th ECML Conference. Springer Verlag, 1998. 8. D. Poole. Normality and faults in logic-based diagnosis. In Proceedings of the 11th International Joint Conference on Arti cial Intelligence (IJCAI'98), 1989. 9. V. Rocio and J. G. P. Lopes. A layered approach to robust syntactic parsing, 1998. Working paper.