Computerized correction of phonographic errors - CiteSeerX

Report 3 Downloads 68 Views
Computers" and the Humanities 22 (1988) 43--56. © 1988 by Kluwer Academic Publishers'.

Computerized Correction of Phonographic Errors Jean Veronis Groupe Reprdsentation et Traitement des Connaissances, Centre National de la Recherche Scientifique, 31, chemin Joseph Aiguier, 13402 Marseille Cedex 9 -- France

text (and they are in an expanding number of fields, such as computer-aided learning of languages, lexicography, machine translation, information retrieval and office automation) spellingerror bottlenecks greatly decrease the efficiency of systems. At the first garbled word they encounter computers generally fail abruptly, frustrating noncomputer-scientist users. Spelling errors may even lead to wrong results (in lexicography, for example). Moreover, thanks to intelligent and expert systems, artificial intelligence methods, etc., computers now have the potential to assist teachers in first- and second-language learning. But how can computer-aided instruction (C.A.I.) apply to language learning if computers are not able to handle language errors? This paper intends to demonstrate how spelling errors can be efficiently handled from an algorithmic, computer science point of view. We first show that the various types of spelling errors must receive different treatments, and we especially focus on phonographic errors, on which relatively little work has been done. We provide some quantitative data on the phonographic structure of language and develop a mathematical framework for modeling the various types of errors, including phonographic ones. The proposed algorithm can be used in various tasks such as C.A.I. systems for spelling learning with automatic correction and, particularly, automatic diagnosis of errors. Such systems could automatically memorize and classify the types of mistakes made by each student and provide computerized profiles and charts which should prove useful to teachers. Finally, we outline an application to fast retrieval of misspelt words in dictionaries, which can be implemented in natural-

Abstract: When computers are confronted with text (C.A.I., lexicography, machine translation, information retrieval, office automation, etc.) spelling-error bottlenecks greatly decrease the efficiency of systems. In this paper, we demonstrate how spelling errors can be efficiently handled from an algorithmic, computer science point of view. We first show that the various types of spelling errors must receive different treatments, and we especially focus on phonographic errors, on which relatively little work has been done. We provide some quantitative data on the phonographic structure of language and develop a mathematical framework for modeling the various types of errors, including phonographic ones. Finally, we outline an application to fast retrieval of misspelt words in dictionaries, which can be implemented in natural-language interfaces to make computers interact more gracefully with non-specialists.

Key Words: Man-machine communication, natural-language processing, error correction, spelling, phonetics.

I. Introduction The spelling errors that sometimes complicate human understanding of written texts have demanded both time and effort of pupils and teachers. They even become, from time to time, a matter of national concern, with regularly flourishing (and always unsuccessful) spelling reform movements. But even a text riddled with errors is generally understood by human readers who are normally able to make the appropriate corrections, often by means of inferences from the context. When, however, computers are confronted with

Jean Veronis is a researcher in the Groupe ReprOsentation et Traitement des Connaissances at the Centre National de la Recherche Scientifique in Marseille.

43

44

JEAN

VERONIS

language interfaces to make computers interact more gracefully with non-specialists. II. Typology of spelling errors If texts are seen as strings of symbols (or, simply strings), spelling errors can be generally thought of as alterations of strings. Specifically, however, several types of errors, which are due to completely different causes, must receive appropriate modeling and treatment (Figure 1). Since at least 1957, the very beginning of computer science, a considerable literature has been devoted to error recovery from garbled strings (for a review, see Peterson, 1980 and Pollock, 1982). The early works were oriented towards two particular types of errors. The first one is mainly due to technical problems in equipment, such as errors in input devices (e.g., optical character recognition), transmission errors, or information storage problems. These errors superimpose some 'noise' on texts and we will call them noise errors. The second type is due to human mistyping on keyboards (a key is typed twice, the finger slips to a neighboring key, two

keys are inverted, etc.). We will call these errors typographical errors. In an often referred-to study, Damerau (1964) shows that 80% of errors in words belong to one of the following categories (figure 2): -- a letter is replaced by another, a letter is omitted, -- a letter is added, -- two adjacent letters are reversed. -

-

The first three errors can result from either of the previously mentioned causes, but the fourth is specifically a typographical one. In writing computer programs or indexing documents by means of keywords, these errors are often the only ones which occur because the words used form limited lists of between a few dozen and a few hundred words. The same words are constantly repeated, and the operator (a specialist) knows exactly how to spell them. But when the general public uses computer services, the problems raised are of a different nature. If noise and typographical errors are still

cause

category

main types

equipment

noise

deletion insertion substitution

typographical

deletion insertion substitution inversion

phonographic

grapheme substitution

grammatical

agreements grammatical homophones

performance user

competence

Figure 1 : Typology of spelling errors.

COMPUTERIZED

CORRECTION OF PHONOGRAPHIC ERRORS

45

insertion

deletion

o1,14.1 ul

\

°1° ,Iol,lelal,,I inversion

substitution

o1,1 elal ul

14 oltlelalu ! X

G I°1 altl elal uL R = reference word G = garbled word Figure 2: Noise and typographical errors.

present (noise error frequency decreases, due to better equipment), they are coupled with other types of spelling errors. First, the writer may not know how to spell certain words, and the mistakes made are not only of a mechanical nature. For example, in French ippeauttainnuze (!) instead of hypotdnuse would be an approximate spelling of the word based on its phonetical form. We will call these errors phonographic errors. Second, grammatical errors can occur (agreements, etc.). In man-machine communication, the correction of these errors is far more important than the correction of noise or typographical errors. In fact, following an error message, the writer can correct noise errors, for which he is not responsible, and also typographical errors, which are due to himself but are only performance errors. To the contrary, he generally cannot correct phonographic or grammatical errors, which are competence errors.

Hereafter, we will leave aside grammatical errors, since their recovery requires an accurate syntactic analysis of texts by computers, which in turn requires accurate semantic processors and global understanding. These are at present longterm goals. Hence, we will focus on phonographic errors. As a final remark about this typology, we will note that it is not always possible to precisely determine the source of errors from the resulting string. For example, letter doubling may result from noise, or from typographical or phonographic causes. Nonetheless, since it preserves recognition (e.g., potteau would be pronounced exacty like poteau), it can be treated without problem as phonographic. In fact, from seeing the result, two major categories can be distinguished: 1) errors altering prononciation, and hence recognition (most noise and typographical ones); 2)

46

JEAN VERONIS

errors preserving prononciation (mainly phonographic ones, or accidentally due to other sources leading to the same result). IlL C o m p a r i s o n of strings: a tutorial Comparing two strings of symbols may mean determining if they are equal or different. This easy-to-compute point of view is not at all sufficient. The French word poteau would probably be judged by many people as close to the garbled word ptotreau, whereas it would be judged as distant from azwerq. Equality or difference cannot handle this kind of judgment, which involves a similarity/dissimilarity point of view, i.e. a relation saying that poteau and ptotreau look more like each other than do poteau and azwerq. Handling such relations is a typical human ability which proves very difficult to reproduce on computers. A basic problem lies in quantifying the abovementioned similarity relation, that is to say, how to define a distance between strings such that the distance between poteau and ptotreau, for example, will be greater than that between poteau and azwerq. In fact, mathematicians offer us many possible distances, and the choice between them is difficult. The chosen distance must reflect the way humans judge strings as more or less similar. Despite many psychological studies, however, we cannot at present state human criteria with complete accuracy. Pragmatically, the idea of directly reproducing human strategies can be forsaken, and interest directed towards the causes of errors. A modeling of the error process can inform us of both the number of errors commited, and of their type. This is a first step towards quantification. Wagner and Fischer (1974) suggest a modeling by means of the edit operations needed to change one string into another. The allowed operations are: -- substitution of one letter for another: (e.g. pot_eau --, poxeau); - - i n s e r t i o n of a letter: (e.g. poteau --" potxeau); -- deletion of a letter: (e.g. pot_eau --" poeau).

adjacent letters. Hence, the set of four operations take into account noise errors as well as typographical ones. For the sake of simplicity, we will outline the computation process with respect to the first three edit operations only, and the reader can refer to the above-mentioned paper to see how it can be slightly modified so as to take inversions into account. With each edit operation, Wagner and Ficher (1974) suggest associating a cost which can be chosen according to the probability of errors. Figure 3 shows the most probable candidates for substituting the letter D, with regard either to a noise error (here, due to transmission), or to a typographical error (hitting the wrong key). In practicality, a satisfactory simplification lies in assigning the same cost to each edit operation (i.e., each error), say one, for example~ Nevertheless, there remains the difficulty of determining the right sequence of operations for going from one string to the other. In the example poteau ~ ptotreau, there are many possible sequences, such as those represented in Figure 4. Some sequences are better than others, the worst being the deletion of all letters from the first string followed by the rewriting of all letters from the second! A good way to define the distance between strings is to consider the cost of the minimal-cost sequence of edit operations needed to change the first string into the other. There is no need to try all possible sequences in turn, and a very efficient computation can be based on dynamic programming (Bellman, 1957). This technique has proved useful in many fields, such as speech recognition. The distance is computed step by step. Given two strings x and y, we will denote D[i, j] the distance between the two substrings composed of the ith first characters of x and the jth first characters of y, and x[i] and y[i] the ith and jth characters of the respective strings. The basic idea is that the distance at point (i, j) of the analysis depends only upon: -- what has already been computed at previous steps (D[i - 1, j], D[i, j - 1], D[i -- 1, j - q),

These operations are a modeling of noise errors. Lowrance and Wagner (1975) propose an extension by including a fourth operation, reversing two

--

the cost of the edit operations enabling us to go from these previous steps to the present one (Figure 5).

COMPUTERIZED

CORRECTION

OF PHONOGRAPHIC

ERRORS

47

Noise in transmission 11000001 ] not a symbol 00000001 I not a symbol 01100001 ',] a 01o100011 Q 01001001 I I 01000101 I E 01000011 I C

01 0 0 0 1 0 0 I D

"

I OLOOOOOO I @

In computers, symbols are coded by a pattern of 8 binary digits (bits). A very probable error consists in an alteration of one bit.

Slip of the finger

[ Jl II I !! FqFq On keyboards, a very probable substitution error consists in hitting a surrounding key. Figure 3: N e a r e s t n e i g h b o u r s of letter D.

Put in a formula, this gives:

D[i -- 1, j -- 1] +

0 if x [i] = y[j] (equal letters) 1 ifx[i] # y[f] (substitution)

D[i, j] = rain D[i - 1, j] + 1 (deletion) - 1] + 1 (insertion)

Dli, j

We can therefore calculate the final distance between x and y iteratively from D[O, 0] _ 0, as shown in Figure 6. At each step, the computer can

memorize the hypothesis and provide, after complete computation, the best sequence of edit operations (there are sometimes several ex aequo candidates). This mathematical framework is very badly suited to phonographic errors. For example, the wrong spelling ippeauttainnuze is very far from the right one hypotenuse with regard to the previous distance definition. In fact, the two words are completely different, though it is obvious for any French speaker that they are pronounced

48

JEAN VERONIS

p oI t e a u I ~0 ~ ~

P

treau

I I = ....

L. delete o [~inserto insert t insert r

cost 1 costa cost 1 cost 1

total cost = 4

poteau pI tko ~t r~ e a u I t ,

I I

cost1 II total cost = 2 cost 1 I

insert t insert r

Two of the many possible sequences of edit operations Figure 4: Sequences of edit operations.

PREVIOUS STATES AND EDITOPERATIONHYPOTHESIS substitution

e ->

o

',i!i~,i~,i~ii'~i~/e

iiiiiiiiiiiiiiiiiii!iiiiiiiiiiiiii

D[i,j]

=

D[i-l,j-l]

+

D[i,j]

= D[i-l,j]

+ 1

D[i,j]

= D[i,j-1]

+ 1

1

PRESENT STATE deletion

insertion

of e

of o

ii iiiiiiiiiiiiiiiiiiii! Y'::

Figure 5: Computing of D[i, Jl at a given step.

exactly the same. We must therefore extend the notion of proximity between strings to take into account phonetic similarity. Before doing this, we

need to perform an accurate observation of the phonographic structure of language, i.e., the system of relationships between sound and spelling.

COMPUTERIZED CORRECTION OF PHONOGRAPHIC ERRORS

p

o ..... 2



P

1I` . : " ( ~

....

e

3 .... 4

a .... 5

u :'"6

;"'"~"- '"-. i _.... 2 . ' " ' 3 ..... 4? " ' ' 5 ' - - 1 ! --" -_~ - . !

--s,

t 0

t

i

i "-. • 3' : ' ' 2 ~

""

The lines correspond to edit operations :

......

delete (cost 1)

'

insert (cost 1)

"

i ! " :~" '- '4"

~'~

49

"'\"

substitute (cost 1)

equal letters (cost O) t r

i "'-_i "'-_i " - - ~ 5 :---4 .----3 :--(,~

e

6 - - - - 5 .... 4 ..... 3

i ---_i i'--

"--.i i'--

"- :

"-1,, i'--

-. i --. i

:---:4'

-1'

- •

I.... 3 ----4

i

a

U

In bold lines " best match

•'

.i

8 .... 7 ..... 6 .... 5 .... 4 ..... 3

Figure 6: Computationof the distancebetweentwo strings.

IV. Phonographic structure of language Precise quantitative data on French spelling (distribution of graphemes, etc.) are sorely needed in any attempt to computerize phonographic correction. Although we probably cannot aspire to exhaustiveness, since spelling-to-sound correspondences, in French, as well as in English, are very complex, (with a great number of very rare cases, and in perpetual motion through the adoption of foreign words an d spellings), systems must aim for maximal efficiency (for example, 99% correction) with as little computational effort as possible. Attaining such a goal requires an accurate observation of the spelling material of the language concerned. To establish a precise quantitative inventory of phonographic material, we performed a computerized treatment of a lexicon containing the 3724 most frequent words in French. This gathering of data has subsequently proved useful to psycholinguists and to teachers (Veronis, 1986).

Before presenting the main results of this work, we must say a word about the French spelling system (which can also apply to English). Several studies (Gak, 1959, 1962, Horejsi, 1962, 1970, 1972, Blanche-Benveniste and Chervel, 1969, Thimonnier, 1967, 1970, Catach, 1973) have clearly stated that French spelling should be regarded as a system even if it contains some local confused areas (exceptions). What emerged from these studies was the idea that the methods of analysis and formalization which had been developed for speech should be applied to writing. Consequently, many linguists and psycholinguits began to speak of graphemes, a word which has now joined the linguistic metalanguage. However, this word seems to be applied to various types of units, depending on the authors, and in the end has come to denote any letter or group of letters, without any functional criterion. We will adopt hereafter the following definition, provided by Horejsi (1970, 1972): the grapheme is the smallest

50

JEAN VERONIS

unit which has a phonemic counterpart in the spoken form of the word, that is to say which cannot be broken down into smaller units having themselves phonemic counterparts. The phonemic counterpart of a grapheme is either a phoneme, or a group of phonemes, or even a "phonemic zero" (silent graphemes). For example, in French, sc is a grapheme in science (it is pronounced /s/, and cannot be broken down), but is not is escalade, where the graphemes are s (=/s/) + c (--/k/). Depending on the length of the grapheme, we will distinguish simple graphemes (s, o, p, etc.) and complex graphemes (sc, eau, pp, etc.). What makes spelling so difficult is that, on the one hand, the same phoneme can correspond to various graphemes (for example,/s/-- s, ss, sc, c, f, x, etc.), and on the other hand, the same grapheme can correspond to various phonemes (s = / s / i n herse, but -- /z/ in pose). Hence, only ordered pairs composed of a grapheme and its phonemic counterpart are representative of the function of letters or groups of letters in words. We will call these couples graphonemes, and we focus on them in the data gathering. Complete tables concerning the data gathering are published elsewhere (V6ronis, 1986), and herein we will focus on the main points of interest for spelling correction. First, we will note the extreme inequality in the distribution of the various graphonemes. Among the 22317 graphonemes of the lexicon, 141 different graphonemes appear, but (r,/R/) (as in pdre) occurs 2153 times (that is nearly 10% of the total), (rr,/R/) (as in terre) occurs only 72 times, and some graphemes have only one occurence (for example (x,/z/) in deuxidme). Figure 7 shows the cumulative frequency of the different graphonemes. We can see that: 1) the 11 most frequent graphonemes enable us to write 50% of the lexicon, 2) 42 graphonemes enable us to write 90% of the lexicon, 3) 90 graphonemes enable us to write 99% of the lexicon. The 51 less frequent graphonemes account therefore only for the remaining 1%. In addition, other graphonemes (about 30) do not even appear in any of the 3724 words of the lexicon, and are restricted to one or two words in French (for example, capharnaiim ).

% of lexicon

100 90

50

0

11

42

90

141 gr|phonemes

Figure 7: Cumulative frequency of graphonemes.

As far as we can see, it appears that analogous statistics would apply to other alphabetic languages, such as English. These data are very important within the framework of phonographic correction. In fact, they mean that, despite its apparent complexity, the French phonographic material is nearly closed, and composed of less than 100 productive graphonemes. In man-machine communication, as well as in C.A.I., a system which could correctly handle this kernel would be insured of more than 99% efficiency. This clearly demonstrates that there is no need of a very large data base, provided it is chosen with accuracy.

V. Extension of the string-to-string correction problem to phonographic errors Although many studies have aimed at detecting and correcting spelling errors, the very few that attack the problem of phonographic correction are based on more or less ad hoc methods. A rather simple idea consists in transcoding words in some phonetic form, by means of grapheme-to-phoneme rules. But many studies (see the survey by Catach, 1984) have proved the difficulties of this approach: numerous rules, need for syntactic information, large dictionaries of exceptions, etc. In addition, such rules can work only on words which respect the phonographic system, whereas garbled words, by definition, do not respect this system. This is particularly obvious, when noise and typographical errors occur. For example, poteau will be translated t o / p o t o / a n d the garbled word poteu (deletion of a) to a different code,/poto/. Hence, we must find other methods. A very early approach of some interest is the

COMPUTERIZED

CORRECTION

Soundex method (Odell and Russel, 1918, 1922) which reduces all strings to a "Soundex code" of one letter and three digits, in such a way that similarly pronounced words correspond to the same code. The first letter of the code is the first letter of the word. Each following letter is replaced by a digit according to the table presented in Figure 8 (it applies to English). Finally, zeros are removed, identical concomittant digits are replaced by a single one and digits exceeding the third one are deleted. Nevertheless, this approach is not highly satisfactory. Words such as wages and wadges, though pronounced in the same way, are not assigned the same code and the method cannot work if noise, typographical and phonographical errors are combined, for the resulting codes are completely wrong. Other methods (Blair, 1960, Davidson, 1962) try to reduce words to skeleton keys, but they all suffer from analoguous limitations. In addition, these methods are more or less makeshift, and we must find methods based on more theoretical grounds. table

example

0 AEIOUHWY 2 CGJKQSXZ 4 L 5 R

1 BFPV 3 DT 5 MN

SOUNDEX

SONDECKS

S005302

S0530222

$532

$532

Figure 8: The Soundex code.

From a mathematical point of view, methods like Soundex are based on an equivalence relation between letters. Equivalent letters are assigned the same code. This is a wrong assumption with regard to the phonographic structure of language. First, it is not letters which must be related, but substrings (roughly speaking, graphemes). In French, eau must be related to au, o, etc. Second, a relation - is said to be an equivalence relation if x N x (reflexivity) x - y implies y - x (symmetry) x y and y - z implies x - z (transitivity).

OF PHONOGRAPHIC

ERRORS

51

In fact, transitivity must not be assumed. We have, in French, s - c (pronounced /s/, e.g., hers_e, cerise) and c - k (pronounced /k/, e.g., case, k_dpi), but n o t s ~ k. Even symmetry is doubtful. We must not forget that the problem is not a symmetrical one. One string (we will call it the reference word) is a correct one, and thus corresponds to the usual phonographic rules, but the other is a wrong one, and thus corresponds only to what the writer assumes to be rules. For example, in French, we have a way to write the s o u n d s / w a / as o~ (e.g. po~le). Nevertheless, if the word po~le can be misspelt very often as poile, we can hardly expect the reverse mistake, i.e. a word such as m o i misspelt as m o L This group of letters thus behaves as a "fossilized" one. Third, we must find a method which handles the possibility of noise and typographical errors combined with phonographic ones, as we have previously said. We thus propose a comparison of strings based on the calculation of a dissimilarity index. We do not use the word 'distance', for it proves unsatisfactory in this precise mathematical definition, but this is of no relevance to our concern. We first need to define a similarity relation between substrings. S i m i l a r substrings, denoted x - y, are two minimal substrings having the same phonetic value. By minimal, we mean that these substrings cannot be broken down into smaller similar substrings. Thus, we have p - p, o - au, t ~ tt, etc., but not p o ~ pau, oto - autto, etc. In a majority of cases these correspond to graphemes, but sometimes they may involve larger units, e.g., in French, gn - ni, due to the instability of the phoneme/3a/, often replaced b y / n j / . The table in Figure 9 gives a part of this relation for French. The complete table comprises about a hundred entries. The index computation algorithm is an extension of Wagner and Fischer's process, to which we will add another type of edit operation, consisting in replacing a substring by a similar substring. We thus have two groups of edit operations, corresponding to errors altering recognition (noise and typographical errors) or not altering recognition (phonographic errors). Basically, errors altering recognition are modeled by high cost edit operations, whereas errors preserving recognition are modeled by l o w cost edit operations. In order to simplify the rest of the

52

JEAN VERONIS a a

au

au e

eau

p pp

ttt

u

X

X

X

X

X X

X X

X

e o

eau o

X X

X X

P PP t tt u

X X X X

X X

x Figure 9: Similar substrings.

paper, we will assume high costs -- 1 and low costs --- 0, but more elaborated assumptions can be made (for example, on the basis of the probability of errors). For the same reason, we will leave aside inversions, but they are very easy to add. The algorithm which enables us to calculate the dissimilarity index between two strings is also based on dynamic programming. Given the comparison of x and y, if similar substrings exist which end at point (i, j), let Pk, qk be the lengths of these different couples of substrings. The evaluation of the index D[i, j] is the following (figure 10): D[i --pt,j - ql]

(similar substrings 1 to n, only if any exists)

D[i - p , , q°l Dli, j] = rain

D[i - 1, ] - 1] + 1

(substitution,only if x[i l ~ ylj])

D[i - l,j] + 1

(deletion)

D[i, j - 1] + 1

(insertion)

We can therefore calculate the final dissimilarity index between the two strings iteratively from D[0, 0] -- 0, as shown in Figure 11. The computer can keep track of the best sequence of edit operations and we will show hereafter that this will provide an error diagnosis useful in C.A.I. This method can be refined in various ways, in particular by taking the context of substrings into account (for example, in French, g is pronounced as j only before e, i, y). This feature should be added in a nonsymmetrical way, and should concern only reference words, since it is well

known that many errors result from the nonobservation of these contexts. In addition, practical methods can be proposed to find similar substrings as quickly as possible at any point.

VI. Application to man-machine dialogue The aim of the proposed algorithm was essentially to establish the theoretical concept of a dissimilarity index between strings, enlarged to include phonographic errors. In its outlined form, it can be used in tasks such as error diagnosis in C.A.I. for spelling learning. Figure 12 shows how such a system could be designed. The computer has a bank of dictations and exercises, which are offered to students. Typically, a text is dictated phrase by phrase by a speech synthesizer, and the student types each one on the keyboard. The wrong words are immediatly indicated to the student, who can interactively correct them. Errors can be detected and intelligently diagnosed by the previous algorithm, and recorded for the teacher's study, with automatic classification and statistics. In addition, various types of help can be provided (precise location of errors inside words, grammar rules, meaning, etc.). Nevertheless, this algorithm is hardly more adapted to the retrieval of garbled words in rather long lists (several hundred words) such as dictionaries in man-machine dialogue than the algorithm proposed by Wagner and Fischer (1974). To achieve a more efficient algorithm, we will have to add appropriate restrictions. We begin with a known algorithm (Morgan, 1970; Durham, Lamb and Saxe, 1983). Its aim is to look up, in a dictionary, words which have typographical errors only. We therefore adapted it to both typographical and phonographic errors. The only condition stipulated by these authors is that the given string has no more than one typographical mistake. This assumption can be made, since it covers the large majority of cases: two typographical errors rarely occur in the same word (Pollock and Zamora, 1983). We will adopt the same condition, no more than one typographical error per word, while within a word we accept an u n f i m i t e d n u m b e r o f p h o n o g r a p h i c errors. Words which are as incorrectly spelt as i p p e a u t t a i n n u z e ( h y p o t e n u s e ) must be perfectly

recognized.

C O M P U T E R I Z E D C O R R E C T I O N OF P H O N O G R A P H I C ERRORS

eau

-

53

PREVIOUS STATES AND EDIT OPERA TION HYPOTHESIS

o

i!',i!i~,},~,ii!~,iiii!!iii~,i~,iiii~!iii~,~: D[i,j]=D[i-3,j-l] O

PRESENT STATE

au ~ o

(i=6,j=7)

D[i,jl=D[i-2,j-l]

iiiiiiiiiiiiiiiiiiiiiiiiiiii

i!i~iiiii!i!iiiii!ii!iii!i!i!!!!!ii!i!!iiiiiiiiiil;.

O

deletion

of

u

D[i,jl=D[i-l,j]+l

insertion

of

o

D[i,jl=D[i,j-l]+l

iiiiiiiiiiiii ',ii iiii, !iiiiiiiiiii iiiiiiiiiiii substitution

u

->

O

,j-11+1 i,i,i',i',i,ii,i,'~i'~i]i,~,~ili]'~i',iii~,i',iiiND[i,jl--D[i-1 ii:,

Figure 10: Computation of D[i, j] with similar substrings.

The substance of the algorithm proposed by Morgan (1970) and Durham et al. (1983) is the following: 1) the longest common initial substring is established, e.g.. pot pot

2)

eau reau

the four following hypotheses are tested:

-----

the next two adjacent letters are transposed, the next letter is missing, the next letter is added (as in the example), the next letter is substituted;

3) in each case, the tail substrings are matched, e.g.: pot potr

eau eau

54

JEAN VERONIS

p ""1

O

t

e

.... 2 :.... 3 . ' " ' 4 :

a

u

The lines correspond to edit operations :

.... 5 _ ' " ' 6

P

delete (cost 1)

t

2 ~ :"(~

" " - 1 .... 1 : " ' , ' ! ..... 3 : ' ' 4 "

insert (cost 1)

a

substitute (cost 1)

U

similar substrings (cost 0)

t t

In bold lines : best match

O Figure 11 : Computation of the dissimilarity index between two strings.

_id

EXERCISEAND DICTATION BANK

I

II I ~ ~ [~ I SPEECHSYNTHESIS

q

I

,t°°,°,

-~ SPELUNGCHECKERI

\

I-, _ ,

I

I

STATISTICS

Figure 12: General architecture of a C.A.I. system for spelling.

I

teacher

C O M P U T E R I Z E D C O R R E C T I O N OF P H O N O G R A P H I C E R R O R S

The essential difference, in the algorithm that we propose, is the fact that we no longer read the strings from left to fight by simply testing at each point (i, j) that the symbols x[i] and y[j] are the same, but rather by examining the string to determine if these symbols constitute the beginning of similar substrings. If several couples of similar substrings occur at some point, we will consider only the longest similar substrings. This is not, in practice, a restrictive condition, and can avoid a useless combinatory analysis. The problem is to find, as quickly as possible, the longest similar substrings at every point (i, j) of the analysis. Whereas in dynamic programming it was necessary to find the similar substrings which preceded the point (i, ]), we now wish to determine those which follow it. We use a preliminary transcoding of the two strings which consists in replacing each character by a code which stands for the longest substring which begins with this character and can be involved in some similarity relation. Dictionary retrieval time can be greatly reduced by not allowing typographical errors for the first letter of the word (we still accept the possibility of phonographic errors at this point). This hypothesis is not very restrictive, as typographical errors at the beginning seem to be much less frequent than in the middle or at the end of words. The typist is undoubtedly more attentive when beginning to type a word. In addition, as we have previously stated, phonographic correction is far more important in man-machine communication. This makes it possible to organize the dictionary in such a way as to limit considerably the number of entries to be looked up. Let us take, for example, a French garbled word beginning with eau. Only certain parts of the dictionary need to be looked up, corresponding to reference words beginning by some substring similar to e, to ea (there is no substring similar to this one) or to eau. We have implemented this algorithm in TurboPascal on a micro-computer (MS-DOS, 8086 microprocessor, 8 MHz). In the case of misspelt words, the access time for the correct entry in a 500-word French dictionary (average word length being 7.05 letters) would take between 24 and 80 ms. The time taken hardly depends on the length of the word or on the number of phonographic errors it contains.

55

VIII. Conclusion The contribution we have outlined throughout this paper shows how, in addition to classical noise and typographical error correction, phonographic correction can be carried out, and how it can be applied to C.A.I. or to general man-machine dialogue. Since computers now have the capacity to handle most types of spelling errors, it is up to software designers to implement such an improvement. Yet, there remains nearly untouched a very difficult field of human error in text creation. This field comprises everything concerning grammar. We fear that it will take some time to handle such phenomena, since this will probably necessitate understanding of text by computers, which is also a rather long term goal. References Bellman, R. E. Dynamic Programming. Princeton University Press: Princeton, N J, 1957. Blanche-Benveniste, C., Chervel, A. L'orthographe. Paris: Masprro, 1969. Blair, Ch. R. A Program for Correcting Spelling Errors. Information and Control, 3 (1960), 60--67. Catach, N. Que faut-il entendre par syst~me graphique du fran~ais? Langue Franqaise, 20 (1973), 30--44. Catach, N. La phon~tisation automatique du franqais. Paris: CNRS, 1984. Damerau, D. N. A Technique for Computer Detectionand Correction of Spelling Errors. Comm. A.CM., 7, 3 (1964), 171--76. Davidson, L. Retrieval of Mispelled Names in an Airline's Passenger Record System. Comm. A.C.M., 5, 3 (1962), 169--71. Durham, I., Lamb, D. A., Saxe, J. B. Spelling Correction in User Interfaces. Comm. A.CM., 26, 10 (1983), 764--73. Gak, V. G. Francuzkaya ortografia. Moscow, 1959. Trans. L'orthographe dufran#ais. Paris: SELAF, 1976. Gak, V. G. Ortografia v svete strukturnojo analasia. In Problemi strukturnoi lingvistiki. Moscow, 1962. Horejsi, V. Analyse structurale de l'orthographe fran~aise. Philologica Praegensia, V (1962), 225--36. Horejsi, V. Formes parl~es, formes 6crites et syst~me orthographique des langues. Folia Linguistica, V, 1/2 (1970), 195--203. Horejsi, V. Les graphon~mes en franqais et leurs parties composantes. Etudes de Linguistique AppliquOe, 8 (1972), 10--17. Lowrance, R., Wagner, R. A. An Extension of the String-toString Correction Problem. JournalA.C.M., 22, 2 (1975), 177--83. Morgan, H. L. Spelling Correction in System Programs. Comm. A.C.M., 13, 2 (1970), 90--94. Odell, M. K., Russell, R. C. U.S. Patent nos. 1, 261, 167 (1918) and 1,435,663 (1922).

56

JEAN VERONIS

Peterson, J. L. Computer Programs for Detecting and Correcting Spelling Errors. Comm. A.C.M., 23, 12 (1980), 676--87. Pollock, J. J. Spelling Error Detection and Correction by a Computer: Some Notes and a Bibliography. J. Doc., 38, 4 (1982), 282--91. Pollock, J. J., Zamora, A. Collection and Characterization of Spelling Errors in Scientific and Scholarly Texts. Journal of the American Society for Information Science, 34, 1 (1983), 51--58. Thimonnier, R. Le syst~me graphique du francais. Paris: Plon, 1967. Thimonnier, R. Code orthographique et grammatical. Paris: Hatier, 1970. Veronis, J. Etude quantitative sur le syst~me graphique et phonographique du franqais. Cahiers de Psychologic Cognitive, 6, 5 (1986), 501--31. Wagner, C. K., Fischer, M. J. The String-to-String Correction Problem. JournalA.C.M., 21, 1 (1974), 168--73.