External word segmentation of o-line handwritten text lines Giovanni Seni and Edward Cohen Center of Excellence for Document Analysis and Recognition State University of New York at Bualo, Bualo, NY
ABSTRACT This paper describes techniques to separate a line of unconstrained (written in a natural manner) handwritten text into words. When the writing style is unconstrained, recognition of individual components may be unreliable so they must be grouped together into word hypotheses, before recognition algorithms (which may require dictionaries) can be used. Our system uses original algorithms to determine distances between components in a text line and to detect punctuation. The algorithms are tested on nearly 3000 handwritten text lines extracted from postal address blocks. We give a detailed performance analysis of the complete system and its components. Key Words: text understanding , pattern recognition , handwritten text recognition
1. INTRODUCTION This research focuses on separating a line of handwritten text into words by determining the location of inter-word gaps (gaps between words). This paper describes and evaluates distance measurement algorithms, punctuation detection algorithms, and an algorithm that combines the two to produce word hypotheses. These techniques were developed for English text, but should be applicable to any Latinbased language (and the general approach should be applicable to other languages). This research is put forward as a step towards developing high-performance computationally-ecient techniques of creating word hypotheses that are useful for text recognition algorithms. Text consists of one or more words, spatially arranged so that the reader can isolate them and determine their ordering. We assume the words are ordered left-to-right in a series of horizontal lines (text lines). To interpret the text a reader must normally locate a text line and separate it into words. In this paper, we address the problem of separating a located text line into words. Separating handwritten text into words is challenging because handwritten text lacks the uniform spacing normally found in machine-printed text. Machine-printed text typically has inter-word gaps that are much larger than inter-character gaps (gaps between characters). In addition, the inter-word gaps normally contain no text and extend the height of the text line. If we consider the text line to 1
2
Machine-printed gap example. z
}|
{
z
}|
{ z
}|
{
0:02 inch 0:12 inch 0:12 inch Inter-character gap Inter-word gap Figure 1: Machine-printed word gaps and their distances. The inter-character gaps are much smaller than the inter-word gaps. be a series of connected components, gap distances for machine-printed text can be determined by computing the horizontal distance between components (Figure 1). The gaps between words and characters in handwritten text vary much more. The examples in Figure 2 illustrate the diculties of nding a distance algorithm that can easily distinguish between inter-word gaps and inter-character gaps. We assume that the process of dividing the text into text lines has been completed (this process is performed automatically ), and our goal is to divide those lines into words. In handwriting, a text line is not strictly horizontal or linear (Figure 3), but we can consider it to be a left-to-right ordered set of connected components. Distances between pairs of connected components combined with punctuation detection information can often indicate the inter-word gap locations. These inter-word gaps can divide the ordered set of connected components into words. This paper describes three separate algorithms that are combined to separate text lines into words. The rst algorithm (a technique for determining distances between adjacent connected components) is described fully in and only summarized here. The second algorithm detects punctuation marks that are useful for word separation (periods and commas) using a set of fuzzy features and a discriminant classi cation function. We give details for each of the 10 features and the discriminant function. The third algorithm combines the rst two algorithms to rank the gaps from largest to smallest and then determines which gaps are inter-character gaps and which are inter-word gaps. Most previous work in recognizing text assumed that the words are already isolated or that isolation is trivial (e.g., words are separated by large amounts of white space or words are written in boxes whose location is known ). Only a few published papers describe methods of locating word gaps in unconstrained handwriting. However, these papers only mention word gap location as part of a larger text interpretation system and do not thoroughly describe their word isolation algorithms or compare the strengths and weaknesses of dierent word isolation algorithms. Brady showed how certain lters can highlight the word gaps in grey-level machine-printed text images. His focus was on determining a preprocessing step in reading and on oering an explanation 4
11
1, 7, 10
6
2, 4, 5, 8
3
3
(a)
(b)
(c) Figure 2: Examples of handwritten text lines with unusual spatial gaps: (a) shows words that overlap horizontally, (b) shows an inter-character gap (between the digits 2 and 7) that is larger than an inter-word gap (between the character A and the digit 5), and (c) shows a text line where many inter-character gaps and inter-word gaps have similar size.
Figure 3: Examples of handwritten text lines that are not linear or horizontal. Some slant correction algorithms can make the lines more linear and horizontal, but in our experience, no slant correction algorithm works very well on unconstrained handwritten text. These algorithms may improve the image quality of some images, but they also reduce the image quality in others.
4 for certain psychological text-interpretation studies. His method was not tested on handwritten text. The remainder of this paper is divided into four sections, and describes our techniques and results in developing algorithms to separate lines of text. Section 2 summarizes the work on distance measures between connected components. Section 3 discusses how ten features were developed for punctuation detection and how these features are combined to determine the location of commas and periods. Section 4 describes how the distance measures and punctuation detection are combined to classify gaps. The nal section summarizes discusses future work.
2. DESCRIPTION OF DISTANCE MEASURES We explored eight algorithms that computed the distances between pairs of connected components. In these experiments, we wanted to isolate the spatial aspect of inter-word gaps and used only that information to compute the distances, realizing that using recognition information (e.g., detecting punctuation) can improve performance. The input to our system is a binarized text line image. The connected components are ordered left-to-right (the ordering technique is discussed later). We assume that each connected component consists of either a word, a part of a word (e.g., a character, character fragment, or cluster of characters), a punctuation mark, or noise. This means that no connected component belongs to more than one word (an assumption supported by the test data). All gaps between pairs of adjacent connected components (adjacent refers to their ordering in the list) are ordered from largest to smallest using distance measures; and, we present quantitative measures to determine how good a particular gap ordering is. These measures are then used to compare the performance of eight dierent distance algorithms. We also brie y examine the computational eciency of the algorithms. This part of our research focuses speci cally on comparing the performance and speed of dierent spatial distance computation algorithms for locating inter-word gaps in binary text images. Three types of distance measures are shown in Figure 4, and these types are used as the basis for our eight distance algorithms. 1. The bounding box method computes the minimum horizontal distance between the bounding boxes of the connected components. The distance between bounding boxes that overlap horizontally is considered to be zero. 2. The Euclidean method computes the minimum Euclidean distance between connected components. 3. The minimum run-length method computes the horizontal run-lengthsy between portions of the connected components that overlap vertically. The minimum of the run-lengths is used as the distance measure. If the connected components do not overlap vertically, the bounding box distance method is used.
Bounding boxes (or connected components) overlap horizontally, if the right-most edge of the box (or component) on the left extends to (or past) the left-most edge of the box (or component) on the right. Vertical overlapping is de ned similarly along the vertical axis. y In this paper, a run-length is the distance along a straight line between two connected components. We consider only horizontal run-lengths because they are useful and are often explicitly available from the image format (e.g., fax images).
5
(a)
(b)
(c) (d) Figure 4: Examples of types of gap measures between a pair of connected components. The original image is shown in (a). The bounding boxes are shown in (b), and the horizontal bounding box distance is the horizontal distance between the boxes. The Euclidean distance is shown in (c) where the arrow indicates the distance between the closest points in each component. The run-length distance is represented in (d), whose arrows show some run-lengths between the components. Dierent distance algorithms could use the average or minimum of the run-lengths. 4. The average run-length method is the same as the minimum run-length method except that the average of the run-lengths is used as the distance measure when components overlap vertically. 5. The run-length with bounding box heuristic method (RLBB ) uses the minimum run-length method when components overlap vertically and uses heuristics (with bounding box distances) otherwise. If the connected components do not overlap horizontally or vertically, the horizontal bounding box distance is used. If one connected component is within the horizontal extent of the other, the distance is set to zero. Finally, if the connected components overlap horizontally but not vertically, the vertical distance between the bounding boxes is used. 6. The run-length with Euclidean heuristic method (RLE ) uses the minimum run-length method if connected components overlap vertically by more than a threshold (we use 0.133 of an inch); otherwise, the minimum Euclidean distance is used. 7. The RLE with 1 heuristic method (RLE(H1) ) is the same as RLE method with one heuristic added. If one connected component completely overlaps the other horizontally, the distance between the two is set to zero. A typical situation where this occurs is when a capital T is written as two separate strokes (a horizontal stroke and a vertical stroke). The RLE(H1) method ensures that these components will not be separated by a gap. 8. The RLE with 2 heuristics method (RLE(H2) ) is the same as RLE(H1) method with an additional heuristic. Sometimes connected components are close and overlap horizontally, but the run-length method computes an non-representative distance due to the way the components are positioned. With the new heuristic when the run-length method is used (i.e., the vertical overlap threshold is
6 exceeded) and the bounding boxes of the two components horizontally overlap more than a xed amount (we use 0.133 of an inch), the computed distance between the components is reduced by 60%. Essentially, this heuristic says that if two components are close (according to the run-length measurement) and overlap horizontally, make them closer. In our testing (summarized in Table 2), the RLE(H2) method has the best performance at 90.3%. The RLE(H1) and RLE methods were next best, but did not dierent signi cantly from the Euclidean and the RLBB algorithms (which were fourth and fth best). However, on average, the Euclidean method takes signi cantly longer to compute. The bounding box method takes the least time to compute and performs better than the minimum and average run-length methods. The RLBB method has a computational cost less than the RLE and Euclidean algorithms (but more than the bounding box algorithm) and performs as well as all but the RLE(H2) algorithm. Additional performance results are given in Section 2.2.
2.1. Distance Measures Testing Methods
The gap distance algorithms (and algorithms described in Sections III and IV) were tested on two sets of text lines collected from postal addresses (1453 text lines used for training and testing and 1084 text lines used only for testing). The locations of inter-word gaps in each text line were manually determined, and each distance algorithm was run on all pairs of adjacent connected components in the text lines. Four dierent success measures were used to determine the eectiveness of each gap distance algorithm.
Test images The text lines were extracted from 1000 address images. These images are 300 pixel per inch 8-bit grey-level images collected by scanning original mail pieces with an Eikonix CCD camera and storing them in HIPS image format. The images were manually cropped (with a bounding rectangle) in an eort to have only the destination address text appear in the image; however, other text or marks (e.g., postmarks, return address text) also frequently appeared in the cropped image. All images were collected from live mail at the Bualo post oce (the sorting center for western New York State) and were selected to represent a sample across the United States. Each of the 1000 images was binarized, passed through a guide line removal program (that removes pre-printed underlining in the image), and separated into text lines. The details of these three preprocessing steps can be found in. Due to performance degradation of the line separation algorithm in the upper lines of the addresses, at most the bottom four lines were separated. Automatic line separation was used since it quickly produced many text line images. The resulting 3634 text line images were manually examined, and images not properly separated into text lines were discarded because we did not want improper line separation aecting the results. We also removed images with one or fewer words (we wanted only images that contained inter-word gaps); the remaining 2537 images formed our testing set (1453 images for training and testing of the distance metrics and 1084 images for testing of the word separation algorithm). 9
4
7
Truthing of test images
The connected components in each text line were automatically sorted left-to-right based on their mean-x value.z Each gap between components in the text lines was then manually classi ed as one of four types. 1. Primary gaps are gaps between semantic units (e.g., city, state, ZIP Code, street number, street name, apartment number) with no commas, periods, or dashes on either side of the gap. 2. Secondary gaps are gaps between words in semantic units (e.g., a gap between the words New and York is a secondary gap if the words formed a semantic unit) with no commas, periods, or dashes on either side of the gap. 3. Punctuation gaps are the gaps between words on either side of commas, periods, or dashes. 4. Inter-character gaps are all other gaps. These are gaps that separate components within a word. Table 1 shows the quantity of gaps found in the 1453 text images. The inter-word gaps consist of primary, secondary, and punctuation gaps. The punctuation gaps are ignored in our distance measures testing (they count towards neither the correct or failure rate) because punctuation placed between words substantially reduces the inter-word gaps (at least between connected component pairs). Locating these inter-word gaps is best done with the help of a punctuation detector (described in Section 3). The primary and secondary gaps are separated because people often have wider gaps between semantic units than between words of the name semantic unit. Tests showed primary gaps were, on average, 1.5 times larger than secondary gaps. We give results that consider inter-word gaps to be both primary gaps alone and primary and secondary gaps.
Gap Type
Primary Secondary Punctuation Character
Total Minimum occurrences Maximum occurrences Number in a single text line in a single text line 2150 275 1490 12400
0 0 0 3
6 4 6 39
Table 1: Quantity of each gap type.
2.2. Results of Distance Measures The performance dierences in the eight distance algorithms are shown in the test results in this section. These two tests show how successfully the distances from each algorithm ordered the gaps, where success is de ned as having inter-word gaps larger than inter-character gaps. Tests 1 and 2 (Tables 2 and 3, respectively) show the number of text lines and inter-character gaps that are correctly ordered using each distance algorithm. The tables show the ranking of each The mean-x value was based on each connected component's bounding box and was calculated with the equation mean x value = (min x value + max x value)=2. z
8 algorithm, the number and percentage (out of 1453) of correctly ordered text lines (i.e., lines in which all inter-word gaps are larger than all inter-character gaps), and the number and percentage (out of 12,400) of correctly ordered inter-character gaps (i.e., inter-character gaps that are smaller than all inter-word gaps in their respective text lines). Test 1 assumes only primary gaps are inter-word gaps (secondary and punctuation gaps are ignored). Test 2 assumes primary and secondary gaps are inter-word gaps (punctuation gaps are ignored).
Rank
Method
1 2 3 4 5 6 7 8
RLE(H2) RLE(H1) RLE Euclidean RLBB Bounding Box Minimum RL Average RL
Correct ordering Correct ordering of primary gaps of inter-character in text lines gaps Qty. % Qty. % 1312 1301 1297 1298 1298 1283 1261 1183
90.3 89.5 89.3 89.3 89.3 88.3 86.8 81.4
12135 12120 12112 12105 12081 12006 12004 11941
97.9 97.7 97.7 97.6 97.4 96.8 96.8 96.3
Table 2: Test 1 results showing word gap detections where only primary gaps are considered to be inter-word gaps.
Rank
Method
1 2 3 4 5 6 7 8
RLE(H2) RLE(H1) RLE RLBB Euclidean Bounding Box Minimum RL Average RL
Correct ordering Correct ordering of primary gaps of inter-character in text lines gaps Qty. % Qty. % 1270 1258 1252 1257 1252 1231 1216 1140
87.4 86.6 86.2 86.5 86.2 84.7 83.7 78.5
12036 12010 11992 11971 11991 11870 11888 11823
97.1 96.9 96.7 96.5 96.7 95.7 95.9 95.3
Table 3: Test 2 results showing word gap detections where primary and secondary gaps are considered to be inter-word gaps.
2.3. Discussion of Distance Measures The goal of the tests was to indicate which distance algorithms would be most useful in determining the inter-word gaps in a text line. The main conclusions from the tests are:
9 1. Spatial information from the distance algorithms provide indications where some inter-word gaps are. 2. Dierent distance algorithms have dierent correct performance levels. 3. The RLE(H2) method performed best with a 90.3% rate of correctly ordering gaps in a text line. 4. The other RLE algorithms, the Euclidean algorithm, and the RLBB algorithm performed next best, but did not dier signi cantly statistically from each other (at a 95% con dence level for a sample size of 1453). The tests showed that the spatial distance measures computed did indicate how gaps in a text line should be ordered. The performance of the algorithms varied from 81.4% to 90.3%. Determining the best distance algorithm should be based on correct performance and computational complexity. The RLE(H2) algorithm has the highest correct performance for ordering the text lines correctly. The other RLE algorithms, the Euclidean algorithm, and the RLBB algorithm were slightly worse, but this dierence is statistically signi cant. All RLE algorithms perform faster than the Euclidean method, on average, although the worst case time complexities are the same. This is because in the worst case (when the vertical overlap of all connected components is below the threshold), all RLE methods perform the Euclidean calculation for all pairs. However, in our tests, the RLE method performed the run-length calculations 44% of the time, the bounding box distance 5% of the time, and the Euclidean distance calculation 51% of the time. The run-length methods without heuristics (minimum RL and average RL) are the worst of all methods. While initially, it is surprising that the run-length methods performed worse than the bounding box method, further analyses show the reason for this. There are a number of instances where two connected components overlap vertically (allowing us to compute the run-lengths), but do not capture the intuitive measure of gaps that we are seeking (see Figure 5). Average RL is especially prone to this behavior.
Figure 5: Example of run-length distance that does not match our intuitive notion of gap distance. In this fraction of a text line, the run-lengths (shown by the arrows) do not match up well with the two connected components. This will produce a run-length distance that is longer than the gap we would nd intuitively.
10 The dierence between primary and secondary gaps can be determined by comparing Tables 2 and 3. The most obvious dierence is that the correct rate drops in test 2. This is expected because test 2 is more stringent than test 1 and secondary gaps are smaller and harder to distinguish from intercharacter gaps. Test 2 is more stringent because in addition to the test 1 restrictions, test 2 also checks if secondary gaps are ordered correctly. The ordering of the distance algorithms in test 2 is similar to test 1 (the order of the Euclidean and RLBB methods have switched, but their placement is not statistically signi cant at a 95% con dence level).
3. DESCRIPTION OF PUNCTUATION DETECTION Distance measures alone are insucient to locate inter-word gaps, since some connected components that reduce gap size are supposed to indicate gap locations (see Figure 6). The role of some punctuation marks (e.g., hyphens, dashes, apostrophes) as word separators is not always clear (e.g., sometimes a dash separates words and sometimes it joins words). Because of the uncertain role of some punctuation marks and because of the limited types of punctuation marks in our test sets, we only developed detection systems for commas and periods. Commas always indicate a word break, and periods usually do (however, consider the phrase an I.O.U.).
Figure 6: Example of the connected components that reduce gap size but indicate word gaps. Punctuation marks are usually written as short lines or simple curves whose interpretation is dependent on their location in the text line. We developed a set of fuzzy features where each feature describes a shape or location aspect of a connected component. Each feature algorithm returns a con dence value between 0.0 and 1.0 (inclusive), where 0.0 means the feature is unlikely to be present and increasing values indicate an increasing likelihood that a feature is present. Combinations of these features can indicate which connected components have the shape and location characteristics of commas and periods (we believe these same features can also be used for many other types of punctuation).
3.1. Features for punctuation detection
Shape Features The shape features we use are narrow, small, short, full-height, and perimeter-to-area ratio. Since the punctuation marks we were trying to detect are simple (i.e., no complicated curves or loops), we
11 feel that these features can distinguish between dierent punctuation marks and other marks in the text (e.g., characters, word-fragments, digits). The narrow feature compares the height with the maximum width of each connected component. If the connected component has a height smaller than two times the width, it is assigned a value of 0.0. Otherwise, the value is the ratio of height to maximum width (with a maximum of 1.0). The small feature compares the size (i.e. number of pixels) of the connected component with the size of other connected components in the text line. If the current connected component is as big or bigger than the average component in its line, it is assigned a value of 0.0. If the current connected component is as small or smaller than the average comma encountered in the training set, it is assigned a value of 1.0. Otherwise a value between 0.0 and 1.0 is assigned depending on whether the size of the given component is closer to the average component size or to the average comma size respectively ((comp size ? avg comp size)=(avg comma size ? avg comp size)). The short feature compares the height of the connected component with the height of comma components encountered in the training set. If the current connected component is as short or shorter than the average comma in the training set, it is assigned a value of 1.0. If it is greater than the average component height, it is assigned a value of 0.0. Otherwise a value between 0.0 and 1.0 is assigned depending on whether the height of the given component is closer to the average component height or to the average comma height respectively ((comp height ? avg comp height)=(avg comma height ? avg comp height)). The full-height feature determines if a connected component has the height of an "average" character by comparing the height of the connected component with the height of other connected components in the text line. If the current connected component is as tall or taller than the average component in its line, it is assigned a value of 1.0. Otherwise, the ratio of comp height to avg comp height is used. The perimeter-to-area ratio feature complements the narrow feature and gives another indication of how thick a connected component is. The perimeter is considered to be the number of pixels present on the external border of the connected component, and the area is the space contained within the exterior contour (including area covered by "holes" in the connected component). A circle has a low ratio and a one-pixel wide line segment will have a very high ratio. This feature compensates for "comma-shaped" components that are mostly horizontally oriented (e.g., a comma that has the shape of a dash). These components have low value for their narrow feature, since they are much wider than they are tall. However, their perimeter-to-area ratio is typically high. Values greater than 1.0 are set to 1.0. Spatial Features The spatial location is necessary since the same mark in dierent locations can mean dierent things, e.g., consider a comma (,) and an apostrophe ('). Our spatial features determine if a connected component is located low on the line, in the middle of the line, or in the upper portion of the line. These features indicate a connected component's vertical location with respect to the reference lines (lines that show where the ascenders and descenders are separated from the main body of the text). In our tests, many of the text lines were curved or slanted so that horizontal lines could usually not be used as reference lines. In some cases, the text was written horizontally with a sudden shift in vertical position midway through the text line. Because of these shifts, any global reference line would
12 be inaccurate for some portion of the text (see Figure 7). To overcome this diculty, we relied on local information (from one or two neighboring connected components) to determine the spatial features. The projection of the two neighboring components on the vertical axis renders a histogram that gives the position of the upper half and lower half reference lines. Such a method is sensitive to character skew and may be deceived by T-crossings. Therefore, we have included two spatial features that do not rely on these reference lines for their computation. Upper Bound Upper Half
}middle region
Lower Half Lower Bound
(a)
(b)
(c) Figure 7: Diculties of assigning reference lines are illustrated: (a) a good t, (b) a poor t due to a sudden shift, and (c) a poor t due to a slanting line. The spatial features are low-on-line, near-midline, near-baseline, extend-beneath-left-neighbor, and extend-beneath-right-neighbor; they are illustrated in Figure 8. The low-on-line feature gives an indication of how low in a line a connected component is. If the given component is located below the lower-half reference line (i.e. it totally falls in the lower portion of the image made up by its two neighbors), it is assigned a value of 1.0. If it extends above the upper-half line, it is assigned a value of 0.0. Otherwise, as the minimum-y coordinate (upper edge) of the current component moves closer to the upper-half line and farther from the lower-half line, the value decreases as computed by the following formula: ((comp miny ? upper half )=(lower half ? upper half )). The near-midline feature gives an indication of how much of the component falls within the middle portion or the size of the middle portion that gets covered by the current component. If the component falls totally above the upper-half line or below the lower-half line, then the assigned value is 0.0.
13
(a1) low on line
(a2) not low on line
(b1) near midline
(b2) not near midline
(c1) near baseline
(c2) not near baseline
(d1) extend beneath left neighbor (d2) not extend beneath left neighbor Figure 8: Examples of spatial features shown with reference lines: (a1) shows a comma located low on line, (a2) shows a dot not located low on line, (b1) shows a period located near midline, (b2) shows a comma not located near midline, (c1) shows a comma located near baseline, (c2) shows a comma not located near baseline, (d1) shows a period that extends beneath its left neighbor, (d2) shows a component (letter Y) that does not extend beneath its left neighbor. Extend beneath right neighbor is assigned similarly.
14 Otherwise, the maximum between the percentage of the middle region's height that is covered by the component, and the percentage of the component's height that falls inside the middle region is assigned to the near-midline feature. These two cases are necessary to dierentiate between short components that totally fall within a wide middle region, and tall components that extend beyond the two boundaries of a narrow middle zone. The near-baseline feature indicates what portion of the current component falls below the lowerhalf reference line. This features value is the ratio between the height of the component's portion that extends below the lower-half line and the component's height. The extend-beneath-left-neighbor and extend-beneath-right-neighbor features indicate where the top of the current component is located relative to its left and right neighbor components respectively. If such components are tall enough to be used as references (small noisy components are eliminated), the ratio of the distance between the component's top and the neighbor's bottom, and the neighbor's height is assigned ((comp miny ? neighbor miny)=(neighbor maxy ? neighbor miny)). Checking for the height of the neighbors is necessary to prevent all components that come after or before components located high on the line (e.g., the horizontal stroke of a disconnected `T', or the dot on an `i') being assigned a high value. In such cases, we move one component to the left or to the right looking for a neighbor with an appropriate height. In our current implementation a given component is considered to have an appropriate height if it is greater than 35% of the average-component height in the line.
3.2. Combining Features for Punctuation Detection
While the previous sub-section describes the features used, this section describes how those features are combined to indicate if a particular component is a comma or a period. Originally, we were trying to develop fuzzy features and hoped to use a fuzzy classi er to combine them. However, simple fuzzy classi cation strategies yielded poor results and we switched to more classical strategies, the logistic regression model and the non-parametric K-nearest-neighbor method (KNN). Parametric methods were avoided based on an intuitive non-Gaussianess of the features (i.e., there are characteristics rather than random variations within each class). Since our word separation algorithm needs to distinguish between commas and periods (the presence of periods do not always indicate a word gap), we began our tests assuming we had a 3-class problem; that is, discriminating between commas, periods and others (i.e., non-comma/periods). Our training set consisted of 11396 connected components extracted from 1885 handwritten lines (432 lines, each containing punctuation, were added to the original training set of 1453 lines). In this set, 766 components were commas and 601 components were periods. The remaining 10029 components consisted of digits, characters, entire and partial cursive words, and other types of punctuation (e.g., dashes). A separate testing set of 5567 connected components, extracted from 1084 handwritten lines, was also collected. In this set, 351 components were commas, 351 were periods, and 4865 were noncommas/periods. We tried three dierent formulations of our classi cation problem: a pure 3-class approach, a combination approach using the majority vote rule, and a combination method using the winner-take-
15 all rule. In the 3-class case, we used a 4-KNN algorithm to discriminate between commas, periods, and others (we do not detail the logistic regression results since logistic regression is primarily used for 2-class discrimination and our experiments showed it performed poorly). In the majority rule case, we developed three discriminant functions using the logistic regression model (commas vs. periods, commas vs. others, and periods vs. others) x . The results of these three discriminant functions were combined using the majority vote rule. That is, when at least 2 out of the 3 functions agree on their classi cation decision, an answer is returned. Otherwise, a reject is returned{. Similarly, in the winner-take-all rule case we also developed three discriminant functions using the logistic regression model (commas vs. others [ periods, others vs. commas [ periods, and periods vs. others [ commas). The results of these three discriminant functions are combined using the winner-take-all rule. That is, the best decision is given by the function with the maximum a-posteriori probability. The best performance for our training set was with the four nearest-neighbors using the Mahalanobis distance based on the pooled covariance matrix. The nearest-neighbor technique was tested on the training set with the leave-one-out method and gave a 97.84% correct rate. The Majority rule gave a 97.50% correct rate, and the Maximum rule gave a 97.38% correct rate. Although the three approaches do not signi cantly dier from each other in the correct rate, they do dier in the false-positive, falsenegative, and sensitivity rates.
From Comma TRUTH # % Comma Other Period Total
696 37 22 755
Other # %
Period # %
90.86 27 3.53 40 0.37 9905 98.76 83 3.66 33 5.49 545 6.63 9965 87.44 668
5.22 0.83 90.68 5.86
Reject # % 3 4 1 8
Total # %
0.39 766 100.0 0.04 10029 100.0 0.17 601 100.0 0.07 11396 100.0
Table 4: 4-NN results for training set.
From Comma TRUTH # % Comma Other Period Total
678 37 24 739
Other # %
88.51 47 6.14 0.37 9903 98.74 3.99 51 8.49 6.48 10001 87.76
Period # %
36 84 525 645
4.70 0.84 87.35 5.66
Reject # %
5 5 1 11
Total # %
0.65 766 100.0 0.05 10029 100.0 0.17 601 100.0 0.1 11396 100.0
Table 5: Majority rule results for training set. In this case, the set others contained all types of connected components except periods and commas. Since every component needs to be classi ed as a punctuation or non-punctuation, a reject is treated as an others classi cation. x {
16
From Comma TRUTH # % Comma Other Period Total
682 40 23 745
Other # %
Period # %
89.03 45 5.87 39 0.40 9898 98.69 91 3.83 61 10.15 517 6.54 10004 87.79 647
Total # %
5.09 766 100.0 0.91 10029 100.0 86.02 601 100.0 5.68 11396 100.0
Table 6: Maximum rule results for training set.
From Comma TRUTH # % Comma Other Period Total
311 27 17 355
Other # %
88.6 23 4.5 0.55 4817 99.01 4.84 13 3.7 6.38 4853 87.17
Period # %
15 21 321 357
2.94 0.43 91.45 6.41
Reject # %
2 0.39 0 0.0 0 0.0 2 0.036
Total # %
351 4865 351 5567
100.0 100.0 100.0 100.0
Table 7: Majority rule results for testing set.
From Comma TRUTH # % Comma Other Period Total
311 25 19 355
Other # %
88.6 28 7.98 0.51 4817 99.01 5.41 13 3.7 6.38 4858 87.26
Period # %
12 23 319 354
3.42 2.66 90.88 6.36
Total # %
351 4865 351 5567
100.0 100.0 100.0 100.0
Table 8: Maximum rule results for testing set.
From Comma TRUTH # % Comma Other Period Total
306 14 17 337
Other # %
Period # %
87.18 9 2.56 15 0.29 4800 98.66 24 4.84 14 3.99 275 6.05 4823 86.64 314
4.27 0.49 78.35 5.64
Reject # %
21 27 45 93
Table 9: 4-NN results for testing set.
5.98 0.55 12.82 1.67
Total # %
351 4865 351 5567
100.0 100.0 100.0 100.0
17 On our separate testing set, the 4-NN method performed slightly worse than the logistic regression models with a 97.14% correct rate. The Majority rule gave a 97.88% correct rate, and the Maximum rule gave a 97.84% correct rate.
3.3. Discussion of Punctuation Detection
The high performance levels indicate that the features and combination methods oer signi cant discrimination capabilities. We found no other gures for this kind of discrimination so qualitative comparisons to other techniques are dicult. Certainly other additional features and combination methods can be explored, but we have already reached an area of diminishing returns. In addition to higher correct performance, the logistic regression model has signi cant run-time computation advantages. For these reasons, the majority rule was the clear choice for use in our combined system. We believe our punctuation detection system will work for non-address punctuation since the feature set can describe many forms of punctuation. For instance, a dash can be described as not(narrow) ^ small ^ short ^ not(full height) ^ not(low on line) ^ near midline.
4. COMBINING SPATIAL AND PUNCTUATION DETECTION INFORMATION This section describes how gap distance and punctuation detection can be combined to develop word segmentation hypotheses. While context can give valuable word-break clues (e.g., the number of words in the line, the role of punctuation, the average size of inter-character and inter-word gaps), we designed our algorithm to be mostly context independent. Towards this goal, we based our algorithm only on two assumptions. First, we assumed that aside from punctuation gaps, for a given text line inter-word gaps should be larger than inter-character gaps and that there would be a signi cant size dierence between the two types of gaps. Second, we assumed that the presence of punctuation (periods and commas) in a location increases the likelihood that the given location is an inter-word gap. Punctuation marks boost the con dence of a gap being a work break. That is, when a component is classi ed as a comma, the gap to the right of this component becomes a very likely inter-word break. To increase the chances that such gap will be selected as a word gap, we increase its size. Ideally, such increment should be proportional to the con dence of the punctuation recognition result. A more straight forward approach is to add the size of both gaps on each side of the comma. In our d[i] + d[i ? 1] to increase the size of the gap to the right. implementation we used the formula d[i] The size of the gap to left of the comma (i.e., d[i ? 1]) is reduced by 60% to reduce the possibility of selecting that gap as a word gap. Periods must be treated more carefully since they are not always word break indicators. Usually, this duality is present when they appear inside an abbreviation like \P.O." which stands for Post Oce. Here, the rst period is not intended to indicate a word break but the second one does. This example suggests that additional heuristic rules are needed to decide when periods indicate word breaks. One such rule could be that every time a period is found, we check two positions ahead for the presence of another period; if another period is found then the current one is ignored in the word hypothesis process. Otherwise, we modify the size of the gap to the right of the period using the 3
2
18 formula d[i] d[i] + d[i ? 1]. The size of the gap to left of the period (i.e., d[i ? 1]) is also reduced by 60% for the reason given above. Our word segmentation algorithm starts by using the RLE(H2) method to measure the length of all inter-component gaps. It then utilizes period and comma recognition results to adjust the size of some of the gaps (as described above). The resulting list of gaps is then ordered from biggest to smallest. After that, the goal is to nd a dividing line that splits this list into two sets, the inter-word gaps set and the inter-character gaps set. This task is performed as follows: 7
5
Step{I Assume the dividing line between inter-word and inter-character gaps is the rst gap. Step{II Compute the average distance of all inter-character gaps (i.e., gaps smaller than the current dividing gap). Step{III Compute the average distance of all inter-word gaps (i.e., gaps greater than or equal to the current dividing gap). Step{IV Compute the ratio avg inter word gaps=avg inter character gaps. If this ratio increases (over the last measurement in this line), try the next gap as the dividing gap, and go back to Step II. Otherwise use this gap as the dividing gap, and stop. Essentially, the algorithm nds the dividing line where the big gaps are the "most bigger" than the smaller gaps. Then it consider the big gaps to be word gaps and the smaller gaps to be character gaps.
4.1. Combining Information Results
The word segmentation algorithm was tested on an independentk set of 1084 text lines. Table 10 shows the quantity of gaps found in these text images.
Gap Type
Primary Secondary Comma Character Total
Total Number 1672 135 351 8612 10770
Table 10: Quantity of each gap type. Tables 11 and 12 show the results of four dierent tests to compare the performance of the proposed method against the traditional bounding box (BB) approach.
4.2. Combining Information Discussion
Our results clearly show how our system can correctly determine the gap types of over 90% of gaps in our testing data. This system would allow nearly 40% of the text lines to be properly separated k
This set was not used in the training of the distance and punctuation algorithms.
False Missed Missed Missed Correctly Method Positives Primary Comma Secondary parsed lines # % # % # % # % # %
BB 1004 9.32 584 5.42 125 1.16 73 RLE(H2) 616 5.72 440 4.09 135 1.25 77
0.68 0.71
239 343
19
22.05 31.64
Table 11: Performance of the word segmentation algorithm using the Bounding Box and RLE(H2) distance methods without punctuation recognition.
False Missed Missed Missed Correctly Method Positives Primary Comma Secondary parsed lines # % # % # % # % # %
BB 1066 9.90 596 5.53 79 0.73 73 RLE(H2) 569 5.28 456 4.23 30 0.28 79
0.68 0.73
261 425
24.08 39.21
Table 12: Performance of the word segmentation algorithm using the Bounding Box and RLE(H2) distance methods and punctuation adjustment. into words. With some trivial adjustments, we could use this system to generate word hypotheses for a text understanding system. The RLE(H2) method showed a signi cantly better performance than the bounding box method (applicable for most machine-printed text). The punctuation detection reduced the false positives and the number of missed comma gaps signi cantly while causing minimal increases in the number of missed primary and secondary gaps. Further testing could show how generating multiple hypotheses (instead of one answer per text line) could increase the likelihood of generating the correct parse. These results show that a signi cant number of words in unconstrained handwritten text lines can be correctly segmented using only spatial information and punctuation detection.
5. SUMMARY AND FUTURE WORK We have presented a set of algorithms for separating words in a text line and have shown dierent ways of measuring their performance. The algorithms were tested on a large number of images and have been shown to be useful in this domain. We believe this in-depth analysis is needed to develop a robust text recognition system. A complete system word separation algorithm must incorporate context (i.e., text interpretation) to determine all word groupings. However, given the computational complexity and accuracy of current handwritten word recognition algorithms, it is likely that most full handwritten text processing systems will use some preprocessing word separation algorithms such as those described in this paper.
ACKNOWLEDGMENT
20
The authors wish to thank several people at CEDAR. Dr. Sargur Srihari for his support and encouragement. Professor Peter D. Scott, Evelyn Kleinberg and Dar-Shyang Lee oered many suggestions and assistance. Ronald Curtis provided the initial code for computing the fuzzy attributes used in punctuation detection. Keith Bettinger wrote an X utility that greatly improved the speed of the truthing process. This work was supported by the United States Postal Service Oce of Advanced Technology under Task Order 104230-91O-5329.
References [1] H.S. Baird and K. Thompson. Reading chess. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12:552{559, Jun 1990. [2] A. Gardin Du Boisdulier, Z. Bichri, and F. Tourand. Post oce box detection on handwritten addresses. USPS Advanced Technology Conference, pages 585{603, Nov 1990. [3] M. Brady. Toward a computational theory of early visual processing in reading. Visible Language, 15(2):183{215, 1981. [4] E. Cohen, J.J. Hull, and S.N. Srihari. Understanding handwritten text in a structured environment: Determining ZIP Codes from addresses. International Journal of Pattern Recognition and Arti cial Intelligence, 5(1&2):221{264, 1991. [5] A.C. Downton, R.W.S. Tregidgo, C.G. Leedham, and Hendrawan. Recognition of handwritten british postal addresses. From Pixels to Features III: Frontiers in Handwriting Recognition, pages 129{144, 1992. [6] R.O. Duda and P.E. Hart. Experiments in the recognition of hand-printed text: Part II context analysis. AFIPS Conference Proceedings, pages 1139{1149, 1968. [7] J.J. Hull. A computational theory of visual word recognition. Technical Report 88-07, State University of New York at Bualo, Department of Computer Science, Feb 1988. [8] F. Kimura, A.Z. Chen, and M. Shridhar. An integrated character recognition algorithm for locating and recognizing ZIP Codes. USPS Advanced Technology Conference, pages 605{619, Nov 1990. [9] M.S. Landy, Y. Cohen, and G. Sperling. HIPS: A UNIX-based image processing system. Computer Vision, Graphics, and Image Processing, 25:331{347, 1984. [10] K.M. Sayre. Machine recognition of handwritten words: A project report. Pattern Recognition, 5:213{228, 1973. [11] G. Seni and E. Cohen. Segmenting handwritten text lines into words using distance algorithms. SPIE-IS&T Conference Proceedings, pages 1000{1110, 1992.