Using Hierarchical Shape Models to Spot ... - Semantic Scholar

Report 2 Downloads 68 Views
Using Hierarchical Shape Models to Spot Keywords in Cursive Handwriting Data M.C. Burl

Jet Propulsion Laboratory MS 525-3660, Pasadena, CA 91109 [email protected] 1

2

12 ;

23 ;

California Institute of Technology MS 136-93, Pasadena, CA 91125 [email protected]

Abstract

Di erent instances of a handwritten word consist of the same basic features (humps, cusps, crossings, etc.) arranged in a deformable spatial pattern. Thus, keywords in cursive text can be detected by looking for the appropriate features in the \correct" spatial con guration. A keyword can be modeled hierarchically as a set of word fragments, each of which consists of lowerlevel features. To allow exibility, the spatial con guration of keypoints within a fragment is modeled using a Dryden-Mardia (DM) probability density over the shape of the con guration. In a writer-dependent test on a transcription of the Declaration of Independence (1300 words, 7500 characters), the method detected all eleven instances of the keyword \government" with only four false positives.

1 Introduction

and P. Perona

Handwriting o ers a more natural human-computer interface than the traditional keyboard. A keyboard is frustrating for children and other novice users who possess limited typing skills. Even advanced users encounter diculty when entering visually formatted material such as mathematical equations or sketches. Pictorial languages such as Japanese also pose significant problems. A further drawback of the keyboard is that its minimum size is limited by the size of the human hands and ngers. To realize truly miniature computers, alternative methods of input must be considered. Digitizing pads, tablets, and scanners enable a user to enter handwriting data into a computer. However, the data is in a raw form that cannot be easily manipulated by the user. Some progress has been made with on-line, hand-printed text as evidenced by the emergence of commercial products such as the Newton [13]. However, the more dicult problems of recognizing connected cursive text and o -line text have not been adequately resolved.

3

Universita di Padova Padova, Italy

One task of interest is keyword spotting [1]: given a set of handwritten notes, nd all instances of a speci ed word or symbol. We are experimenting with an approach to keyword spotting that is a generalization of a method we developed for locating human faces in still-frame images [4, 5]. In principle, the method is applicable to both on-line and o -line data. (At this point, however, we have only tested with on-line data.) Keywords are modeled hierarchically as a spatial arrangement of meta-features (word fragments or individual letters), and each fragment is modeled as a spatial arrangement of lower-level features (for example, keypoints such as humps, cusps, and crossings). Spatial arrangements are represented probabilistically with Dryden-Mardia (DM) shape densities, where the word \shape" is used in the sense of Bookstein [2] and Kendall [9] to refer to the information in a spatial con guration after translation, rotation, and scaling have been normalized away.

1.1 Related Work

The leading paradigm for handwriting recognition has been the Hidden Markov Model (HMM); see for example [14, 10, 11, 3]. One advantage of the shapebased method we describe is that, in principle, it is applicable to both on-line and o -line data. Also, under the shape model, a feature's position can depend on the positions of a number of other local features, while in HMMs only rst or second order dependence is typically assumed. A disadvantage, however, is that in order to learn the appropriate statistics, the shape method requires many ground-truthed training examples, while HMMs can be trained from a relatively small number of unmarked examples. Also, HMMs provide a model for the entire writing trajectory, while the shape method, as currently applied, only models the positions of keypoints.

(x3, y3)

(x1, y1)

60

θ2

40

t3

θ1 t2 t4

t5

θ0

t3

t0

t4 t1 d_theta (smoothed)

(x2, y2)

δθ n = ( θ n − θ n-1)

t0

20

0

t5

−20

t2

−40

t1

(x0, y0)

Figure 1: Portion of a digitized curve. At each sample point, the change in direction is calculated.

2 Low-level Feature Detectors 2.1 Detector Descriptions

Lifts and Drops: In our experiments, handwriting data is collected using a pressure-sensitive WACOM digitizing tablet. The tablet provides a discretetime sequence of x, y, and p (pressure) samples. Pen lifts and drops can be detected by thresholding the samples pn into a binary sequence where zero represents pen up (no pressure) and one represents pen down (full pressure). A pen drop occurs at time n if bn?1 = 0 and bn = 1. Similarly, a pen lift occurs if bn = 1 and bn+1 = 0. Humps and Cusps: At each point (xn ; yn) of the digitized handwriting trajectory, the angle n of the segment from (xn ; yn ) to (xn+1 ; yn+1 ) is measured with respect to horizontal as shown in Figure 1. The change in angle is the important quantity, so we form the sequence: n = n ? n?1 . Noise is suppressed by smoothing  with a narrow Gaussian kernel. Keypoints are then detected as local extrema over a three pixel window on the smoothed sequence. A local maxima must be greater than an absolute threshold of 30, while a local minima must be less than ?30. The state of the pen (up or down) is also checked. This process is illustrated in Figure 2. Features are labeled as \right-turn" or \left-turn" based on the sign of  (smoothed) at the keypoint. For sharp hair-pin turns the change in angle may be close to 180 leading to some instability in the label. Such points should probably be given a distinct label, but we have not done so in our experiments. Crossings: Crossings occur when writing overlaps with itself, as in a cursive lowercase ell. Since digitized handwriting consists of a number of line segments, we simply need to check whether the current line segment intersects any previously written segments, which is a standard problem in computer graphics [8]. To avoid spurious detections from pauses and other anomalies, only segments within a given time window are checked. The time and position are interpolated from the crossing segments.

−60 0

10

20

30

40

50

60

Time

Figure 2: (a) Cursive letter \G" with important time instants marked. (b) Detection of humps and cusps from the smoothed n sequence. The peaks for t < t0 were not identi ed because the pen was up. 3500

2

3600

3 3700

5 3800

2

3900

2

2

5 4000

1: 2: 3: 4: 5:

4200 4300 4400 2000

3

2 5

2 5

2

2

4 4100

3

2

5

3

Pen lift Left turn Right turn Pen Drop Crossing

2

2

1

2 5

3

Figure 3: Feature detector performance on a cursive handwriting sample. 2200

2400

2600

2800

3000

3200

3400

3600

3800

4000

2.2 Detector Performance

For illustration, Figure 3 shows the performance of the feature detectors on a cursive handwriting sample. Individual segments of the handwriting are shaded according to the pen pressure: light gray indicates little or no pressure, while dark gray indicates normal to heavy pressure. The detections are labeled according to the feature type: (1) pen lift, (2) left turn, (3) right turn, (4) pen drop, and (5) crossing.

3 Shape Models

Since the ve basic features occur in many places in a sample of handwriting, we must rely on the spatial con guration of the features to nd a particular word or word fragment. A con guration of N points in a plane can be represented as a 2N -dimensional vector X that contains the x and y coordinates of each point. As discussed in [4], a more useful representation for recognition involves separating the translation, rotation, and scaling of the con guration from the shape U. This conversion can be accomplished by mapping two points to xed reference positions. The coordinates of the remaining N ?2 points then represent the shape of the con guration. Figure 4a shows a cursive letter G with eleven numbered locations, which were manually identi ed as potential object parts, i.e., places likely to be found by

2

G eor ge

2

5

3 9 10

7 1

4 4 33 43 4 3 4 333 433 3 4 4 4333 43333343 44 3 344 443343 34 433 4343 33 4 443 34 4 3334433433 3333 34 4333 4 34443343 34 43 4 433433344 33 44 43 444 434 33 4 4 33344 3 3 4 4 3444 4 4 3 333 4 44 4 4 443 434443 4 3 44 44 44 4 4 11 10 8 4 11 44 10 4 4 1111 9 8 11 99 8 11 11 9 99 9 1111 11 8987 78987 97 10 111111 11 11 88 10 10 11 8 88998 9 98 8 11 8789 11 11 1010 8 11 11 11 11111111 10 1010 10 10 10 10 10 10 9 798 7 777888788 11 11 10 11 10 11 1111 1111 1010 11 1111 11 8 11 6 10 10 11 8 10 7 11 8 10 8 8 7 10 11 8 10 8 778 8 10 11 11 11 87 9888 88 978 11 88 8 8 8 1111 10 1110 1111 11 11 11 7 11 10 1010 11 11 11 11 11 10 10 1010 8 10 888 8 8 78 88888988 10 11 1011 11111111 10 10 8 10 10 1010 10 10 10 1010 1010 10 10 10 10 88 11 8 88 8 8 88 11 11 11 10 1011 10 10 10 11 10 101111 10 11 887 10 11 10 11 11 11 10 10 87 98 8 10 1111 1010 11 88888888 11 10 10 11 10 11 11 10 10 10 10 11 10 11 88 8 8 8 11 10 10 1010 10 8 10 10 8 6 6 8 8 10 11 10 88 106 10 10 8 8 11 11 10 10 3

4 8

6

11 1

5 55 5 5 55 5 55 5 5 5 5 5 5 5 555 55555 5 5 5 5 5 555 5555 5555555 55 55 5 55555 5 5 555555 5 5 5 5 555555 5 55 5 5 55 5 5 5 5 5 55 55 55 55 5 5

Figure 4: (a) A cursive letter G with de nitions of hand-selected object parts. (b) Uncertainty regions in shape space (features 1 and 2 used as reference). the feature detectors. Figure 4b shows the superposition in shape space of part locations from one hundred realizations of the letter G. The clouds represent the uncertainty in the positions of parts 3{11 when parts 1 and 2 are mapped to xed positions. The solid line marks the entire shape-space trajectory for one of the samples. This gure shows only the marginal shape-space density for each part, not the joint density, i.e., it doesn't show how the shape-space position of part i (i  3) a ects the shape-space position of part j (j  3). We will hypothesize that the joint density over the shape variables can be well-modeled using a Dryden-Mardia density [7, 4, 5], which we denote by pU (U; ; ). The parameters  and  are estimated from the gure space positions of the detected features after translation is removed. For this procedure to work, the training examples must have been collected at one scale and orientation. (The reasons are somewhat complex, but have been discussed in the references. The basic idea is that if the gure space density is Gaussian, then the density induced in shape space by normalizing will be a Dryden-Mardia density. Limited variation in the scale and orientation of the training examples will help to insure that the examples look approximately Gaussian in gure space.) Parts that are not detected reliably or do not have a ground truth location in every training example (e.g, part 7 and part 9 of the G only exist in 20% of the examples) are omitted from the model. For the G this leaves a model consisting of just six parts: 1, 2, 3, 5, 8, and 10. Training is done from the detected positions of the parts rather than the ground truth positions so that the e ects of detector localization errors are included in the model. A side-e ect, however, is that some parts are not detected in every example. These missing values could be imputed with the EM algorithm [6], but instead we simply replace these values with the corresponding ground truth coordinates.

Figure 5: Hierarchical model for the word George consisting of three fragments G, eor, and ge. Each fragment consists of 6{8 features indicated by the large dots.

3.1 Hierarchical Models

In principle, we could attempt to model the spatial con guration of keypoints in an entire keyword using a probabilistic shape model. A better approach, however, is to divide the keyword into smaller word fragments and model the con guration of keypoints within each fragment with a Dryden-Mardia density. Splitting the word is a good idea for a number of reasons: (1) if we allow for missing (undetected) parts, the number of hypotheses generated by the whole keyword will be too large, (2) the assumptions behind the Dryden-Mardia density may be more appropriate when applied to word fragments, (3) the full joint density over the entire word cannot be reliably estimated from a limited amount of training data; however, by splitting the word into pieces we are basically assuming a number of covariance parameters to be zero (reducing the number of parameters relative to the number of examples), and (4) we do not expect that the detailed arrangement of features at the end of the word depends much on the detailed arrangement at the beginning, except through the global scale and orientation of the keyword.

4 Algorithm

To nd a word fragment such as the letter G, we apply the basic feature detectors and obtain a set of candidates for each type of feature. Next, candidates are grouped into hypotheses. The grouping process uses knowledge about the spatial layout of the parts to avoid exploring all combinations of candidates. The grouping process is also given invariance hints, which specify the extent to which a keyword in the test data could di er from the training data in terms of orientation, scale, and time duration. In addition, we impose a time-ordering constraint so that part candidates must obey the same time-ordering as the true parts. (Note that the time duration invariance hint and time-ordering constraint would not be available

for o -line data.) Each hypothesis H generated by the grouping procedure is evaluated according to the following scoring function: given by: ) = P (H )  ppU ((UU;;0;;  (1) I) U where U is the shape space con guration of H and P (H ) is a factor to penalize hypotheses with missing features. The numerator of the second term is just the probability that the con guration was generated from a cursive G, while the denominator is the probability the con guration was generated by random placement of features on the plane. Justi cation for this scoring function is given in [5]. The output of the fragment detection process is a set of candidate locations for each fragment. Conjunctions of word fragments in the correct spatial arrangement are then used to locate keywords. G0 (H )

5 Experimental Results

We have conducted two experiments to demonstrate the potential of the hierarchical shape approach for spotting word fragments and keywords in cursive text. The basic scenario is that a user has recorded some handwritten notes on a graphics tablet, and later he would like to look back and nd all locations in his notes where a particular query word occurs.

5.1 Mount Rushmore Passage

Figure 6 shows a small section of cursively written notes about Mount Rushmore. Suppose the user wants to search his notes to locate the word George. From a training set of one hundred (ground-truthed) instances of the word George, we constructed a hierarchical shape model consisting of three fragments: G, eor, and ge. Separate Dryden-Mardia densities were used to describe the feature positions in each fragment. (Note: The fragments were hand-selected to approximately equalize the number of reliable features in each.) The best G hypothesis and the best ge hypothesis occur in George. The best eor hypothesis, however, occurs on the eod in Theodore. This mistake is actually quite reasonable since one of the features on the eor in George was missed by the feature detectors, thereby causing a slightly lower ranking. Given detected word fragments, we can look for an assembly of these meta-features having the correct spatial con guration. Although the eod in Theodore is the best eor candidate, there are no G or ge candidates nearby so this location would not be detected as a keyword. The true word, however, has the best G candidate, a strong (although not best) eor candidate,

2500

3000

3500

4000

4500

5000

5500 0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Figure 6: Handwritten notes about Mount Rushmore recorded on a graphics tablet. and the best ge candidate so it would be identi ed as a likely instance of the keyword.

5.2 Declaration of Independence

The Rushmore passage contains approximately 20 words and 115 characters. To test our algorithm under more stringent conditions, we asked a (naive) user to transcribe the Declaration of Independence on a WACOM tablet. This document contains approximately 1; 300 words and 7; 500 characters. The word government was selected as a keyword since it appears eleven times. The same user provided training data by writing the word government one hundred times. The letters g, m, and t were chosen as word fragments. A model for the spatial arrangement of keypoints in each fragment was estimated from the training data. Figure 7a shows sample results for the g detector on a well-known passage of the document. Dark boxes show correct hits, light boxes show false positives, and the dashed boxes show misses. At a threshold of 1:5, 12 of the 13 g's were detected as well as ve false positives. Some of the false positives occurred on the letter y which does in fact look like a g with an open top. Errors also occurred on the combination ap since the nish of the a and start of the p together look like an open g. Figure 7b shows sample results for the t detector on the same passage. At a threshold of 4:0, 44 of the 62 t's were detected with 22 false positives. Figure 7 shows the detection performance for the g and t over the entire document using ROC (receiver operating characteristic) curves. The ROC curve shows the tradeo between the probability of detection and the number of false alarms. The total number of detection opportunities was approximately 125 for g and slightly less than 600 for t. It turns out that from the two letters g and t, we can reliably detect the word government just by looking for words that have a leading g and a terminal t. Over the entire document, this method correctly identi es all eleven instances of government with only four

Figure 7: (a) Detection of g on a section of the Declaration of Independence. Dark boxes show correct hits, light boxes show false positives, and dashed boxes show missed positives. (b) Detection of t.

0.9

0.9

0.8

0.8

0.7 0.6 0.5 0.4 0.3 0.2

0.7 0.6 0.5 0.4 0.3

Acknowledgements

0.2

0.1 0 0

(not practical) or the system would have to synthesize a model from individual letters or pairs of letters. A second diculty is that the current feature set does not provide a rich enough description to uniquely represent each letter of the alphabet. A shape model that provides a dense representation of the entire writing trajectory would be useful.

Detection of the Letter t 1

Probability of Detection

Probability of Detection

Detection of the Letter g 1

0.1

20

40

60

80

100

120

140

Number of False Alarms

160

180

200

0 0

100

200

300

400

500

600

700

800

900

1000

Number of False Alarms

Figure 8: ROC performance over the Declaration of Independence: (a) lowercase g, (b) lowercase t. false positives: object, Assent, appealed, and Great. A more rigorous test on the length of the word or on the characters in the middle of the word would probably reject these false positives. As mentioned, we also attempted to detect the letter m with the shape-based method, but these experiments were not as successful for two reasons. First, the model speci es only the arrangement of the keypoints, not the pen trajectory between them. For the the limited set of features we have used, the keypoints are not descriptive enough of the m so many false positives occur. Second, the test handwriting shows signi cantly more vertical compression and slant than the training data so the correct hypotheses are often scored poorly. It isn't clear whether providing the algorithm with a wider variety of training data would eliminate this problem. Ane-invariant shape descriptions [12] may be a solution.

6 Discussion and conclusion

We have proposed a novel method for spotting keywords in cursive handwriting data. A keyword is modeled hierarchically as a set of meta-features (word fragments), each of which is composed of lower-level features (keypoints). The spatial arrangement of the keypoints within a fragment is modeled probabilistically using a Dryden-Mardia density over the shape of the con guration. In a writer-dependent test on a transcribed version of the Declaration of Independence, the method was able to detect all eleven instances of the word government with just four false positives. Although more experimentation is needed, we are encouraged because this approach appears to be quite general; for example, the same technique (with different low-level features) has worked well for locating human faces in cluttered scenes [4, 5]. One drawback of the proposed method, however, is that on the order of a hundred ground-truthed examples are necessary in order to learn the proper shape models. For keyword spotting this implies the user would have to write the query keyword multiple times

This research has been carried out and/or sponsored in part by (i) the Jet Propulsion Laboratory, California Institute of Technology, under contract with the National Aeronautics and Space Administration, (ii) the Caltech Center for Neuromorphic Systems Engineering as a part of the NSF Engineering Research Center Program, and (iii) the California Trade and Commerce Agency, Oce of Strategic Technology. The authors also wish to thank Mario Munich for his assistance.

References

[1] O.E. Agazzi and S. Kuo. \Pseudo Two-Dimensional Hidden Markov Models for Document Recognition". AT&T Technical Journal, pages 60{72, 1993. [2] F.L. Bookstein. \Size and Shape Spaces for Landmark Data in Two Dimensions". Statistical Science, 1(2):181{ 242, 1986. [3] C.B. Bose and S. Kuo. \Connected and Degraded Text Recognition Using Hidden Markov Model". Pattern Recognition, 27(10):1345{1363, 1994. [4] M.C. Burl, T.K. Leung, and P. Perona. \Face Localization via Shape Statistics". In Intl. Workshop on Automatic Face and Gesture Recognition, 1995. [5] M.C. Burl, T.K. Leung, and P. Perona. \Recognition of Planar Object Classes". In Proc. IEEE Comput. Soc. Conf. Comput. Vision and Pattern Recogn., 1996. [6] A.P. Dempster, N.M. Laird, and D.B. Rubin. \Maximum Likelihood from Incomplete Data via the EM Algorithm". J. Royal Stat. Soc. B, 39:1{38, 1977. [7] I.L. Dryden and K.V. Mardia. \General Shape Distributions in a Plane". Adv. Appl. Prob., 23:259{276, 1991. [8] J.D. Foley and A. van Dam. Fundamentals of Interactive Computer Graphics. Addison-Wesley, 1984. [9] D.G. Kendall. \A Survey of the Statistical Theory of Shape". Statistical Science, 4(2):87{120, 1989. [10] G.E. Kopec and P.A. Chou. \Document Image Decoding Using Markov Source Models". IEEE Trans. Pattern Anal. Mach. Intell., 16(6):602{617, 1994. [11] S. Kuo and O.E. Agazzi. \Keyword Spotting in Poorly Printed Documents Using Pseudo 2-D Hidden Markov Models". IEEE Trans. Pattern Anal. Mach. Intell., 16(8):842{848, 1994. [12] T.K. Leung, M.C. Burl, and P. Perona. \Probabilistic Ane Invariants". In Proc. IEEE Comput. Soc. Conf. Comput. Vision and Pattern Recogn., 1998. [13] L.S. Yaeger, B.J. Webb, and R.F. Lyon. \Combining Neural Networks and Context-Driven Search for Online, Printed Handwriting Recognition in the Newton". AI Magazine, 19(1):73{89, 1998. [14] L. Yang, B.K. Widjaja, and R. Prasad. \Application of Hidden Markov Models for Signature Veri cation". Pattern Recognition, 28(2):161{170, 1995.