Handwritten Numeral String Recognition: Character-Level vs. String-Level Classifier Training Cheng-Lin Liu and Katsumi Marukawa Central Research Laboratory, Hitachi, Ltd. 1-280 Higashi-koigakubo, Kokubunji-shi, Tokyo 185-8601, Japan {liucl,marukawa}@crl.hitachi.co.jp Abstract The performance of handwritten numeral string recognition integrating segmentation and classification relies on the classification accuracy and the resistance to noncharacters of the underlying classifier. The classifier can be trained at either character level (with character and noncharacter samples) or string level (with string samples). We show that both character-level and string-level training yield superior string recognition performance. Stringlevel training improves segmentation but deteriorates classification. By combining the character-level trained classifier and the string-level trained classifier, we have achieved higher string recognition performance. We show the experimental results of three classifier structures on the numeral strings of NIST Special Database 19.
1. Introduction The ambiguity of segmentation in character string recognition is commonly solved by integrating classification into segmentation[1, 2]. A character classifier is embedded to assign class scores to the candidate patterns generated by pre-segmentation (over-segmentation). The classification scores are combined to evaluate the segmentation paths and the optimal path gives the segmentation result. The string recognition performance relies on the classification accuracy and the resistance to non-characters of the underlying classifier. The classifier is usually trained at character level with segmented characters and adding non-characters to the training set can improve the non-character resistance [3, 4]. String-level training of classifiers is supposed to yield higher string recognition performance and has been tried in early works [1, 5]. In recent years, promising results have been reported using string-level discriminative training which aims to minimize the string recognition error [6, 7]. String-level training commonly adopt the Minimum Classification Error (MCE) criterion (objective) of Juang et al. [8], which has been widely used in speech recognition [9]. In this paper, we give an implementation of stringlevel training to handwritten numeral string recognition and compare its performance with character-level training. A classifier initially trained at character level is adjusted on string samples by optimizing the string-level MCE cri-
terion. During training, the string samples are dynamically segmented by candidate pattern classification and path search. The classifier parameters are adjusted on segmented character/non-character patterns by stochastic gradient descent. We experiment with a variety of classifier structures, including a Multi-Layer Perceptron (MLP), a Polynomial Classifier [10], and a Discriminative Learning Quadratic Discriminant Function (DLQDF) [11]. The results on the numeral string images of NIST Special Database 19 (SD19) show that for each classifier, string-level training yields higher string recognition accuracy than characterlevel training via decreasing the mis-segmentation of string images. On the other hand, the character-level trained classifier gives higher classification accuracy on segmented characters. This inspires the combination of the stringlevel and character-level trained classifiers for higher string recognition performance. We achieved this goal using a simple combination rule. The numeral string recognition system has been introduced in [4] and updated in [11]. A segmentation path (composed of a sequence of candidate patterns) is scored by the average of distances (called normalized cost) of its constituent patterns and the minimum cost paths are found by beam search. Rejection is based on the difference of accumulated cost between the top two result strings. Our system yields high string recognition accuracies though we use a simple pre-segmentation procedure and totally ignore the geometric context.
2. String-Level Classifier Training A character string comprises of a sequence of candidate patterns X = x1 · · · xn . A character pattern is classified into M defined classes {C1 , . . . , CM } (M = 10 for numerals), and the string class (which we also call a class string) is denoted by a sequence of character classes S = Ci1 · · · Cin . A character classifier is embedded in the string recognition system for classifying the constituent patterns of string X. Each pattern is assigned similarity/dissimilarity scores to the defined classes: yi (x) = −d(x, Ci ) = f (x, Θi ), where Θi denotes the set of parameters connected to class Ci .
0-7695-2128-2/04 $20.00 (C) 2004 IEEE
2.1 MCE criterion Assume classifying an object X into M classes {C1 , ..., CM }. M is very large for string recognition. Each class is assigned a discriminant score gi (X, Θ). A sample set DX = {X n , cn )|n = 1, . . . , NX } (cn denotes the class label of sample X n ) is used to estimate the parameters Θ. Following Juang et al. [8], the misclassification measure on a pattern from class Cc is given by hc (X, Θ) = −gc (X, Θ)+log
1 M −1
1/η
eηgi (X,Θ)
,
i=c
where η is a positive number. When η → ∞, hc (X, Θ) = −gc (X, Θ) + gr (X, Θ),
(2)
i=c
The loss of misclassification is computed by
On the training sample set, the empirical loss is NX M n n L0 = N1X n=1 i=1 li (X , Θ)I(X ∈ Ci ) NX 1 n = NX n=1 lc (X , Θ),
(3)
(4)
where I(·) is an indicator function. By stochastic gradient descent, the classifier parameters are updated on each input sample by Θ(t + 1) = Θ(t) − (t)U ∇lc (X, Θ)|Θ=Θ(t) ,
n k=1
gr (X, Θ) = max gi (X, Θ).
1 . 1 + e−ξhc
hc (X, Θ) =
(1)
where gr (X, Θ) is the score of the closest rival class:
lc (X, Θ) = lc (hc ) =
in which the difference of normalized cost is multiplied with the genuine string length based on the fact that, the two class strings often correspond to the same pattern sequence and differ only in the class label of a constituent pattern and hence the difference of accumulated cost is less dependent on the string length. During training, the string sample X is segmented into a sequence xc1 . . . xcn corresponding to the genuine class string and a sequence xr1 . . . xrm corresponding to the competing class string. Consequently, the string scores are given n by N D(X, Sc ) = n1 k=1 d(xck , Cik ) and N D(X, Sr ) = m 1 r k=1 d(xk , Cjk ). Then the misclassification measure is m
(5)
n d(xrk , Cjk ). m m
d(xck , Cik ) −
(8)
k=1
By stochastic gradient descent, the classifier parameters are updated on training sample X by (X,Θ) Θ(t + 1) = Θ(t) − (t) ∂lc ∂Θ Θ=Θ(t) (X,Θ) = Θ(t) − (t)ξlc (1 − lc ) ∂hc∂Θ Θ=Θ(t) (9) ∂d(xck ,Cik ) n = Θ(t) − (t)ξlc (1 − lc ) k=1 ∂Θ m ∂d(xrk ,Cjk ) n −m . k=1 ∂Θ Θ=Θ(t)
The candidate patterns in the two sequences corresponding to Sc and Sr can be partitioned into three disjoint subsets: Pc ∪ Pr ∪ Ps . Ps contains the patterns shared by the two class strings, Pc contains the patterns of Sc that are not shared, and Pr contains the patterns of Sr that are not shared (Fig. 1). The shared patterns are further partitioned into two subsets: Ps = Ps1 ∪ Ps2 . The patterns in Ps1 are assigned different character classes while the patterns in Ps2 are assigned the same class in the two class strings. Since Sr is the closest rival to Sc , a pattern is assigned different character classes only when the two class strings have identical pattern sequence.
where (t) is the learning step. U is related to the inverse of Hessian matrix and is usually approximated to be diagonal.
2.2 Specification to numeral string recognition In numeral string recognition, the discriminant score of a string class S = Ci1 . . . Cin is the negative normalized cost [4, 11]: 1 d(xj , Cij ). n j=1 n
g(X, S, Θ) = −N D(X, S) = −
In MCE training, the string sample is segmented into pattern sequences by path search based on the current classifier parameters Θ(t). The pattern sequence corresponding to the genuine class string is found by string matching using dynamic programming while the competing class strings are found by beam search. Denoting the genuine class string by Sc = Ci1 . . . Cin and the closest competing string by Sr = Cj1 . . . Cjm , we compute the misclassification measure by hc (X, Θ) = n · [N D(X, Sc ) − N D(X, Sr )],
Figure 1. Candidate patterns corresponding to two class strings
(6)
(7)
We denote the common class of a shared pattern xci = = xk by Cijk . Then the partial derivative of the second row of (9) is specified to ∂d(xck ,Cik ) ∂d(xrk ,Cjk ) ∂hc (X,Θ) n = − c ∈P r ∈P x x ∂Θ m ∂Θ c r k ∂Θ k ∂d(xk ,Cik ) n ∂d(xk ,Cjk ) + xk ∈Ps1 − ∂Θ m ∂Θ ∂d(xk ,Cijk ) n ) xk ∈Ps2 . +(1 − m ∂Θ (10)
xrj
0-7695-2128-2/04 $20.00 (C) 2004 IEEE
Based on the fact that the partial derivative on a candidate pattern is independent of the other patterns, we perform the updating of (9) in multiple steps, sequentially on each candidate pattern in Pc ∪ Pr ∪ Ps1 ∪ Ps2 . On each pattern, the parameters connected to the assigned class(es) are adjusted.
2.3 Specification to classifier structures The computation of partial derivatives in (10) is specified to classifier structures according to the discriminant function d(x, Ci ) (for dissimilarity classifiers) or yi (x) (for neural classifiers). The classifier structures tested in our experiments include a Multi-Layer Perceptron (MLP), a Polynomial Classifier (PC) [10], and a Discriminative Learning Quadratic Discriminant Function (DLQDF) [11]. All the three classifiers have been evaluated in [11], where they were trained at character level. The MLP has one hidden layer. The PC takes the binomial terms of the principal components of the features as the inputs of a single-layer network. In character-level training, the parameters of MLP and PC are adjusted by minimizing the mean square error between the sigmoid outputs and target outputs. In string recognition and (string-level) MCE training, we take the linear outputs of neural classifiers as the class scores. The DLQDF is inherently resistant to non-characters because of the adherence to Gaussian density assumption and as well can be trained with non-character samples by viewing a threshold as the discrminant score of an additional “non-character” class. In (either character-level or stringlevel) MCE training, the parameters are attracted to the ML (maximum likelihood) estimate by regularization. We omit the details of classifier structure-specific gradient computation because of the page limit.
3 Experimental Results The numeral strings of NIST SD19 have been widely tested in previous works [12, 13]. Nevertheless, the detailed indices of their test samples are not available. Our test set contains 1,476 3-digit strings and 1,471 6-digit strings collected from the page images of 300 writers, no.1800–2099 [11]. A dataset containing 66,214 digit images of 600 writers (no.0–399 and no.2100–2299) is used for character-level training and a set of 45,398 digit images of 400 writers is used for evaluating the classification accuracy [14]. To improve the non-character resistance of characterlevel training, we collected 17,338 non-character images from the 2-digit and 3-digit strings of the 600 writers by over-segmentation. For string-level training, from the numeral strings of length 4–6 of the same 600 writers, we collected 8,931 string samples, which contain 44,644 digit patterns. In training and recognition, either a digit pattern or a non-character is represented in a 100-dimensional feature vector of chaincode orientation [14]. The MLP has 300 hidden units, the PC uses 70 principal components, while the DLQDF has 40 eigenvectors for each class. The classifier trained with non-character samples is called the enhanced version (EMLP, EPC, and EDLQDF). The classifier parameters after character-level training are copied as the initial parameters of string-level training.
For the limit of space, we only report the recognition results of 6-digit strings. Our recognition method does not rely on the string length, however. Table 1 shows the string recognition rates by character-level and string-level classifier training. The correct rates are counted at no reject (forced), 1% error and 0.5% error. We can see that for each classifier structure, character-level training with noncharacter samples yields higher string recognition accuracy than that without non-characters. On the other hand, stringlevel training improves the string recognition accuracy of each classifier, especially for the one without non-character training. Table 1. Correct rates (%) of numeral string recognition
Classifier MLP EMLP PC EPC DLQDF EDLQDF
Character-level Forced 1%E 0.5%E 63.70 14.14 11.35 95.58 88.51 84.09 55.61 7.00 3.33 96.46 93.20 91.71 95.24 89.46 85.66 96.19 91.16 87.70
Forced 96.46 96.74 97.08 96.94 96.87 96.80
String-level 1%E 91.77 93.00 94.43 94.02 93.81 93.81
0.5%E 86.27 90.14 91.43 90.82 88.51 88.72
To investigate how string-level training improves the string recognition accuracy, we list in Table 2 the classification accuracies (on segmented test digits) and the numbers of wrong-length segmentation (accounting for the majority of mis-segmentation and can be easily counted). We can see that character-level training gives higher classification accuracies while string-level training gives lower segmentation error. The inspection of string recognition errors (some examples of DLQDF are shown in Fig. 2) confirms this result. Table 2. Classification accuracies (%) and numbers of wrong-length (WL) segmentation
Classifier MLP EMLP PC EPC DLQDF EDLQDF
Character-level Accuracy #WL 98.93 484 98.90 15 99.11 633 99.09 10 99.14 26 99.13 15
String-level Accuracy #WL 98.77 8 98.79 4 98.93 4 98.92 4 98.90 13 98.88 7
The high classification accuracy of character-level trained classifier and the low segmentation error of stringlevel trained classifier inspires us to combine the two classifiers to achieve higher string recognition performance. In our attempt of combination, the string-level trained classifier is used to segment the string image into character patterns corresponding to two class strings of minimum costs. When the two class strings correspond to the same pattern
0-7695-2128-2/04 $20.00 (C) 2004 IEEE
the character-level trained classifier. Character-level training can always benefit from the availability of large set of segmented characters. The wiser combination of classifiers for character string recognition is the task of future works.
References [1] C. Burges, et al., Shortest path segmentation: a method for training a neural network to recognize character strings, Proc. IJCNN, Baltimore, 1992, Vol.3, pp.165-172.
Figure 2. Examples of string recognition error by DLQDF
[2] Y. Saifullah, M.T. Manry, Classification-based segmentation of ZIP codes, IEEE Trans. System, Man, and Cybernetics, 23(5): 1437-1443, 1993. [3] J. Bromley, J.S. Denker, Improving rejection performance on handwritten digits by training with rubbish, Neural Computation, 5: 367-370, 1993.
Table 3. Correct rates (%) of numeral string recognition by combining character-level and string-level trained classifiers Classifier MLP EMLP PC EPC DLQDF EDLQDF
Forced 96.60 97.08 97.42 97.21 96.74 97.35
1%E 93.00 92.93 94.97 95.58 94.53 94.43
0.5%E 88.10 89.06 91.77 93.07 90.62 91.50
sequence (this appeared in about 90% of cases in our experiments), we apply the character-level trained classifier to the segmented patterns. Each segmented pattern is rescored by averaging the outputs of both classifiers and is re-assigned two classes of maximum combined scores. The pattern sequence is then re-assigned two class strings of maximum path scores. The string recognition rates of combining string-level and character-level trained classifiers are shown in Table 3. We can see that combining the two classifiers mostly improve the string recognition accuracy compared to the string-level trained classifier. Our results compare favorably to the best ones reported in the literature though the test samples do not coincide (the indices of previously tested samples are not available). On NIST 6-digit strings, the correct rates of Ha et al. [12] are 90.3% (forced), 75.5% (1% error), and 66.5% (0.5% error), while the correct rates of Oliveira et al. [13] are 93.12%, 85.68%, and 84.03%, respectively. Our recognition rates are significantly higher.
4 Conclusion We have shown in the experiments that both characterlevel and string-level classifier training give superior string recognition performance. String-level training results in lower segmentation error but sacrifices the classification accuracy, hence the string recognition accuracy is further improved by combining the string-level trained classifier and
[4] C.-L. Liu, H. Sako, H. Fujisawa, Integrated segmentation and recognition of handwritten numerals: comparison of classification algorithms, Proc. 8th IWFHR, Ontario, Canada, 2002, pp.303-308. [5] Y. LeCun, Y. Bengio, Word-level training of a handwritten word recognizer based on convolutional neural networks, Proc. 12th ICPR, Jerusalem, 1994, Vol.2, pp.88-92. [6] W.-T. Chen, P. Gader, Word level discriminative training for handwritten word recognition, Proc. 7th IWFHR, Amsterdam, 2000, pp.393-402. [7] A. Biem, Minimum classification error training for online handwritten word recognition, Proc. 8th IWFHR, Ontario, Canada, 2002, pp.67-71. [8] B.-H. Juang, W. Chou, C.-H. Lee, Minimum classification error rate methods for speech recognition, IEEE Trans. Speech and Audio Processing, 5(3): 257-265, 1997. [9] W. Chou, Discriminant-function-based minimum recognition error pattern-recognition approach to speech recognition, Proc. IEEE, 88(8): 1201-1223, 2000. [10] U. Kreßel, J. Sch¨urmann, Pattern classification techniques based on function approximation, Handbook of Character Recognition and Document Image Analysis, H. Bunke and P.S.P. Wang (Eds.), World Scientific, 1997, pp.49-78. [11] C.-L. Liu, H. Sako, H. Fujisawa, Discriminative learning quadratic discriminant function for handwriting recognition, IEEE Trans. Neural Networks, 15(2): 430-444, 2004. [12] T.M. Ha, J. Zimmermann, H. Bunke, Off-line handwritten numeral string recognition by combining segmentationbased and segmentation-free methods, Pattern Recognition, 31(3): 257-272, 1998. [13] L.S. Oliveira, R. Sabourin, F. Bortolozzi, C.Y. Suen, Automatic recognition of handwritten numeral strings: a recognition and verification strategy, IEEE Trans. Pattern Analysis and Machine Intelligence, 24(11): 1438-1454, 2002. [14] C.-L. Liu, H. Sako, H. Fujisawa, Performance evaluation of pattern classifiers for handwritten character recognition, Int. J. Document Analysis and Recognition, 4(3): 191-204, 2002.
0-7695-2128-2/04 $20.00 (C) 2004 IEEE