Methods, Report and Survey for the Comparison of Diverse Isolated Character Recognition Results on the UNIPEN Database Eugene H. Ratzlaff IBM T. J. Watson Research Center Yorktown Heights, NY 10598, USA
[email protected] Abstract A framework of data organization methods and corresponding recognition results for UNIPEN databases is presented to enable the comparison of recognition results from different isolated character recognizers. A reproducible method for splitting the Train-R01/V07 data into an array of multi-writer and omni-writer training and testing pairs is proposed. Recognition results and uncertainties are provided for each pair, as well as results for the DevTest-R01/V02 character subsets, using an online scanning n-tuple recognizer. Several other published results are surveyed within this context. In sum, this report provides the reader multiple points of reference useful for comparing a number of published recognition results and a proposed framework that similarly allows private evaluation of unpublished recognition results.
1. Introduction The International UNIPEN Foundation has provided the research community with a valuable public database, the UNIPEN Train-R01/V07 distribution of online handwritten data [6]. Several authors have published recognition results using this or related UNIPEN databases, thereby affording readers some reference points for comparing the performance of their recognizers. Unfortunately, it has typically been difficult for readers to directly compare recognition rates in these reports. At times, there is no other result published for the same database subset category. In other cases, direct comparison is difficult because of the diversity of methods used to clean the data or to split the database category subsets into training and testing sets. Others have reported results after training on a complete Train-R01/V07 category subset and testing on the corresponding DevTest-R01/V02 subset, which is not publicly available. This paper provides a context for comparing published recognition rates for the six UNIPEN isolated character subsets - categories 1a, 1b, 1c, 1d, 2 and 3. Simple methods that can be easily reproduced by future researchers are proposed for splitting the Train-R01/V07 subsets into either multi-writer or omni-writer train and test sets as well as for rudimentary cleaning of the data. These meth-
methods are applied to the data and a scanning n-tuple classifier is used to create a large array of recognition results. This framework of methods and results creates a generalized comparative context within which several published results are intercompared.
2. The online scanning n-tuple classifier The offline and online scanning n-tuple classifiers have been described elsewhere [13, 15]. The online scanning n-tuple classifier (OnSNT) first extracts both static and dynamic features into chain codes and then a sliding window is scanned across the chain code sampling it into n-tuples that become the features (or “addresses”) presented to a probabilistic n-tuple classifier. In order to balance good performance with reasonable memory requirements, parameters for the OnSNT are as given in [15], method 26 when applied to category 1a data, and as given in method 27 for all other categories. As before [15], each time the OnSNT was trained a bootstrap process was applied to generate allograph clusters. The OnSNT was initially trained on a very small set of allograph archetypes, manually selected for their apparent visual and dynamic distinctiveness. This bootstrap training was then used to subclass the designated training data into allograph clusters; the OnSNT was then trained with this allograph-clustered data.
3. Creating train and test subsets The OnSNT was trained and tested with many different data arrangements in order to create an array of externally comparable conditions and results. Each TrainR01/V07 category was split eight times to create four multi-writer and four omni-writer jackknife sets; each set contains N subsets, where N is 2, 3, 5, and 10. To create each set, the original data files (one per writer) were indexed in a file list in alphanumeric order according to their full path/filename specification. For each multiwriter set, characters were drawn in character-bycharacter, file-after-file order as given in the file list and distributed in sequence (modulus N) into the N different files of the jackknife set. For the omni-writer sets, characters from each writer were drawn as a group, file-after-
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
file in the order given by the file list, and each group was distributed in turn into the N files of the set. It is important to note that although these sets were generated in a reproducible way so that they are readily duplicated by others, the data are randomly distributed between subsets. To analyze these sets, the OnSNT was trained on a single jackknife subset containing 1/Nth of the data and tested on the balance of the data. This was repeated for each of the N subsets. Larger training subsets were created using the N=3 and N=10 jackknife sets by merging all but one subset for training and testing on the remaining subset. In addition to the jackknife sets, the OnSNT was trained on each of the six complete Train-R01/V07 category sets and tested on the corresponding DevTestR01/V02 sets. A seventh pair, hereafter referred to as category 3lc, comprises the lowercase letters in the category 3 pair. It was also processed in the same way. A few DevTest files were found to be corrupt and were only partially read. Table 1 indicates the number of characters read from each Train and DevTest category, and the number of training characters actually used after cleaning (described in the following section). Table 1. Character count for UNIPEN data. Category 1a 1b 1c 1d 2 3 3lc
Train Read
Train Used
DevTest
15953 28069 61360 17286 122668 67352 40092
15914 27806 61293 17070 122083 67179 40083
8598 16401 37122 9895 71999 44001 26210
4. Cleaning the data when training Not surprisingly, the Train-R01/V07 database has several characters that one or another researcher would identify as being mislabeled; many of these appear to be character fragments. Removing small amounts of mislabeled data may be inconsequential for the training of some classifiers; however, it is arguable that a simple method for removing indisputably mislabeled data could be beneficial for some systems and should therefore be employed. In this report, “outlier” characters with extremely large or small aspect ratios (horizontal or vertical lines, respectively) are dynamically removed during training if the character clearly should not have such an aspect ratio. This filters out many of the obvious character fragments. The chosen limits were based on a visual survey of the training data, and common sense. When the aspect ratio exceeds 4.0 (very wide) all but the 14 characters "+,-.=MW^_`mw~
were rejected; characters M and W were rejected when the aspect ratio exceeded 5.0. All but the 36 characters !"$%'(),./169:;?I[\]`bdfhijklpqty{|} were rejected if the aspect ratio was less than 0.20 (very narrow). Table 1 lists the number of training characters used after some were removed during this cleaning stage.
5. Results and discussion Results with the OnSNT are reported using the methods described in sections 3 and 4. Cleaning had no statistically significant impact on the OnSNT. The affect was usually barely discernable and not always for the better. Error rates are reported in Table 2 for isolated character data according to the following category labels: 1a digits, 1b uppercase, 1c lowercase, 1d punctuation and other symbols, 2 mixed and 3 mixed, with 10, 26, 26, 32, 94, and 94 classes, respectively. Training size is evaluated as the number of training characters used divided by the total usable characters in Train-R01/V07; DT indicates that DevTest-R01/V02 was used as test data. For the OnSNT jackknife set results, mean error rate is given on the left and the error rate for a single pass is on the right. For the 10%, 20%, 33% and 50% training results, the single pass result is trained on the first jackknife subset. For the 66% and 90% results, the single pass result is trained on the first N-1 subsets. 95% confidence limits are given where available. The average number of models or prototypes used for each class (M/C) is provided or estimated where possible. The reader is referred to the referenced papers for clarification of referenced methods. Figures 1 and 2 provide a graphic comparison of the results for categories 1a, 1b and 1c in multiwriter and omniwriter modes, respectively. Error bars reflect 95% confidence intervals when available. A smoothed cubic spline, weighted by uncertainties, is fit to the OnSNT jackknife sets results for visual context. Other results are labeled by reference, with DT indicating DevTest results. This paper is not intended to review methods, but rather surveys results, providing the reader a structure for assessment. Readers are reminded that there are vast differences in the developer’s assumptions, goals and designs. These impact resource requirements, speed, ease of integration with other methods, and extensiblity to continuous cursive writing. As expected, the results demonstrate that training and test sets should be well matched and their sizes maximized to minimize coverage and measurement uncertainties. To this end, some authors have used the DevTest sets. Unfortunately, this set is private and appears to be slightly mismatched to the Train set, being somewhat more difficult. This study suggests that a new, publiclyreleased test set would be a useful and desirable complement to the UNIPEN Train database.
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
Table 2. Multiwriter, omniwriter and DevTest recognition error rates. Cat. 1a
Train Size (%) 10 20 21
Multiwriter Error (%) 95% C.L. 1.6 / 1.5 0.2 1.3 / 1.3 0.2 4.5 4.0 8.2
24 DT 33 40 50 63 66 90
1.2 / 1.2 3.2 3.8 1.2 / 1.2
0.1
1.1 / 1.1 1.1 / 1.1
0.02 0.5
100 DT 1b
10
8.3 / 7.5 6.3 / 6.6 10.0 7.6 5.1 / 5.6 8.3 5.2 8.0 7.6
20 33 35 40 46 48 50 66 90
4.9 / 5.1 4.5 / 4.3 4.4 / 3.5
2.3 0.7 0.8
11.4 / 10.9 13.0 11.7 9.4 / 9.3 11.4 12.1 8.5 / 8.4
10
20 33 39 49 50 52
7.8 / 7.9 9.7 13.2 7.7 / 7.8
83 90 100 DT
1.4 45 1.0
4.4 2.2 / 2.3 1.9 / 2.1 1.2 3.2 1.7 / 2.4 1.8 / 2.9 2.6 3.7 3.0 11.1 / 11.5 8.2 / 9.2 6.6 / 6.5
0.4
1.4 45 1.4 48
1.2 1.3
3.4 2.3 1.2
6.5
6.5 43 60?
12.2 / 12.2
1.3
5.1 237
0.2
10.7 / 10.2 3.7 30.2 9.8 / 9.3 14.1 9.5 / 9.5
0.9
5.1 43
0.6
9.3 / 7.7 11.5 18.8
OnSNT [3][4]HMM-SDTW [4] DAG-SVM-GTDW [12] HMM; R01/V05 [11] GP/MLP OnSNT [3][4] HMM-SDTW [4] DAG-SVM-GTDW OnSNT [15] KNN [10] HMM; R01/V06 cleaned
[7] KP-NN [14] MLP
0.2
0.4
Ref./methods/comments
OnSNT
0.7
0.3 0.9
20.5 8.0 / 7.8
66
M/C
3.4 6.4 6.0 / 6.1 5.6 / 5.2 5.6 / 3.5 8.7 15.0 / 14.3
100 DT 1c
Omniwriter Error (%) 95% C.L. 3.8 / 4.1 1.6 2.9 / 2.3 1.6
45 1.4 3.7
OnSNT [3][4] HMM-SDTW [4] DAG-SVM-GTDW OnSNT [17] KNN [17] KNN/SVC [3][4] HMM-SDTW [4] DAG-SVM-GTDW [15] KNN [10] HMM; R01/V06 cleaned
6.5
OnSNT
1.3
5.1 237
OnSNT [3][4] HMM-SDTW [4] DAG-SVM-GTDW OnSNT [3][4] HMM-SDTW [4] DAG-SVM-GTDW OnSNT [15] KNN [1][2] HMM/MLP; R01/V06 OnSNT [10] HMM; R01/V06 cleaned OnSNT [3][4] HMM-SDTW [5] HMM-NN; R01/V06
0.9 1.3 0.2 3.7
5.1 5.1 237 5.1
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
OnSNT [8] KP-NN
Cat.
Train Size (%)
1d
10 20 33 50 66 90
Multiwriter Error (%) 95% C.L. 23.9 / 24.4 22.9 / 23.1 22.5 / 23.2 22.0 / 22.4 22.1 / 21.2 21.9 / 20.4
1.0 0.3 1.3
26.5 / 27.0 24.1 / 24.4 23.4 / 23.5 23.2 / 23.0 23.0 / 22.9 23.1 / 23.6
0.8 0.4 0.2
28.6 / 30.0 25.8 / 26.5 24.8 / 24.6 24.0 / 24.0 23.4 / 23.6 23.1 / 23.7
1.5 0.9 0.6
1.5 1.8
100 DT 2
10 20 33 50 66 90
0.4 0.9
100 DT 3
10 20 33 50 66 90 100 DT
3lc 100 DT
0.5 1.1
Omniwriter Error (%) 95% C.L. 14.4 26.2 / 25.9 2.5 24.1 / 23.7 2.3 23.5 / 22.4 3.1 23.1 / 22.5 23.5 / 25.3 5.4 23.5 / 24.4 7.0 26.4 31.2 / 32.8 1.8 27.6 / 28.0 1.5 26.3 / 26.2 0.4 25.9 / 25.5 25.6 / 26.8 2.2 25.8 / 25.7 2.1 27.4 30.5 / 34.8 3.5 27.1 / 28.4 1.9 25.3 / 25.5 0.5 24.6 / 24.5 24.3 / 24.3 0.9 23.9 / 24.6 1.4 27.4 11.8 16.2 15.6
M/C
Ref./methods/comments [14] MLP
1.7
OnSNT
3.6
OnSNT
3.6
OnSNT
5.1
OnSNT [9] KPNN [14] MLP
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
14
[5]
[3]
1a 1b 1c
[4]
12
[4] [3]
[3]
Error rate (%)
10
[12]
8
[17]
[3]
[4] [4]
6
[17] [3] [4]
4
[4] [3]
2
0
60 40 Training data used (%)
20
0
100
80
Figure 1. Multiwriter error rates for categories 1a, 1b and 1c. 20 1a 1b 1c
[8 DT]
15 [14 DT]
Error rate (%)
[10]
10
[10]
5 [11 DT]
[15]
[7 DT]
[10]
[15]
[14 DT]
[15]
0 0
20
40
60
80
Training data used (%)
Figure 2. Omniwriter and DevTest error rates for categories 1a, 1b and 1c.
7. References
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
100
[1] T. Artières, J-M. Marchand, P. Gallinari, B. Dorizzi, Multimodal segmental models for on-line handwriting recognition, Proc. of the 15th ICPR, (2000).
[17] L. Vuurpijl, L. Schomaker, Two-stage character classification: a combined approach of clustering and support vector classifiers, Proc. of the 7th IWFHR, 423-432 (2000).
[2] T. Artières, J-M. Marchand, P. Gallinari, B. Dorizzi, Stroke level modeling of on line handwriting through multi-modal segmental models, Proc. of the 7th IWFHR, 93-102 (2000). [3] C. Bahlmann, H. Burkhardt, Measuring HMM similarity with the Bayes probability of error and its application to online handwriting recognition, Proc. of the 6th ICDAR, (2001). [4] C. Bahlmann, B. Haasdonk, H. Burkhardt, On-line handwriting recognition with support vector machines – a kernel approach, Proc. of the 8th IWFHR, 49-54 (2002). [5] N. Gauthier, T. Artères, B. Dorizzi, P. Gallinari, Strategies for combining on-line and off-line information in an on-line handwriting recognition system, Proc. of the 6th ICDAR, (2001). [6] I. Guyon, L. Schomaker, R. Plamondon, M. Liberman, S. Janet, UNIPEN project of on-line data exchange and recognizer benchmarks, Proc. of the 12th Int’l. Conf. on Pattern Recognition, ICPR'94, 29-33 (1994). IAPR-IEEE. [7] J. Hébert, M. Parizeau, N. Ghazzali, A new fuzzy geometric representation for on-line isolated character recognition, Proc. of the 14th ICPR, 1121-1123, (1998). [8] J. Hébert, M. Parizeau, N. Ghazzali, Learning to segment cursive words using isolated characters, Vision Interface, (1999). [9] J. Hébert, M. Parizeau, N. Ghazzali, Cursive character detection using incremental learning, Proc. 5th ICDAR, (1999). [10] J. Hu, S.G. Lim, M.K. Brown, Writer independent on-line handwriting recognition using an HMM approach, Pattern Recognition, 33:133-147, (2000). [11] A. Lemieux, C. Gagné, M. Parizeau, Genetical engineering of handwriting representations, Proc. of the 8th IWFHR, (2002). [12] X. Li, M. Parizeau, R. Plamondon, Model-based on-line handwritten digit recognition, Proc. of the 14th ICPR, (1998). [13] S. Lucas, A. Amiri, Statistical syntactic methods for high performance OCR, IEE Proc.-Vis. Image Signal Process., 143(1), pp. 23-30, (1996). [14] M. Parizeau, A. Lemieux, C. Gagné, Character recognition experiments using UNIPEN data, Proc. of 6th ICDAR, (2001). [15] L. Prevost, M. Milgram, Modelizing character allographs in omni-scriptor frame: a new non-supervised clustering algorithm, Pattern Recognition Letters, 295-302, (2000). [16] E.H. Ratzlaff, A scanning n-tuple classifier for online recognition of handwritten digits, Proc. 6th ICDAR, (2001).
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE