2005 IEEE International Conference on Systems, Man and Cybernetics Waikoloa, Hawaii October 10-12, 2005
Ensemble Methods for Handwritten Text Line Recognition Systems Roman Bertolami and Horst Bunke Institute of Computer Science and Applied Mathematics University of Bern, Neubr¨uckstrasse 10, CH-3012 Bern, Switzerland {bertolam, bunke}@iam.unibe.ch
Abstract - This paper investigates the generation and use of classifier ensembles for offline handwritten text recognition. The ensembles are derived from the integration of a language model in the hidden Markov model based recognition system. The word sequences output by the ensemble members are aligned and combined according to the ROVER framework. The addressed environment is extreme because of the existence of a large number of word classes. Moreover, the recognisers do not produce single output classes but sequences of classes. Experiments conducted on the IAM database show that the ensemble methods are able to produce statistically significant improvements in the word level accuracy when compared to the base recogniser. Keywords: Classifier Ensemble Methods, Handwritten Text Line Recognition, Statistical Language Model, Hidden Markov Model.
1
Introduction
Today, offline handwriting recognition is still a field with many open challenges. Good recognition rates are achieved for character or numeral recognition, where the number of classes is rather small. But as the number of classes increases, as for example in isolated word recognition, the recognition rates drop significantly. An even more difficult task is the recognition of general handwritten text lines or sentences. Here, the lexicon usually contains a huge amount of word classes and the correct number of words in the image is unknown in advance, which leads to additional errors. In this field, recognition rates between 50% and 80% are reported in literature, depending on the experimental setup [6, 12, 14, 15, 23]. Classifier ensemble methods are used in many pattern recognition problems to improve the classification accuracy. Various experiments have shown that a set of classifiers has the potential to achieve better results than the best single classifier [10]. Two major problems have to be solved when we use ensemble methods. First, it is desirable to create multiple classifiers automatically from one given base classifier. Second, an adequate combination strategy must be found to benefit from the different classifiers’ results.
0-7803-9298-1/05/$20.00©2005 IEEE
Various ensemble methods, e.g. boosting, bagging, and random subspace, have successfully been applied to handwritten character [4, 13] or word recognition [2, 3]. A potentially large, but fixed number of classes is considered in each of these methods and the final result is selected among the results produced by the ensemble members. However, most of these methods can not be applied to text line recognition, because the output of a text line recognition system is a sequence of word classes instead of just a single word class. We do not want to select a word sequence produced by just one of the ensemble members as the final result, but derive the latter from the results of multiple ensemble members by appropriate combination. Therefore, to combine multiple text line recognisers, an alignment procedure based on dynamic programming or similar techniques has to be applied in the first step of the combination. Once the results are aligned, various voting algorithms can be used to derive the final recognition result. To the knowledge of the authors, no paper has been published yet in the literature on multiple classifier systems for offline handwritten text line recognition. Related work in handwritten digit and word recognition as well as continuous speech recognition is surveyed in the following paragraph. A framework called StrCombo has been presented in [20] for numeric string recognition. This graph-based combination approach uses each geometric segment of the individual recognisers as a node in a graph. The best path through this graph provides then the final recognition result. In isolated word recognition, another approach has been proposed in [19]. Here, no word classes are used, but words are treated as sequences of character classes. A combination framework is presented which uses a weighted opinion pool. A system called ROVER has been proposed in [1] in the domain of continuous speech recognition. The main goal of this system is to reduce the word error rate by aligning and combining the results of multiple speech recognisers. An extension of the ROVER system, where language model information supports the combination process, has been presented in [11]. In this paper, ensemble methods are applied to offline handwritten text line recognition for the first time. We propose a new method to automatically generate ensemble mem-
2334
Figure 1. Normalisation of the handwritten text line image. The first line shows the original image, while the normalised image is shown on the second line.
bers out of one base classifier. The proposed method is based on the integration of a statistical language model. Another novel feature is the alignment of output results before combination procedures, such as voting, are applied. The application we are dealing with in this paper is extremely difficult from two different points of view. First, a classification task involving a very large number of classes (more than 12,000 in the experiments described in Sect. 4) is considered. Secondly, the output to be produced by our system is not a single class, but a sequence of classes of variable length with no given segmentation of the input signal. The remaining part of the paper is organised as follows. Section 2 introduces novel methods to generate ensembles of recognisers from a base recogniser, which can then be combined as described in Sect. 3. Experiments and results are presented in Sect. 4, whereas conclusions are drawn in the last section of the paper.
2
Generation of ensembles
To obtain ensembles of classifiers we first build a hidden Markov model (HMM) based text line recogniser which we then use as the base recogniser. Multiple recognition results are then produced by the integration of a statistical language model. The proposed strategy can only be applied if a language model supports the recognition process. In contrast to other ensemble generation methods, e.g. boosting or bagging, our strategy would not work with a character or isolated word recognition system. However, for these tasks multiple classifier approaches have been proposed before [2, 3, 4, 13].
2.1
Handwritten text line recogniser
Based on the system described in [7] we create an offline handwritten text line recognition system which we then use as the base recogniser. The system can be divided into three parts, preprocessing and feature extraction, HMM based recognition, and postprocessing. The handwriting images are normalised with respect to skew, slant, and baseline position in the preprocessing phase. This normalisation reduces the impact of different writing styles. An example of the normalisation procedure is shown in Fig. 1. After preprocessing, each image of a handwritten text line is converted into a sequence of feature vectors. A sliding window, moving one pixel per step from left to right over a line of text, is used for this purpose. A number of
features are extracted at each position of the sliding window. Further details about preprocessing and feature extraction are provided in [7]. For the HMM based recognition process, each character is modelled by an HMM. For each of the HMMs we chose a linear topology. The character HMMs have an individual number of states [22] and a mixture of twelve Gaussians is used to model the output distribution in each state. The HMMs are trained according to the Baum-Welch algorithm [9]. The Viterbi decoding procedure with integrated language model support is used to perform the recognition [16]. During the postprocessing phase, a confidence measure is computed for each recognised word. The confidence measure is derived from alternative candidate word sequences according to the procedure presented in [21].
2.2
Multiple recognition results
Once we have created the base recognition system, we are able to generate multiple recognisers by the integration of a language model. In the field of speech recognition, it has been shown that those parts of a recognised word sequence that are very sensitive to a specific integration of the language model are often recognised incorrectly. For these parts we are trying to find alternative interpretations to improve the recognition accuracy. The integration of the language model into an HMM based recognition system is accomplished according to the following formula: n o Wˆ = argmax log p(X|W ) + α log p(W ) + mβ
(1)
W
We try to find the most likely word sequence Wˆ = (w1 , . . . , wm ) for a given observation sequence X. The likelihood p(X|W ), computed by the HMM decoding, is combined with the likelihood p(W ), provided by the language model. Two parameters α and β are required to control the combination because HMM decoding as well as the language model produce merely approximations of probabilities. The term Grammar Scale Factor is used for parameter α, which weights the impact of the statistical language model. The parameter β is called Word Insertion Penalty and controls the segmentation rate of the recogniser. If we select n different parameter pairs (αi , βi ) (i = 1, . . . , n) we are able to produce n different recognition systems, which produce n recognition results from the same input, i.e. the same image of a handwritten text line. For the selection of m parameter pairs (αi , βi ), m ≤ n, which will be used to build an ensemble, we propose three different strategies: m best We measure the performance of n different parameter pairs on a validation set and select the m pairs that individually perform best. forward search We select the m best performing pairs on a
2335
α 10 20 20 30 40 40 50
β -10 -100 80 110 -40 140 110
W1 : W2 : W3 : W4 :
Recognition Result ’ he going - out of the love the going - out of the love ” be going - out of the love ” the going - out of the love the going - out of the lack ” the going - out of the lack ’ he going - out of the lack
Figure 2. Example of multiple recognition results produced by different values of α and β .
validation set by a forward search procedure. We start with the best performing pair (α1 , β1 ). Then we iteratively determine the values of (αi , βi ) by measuring the combination of ((α1 , β1 ), . . . , (αi−1 , βi−1 ), (αi , βi )) for each of the remaining pairs (αi , βi ). The best performing pair is then used as (αi , βi ). The procedure is terminated once i = m. backward search Starting point is the set given by the n best performing pairs on the validation set. Iteratively, we leave one of the members out and measure the performance. We select the best performing set to continue iteratively. The procedure terminates once the best performing subset of m pairs has been found. To reduce the computational cost, the HMM based recogniser does not only produce single recognised word sequences, but whole recognition lattices [23]. These lattices contain part of the search space which has been explored during the Viterbi decoding step. This means that we have to apply the Viterbi decoding step only once instead of m times. A lattice rescoring procedure using the different values of α and β produces the different recognition results. An example of multiple recognition results for the handwritten text ”the going-out of the land produced by different α and β values is shown in Fig. 2.
3
Combination of multiple results
To combine the multiple recognition results we apply the Recogniser Output Voting Error Reduction (ROVER) framework [1]. This system was developed in the domain of speech recognition and first used to combine multiple continuous speech recognisers. The ROVER system can be divided into two modules, the alignment module and the voting module. These modules will be described next.
3.1
Alignment module
In the alignment module we have to find an alignment for n word sequences. Because the optimal solution to this problem is NP-complete [18], we use an iterative and ap-
Face court is on Race course is on Face ( ours it on Face ( ours if on
WTN1 = W1 + W2 :
Face Race
court course
is
on
Face Race
ε (
court course ours
is it
on
WTN2 = WTN1 + W3 :
Face Race
ε (
court course ours
is it if
on
WTN3 = WTN2 + W4 :
Figure 3. Example of iteratively aligning multiple recognition results.
proximate solution. This means that the word sequences are aligned incrementally. At the beginning the first two sequences are aligned using a standard string matching algorithm [17]. The result of this alignment is a Word Transition Network (WTN). The third word sequence is then aligned with this WTN, resulting in a new WTN, which is then aligned with the fourth word sequence and so on. We refer to [1] for further details. In general, this iterative alignment procedure does not deliver an optimal solution, because the result is affected by the order in which the word sequences are considered. In practice, however, the suboptimal alignment result produced by the ROVER alignment module often provides an adequate solution at much lower computational costs. An example of this alignment algorithm is shown in Fig. 3. The image of the handwritten text Face ( ours is on is analysed by four different recognisers which results in four different word sequences W1 , W2 , W3 , and W4 . None of these word sequences is a correct transcription. In the first step W1 and W2 are aligned in W T N1 . Next, W3 is aligned with W T N1 . Subsequently, W4 is added resulting in W T N3 . If we now select the correct path through W T N3 , we can perfectly transcribe the handwritten text image.
3.2 Voting module The voting module combines the different word sequences once they are aligned in a WTN. The goal is to identify the best scoring word sequence in the WTN and extract it as the final result. The decisions are made independently for each segment of the WTN. Thus neither of the adjacent segments has any effect on the current decision. Each decision depends only on the number m of recognition outputs, on the number of occurrences, mw , of a word w, and on the confidence mea-
2336
74 72 70 68
0.75
M Best Forward Search Backward Search Base Recogniser
0.749
Word Level Accuracy [%]
0.748 Word Level Accuracy
75 70 65 60
0.747 0.746 0.745 0.744
55
0.743
50
0.742
45
0
0.741
200 150
40
0.74
100 20
50 40
60
α
2
0 80
-50 100-100
4
6
8
10 12 14 16 Number of Ensemble Members
18
20
22
24
β
Figure 5. Validation of the number of ensemble members for the different selection strategies.
Figure 4. Validation of different values of grammar scale factor (α) and word insertion penalty (β ).
sure, cw , of word w. The confidence measure cw is defined as the maximum confidence measure among all occurrences of w at the current position in the WTN. For each possible word class w, we calculate the score sw as follows: mw + (1 − λ )cw (2) m We then select the word class w for the current segment with the highest score sw . To apply Eq. 2 we have to experimentally determine the value of λ . Parameter λ weights the impact of the number of occurrences against the confidence measure. Additionally we have to experimentally determine the confidence measure cε for null transition arcs, because no confidence score is associated with a null transition ε. For this purpose we probe various values of λ and cε on a validation set. sw = λ
4 Experiments and results Each of the experiments reported in this section makes use of the base recogniser described in Sect. 2. This offline handwritten text line recogniser is supported by a statistical bigram language model that has been extracted from the LOB corpus [5]. We considered a writer independent task, where none of the writers in the test set has been used for the training or the validation of the system. The text lines used in the experiments originate from the IAM database [8] which consists of a large number of handwritten English forms. We use 6,166 text lines written by 283 writers to train the HMMs of the base recogniser. The ensemble methods are validated on 941 text lines written by 43 writers, whereas the test set contains 1,863 text lines from 128 writers. Training, validation and test set are disjoint, i.e. each writer has only contributed to one set. The lexicon contains 12,502 words and is given by the union of training, test, and validation set. The character HMMs of the base recogniser are trained
according to the Baum-Welch algorithm. This algorithm iteratively optimises the parameters of the character models and is a special instance of the Expectation-Maximisation algorithm. Once the base recogniser is trained, we measure the performance of a large number of (α, β ) value pairs on the validation set. The result of this measurement is shown in Fig. 4. We then choose the 24 best performing pairs to be considered for combination. Next, we apply the three selection strategies for (αi , βi ), described in Sect. 2.2, where i = 1, . . . , m. We do not only have to determine which (α, β ) values should be used for combination, but also the number of ensemble members, m, of (α, β ) pairs to be used. For this purpose we measure the performance for all possible values of m on the validation set. Simultaneously, we optimise the parameters λ and cε . The result of this validation procedure is shown in Fig. 5 for each of the selection strategies. Finally, we measures the performance of the optimised systems on the test set. No forward and backward tuning of recognisers is done directly on the test set. The performance of the systems on the test set is as follows. The base recognition system attains a word level accuracy of 67.35%. The m best strategy achieves a word level accuracy of 67.82%, whereas the forward search strategy achieves 68.03%. The backward search strategy reaches a word level accuracy of 68.09%. The latter is the best performing strategy and it achieves a statistical significant improvement of 0.74% over the base recogniser at a significance level of 95%. The significance level was calculated from the text line level accuracies, by applying a statistical z-test. The results on the test set are summarised in Tab. 1.
5
Conclusions
We have proposed a novel method to generate and combine ensemble members for offline handwritten text line recognition. Handwritten text line recognition is an ex-
2337
Selection Strategy Base recogniser M best Forward search Backward search
Word Level Accuracy 67.35% 67.82% 68.03% 68.09%
Document Analysis and Recognition, 5(4):224 – 232, 2003. [4] T. Huang and C. Suen. Combination of multiple experts for the recognition of unconstrained handwritten numerals. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17:90–94, 1995.
Table 1. Results on the test set. tremely challenging field, due to the existence of a large number of word classes, and because the correct number of words in a text line is unknown in advance. Multiple recognisers were generated by the integration of a statistical language model in the hidden Markov model based text line recognition system. We presented three different strategies for ensemble member selection. The results of the individual ensemble members were combined according to the ROVER combination scheme. First, an iterative alignment algorithm is applied to align the results in a single word transition network. Second, the final result is built by extracting the best scoring transcription from the word transition network. In the experiments, conducted on a large set of text lines extracted from the IAM database, the proposed ensemble methods were able to achieve improvements in the word level accuracy when compared to the base recogniser. The absolute improvement, which is less than 1%, is moderate. However, this phenomenon is common in handwriting recognition, where often enormous effort is needed to achieve any improvement at all. Nevertheless we note that the obtained improvement is statistically significant. Future work will include the consideration of additional base recognition systems as well as the investigation of other alignment and voting strategies.
Acknowledgement This research was supported by the Swiss National Science Foundation (Nr. 20-52087.97). Additional funding was provided by the Swiss National Science Foundation NCCR program ”Interactive Multimodal Information Management (IM)2” in the Individual Project ”Scene Analysis”.
References [1] J. Fiscus. A post-processing system to yield reduced word error rates: Recognizer output voting error reduction. In IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Santa Barbara, pages 347–352, 1997. [2] P. Gader, M. Mohamed, and J. Keller. Fusion of handwritten word classifiers. Pattern Recognition Letters, 17:577–584, 1996. [3] S. G¨unter and H. Bunke. Ensembles of classifiers for handwritten word recognition. International Journal on
[5] S. Johansson, E. Atwell, R. Garside, and G. Leech. The Tagged LOB Corpus, User’s Manual. Norwegian Computing Center for the Humanities, Bergen, Norway, 1986. [6] G. Kim, V. Govindaraju, and S. Srihari. Architecture for handwritten text recognition systems. In S.-W. Lee, editor, Advances in Handwriting Recognition, pages 163–172. World Scientific Publ. Co., 1999. [7] U.-V. Marti and H. Bunke. Using a statistical language model to improve the performance of an HMM-based cursive handwriting recognition system. International Journal of Pattern Recognition and Artificial Intelligence, 15:65–90, 2001. [8] U.-V. Marti and H. Bunke. The IAM-database: an English sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition, 5:39 – 46, 2002. [9] L. Rabiner. A tutorial on hidden Markov models and selected application in speech recognition. Proc. of the IEEE, 77(2):257–286, 1989. [10] F. Roli, J. Kittler, and T. Windeatt, editors. Proc. of the 5th International Workshop on Multiple Classifier Systems, Cagliari, Italy, 2004. [11] H. Schwenk and J. Gauvain. Combining multiple speech recognizers using voting & language model information. In International Conference on Speech and Language Processing (ICSLP), Beijing, China, pages 915 – 918, 2000. [12] A. Senior and A. Robinson. An off-line cursive handwriting recognition system. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3):309– 321, 1998. [13] K. Sirlantzkis, M. Fairhurst, and M. Hoque. Genetic algorithm for multiple classifier configuration: A case study in character recognition. In J. Kittler and F. Roli, editors, 2nd International Workshop on Multiple Classifier Systems (MCS), Cambridge, England, pages 99– 108, 2001. [14] A. Vinciarelli, S. Bengio, and H. Bunke. Offline recognition of unconstrained handwritten texts using HMMs and statistical language models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6):709– 720, 2004.
2338
[15] A. Vinciarelli and J. Luettin. Off-line cursive script recognition based on continuous density HMM. In 7th International Workshop on Frontiers in Handwriting Recognition, Amsterdam, The Netherlands, pages 493– 498, 2000. [16] A. Viterbi. Error bounds for convolutional codes and an asimptotically optimal decoding algorithm. IEEE Transactions on Information Theory, 13(2):260–269, 1967. [17] R. Wagner and M. Fischer. The string-to-string correction problem. Journal of the ACM, 21(1):168–173, 1974. [18] L. Wang and T. Jiang. On the complexity of multiple sequence alignment. Journal of Computational Biology, 1(4):337–348, 1994. [19] W. Wang, A. Brakensiek, and G. Rigoll. Combination of multiple classifiers for handwritten word recognition. In 8th International Workshop on Frontiers in Handwriting Recognition, Niagara-on-theLake, Canada, pages 117–122, 2002. [20] X. Ye, M. Cheriet, and C. Y. Suen. Strcombo: combination of string recognizers. Pattern Recognition Letters, 23:381–394, 2002. [21] M. Zimmermann, R. Bertolami, and H. Bunke. Rejection strategies for offline handwritten sentence recognition. In 17th International Conference on Pattern Recognition, Cambridge, England, volume 2, pages 550–553, 2004. [22] M. Zimmermann and H. Bunke. Hidden Markov model length optimization for handwriting recognition systems. In 8th International Workshop on Frontiers in Handwriting Recognition, Niagara-on-theLake, Canada, pages 369–374, 2002. [23] M. Zimmermann and H. Bunke. Optimizing the integration of a statistical language model in HMM based offline handwriting text recognition. In 17th International Conference on Pattern Recognition, Cambridge, England, volume 2, pages 541 – 544, 2004.
2339