2010 International Conference on Pattern Recognition
Comparing Several Techniques for Offline Recognition of Printed Mathematical Symbols Joan Andreu Sánchez Instituto Tecnológico de Informática Universidad Politécnica de Valencia Valencia, Spain
[email protected] Francisco Álvaro Instituto Tecnológico de Informática Universidad Politécnica de Valencia Valencia, Spain
[email protected] achieved very good results for a large database. However, a comparison of several techniques on the same database would be interesting. In this work we compare several classification techniques for recognition of printed MS that have proved to be very efficient for other classification tasks. The proposed techniques are evaluated in two different databases: the UWIII database [7] that is a small database with degraded images and, the InftyCDB-1 database [8] that is a large database with good-quality images. In the following section, we briefly mention the classification techniques considered in this work. Experimental results are reported in Section III.
Abstract—Automatic recognition of printed mathematical symbols is a fundamental problem for recognition of mathematical expressions. Several classification techniques has been previously used, but there are very few works that compare different classification techniques on the same database and with the same experimental conditions. In this work we have tested classical and novelty classification techniques for mathematical symbol recognition on two databases. Keywords-Mathematical symbol classification; Markov models; weighted nearest neighbour;
hidden
I. I NTRODUCTION Automatic recognition of mathematical expressions (ME) constitutes an important research problem for the analysis of scientific documents [1]. This problem is important to tackle the transcription of scientific documents. The adequate solution of this problem would allow us also to search ME in scientific documents. Online handwriting recognition techniques have been extensively studied in the last two decades [2]. When the recognition is carried out from a document image, then offline techniques must be considered [3], [4]. In the last case, most of the research has focused in typeset ME. The main difference between online and offline recognition of ME is the temporal information that conveys the former problem that is lost in the latter problem. The recognition of ME comprises mainly two steps: the recognition of mathematical symbols (MS) and, the structural recognition of the ME [3], [4], [2], [5]. It is worth noting that the adequate selection of efficient techniques for the recognition of MS can help greatly to the recognition of ME. Recognition of typeset MS is a difficult problem due to several reasons: first, there is a large number of different symbols; second, the number of font-types can be different in the same ME (e.g., roman, italic, and calligraphic); and third, symbols can be of different size in the same ME. Several techniques have been proposed for the offline recognition of printed MS and a good review can be seen in [6]. A combination of classifiers was proposed in [6] that 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.481
II. O FFLINE RECOGNITION OF MATHEMATICAL SYMBOLS We intended to compare classical and very efficient classification techniques with other techniques not well explored for MS classification. Four classification techniques have been considered. The k-Nearest-Neighbor (k-NN) rule is a very popular pattern classification rule that provides good results when the number of prototypes is large. This is a usual classification technique that has also been tested for MS classification [2]. Support vector machine (SVM) is a maximum margin classifier that has demonstrated to be a powerful formalism for recognition tasks. In this work we used the multi-class SVM described in [9]. This technique has been previously used for MS recognition with successful results [5]. In addition to these classification techniques, we considered two additional classification techniques that have not previously used for MS recognition. These techniques are explained below in more detail. A. Weighted nearest neighbour The Weighted Nearest Neighbour (WNN) technique is an improvement of the classical 1-NN [10]. A discriminative technique is used to learn a weighted distance by using the 1-NN rule with a training set. A distance weighting scheme is proposed which can independently emphasize prototypes and/or features. Several alternatives are considered in [10]: 1957 1953
e)
using a different weight for each prototype, using a different weight for each class and characteristic, or using a combination of the previous alternatives. In this work, we used the last alternative previously mentioned, that is, a different weight for each prototype combined with a different weight for each class and characteristic. The reason for this was that in a training set there are samples more representative than others, and also in symbol representation the importance of each pixel is different. Consequently, it is reasonable to weight both the prototypes and the features for each class. The experimental results that are reported in [10] with the UCI and Statlog corpora were comparable to or better than those obtained by other state-of-the-art classification methods.
b) a)
c)
d)
B. Hidden Markov models
0.4
Hidden Markov models (HMM) have been widely used for MS classification in online ME recognition [2]. However their use in offline ME recognition remains unexplored. In recent years, HMM has been successfully used for offline handwritten text recognition [11]. In this work, we explore the technique described in [11] applied to printed MS recognition. We have slightly adapted those techniques for printed MS recognition. Given the novelty of this approach for printed MS recognition, we explain it in more detail (see [11] for additional details). In the preprocessing stage, noise reduction is carried out in the MS image and then, it is adequately normalized to a fixed size row, keeping the aspect ratio (see Figure 1.a). The image is transformed into a sequence of fixed-dimension feature vectors as follows: the image is divided into a grid of small square cells, sized a small fraction of the image height. Each cell is characterized by the following features: normalized grey level (see Figure 1.b), horizontal grey-level derivative (c) and vertical grey-level derivative (d). To obtain smoothed values of these features, feature extraction is extended to a 5 × 5cell window centered at the current cell and weighted by a bidimensional Gaussian function in b) and a unidimensional Gaussian function in c) and d). The derivatives are computed by least squares fitting a linear function. Columns of cells are processed from left to right and a feature vector is built for each column by stacking the features computed in its constituent cells. Figure 1.e shows a graphical representation of the obtained values.
0.6 Figure 1.
0.3 0.7
0.6 0.4
Feature extraction for HMM recognition.
previously used for MS classification [3]. The second one was the InftyCDB-1 database [8] that has been also used for MS classification in other works [5]. The UW-III database [7] is a set of document images from different fields that includes 25 journal document pages containing mathematical formulae. Some of the images come from blurred photocopies. Each image has annotated the zones where the ME are located, but the MS are not isolated. The zones that are annotated are not embedded in the text. For this work, we isolated and classified the MS manually1 . The complete database had 2, 233 symbols. From this set, we removed touching symbols and those symbols that appeared less than four times. In this way the total number of symbols was 2, 076. Table I shows the main statistics of this data set. Some symbols are expressed in LATEX notation to avoid confusion. In the classification experiments, we distinguished analogous symbols with different meaning. Thus, the minus symbol and the frac symbols were classified in different classes. The InftyCDB-1 database [8] is a set of document images that comprises articles on pure mathematics. The database can be used for MS recognition. Each MS is manually annotated with its bounding box and with its class tag. The number of MS both in embedded text and displayed ME is 157K. Given the large size of this database, we limited the maximum amount of training and test data. We composed four training sets of increasing size (5K, 10K, 50K) and one test set (5K). These training and test data sets were chosen at random but keeping the actual
III. E XPERIMENTS A. Databases There are very few publicly available databases that have appropriate information in order to isolate the MS adequately. Two different databases were used to test the classification techniques described in the previous section: the first one was the UW-III database [7] that has not been
1 Available
1958 1954
at http://www.dsic.upv.es/~jandreu/UW-III-MS.tgz
Table I UW-3 MS AND ABSOLUTE FREQUENCY OF EACH SYMBOL . 1 − + ] 3 f D \sum √ z l e 0
O 4 X ∈ T γ y q
128 81 48 41 35 28 26 23 20 17 16 15 12 11 11 9 8 7 7 6 5
( 2 m 0 ω n u d µ V k cos a j λ N ˆ Q ∆ Φ ×
122 80 47 40 34 27 25 23 20 17 16 15 12 11 10 9 8 7 7 6 4
) \frac [ | x h i v S t sin R ∼ η ¨ K ∩ π } E G
119 72 42 39 30 27 25 21 19 17 15 13 11 11 10 9 8 7 7 6 4
= p , θ . s r ∂ R I α / C c ρ w L { ¯ 5
For the SVM classification, we used the SVMmulticlass with default options3 . In experiments reported in [5] on MS classification with the InftyCDB-1 database, a linear kernel obtained better classification results than a Gaussian kernel and a cubic polynomial kernel. Therefore, a linear kernel function was just used. For the UW-3 data set, we used 25% for test and 75% for training. The test set and the training set were composed at random. Given that the UW-3 data set was very small, we repeated this process 100 times. Column ≥ 4 in Table II shows the obtained average error rate for this experiments. Note that we only considered the classes that had at least 4 prototypes per class.
104 52 42 37 29 26 24 21 19 17 15 13 11 11 10 8 7 7 7 6 4
Table II AVERAGE CLASSIFICATION ERROR RATE FOR THE UW-3 DATA SET.
1-NN 3-NN 5-NN WNN SVM HMM-8
distribution of MS of the original data set. The total number of classes for the experiments was 183. B. Results
≥4 6.34±0.08 8.27±0.09 9.60±0.08 5.88±0.07 4.85±0.05 12.19±0.08
≥8 5.44±0.08 6.71±0.07 7.78±0.07 5.23±0.08 4.27±0.04 10.24±0.08
≥ 16 5.05±0.07 5.70±0.08 6.57±0.09 5.02±0.09 4.24±0.05 9.42±0.10
We can see that the best results were obtained by the SVM technique. Note also that the classification error with the k-NN rule increased as k increased. The reason for this was that as k increased, there was not enough “similar” prototypes to chose in classes with few prototypes; or in other words, this technique is very sensitive to low displacements in the bounding box. We tested this hypothesis by removing the classes that had less than 8 prototypes per class (column ≥ 8 in Table II), and 16 prototypes per class (column ≥ 16 in Table II). Thus, we can see in column ≥ 4 that the difference between row 1-NN and row 5-NN was 3.26, while this difference was 1.52 in column ≥ 16, which confirmed our hypothesis (experiments with InftyCDB-1 data set also confirmed this hypothesis). We also observed that the WNN classification technique was very competitive. The worst results were obtained with HMM, maybe due to low amount of training data. Thus, we observed that the difference between this classification technique and the other techniques decreased as more samples were available for training. Table III shows the results with the InftyCDB-1 data set. In all cases the results clearly improved as the size of the training set increased. The best result were obtained with SVM, but WNN obtained analogous competitive results. Note also that with a large amount of prototypes (column 50K in Table III), the k-NN classification rule obtained similar values for different values of k. The worst results in this experiment were also obtained by HMM.
The classification techniques that have been described in Section II were tested with the data sets that have been explained in the previous section. For the k-NN classification technique, we used several values of k. We did not used any technique of prototype removing. We used the Euclidean distance between two images taking into account the difference between each pixel. We divided initially the set of prototypes into three classes based on the aspect ratio, and in this way the number of comparisons was reduced to a large extent. If we did not divide the set of prototypes according to the aspect ratio, the classification error rate remained the same. Then, the MS images in each one of these three classes were adequately normalized. Each new prototype to be classified was just compared with the prototypes of one of these three classes depending on the aspect ratio. For the HMM classification technique, we tested different number of Gaussian distributions per state, but the best results were obtained with 8 Gaussian distributions. Therefore, we only report results for this number of Gaussian distributions. We used left-to-right models with different number of states in each model, depending on the average width of the symbols of each class. The number of states ranged from 1 (vertical bar) to 15 (trigonometric, logarithm functions or square tail). The HTK toolkit2 was used for these experiments. If we used the same number of states for all HMM, then the classification results were clearly worst. 2 http://htk.eng.cam.ac.uk/
3 http://svmlight.joachims.org/svm_multiclass.html
1959 1955
Table III C LASSIFICATION ERROR RATE FOR THE I NFTY CDB-1 DATA SET.
1-NN 3-NN 5-NN WNN SVM HMM-8
5K 6.3 7.0 7.9 4.8 4.5 8.1
10K 4.5 5.0 5.5 3.5 3.4 7.6
20K 4.3 4.3 4.4 3.4 3.0 7.4
R EFERENCES [1] K. Chan and D. Yeung, “Mathematical expression recognition: a survey,” International Journal on Document Analysis and Recognition, vol. 3, pp. 3–15, 2000.
50K 3.3 3.2 3.5 2.8 2.6 7.3
[2] U. Garain and B. Chaudhuri, “Recognition of online handwritten mathematical expressions,” IEEE Trans. on Systems, Man, and Cybernetics - Part B: Cybernetics, vol. 34, no. 6, pp. 2366–2376, 2004. [3] R. Zanibbi, D. Blostein, and J. Cordy, “Recognizing mathematical expressions using tree transformation,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 24, no. 11, pp. 1–13, 2002.
In the four classification techniques, approximately 50% of the errors involved overline, minus, fractional line, underline, and hyphen symbols. These symbols are equal in all cases, and they should be distinguished by structural methods.
[4] M. Suzuki, F. Tamari, R. Fukuda, S. Uchida, and T. Kanahori, “Infty- an integrated ocr system for mathematical documents,” in Proc. of ACM Symposium on Document Engineering, C. Vanoirbeek, C. Roisin, and E.Munson, Eds., Grenoble, 2003, pp. 95–104.
C. Related work
[5] C. Malon, S. Uchida, and M. Suzuki, “Mathematical symbol recognition with support vector machines,” Pattern Recognition Letters, vol. 29, no. 9, pp. 1326–1332, 2008.
An accuracy recognition result of 98.5% is reported on the InftyCDB-1 database in [4]. However, it should be taken into account that the MS classification was helped with structural information. Other MS classification techniques are reported in [6] on a different database. In that work, first the symbols are confidently classified if some special strokes appears in the MS. This allows the system to classify 39% of the symbols, obtaining an accuracy of about 98.3%. For the other set of symbols the best obtained results was 93.8% in accuracy.
[6] U. Garain, B. Chaudhuri, and R. Ghosh, “A multiple-classifier system for recognition of printed mathematical symbols,” in Proc. ICPR, vol. 1, 2004, pp. 380–383. [7] I. Phillips, “Methodologies for using UW databases for OCR and image understanding systems,” in Proc. SPIE, Document Recognition V, vol. 3305, 1998, pp. 112–127. [8] M. Suzuki, S. Uchida, and A. Nomura, “A ground-truthed mathematical character and symbol image database,” in Proc. 8th International Conference on Document Analysis and Recognition (ICDAR’05), 2005, pp. 675–679.
IV. C ONCLUSIONS In this work we have compared several techniques for MS classification. The best results were obtained with SVM classifiers and weighted nearest neighbour techniques. The worst results were obtained with HMM with feature extraction techniques that were successfully used for handwritten text recognition. Although HMM did not seem to be appropriate for this classification problem, for future work we intend to explore this technique for classification of touching symbols in ME, since there is no need of explicit symbol segmentation.
[9] K. Crammer and Y. Singer, “On the algorithmic implementation of multiclass kernel-based vector machines,” J. Mach. Learn. Res., vol. 2, pp. 265–292, 2002. [10] R. Paredes and E. Vidal, “Learning weighted metrics to minimize nearest-neighbor classification error,” IEEE Transaction on Pattern Analisys and Machine Intelligence, vol. 28, no. 7, 2006. [11] A. Toselli, A. Juan, and E. Vidal, “Spontaneous handwriting recognition and classification,” in Proc. of the 17th International Conference on Pattern Recognition, Cambridge, UK, August 2004, pp. 433–436.
ACKNOWLEDGMENTS Work supported by the EC (FEDER/FSE) and the Spanish MEC/MICINN under the MIPRCV “Consolider Ingenio 2010” program (CSD2007-00018), the MITTRAL (TIN2009-14633-C03-01) project, and by the Generalitat Valenciana under grant Prometeo/2009/014.
1960 1956