Holistic Urdu Handwritten Word Recognition Using Support Vector ...

Report 2 Downloads 37 Views
2010 International Conference on Pattern Recognition

Holistic Urdu Handwritten Word Recognition Using Support Vector Machine Malik Waqas Sagheer, Chun Lei He, Nicola Nobile, Ching Y. Suen CENPARMI (Centre for Pattern Recognition and Machine Intelligence) Computer Science and Software Engineering Department, Concordia University Montreal, Quebec, Canada {m_saghee, cl_he, nicola, suen}@cenparmi.concordia.ca Abstract—Since the Urdu language has more isolated letters than Arabic and Farsi, a research on Urdu handwritten word is desired. This is a novel approach to use the compound features and a Support Vector Machine (SVM) in offline Urdu word recognition. Due to the cursive style in Urdu, a classification using a holistic approach is adapted efficiently. Compound feature sets, which involves in structural and gradient features (directional features), are extracted on each Urdu word. Experiments have been conducted on the CENPARMI Urdu Words Database, and a high recognition accuracy of 97.00% has been achieved.

I.

INTRODUCTION

Offline Handwritten Word Recognition (HWR), especially cursive handwritten word recognition, has become a challenging area in pattern recognition. The Urdu language (also cursive) is one of the more used languages of South Asia. It was derived from the Farsi alphabet, which itself is derived from the Arabic alphabet. Urdu is one of the 23 official languages in India and it is one of the two official languages in Pakistan. Urdu is also one of the most spoken languages around the world [15]. Like Arabic, Urdu is written from right to left. However, Urdu has more isolated letters (38) than Arabic (28) and Persian (32) as shown in Fig. 1. There has been very little research work done only on online Urdu handwriting recognition ([1-2]) and no work on offline Urdu handwriting word recognition. In this paper, we design an offline word recognition system applying to Urdu. It is a new approach since we extract two different feature sets, and a Support Vector Machine (SVM) is designed for classification. Due to the cursive style in Urdu, the classification using a holistic approach is more efficient and optimal. Although SVM always outperforms other classifiers, it is difficult to apply it with various numbers of features in segmentationbased word recognition. Accordingly, only word recognition without segmentation (holistic approach) can make it possible to use a SVM in classification. Compound feature sets, which involves in structural and gradient features (directional features), are extracted on each Urdu word. Since gradient features keep both the positions and directions information, the system can arrive at a good performance with them [8]. Once combining the structural features to it, the performance improved. In Section II, procedures in pre-processing are described, and compound feature sets and classification using SVM are designed and described in Sections III and IV, respectively. 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.468

Figure 1. Urdu, Farsi and Arabic Characters

The experiments and results are given in Section V, and we conclude our work in Section VI. II.

PRE-PROCESSING

Recognition results rely on the pre-processing phase. A little noise on the image can result in incorrect recognition. Our pre-processing phase consists of the following steps: 1) During database creation, writers used either blue or black ink. Sometimes the color of the ink was very light or we lost the intensity of handwriting during our scanning process. We keep both binarized and greyscale images in pre-processing so that we can extract the different features on them. 1904 1900

2) Median filters were applied on the binary images to reduce the salt and pepper noise and to smooth the images. 3) Since white space around the written word in an image does not help in any recognition process, this unwanted white space around the word was eliminated. To eliminate this white space, the smallest bounding box was used. 4) Before extracting the features, it was necessary to normalize the size of all images before the training and testing phases. Size normalization is an important preprocessing operation and each image was normalized to two different sizes, 64 x 64 pixels and 128 x 128 pixels, by keeping an aspect ratio for different feature sets. III.

• The gradient strength and direction of each pixel I(x,y) were calculated as follows: Strength: Direction:

f(x, y) =

g

g

θ(x, y) = tan

(3)

g /g

(4)

Some examples of gradient images of Urdu words are shown in Fig. 3, below:

FEATURE EXTRACTION

For handwriting recognition, different features like chain codes, structural features, statistical features, curvature features, projection profile and gradient features (directional features) have been used in different studies ([3-4],[78],[10]). In our study, we extracted two different types of features from each image: gradient features and structural features. We conducted various experiments with different feature sets and found that the combination of these two feature sets produced good results for our Urdu HWR system.

Figure 3. (a) Greyscale image of size 128x128, (b) (c) Gradient Direction

Gradient Strength,

In equation 4, θ(x,y) returns the directon of a vector ( , ) in the range of [-π, π]. These gradient directions were quantized to 32 intervals of π/16 each.The gradient image was divided into 81 blocks, with 9 vertical blocks and 9 horizontal blocks, as shown in Fig. 4 For each block, the gradient strength was accumulated in 32 directions. By applying this step, the total size of the feature set in the feature vector will be (9 x 9 x 32) = 2592. A 5 x 5 Gaussian filter was applied to reduce the spacial resolution by downsampling the number of blocks from 9 x 9 to 5 x 5. Also, the number of directions was reduced from 32 to 16 by downsampling with a weight vector [1 4 6 4 1], to obtain a feature vector of size 400 (5 horizontal blocks x 5 vertical blocks x 16 directions).

A. Gradient Feature Extraction Gradient features are directional features and can be extracted from the gradient of a greyscale image for a handwritten text. A gradient vector of discrete direction is decomposed into two components by specifying a number of standard directions (chain code directions). In our gradient feature extraction phase, after the pre-processing phase, each image of size 128 x 128 pixels was converted back into a greyscale image. The following steps were applied for the extraction of features: • Robert’s filter masks were applied on images. These masks are shown in Fig. 2. 0 1

1 1 0 0

0 1

Figure 2. Robert’s Filter Mask for Extracting Gradient Features

• Let I(x, y) be an input image of size 128x128 pixels. After applying the Robert filter masks, the horizontal gradient component ( ) and vertical gradient component ( ) were calculated, as follows: = I(x+1, y+1) - I(x, y)

(1)

= I(x+1, y) - I(x, y+1)

(2)

Figure 4. Division of Gradient Image into 9-by-9 Blocks.

B. Structural Feature Extraction In our study, we extracted only the upper profile features and well-known horizontal and vertical projection profiles from each normalized (64 x 64 size) binary image. The structural features are physical attributes of an image. The structural features include projection profile, topological

1905 1901

extraction, the different features, and the different number of training samples. Different experiments were performed with different image sizes. Table I shows the comparison of various sizes in normalization. As a result, size with 128-by-128 is more optimal. We also tested the Sobel filter for extracting the gradient features, but we achieved a higher accuracy with Robert’s filter, and the performance increased from 95.72% to 96.02% while testing on the same feature sets. Recognition results with different feature sets are shown in Table II. We also compared our results with a different training set. By increasing the size of the Training set by approximately 3800 samples, we had a recognition rate increase from 96.63% to 97.00%.

features [14], upper and lower profiles [10], and so on. The shape outline of a handwritten or printed text is captured by the upper and lower profile features [48]. The upper profile features are used to capture the outline shape of the top part of the word. Each image was converted into a twodimensional array. Each column was examined from right to left. For each column, the distance from the top of the image to the first ink pixel was calculated. This step would return the value of the distance in the range of 0 to 64. The outline shape of a word captured by the upper profile is shown in Fig. 5. (b).

VI.

In this study, we designed a holistic handwritten Urdu word recognition system with a significant recognition performance (97.00%) on the CENPARMI Urdu Database. A holistic approach in word recognition yield to adapt a higher performance classifier (SVM) in cursive word recognition. Compound feature sets, which involves in structural and gradient features (directional features), are effectively extracted in this study.

Figure 5. A Sample of an Urdu Word and its Outline Shape

A variable transformation ( y = . ) was appied on all features to make the distribution of features guassian-like [8] and to make all the features of the same type and in the same scale. IV.

CLASSIFICATION WITH A SVM

TABLE I: EXPERIMENTAL RESULTS WITH DIFFERENT IMAGE SIZES FOR GRADIENT FEATURES

It is a new approach to apply SVMs to word recognition in Arabic script since the HMM classifier has been the stateof-art classifier for Arabic/Farsi word recognition ([4-7]). In our study, we used a SVM classifier for recognition purposes. SVMs with different kernel functions can transform a non-linear separable problem into a linear separable one by projecting data into the feature space, and then SVMs can find the optimal separating hyperplane. Radial Basis Function (RBF) was chosen as the kernel in this research. Parameters were chosen based on crossvalidation. V.

CONCLUSION

TABLE II: COMPARISON OF RECOGNITION RESULTS WITH DIFFERENT FEATURES

EXPERIMENTS AND RECOGNITION RESULTS

We used CENPARMI Urdu word dataset [9]. This Urdu word dataset consists of 57 words (mostly finance-related). It included words for weights, measurements, and currencies. Samples of some of these Urdu words are shown in [9]. These were divided into training (60%), validation (20%), and testing (20%) sets. We combined the validation with the training set with a new size of 14,407 samples. The testing set contained 3,770 word images. In our experiments, the recognition results were conducted to compare to different sizes of the images for normalization, different operators for gradient feature

1906 1902

[13] A. R. Ahmad, C. Viard-Gaudin and M. Khalid, “Lexicon-

When normalizing the images to 128 by 128, applying the Robert’s filter to extract gradient features, using the compound feature sets with gradient features and structural features, and training the classifier with all samples in both the Training and Validation sets, the recognition result achieve the best result. He et al. [16] have discussed that the optimal size of numerals (one component) for normalization is about 32 x 32. Since most Urdu words have 4 or less connected components (excluding diacritics), normalizing Urdu words to 128x128 is optimal.

Based Word Recognition Using Support Vector Machine and Hidden Markov Model,” 10th International Conference on Document Analysis and Recognition (icdar), pp.161-165, 2009. [14] M. Cheriet, N. Kharma, C.L. Liu, and C. Y. Suen, “Character recognition systems: a guide for students and practioners,” Wiley-Interscience , 2007. [15] C. L. He, P. Zhang, J. X. Dong, C. Y. Suen, and T. D. Bui, “The Role of Size Normalization on the Recognition Rate of Handwritten Numerals,” Proc. of IAPR TC3 Workshop of 8th International Conference on Document Analysis and Recognition (ICDAR’05): Neural Networks and Learning in Document Analysis and Recognition, Seoul, Korea, pp. 8-12. 2005.

REFERENCES [1] [2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

S. Malik and S. A. Khan, “Urdu Online Handwriting Recognition,” in the proceeding of the URSI, Oct 2005. M. I. Razzak, S. A. Hussain, M. Sher and Z. S. Khan, “Combining Offline and Online Preprocessing for Online Urdu Character Recognition,” in the Proceedings of the International MultiConference of Engineers and Computer Scientists (IMECS 2009), Vol. I, March 18 - 20, 2009, Hong Kong. A. Benouareth, A. Ennaji and M. Sellami, “Arabic Handwritten Word Recognition Using HMMs with Explicit State Duration,” EURASIP Journal on Advances in Signal Processing, Vol. 2008, pp. 1-13 M. Dehghan, K. Faez, M. Ahmadi and M. Shridhar, “Handwritten Farsi (Arabic) Word Recognition: A Holistic Approach Using Discrete HMM,” Pattern Recognition, vol. 34, pp. 1057-1065, 2001. R. El-Hajj, L. Likforman-Sulem and C. Mokbel, “Arabic Handwriting Recognition Using Baseline Dependant Features and Hidden Markov Modeling,” In the proceedings of the Int’l Conf. Document Analysis and Recognition, pp. 893-897, 2005. M.S. Khorsheed, “Recognising Handwritten Arabic Manuscripts Using a Single Hidden Markov Model,” Pattern Recognition Letters, vol. 24, pp. 2235-2242, 2003. M. Pechwitz and V. Ma¨rgner, “HMM Based Approach for Handwritten Arabic Word Recognition Using the IFN/ENITDatabase,” In the proceedings of the Int’l Conf. Document Analysis and Recognition, pp. 890-894, 2003. M. Shi, Y. Fujisawa, T. Wakabayashi, and F. Kimura, “Handwritten numeral recognition using gradient and curvature of gray scale image,” In Pattern Recognition, vol. 35, no. 10, pp. 2051—2059, 2002. M. W. Sagheer, C. L. He, N. Nobile and C. Y. Suen, “A New Large Urdu Database for Off-Line Handwriting Recognition,” in the Proceedings of the ICIAP (International Conference on Image Analysis and Processing), Salerno, Italy, pp 538-546, Sept 8-11, 2009 Z. Aghbari and S. Brook, “HAH manuscripts: A holistic paradigm for classifying and retrieving historical Arabic handwritten documents,” Expert Systems with Applications: An International Journal 36(8), pp. 10942-10951, 2009. B. Gatos, I. Pratikakis, A. L. Kesidis and S. J. Perantonis, “Efficient Off-Line Cursive Handwriting Word Recognition,” In the Proceedings of the Tenth International Workshop on Frontiers in Handwriting Recognition (IWFHR), La Baule, October 23-26, 2006. R. Hamdi, F. Bouchareb, and M. Bedda, “Handwritten Arabic character recognition based on SVM Classifier,” Information and Communication Technologies: From Theory to Applications, 2008 (ICTTA), pp.1-4, 2008

1907 1903