A Fast Recognition System for Isolated Arabic Characters Dr. JOHN COWELL Dept. of Computer Science De Montfort University The Gateway, Leicester, LE1 9BH ENGLAND
[email protected] Abstract This paper presents a very fast multi-stage algorithm for the recognition of non-Latin script. Although the examples use Arabic script, the system could be adapted in minutes to deal with any character set, in particular non-Latin characters where no commercial OCR systems are available. The approach used normalises isolated characters for size and extracts an image signature based on the number of black pixels in the rows and columns of the character and compares these values to a set of signatures for typical characters of the set. This technique identifies not only the closet match but gives the closeness of match to all other characters in the set, which is expressed in a triangular Confusion Matrix.
Keywords: Arabic, fonts, normalisation, OCR, pattern recognition, confusion matrix, image signatures.
1. Introduction Optical character recognition (OCR) systems for Latin script are widely available and perform well on high quality printed text. OCR systems for non-Latin script including Arabic, however, are not commercially available and this is still an important area for research. A wide variety of approaches have been considered, but these can broadly be categorised into on-line and off-line techniques. On-line techniques record the movement of a pen on a drawing table and use its speed and direction of movement as part of the recognition process. There are numerous papers which cover on-line techniques and a comprehensive survey is given by Nouboud et al [1]. Although on-line recognition tends to give better results than the off-line approaches, often the text to be recognised is only available as hard copy and therefore on-line recognition is not possible. Mori et al [2] and Tappert et al [3] give an extensive review of the literature
Dr. FIAZ HUSSAIN Dept. of Computing & Information Systems University of Luton Luton, LU1 3JU ENGLAND
[email protected] for off-line recognition. Plamondon and Srihari [4] review both on-line and of-line approach. The recognition of non-Latin and Arabic script in particular has been far less extensively studied [6-14]. Amin [15] provides an extensive survey of off-line approaches to recognition of Arabic script. El-Wakil [16] reviews on-line recognition of Arabic script. Both areas are reviewed by Al-Badr et al [17]. The structural approach is a popular approach since it is widely used for Latin script [24-31], however while such techniques are successful for Latin script, the same approach is not as effective for cursive script such as Arabic. This approach requires the character to be thinned to a skeletal form and structural information such as the number and position of holes and strokes to be extracted. There are numerous problems with thinning large, poor quality cursive characters as described by the authors [18]. In addition, the process of thinning and feature extraction is relatively slow. Furthermore, the number of dots and their position does not determine the meaning of Latin characters, but this is highly significant for many non-Latin scripts. In Arabic, there may be one, two or three dots, which may be positioned above, below or in the middle of the character. The recognition of isolated Arabic character using template comparison is a proven approach to achieving a robust recognition system [19-21]. One of the main problems, however, is the speed of recognition. A 12point character scanned at 600 dpi will yield a pattern of up to 100 pixels high and 100 pixels wide. To compare patterns 10,000 pixels in size with a large set of templates for recognition is a time consuming process and has been one of the main criticisms of this approach. However, an important advantage of this approach is that it can be adapted in a few hours to deal with any character set provided that typical members of the set are available to produce the template set. The characters may require normalisation for size and orientation. In addition, not only is the closet match identified, but the closeness of match to all the other characters is also calculated and can be expressed in a triangular Confusion Matrix [21-23].
Proceedings of the Sixth International Conference on Information Visualisation (IV’02) 1093-9547/02 $17.00 © 2002 IEEE
The main drawback of this approach is that it is slow, requiring templates typically 100×100 to be compared against a template of the same size for every character in the set. This paper presents an approach for using simple image signatures which can be rapidly derived and compared against a set of signatures typical of the characters in the character set. The approach has been successfully employed for classification and identification of similar images in large image databases [32]. Signatures must be capable of being derived quickly and must be much smaller than the originating images. This approach retains all of the benefits of template comparison but is typically two orders of magnitude faster. The new approach makes use of four phases, as outlined below: • Read the input character. • Normalise character to a standard size. 100×100 pixel resolution has been used in the implementation. • Count the number of black pixels in every row and every column. • Compare the character signature against the signature templates.
This process yields a good separation of characters and only requires 200 hundred values to be compared for each template rather than 10,000, an improvement in performance of 50 times.
2. Normalising for Size The first stage in the process is to ensure that the input character has the same dimensions as the characters used to create the signatures. This is achieved by stretching or compressing the character so that it has an exact resolution of 100 × 100 pixels. Figure 1 depicts the case for the Arabic character hamza. The figure on the right shows the character as it is scanned; the figure on the left shows how that character is expanded so that it has exactly the size required and therefore touches each side of the 100×100 square as shown.
Figure 1 Normalised characters and their original form.
.
Figure 2 Normalised characters and their original form.
Figure 2 shows the typical characters for the entire Arabic character set. The right pattern of each pair is the character as scanned and the character on the left of every
pair is the normalised form. The table should be read from right to left, as is the convention with Arabic text
Proceedings of the Sixth International Conference on Information Visualisation (IV’02) 1093-9547/02 $17.00 © 2002 IEEE
3. Counting pixels in rows and columns The signature for each character is produced through a process of iteration. We loop to count the number of black pixels in each of the 100 rows and then the number of pixels in each of the 100 columns. This is compared against the corresponding count of black pixels in a set of templates. For each row, the modulus of the difference in the number of pixels is calculated and the calculated values added. The process is repeated for the columns and the two difference (one for row and the other for column) values are added. A complete match would yield a sum of zero, while the other extreme would yield a value of 20,000 when a 100% exclusive-OR of input character with a template occurs. This outcome can be more readily appreciated by converting the result to a value between 0 and 100 through dividing the resulting difference value by 200.
4. The implementation of the system The system described above has been fully implemented and is shown running in figure 3. The working of the system is as follows: When the application is executed, the first phase is to select the Choose Configuration File option from the Language menu. This specifies a simple text file containing information such as the location of the template files, and the number and names of the characters in the character set. Although Figure 2 shows the application running with the Arabic character set, it would run without changes for any character set for which a set of sample characters are available. A configuration file for any character set can be produced in a matter of few minutes and this in turn can be used to recognise the respective characters. The file name of the character to be recognised is specified in the text box (its file location is specified in the configuration file). The recognition process is initiated by clicking on the Do All button. Looking at figure 3, it contains three images in the top half. These (going from left to right) refer to the character submitted to the recognition system, the normalised form of that character and the template character found to have the closest match. The bottom half of the signature recognition system in figure 3 gives the closeness of match between the input character and all other characters in the character set. This is one of the strengths of this approach in that not only is the closest match found but also how closely other characters match the input character. If a second template returns a similar match, additional processing can be used to give confidence to the result.
Figure 3 The signature recognition system.
The application produces the result extremely fast. The present prototype which is a research tool and is not optimised for speed, recognised about 100 characters per second. This could be significantly improved by the application of standard software engineering techniques to reduce reading and writing from the disk and removing the diagnostic debugging code.
5. The Confusion Matrix The closeness of match between every character can be found by comparing every template against every other template. This produces a triangular matrix called the Confusion Matrix [33]. An application to calculate the Confusion Matrix using this algorithm has been implemented and is shown running in figure 4. As for the recognition system, the Configuration File is specified which contains the locations of the required template files. Provided that a set of templates and a simple Configuration File are available this application can be run without any additional work for other character sets. Another useful component of the application is the visualisation of the closeness of match. With reference to figure 5, the horizontal axis shows the closeness of match – a value between 0 and 100. The vertical axis is the number of pairs of compared templates which have a particular closeness of match. We choose View | Show Graph of Distribution to gain the visual form.
6. Discussion of results There are twelve pairs of characters which have confusions ratings of 90 or more. These are as follows: • • • •
Proceedings of the Sixth International Conference on Information Visualisation (IV’02) 1093-9547/02 $17.00 © 2002 IEEE
97 – haa/jeem. 93 – waw/raa 91 – thaa/taa, sad/sheen, zein/raa and noon/faa 90 – thaa/noon, dad/sheen, qaf/faa, sad/sheen, yaa/qaf, and sad/seen.
These results were as expected. In many non-Latin character sets including Arabic, small groups of characters are identical apart from the number and position of dots. In Arabic up to three dots are found. These dots may be positioned entirely above the main body of the character, entirely below it, or in the middle of it (that is, no part of
the dot is either above or below the main part of the character). The characters which are most likely to be confused by this approach can be distinguished by examining features such as dot position and number and possible other features such as the number of occlusions.
Figure 4 The Confusion Matrix system.
0
10
20
30
40
50
60
70
80
90
100
Figure 5 The distribution of results.
7. Conclusions
character set. It has been demonstrated using the Arabic character set but could read any character set with a small amount of work to create the signatures for idealised characters. The system used is very rapid and even in the prototype stage is able to identify in the region of 100 characters per second. This performance could be improved significantly by addressing implementation issues. The system not only identifies a character but also gives a measure of how close other characters are to one recognised. The Confusion Matrix gives the degree of similarity between characters. Additional work is needed to consider the benefits of pre-processing to identify other features such as dot position and number and the number of occlusions in order to separate characters which the Confusion Matrix indicates are most likely to be confused.
8. Bibliography [1]
This paper describes a fast recognition system based on creating image signatures which can be used for any
Proceedings of the Sixth International Conference on Information Visualisation (IV’02) 1093-9547/02 $17.00 © 2002 IEEE
Nouboud F. and Plamondon R, On-line recognition of handprint characters: Survey and beta tests, Pattern Recognition 25, 1031-1044(1990).
[2] [3]
[4]
[5] [6] [7]
[8]
[9]
[10]
[11]
[12] [13]
[14]
[15] [16]
[17] [18]
S. Mori, C.Y. Suen and K. Yamamoto. Historical review of OCR research and development. Proceedings IEEE 80, 1029-1058 (1992). C.C. Tappert, C.Y. Suen, and T. Wakahara, On-line handwriting recognition - a survey., Proceedings 9th ICPR International Conference on Pattern Recognition ICPR9, Rome, Italy (1988), IEEE, New York, N.Y., USA , 1988, 1123-1132. R. Plamondon and S.N. Srihari, On-line and off-line Handwriting Recognition: A Comprehensive Survey, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 1, 63-84. (2000). Amin A, Al-Sadoun H.B. Hand printed Arabic character recognition system. Proceedings 12th IAPR International Conference on pattern recognition volume 2. (1994). Atici A.A. and Yarmanvural F.T. A heuristic algorithm for optimal character recognition of Arabic Script, Signal Processing vol. 62, no 1, 87-99 (1997). Al-Yousefi and Upda S. S, Recognition of Arabic characters. IEEE Transactions on pattern Analysis and machine Intelligence. vol. 14 no 8 (1992). Bazzi I, Schwatz R and Makhoul J. An Omnifont OpenVocabulary OCR System for English and Arabic. IEEE Transactions on pattern Analysis and Machine Intelligence. vol. 21, no 6. 495-504. (1999). Abuhaiba I.S.I. Mahmoud S. A. and Green R. J. Recognition of hand-written cursive Arabic characters, IEEE Transactions on pattern Analysis and Machine Intelligence, vol. 16 no 6, 664-672. (1994). Romeo-Pakker K., Ameur A., Olivier C., and Lecourtier Y. Structural analysis of Arabic handwriting: segmentation and recognition, Machine Vision and Applications, vol. 8, no 4 (1995). El-Dabi S. S., Ramsis R., Kamel A., Arabic character recognition system : a statistical approach for recognising cursive hand-written text, Pattern Recognition vol. 23, no 5, 485-495. (1990). El-Sheikh and El-Taweel, real-time Arabic hand-written character recognition. Pattern Recognition vol. 23, no 12, 1323-1332. (1990) Al-Emami S. and Usher M., On-line recognition of hand-written Arabic characters, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 12, no. 7, 704-710. (1990). Cowell, J., and Hussain, F Extracting Features from Arabic Characters. CGIM2001 Computer Graphics and Imaging conference. IASTED, Hawaii USA CGIM2001. Amin A, Off-line Arabic character recognition - the state of the art, Pattern Recognition, vol. 31, no. 5, 517-530 (1998). El-Wakil M. S. and Shoukry A.A., On-line recognition of hand-written isolated Arabic characters, Pattern Recognition, vol. 22, no. 2, 97-106. (1989). Al-Badr B and Mahmoud S. A., Survey and bibliography of Arabic optical text recognition, Signal Processing, vol. 41, no. 1, 49-77. (1995). Cowell J and Hussain, F., Thinning Arabic Characters for Feature Extraction, IV2001 Proceedings IEEE Conference on Information Visualisation 2001, London UK. July 2001.
[19] [20]
[21]
[22]
[23] [24]
[25] [26]
[27]
[28] [29]
[30] [31]
[32]
[33]
Proceedings of the Sixth International Conference on Information Visualisation (IV’02) 1093-9547/02 $17.00 © 2002 IEEE
Hussain, F., and Cowell, J., Character recognition of Arabic and Latin Script, IV2000 Proceedings IEEE Information Visualisation London UK. July 2000. Cowell, J., and Hussain, F., A Multi-Stage Algorithm for Character Recognition, Proceedings: GITIS 2000 Conference, Chamber of Commerce & Industry, Dubai, UAE, 140-146, (2000). Hussain, F., and Cowell, J., A Generic Recognition Algorithm for Latin and Arabic Scripts, 3rd Workshop on Information & Computer Science, KFUPM, Saudi Arabia, 55-62, (2000). Cowell, J., and Hussain, F., The Confusion Matrix: Identifying Conflicts in Arabic and Latin Character Recognition, CGIM2000, Las Vegas, (2000). Cowell, J., and Hussain, F., Resolving Conflicts in Arabic and Latin Character Recognition, 19th Eurographics UK Conference, London, UK. April 2001. Atici A.A. and Yarmanvural F.T. A heuristic algorithm for optimal character recognition of Arabic Script, Signal Processing vol. 62, no 1, 87-99 (1997). Al-Yousefi and Upda S. S, Recognition of Arabic characters. IEEE Transactions on pattern Analysis and machine Intelligence. vol. 14 no 8 (1992). Bazzi I, Schwatz R and Makhoul J. An Omnifont OpenVocabulary OCR System for English and Arabic. IEEE Transactions on pattern Analysis and Machine Intelligence. vol. 21, no 6. 495-504. (1999). Abuhaiba I.S.I. Mahmoud S. A. and Green R. J. Recognition of hand-written cursive Arabic characters, IEEE Transactions on pattern Analysis and Machine Intelligence, vol. 16 no 6, 664-672. (1994). Cowell J., Syntactic pattern recognizer for vehicle identification numbers, Image and Vision Computing, vol. 13, no. 1, 13-19 (1995). Baptista G and Kulkarni K. M., A high accuracy algorithm for recognition of hand-written numerals. Pattern Recognition 21, 287-291(1988). Nishida N. Mori S. An algebraic approach to automatic construction of structural models. IEEE Trans. Pattern Analysis Mach. Intelligence 15(12), 1298-1311 (1993). Lam C and Sun C.Y. structural classification and relaxation matching of totally unconstrained handwritten ZIP-code numbers. Pattern Recognition 21, 1931 (1988). Kinser Jason. Image signatures: Ontology and classification. CGIM2001 Computer Graphics and Imaging conference, IASTED, Hawaii, USA. Cowell, J., and Hussain, F., The Confusion Matrix: Identifying Conflicts in Arabic and Latin Character Recognition, CGIM2000 Computer Graphics and Imaging conference, IASTED, Las Vegas, USA.