Multi-Font Arabic Word Recognition Using Spectral Features Mohammad S Khorsheed and William F Clocksin Computer Laboratory, University of Cambridge New Museums Site, Pembroke Street Cambridge CB2 3QG, England e-mail:
[email protected] Abstract
are extracted from the unsegmented word and compared to a model. While [8] used word shape profile as a feature, we required a model which is invariant to rotation, translation and dilation. In this paper we implement a global approach to recognising cursive Arabic words. Each word is represented by a set of Fourier coefficients extracted from the word image. The recognition of an unknown word is based on a normalised Euclidean distance between the coefficient set and a model.
In this paper we present a new technique for recognising Arabic cursive words from scanned images of text. The approach is segmentation-free, and is applied to four different Arabic typefaces, where ligatures and overlaps pose challenges to segmentation-based methods. We transform each word into a normalised polar image, then we apply a two dimensional Fourier transform to the polar image. The resultant spectrum tolerates variations in size, rotation or displacement. Each word is represented by a template that includes a set of Fourier coeficients. The recognition is based on a normalised Euclidean distance from those templates.
2. Feature Extraction Given a scanned image in the form of a bitmap, our system first isolates each Arabic word (separated by white space), then finds a feature set for the word which is invariant to the Poincare group of transformations: dilation, translations and rotations. A normalisation process uses a log-polar transform to transform rotation into translation (Figure 1). Complex numbers are used to represent 2D coordinates. Let g(z,y) be a word image with Y rows and X columns. Each dark (inked) pixel in the image may be represented using a complex number 2:
1. Introduction The Arabic alphabet is used by several widespread languages such as Arabic, Persian and Urdu. Arabic script presents challenges because all orthography is cursive and letter shape is context sensitive. Previous research on script recognition has used an optical character recognition approach that depends on segmentationof words. The word is first segmented into either primitives [7] or characters [ 2 ] , then features are extracted from the segments, then the word is classified by comparing the features with a model. Another technique [5, 61 is to decompose the word skeleton into small strokes (pieces of the character) which then are transformed into a sequence of observations that is fed to a hidden Markov model [9]. The Viterbi algorithm [31 is used to find the best path through the model which represents the word to be-recognised. However, these studies have demonstrated the difficulties in attemptingto segment Arabic words into individual characters [l] owing to context sensitivity, co-articulation effects, and stylistic features such as overlaps. A global approach, which has been applied to English [8] script but not to Arabic, treats the word as a whole. Features
0-7695-0750-6/00 $10.00 0 2000 IEEE
= x+jy (1) 0 5 x < x,o5 y < Y
2
Let 0 = (Ox, 0,) be the centroid of all dark pixels in the word image and DmaX be the maximum Euclidean distance between and all pixels Dmax
=
:y(llZ-
= "(& X>Y
011) -
oxy+ (y - OY)2)
(2)
A coordinate transform, U , for each pixel exists which transforms the object to normalised polar coordinates with
543
exponential term. Take the magnitude
"
JF(wx wy) 1Ie-j'
I I
I
~
I
=
IF(w, wy ) I I cos 8 - j sin 8) IF(^^, wy)l[cos28 sin2e14
=
IF(%lwy)l
I =
+
(5)
so a shift in f(z,y) does not affect the magnitude of its Fouriertransform.
-
I
Figure 1. Three images (a), (b), and (c) of the word (J6 qdl, 'say'), and below each one is the normalised polar equivalent.
origin 0.
U(Z) = r + j Q
Figure 2. Three reconstructed polar images for the polar image shown in Fig l-e, using: (a) half the Fourier spectrum, (b) the lower 16 frequency band, (c) the lower 8 frequency band, (d) polar raster of samples in the Fourier domain.
This results in a normalised polar image f ( x , y) which has N x N pixels (here N = 64). The application of the Fourier transform to f (x y) can be represented as N-1 N-1
Jain [4] shows how Fourier descriptors can be used for image reconstruction and shape matching. Since the word image is a real signal, half the Fourier spectrum is sufficient to reconstruct the word image completely. Moreover, a selection of lower frequencies is sufficient for modelling words. Fig 2 shows three reconstructed polar images for that shown in Fig l-e.
Any rotation in the word image is transformed into translation in the normalised polar image so we need to consider the translation property of the Fourier transform which is represented as
3. Results More than 1700 sample words were used to assess the performance of the proposed method. The samples were printed using four different fonts: Simplified Arabic, Thuluth, Andalus, and Arabic Traditional, illustrated in Fig 3 .
This shows that a shift in the origin of the function f ( x , y) results in multiplying F(w,, touil/)by the indicated
544
The samples were rendered at random angles ranging 0 27r, at random sizes ranging between 18 --+ 48pt, and at random translations up to twice the size of the sampled word. Images of the text were captured using a scanner with a resolution of 300 dpi. Each word image was transformed using Eq. 3 into a normalised polar image. Next, the fast Fourier transform was applied to the polar image, and a set of the Fourier coefficients was extracted to be the feature vector representing that image word.
One observation from Fig 4 is that the templates formed using the lower 16 frequency bands had a lower error rate than those formed using the lower 8 frequency bands. Using more features increases the computational burder while it can also lower the recognition rate due to the so-called peaking phenomenon. Here using 16 frequency band gives a higher recognition rate. 1 K 9
I
I
8-frequency band & 16-frequency band,--*-. CI (d
10%
....................................................
i-I
LI
g
5%
...........................
w
m”*
*2tM
1
...;..*.* X.% 7
0
ii13j
50 100 Words in the lexicon
0
150
Figure 4. Error rate against lexicon size using Euclidean distance measure. Figure 3. Four different Arabic fonts: (a) Simplified Arabic, (b) Thuluth, (c) Andalus, and (d) Arabic Transparent. Each displays the words alCih, ‘Sheik’) and (Jjljli alzlAt, ‘Earth tremor’).
(el
In an attempt to improve recognition rate, a more sophisticated comparison function, the Standard measure, can be obtained by normalising by the sample variance. Let pi be the average word vector calculated in equation (6), and U ; is the standard deviation vector computed below -
Each word in the lexicon was represented by a template formed from the average coefficient values of a sample of training words.
I
ui
= --[ Ni
N.
’
;=,
(x- pi)”]” i = 1 , 2 , ....M
(7)
then the Standard distance is calculated as follows
i = 1,2, .... M where M is the number of words, Ni is the number of word images representing word w i and used to calculate this template. Here the template is formed from one sample of each of the four fonts. To perform recognition of sampled words, four samples at different sizes, rotations and translations of each word were compared with each template using one of two comparison functions. Fig 4 summarises the error rates when the Euclidean distance is used to compare a sample word with every template. Two graphs are shown: one in which the lower 8 frequency bands that includes 106 coefficients were used to construct the templates, and one where lower 16 frequency bands that include 402 coefficients were used. The error rate is the number of misrecognised words divided by the total number of comparisons.
Fig 5 summarises the recognition error rates based on the Standard distance measure. One observation is that the templates formed using the lower 16 frequency bands showed lower performance than those formed using the lower 8 frequency bands. It is likely that this is an example of the peaking phenomenon, where increasing the dimensionality of the feature space leads to lower recognition rates due probably to over-fitting. This observation concurs with general experience that some classification functions seem much more sensitive to the feature dimensionality than others. It should also be noted that because templates are formed from
545
8-freq. band +,X3K# 16-freq.,band -*-.
a,
c, rd
10%
_ _ _ _ _ _ _ _ . . . . . ....... ..~ j ................_.....~~
Figure 5. Error rate against lexicon size using Standard distance measure.
samples written in multiple fonts, that U will be large, having a dominant effect on SD. Fig 6 shows the confusion matrices for all the 145 words in the lexicon. The templates were formed using the lower 8 and 16 frequency bands, respectively.
100%
-
4. Conclusion
150 Words in the lexicon
A new method for recognising cursive Arabic words has been presented. This method is based on transforming a word image into a ‘thumbnail’ pattern which is invariant to dilation, translation, and rotation. The method showed recognition rates of over 90% while demonstrating the variation in sensitivity to feature dimensionality depending on the comparison function used. This method may not be as suitable if many more fonts are used, particularly if handwriting samples are used. In these situations it is likely that the Hidden Markov Model technique can improve the recognition rate [ 5 ] .
Words in the lexicon
(b)
Figure 6. Three dimensional depictions of confusion matrices for the 145 words subjected to 252300 recognition tests: (a) lower 8 frequency bands and (b) lower 16 frequency bands.
References [6] M. S . Khorsheed and W. F. Clocksin. Structural features of cursive arabic script. In Proceeding of The Tenth British Machine Vision Conference, volume 2, pages 422-43 l , University Of Nottingham,UK, 1999. [7] B. Parhami and M. Taraghi. Automatic recognition of printed farsi texts. Pattern Recognition, 14(6):395403,1981. [8] C. Parisse. Global word shape processing in off-line recognition of handwriting. IEEE Trans. on Pattern Analysis and Machine Intelligence, 18(4):460464,1996. [9] L. Rabiner. A tutorial on hmm and selected applications in speech recognition. Proceedings of The IEEE, 77(2):257286, Feb. 1989.
B. AI-Badr and S . Mahmoud. Survey and bibliography of arabic optical text recognition. Signal Processing, 4 1:49-77, 1995. A. Amin and J. Man. Machine recognition and correction of printed arabic text. IEEE Trans. on Systems, Man, and Cybernetics, 1Y(5): 1300-1 306,1989. G. Fomey. The viterbi algorithm. Proceedings Of The IEEE, 61(3):268-278, 1973. A. Jain. Fundamentals OfDigital Image Processing. Prentice Hall, 1989. M. S. Khorsheed and W. F. Clocksin. Off-line arabic word recognition using a hidden markov model. In Statistical Methods For Image Processing - A Satellite Conference Of The 52nd ISI Session In Helsinki, Uppsala, Sweden, 1999.
546