Discriminative HMM Training with GA for Handwritten Word Recognition

Report 2 Downloads 86 Views
Discriminative HMM Training with GA for Handwritten Word Recognition Tapan K Bhowmik CVPR Unit, Indian Statistical Institute, Kolkata, India [email protected]

Swapan K Parui CVPR Unit, Indian Statistical Institute, Kolkata, India [email protected]

Abstract This paper presents a recognition system for isolated handwritten Bangla words, with a fixed lexicon, using a left-right Hidden Markov Model (HMM). A stochastic search method, namely, Genetic Algorithm (GA) is used to train the HMM. A new shape based direction encoding features has been developed and introduced in our recognition system. Both non-discriminative and discriminative training procedures have been applied iteratively to optimize the parameters of HMM.

1. Introduction An off-line handwriting recognition system generally consists of four basic modules: preprocessing, feature extraction, classification and post processing. The classification module is the main brain of the system. Several classifiers are used for classification purpose, but HMM is the most effective one, because it can capture time varying signal. Though off-line handwriting is represented as a static image that does not convey time-varying signal, the sliding window technique is commonly used to generate an observation sequence [1]. The performance of the HMM largely depends on its parameters. So, training of HMM is crucial for estimating the optimum parameters. Though, conventionally non-discriminative Maximum Likelihood (ML) training method is used for this purpose, the use of discriminative training of HMM improves the performance of the recognition system. But the approach for discriminative training is an open research problem and many of its aspects have not been explored yet. The process of ML estimation is to find the parameters that give rise to the observed data for each class individually. On the other hand, discriminative training aims to separate the class by taking into account the samples of other competitive classes. This training procedure attempts to adjust the model parameters to produce a better decision

978-1-4244-2175-6/08/$25.00 ©2008 IEEE

Utpal Roy Dept. of IT, School of Technology Assam University, India [email protected]

boundary by optimizing some objective function. Two discriminative training criteria namely Maximum Mutual Information (MMI) [1] and Minimum Classification Error (MCE) are used for this purpose to define the objective function. The present work deals with handwritten word recognition in Bangla with a holistic approach [2]. New shape based direction encoding features have been developed for our recognition system. One HMM is constructed for each word class. Both nondiscriminative and discriminative training procedures have been applied iteratively to optimize the HMM parameters. In both the training procedures, GA is used. A comparative study reveals that discriminative training improves the performance of HMM when applied to a database of Bangla handwritten words containing 119 word classes and each class containing 250 and 50 training and test samples.

2. HMM for Word Recognition An HMM consists of three sets of parameters B = {b jk } , π = {π i} , A = {a ij } and 1 ≤ i, j ≤ N , 1≤ k ≤ M , where

π is the initial state

probability distribution, A is the state transition probability distribution matrix and B is the observation symbol probability distribution matrix. Here N is the number of states and M is the total number of distinct observation symbols for a state. The complete notation of HMM is λ = (π , A, B) . Each handwritten word is represented by a sequence of observations O =O1, O2,......OT where Ot is the observation symbol observed at time t. In the training stage one HMM λ i is constructed for each word class and in the classification stage, given an unknown input sequence, O = O1 , O 2 , . . . . . .O T the probability P (O / λ i ) is computed for each

model λ i , and O is classified in that class whose model shows the highest likelihood P ( O / λ i ) .

3. Feature Extraction Feature extraction is one of the key modules in any recognition system. Several features may be extracted from a word image. But the structural feature is the most effective in a problem like the present one. Here the structural feature, defined as direction encoding feature, is extracted as the outer and inner boundaries of a word image traced in anticlockwise direction starting from the left-most boundary point. Let such a boundary be represented as (d 1, x 1 , y 1 ) , (d 2 , x 2 , y 2 ) , … , (d i , x i , y i ),… where d i ∈{1, 2, … 8 } is a directional code (Fig.1) and ( x i , y i ) is the corresponding position. This positional information is used to obtain the position of the pixel along the contour. The original image is sufficiently smoothed so that ei = d i +1− d i (mod 8) is +1 or 0 or –1. Now

our

Fig. 1. 8-directional chain codes

Fig. 2. A digital curve as the derived chain code. Let θ 1 be the angle that the line P0 P1 makes with the x-axis ( θ 1 = 303.6 0 in Fig.2). Similarly, the angle θ 2 ( P0 P2 makes with x-axis) is 333.50 .

If

α > tθ ,

where

α = (θ i + 2 − θ i +1) (mod 360) , we say there is a change in direction ( t θ is a threshold value). So there is a directional change between p 0 and p 2 if 29.9 > t θ and the middle point ( X , Y ) of the curve ( P 0 P 1 P 2 ) is taken as the position of the change. To represent the directional change, we divide the space 0 0 ≤ θ < 360 0 into a number of equal segments, say M , each with length β . When β = 10 0 , M = 36 and each segment has a unique number D starting from 1 to M . The segment number in which θ i + 2 lies is considered the corresponding encoded direction value. (D is 33 in Fig.2). After a change in direction is encountered, the value of θ i +1 is replaced by θ i + 2 and a new value of

θ i + 2 is found for the next change in direction. This process continues until the starting pixel is reached again. This is repeated for both inner and outer boundaries of the image in anticlockwise directions. Thus, each directional change along the contour is encoded as ( D , X , Y ) . The contour representations of a word image “UDAYNARAYANPUR” as well as the observed feature points when traversed in anticlockwise directions are shown in Fig. 3.

goal is to determine the pixels where there is a significant change in direction along the boundary and these changes are then encoded into numbers based on their directions. Now, in a digital straight line, there may be two different directional codes. To avoid such spurious changes and find the genuine changes in direction, we consider a subsequence of codes instead of individual pairs of consecutive codes. Let d 1, d 2 , … d l be a subsequence of consecutive codes generated while traversing the boundaries of a word image in anticlockwise direction. To find the direction of the traverse process we do the following: Consider a digital curve in Fig.2. The directional chain code of the curve when traversed from p 0 to p1 is 343343. Now, the length corresponding to these codes is not the same. The even and odd codes have a

Fig. 3. 302 feature points are extracted from Bangla word image “UDAYNARAYANPUR” in anticlockwise direction when t θ = 20 .

length 1 and 2 respectively. Now, we define a derived chain code in which each code represents

4. HMM State Definition

nearly an equal length. Since 2 is very close to 1.4, we repeat each even code 5 times and each odd code 7 times. For example, from the directional code 343343, we get 33333334444433333333333333444443333333

The HMMs here is a left-to-right topology in which each state has a transition to itself and the next state. To define the states, overlapping widows (of size

h × w z ) are considered where h is the height of the word image and w z is the height of the middle zone. The length of the overlapping portion is taken as ⎣w z / 2⎦ . Obviously, the first state starts from the first column of the first window, the second state starts from the middle column of the first window and the third state starts from middle column of the second window and so on. In case the right most column of the window crosses the image width, the rest of such window is merged with the previous one. Middle Zone ( w z ) Detection: We determine the zones of the binary word image by analyzing the horizontal projection which is obtained in the form of α i as follows. 1 ⎧ ⎪1, if ( m i − ∑ m j ) > 0 αi =⎨ B ⎪0 otherwise . ⎩ th

where m i = number of object pixels in the i row of

5. Estimation of HMM parameters with

GA Genetic algorithm is an optimization technique. It has wide use in classification problems to determine the optimal class boundaries, the optimal density parameters etc. GA is also used in HMM based classification problems for estimating the HMM parameters where a certain fitness function is optimized. The basic steps involved in an iteration of GA are Encoding/Decoding, Selection, Pairing, Crossover and Mutation.

5.1. Encoding Mechanism To construct a chromosome, all the parameters of an HMM are arranged in a sequence. In this study, we have used a left-right HMM model with no jump. A 4state such left-right HMM model is shown in Fig. 4.

the word image, B = number of object pixel rows of the word image and i ranges from 0 to (H -1) (H is the number of rows of the word image). The min and max indices of positive α i give the middle zone boundaries of the word image. α i > 0 indicates that

i th row is significant. The first significant rows (say, row u ) from the top and (say, row l ) from the bottom define the middle zone. The height of the middle zone is thus w z = (row l − row u + 1) . In most of the cases, the above measurement works satisfactorily. But, in case of Indian scripts such as Bangla and Hindi, when a vowel modifier itself has a long horizontal object pixel run, the significant row (for which α i > 0 ) does not appear in the desired position. To circumvent this problem we adopt the following procedure. Let ⎧ α i Run −ζ ) > 0 ⎪1, if ( ' α i =⎨ α i RunMax ⎪ otherwise . ⎩0 where α i

Run

in (α 0 , α 1, …α H −1) , α i

Note that the initial state distribution here is fixed as π ={1, 0, 0, … } . Thus π is not included in the search GA space. The observation symbol set here is {1, 2, … 36 } (for β = 10 0 ) which is the set of change codes. Hence, the number of distinct observation symbols is 36. But the number of states varies from one HMM to another depending upon the length of the word. Now, within a word class, the number of states varies from sample to sample and the maximum such number is taken as the number of states of HMM of that word class. In a 4-state HMM, the state transition probability distribution is given by A={aij } in which there are only 7 aij > 0 , and the other aij are always 0. The observation symbol distribution

probability matrix B = { b jk } includes 4 x 36 = 144

position

parameters b jk > 0 . The total number of parameters of

= maximum length

4-state HMM is 7 + 144 = 151. The training of an HMM involves search for the best values of these 151 parameters each of which ranges from 0 to 1. So, in the GA-HMM training, the chromosome consists of two parts, namely A and B, one is for matrix A={aij }

= length of positive run at i RunMax

th

Fig. 4. A 4-state left-right HMM

of positive run in (α 0 , α 1, …α H −1) and ζ ≥ 0 is a bias term. It is observed from our database that any value between ζ = 0.3 and ζ = 0.5 is quite suitable to identify whether the indices i are in the middle zone or outside it. The maximum and minimum indices of positive α i ' give the middle zone boundary.

and the other for matrix B={b jk }. We encode one chromosome as 151 real numbers.

5.2. Fitness Functions

estimated

Let us consider the set of R sequences of observation sequences O = {O (1) , … , O ( r ) , … , O ( R ) } observed from R samples in the whole training set and let C = {C 1 , … , C r , … , C R } be its corresponding true class label. Assume, each O (r ) is classified into one of the K classes {C 1 , … , C 2 , … , C K } . So C r ∈{C 1 , … , C K } and C r is the correct class of O (r ) . The objective of ML training is to maximize the value of

R

∑ log P(O ( r ) /C r )

whereas in MMI

r =1

discriminative training, the value of R K 1 is ∑(logP(O(r ) /C r ) − log∑ P(O(r )/Ci)P(Ci)). R r =1 i =1 maximized. So, for non-discriminative training, the R

fitness function is defined as f ML =

P (O ( r ) / λ r ) ∑ r =1

while in MMI discriminative training the fitness function is defined as f MMI = 1

R

R

K

(log P(O ( r ) / λ r ) − log ∑ P(O ( r ) / λ i ) P(λ i )) ∑ r =1 i =1

where λ r is the HMM model for the word class C r and λ i is the other competitive class HMM of λ r . Note that P (O / λ ) is calculated by well-known forward algorithm [3] with incorporating the scaling coefficient. .

5.3. Training with GA Consider the set of feature vectors of the form (O, X , Y ) extracted from all the training samples in a word class, where O is a change code and ( X , Y ) is the position of the corresponding feature point. Now each O belongs to one state or two states depending on ( X , Y ) . For each state, b jk is computed as the relative frequency of the k j th state.

th

change code in the

B = {b jk } is the initial estimate of the

observation symbol probability distribution, which is incorporated in each of L (in this study, L is taken as 20) chromosomes. For estimating a ij , k th symbol (change code) is randomly chosen from among M (which is 36 here) symbols. Then the transition probabilities associated with the i th state are

a i , i +1 =

as:

a ii =

b ik b ik + b i +1k

and

b i +1k . The above is repeated for all the b ik + b i +1k

states. Rest of the chromosomes of the population is similarly initialized. After initial estimation of parameters of an HMM, GA [2] is used to re-estimate it by optimizing the f ML fitness function. This reestimation process is done for all the word class HMMs. Once the re-estimation is over for all the models, the HMM models are retrained again in a discriminative fashion by optimizing the fitness function f MMI . The initial population of the chromosomes used in the discriminative approach is the output of the ML re-estimating process. Both the re-estimation process is applied alternatively. Since the number of word classes (119) is not large here, we have treated each individual word class HMM against the rest of the HMMs as competitive HMMs for MMI discriminative training.

6. Results and Discussions The proposed scheme has been tested on the database consisting of 119 town names of West Bengal each of which has 300 different handwritten samples. The training set consists of 29,750 word images while the test set consists of 5950 word image samples. The overall recognition accuracies obtained from ML and MMI training on the test set are shown in Table-1. In the present experiment, the value of β is taken as 100 . We plan to test with other values of β and also of t θ in order to achieve better accuracy. Table-1: Recognition accuracy ML (%) tθ 10 75.27 20 75.09

MMI (%) 78.33 79.12

References [1] Roongroj Nopsuwanchai and Dan Povey: Discriminative Training for HMM-Based Off-line Handwritten Character Recognition, Proc. of 7th ICDAR, Vol 1, 114-118, 2003. [2] T. K. Bhowmik, S. K. Parui, M. Kar and U. Roy: HMM Parameter Estimation with Genetic Algorithm for Handwritten Word Recognition, Proc. in the 2nd PReMI, Springer-Verlag, 536-544, 2007. [3] L. Rabiner : A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Trans. on. IEEE, 77(2), 257-286(1989).