Recognition of Bangla compound characters using ... - Semantic Scholar

Comment

Report 1 Downloads 76 Views

Pattern Recognition 47 (2014) 1187–1201

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

Recognition of Bangla compound characters using structural decomposition Soumen Bag a,n, Gaurav Harit b, Partha Bhowmick c a

Department of Computer Science and Engineering, International Institute of Information Technology Bhubaneswar, Bhubaneswar-751003, India Center for Information and Communication Technology, Indian Institute of Technology Jodhpur, Jodhpur-342011, India c Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur-721302, India b

art ic l e i nf o

a b s t r a c t

Article history: Received 19 July 2012 Received in revised form 4 August 2013 Accepted 29 August 2013 Available online 13 September 2013

In this paper we propose a novel character recognition method for Bangla compound characters. Accurate recognition of compound characters is a difﬁcult problem due to their complex shapes. Our strategy is to decompose a compound character into skeletal segments. The compound character is then recognized by extracting the convex shape primitives and using a template matching scheme. The novelty of our approach lies in the formulation of appropriate rules of character decomposition for segmenting the character skeleton into stroke segments and then grouping them for extraction of meaningful shape components. Our technique is applicable to both printed and handwritten characters. The proposed method performs well for complex-shaped compound characters, which were confusing to the existing methods. & 2013 Elsevier Ltd. All rights reserved.

Keywords: Compound character recognition Decomposition rules Printed and handwritten Bangla character Topological feature Template matching

1. Introduction Optical character recognition (OCR) is the process of automatic computer recognition of optically scanned and digitized character images. Several OCR systems are available commercially in the market [1,2]. OCR is a necessary step for tasks like converting books, documents, and ofﬁce records into electronic form [3] which allows the widely available text processing tools to be used for retrieval and dissemination of information [4]. The electronic text takes up less storage space compared to the image, can be edited, searched [5,6,9] and formatted for better display and printing. It can be machine translated [7] and converted to speech [8]. OCR systems are available for Roman or English script [10] and for a few Asian scripts, such as Chinese [11,12], Japanese [13], Korean [14,15], and Arabic [16,17]. In the last two decades, several OCR works have been reported on different Indian scripts, such as Bangla [18], Devanagari [19], Tamil [20], Malayalam [21], Oriya [22], Telugu [23], Kannada [24], Gurmukhi [25], Gujarati [26]. These works mainly deal with recognizing basic characters. However, the main challenge in designing an OCR for Indian scripts is to handle compound (also known as ‘conjunct’) characters which are formed by combining two or more basic characters.

n

Corresponding author. Tel.: þ 91 6743016079. E-mail addresses: [email protected], [email protected] (S. Bag), [email protected] (G. Harit), [email protected] (P. Bhowmick). 0031-3203/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.patcog.2013.08.026

The complex shapes of these characters make the problem more difﬁcult. In this paper, we address the problem of compound character recognition in Bangla which is the second-most popular language in India and among the top ten languages all over the world [27]. Bangla script is used to write Assamese and Bengali (also called ‘Bangla’) languages. There are a large number of (near about 250) compound characters in Bangla. However, some of these characters are obsolete now-a-days. Hence, in our work, we have considered the most familiar character set (about 165 in number [28]) used in the Bangla literature. Many of them are very complex in shape compared to the Devanagari compound characters [29,30]. Prior work on Bangla OCR includes [18,31–34] for printed basic characters and [35–40] for handwritten basic characters. But the evidences of work on recognizing Bangla compound characters, as observed in the literature [18,36,41,42,34], are few. 1.1. Related work The research on Bangla compound character recognition can be categorized into two sets of methods, developed for printed and handwritten characters. Chaudhuri and Pal [18] have proposed a template matching approach for printed Bangla compound character recognition. In this method, text digitization, noise removal, skew detection, and correction are done as part of preprocessing. The text documents are segmented into lines, words, and characters using horizontal–vertical projection proﬁle analysis and

1188

S. Bag et al. / Pattern Recognition 47 (2014) 1187–1201

head line removal techniques. They have used eight stroke-based features for representing a character and a ﬁlled-circle feature for representing a dot. Garain and Chaudhuri [41] have proposed a template matching technique for recognizing Bangla printed compound characters. Run number vectors are computed using horizontal and vertical scans organized with respect to the centroid of the pattern. The vector is normalized and abbreviated so as to make it invariant to scaling, insensitive to character style variation and more effective for complex shaped characters. Matching is performed within a group of compound characters. Sural and Das [34] have used the concept of fuzzy sets for recognizing printed Bangla compound characters. Hough transform is used to extract lines and circles. Attributes such as length, position and orientation are used to deﬁne a number of fuzzy sets. A three stage multi-layer perceptron (MLP) classiﬁer, trained with a number of linguistic set memberships derived from the intersections on the basic fuzzy sets, can recognize the characters by their similarities to the different fuzzy pattern classes. Pal et al. [42] have proposed an off-line Bangla handwritten compound character recognition method using modiﬁed quadratic discriminative function (MQDF) classiﬁer. The features used are mainly based on the directional information obtained from the arc tangent of the gradient. Das et al. [36] have recognized Bangla handwritten basic and compound characters using two different classiﬁers: multi-layer perceptron (MLP) and support vector machine (SVM). Features used are based on shadow, longest run, and quad-tree. The MLP classiﬁer is used for recognizing different groups of characters. A confusion matrix is prepared for the recognition results of the MLP classiﬁer. Classes having a high degree of mutual misclassiﬁcation are further handled using an SVM classiﬁer, which gives a better accuracy. 1.2. Motivation of this work Proper recognition of compound characters for Bangla script is a difﬁcult problem because of the complex structural characteristics of these characters. We highlight some typical characteristics of Bangla compound characters which render the problem quite difﬁcult and challenging. 1. Certain compound characters are very similar in shape and are referred to as confusing characters. Fig. 1 shows a representative set of pairs of confusing characters. 2. Few compound characters, such as ( þ þ ), ( þ þ ), ( þ þ ), have very complex shapes. It is seen that when a compound character is formed by three basic characters its shape becomes very complex. 3. The shapes for handwritten versions of certain compound characters are quite different from their printed styles.

topological feature with the standard feature templates that we formulate for the compound characters. A unique aspect of this work is the formulation of character decomposition rules for getting simpler shapes within the character skeleton. The rest of this paper is organized as follows. Section 2 describes the module for detecting compound characters in a dataset containing both basic and compound characters. The decomposition of compound characters into shape components is explained in Section 3. The skeletal segments are decomposed into strokes and represented as shape primitives using the method given in Section 4. In Section 5 we discuss the formulation of topological features and the similarity measure for feature templates for recognizing compound characters. Experimental results and related discussion are reported in Section 6. The concluding notes are given in Section 7.

2. Detection of compound characters Compound characters can be detected and recognized by certain typical structural characteristics which distinguish them from the basic characters. The printing style and font information do not contribute to the character topology. Hence we prefer to work with the most simplistic representation of the character topology – its skeleton. For our purpose of detecting and recognizing compound characters it sufﬁces to have a topological representation which is able to distinguish between even the confusing pairs, e.g. in Fig. 1. Therefore, we further simplify the skeleton and approximate it by using straight line segments. We discuss the pre-processing steps for obtaining the approximated skeleton in the following subsection. 2.1. Preprocessing Our detection and recognition modules extract the structural shape features from thinned (skeletonized) characters. Given a scanned character image, we ﬁrst binarize it by using an adaptive binarization method [43]. This is followed by thinning of the character image. We apply a medial-axis based thinning strategy [44] which produces one-pixel thick skeleton without distortion or spurious branches. A result of thinning is shown in Fig. 2(b). We then apply a straight line approximation method [45] on these thinned images. The result of straight line approximation for Fig. 2 (b) is shown in Fig. 2(c). The goodness of the approximation depends on the permissible error between a skeleton segment and the approximated straight line for that segment. The approximation method [45] uses a parameter τ which governs the error tolerance for the straight line approximation. Smaller values of τ imply a better approximation to the skeleton, but with an increased number of approximation points (vertices) and straight line segments. Fig. 3 shows the changing shapes of approximated skeletons for different values of τ. Based on the experiments discussed in Section 6.4 we have

To address the aforementioned challenges we propose a novel character recognition method for Bangla compound characters, using topological features, extracted by analyzing the structural convexities of the script aksharas. We handle the complex shape of a compound character by decomposing it into convex shape primitives. The topological characteristics of the character are represented in the form of the layout of the shape primitives. We recognize the compound character by matching the extracted

Fig. 1. Examples of confusing pairs in Bangla handwritten compound characters.

Fig. 2. Straight line approximation of a compound character. (a) Input image; (b) skeleton image; (c) straight line approximated image.

S. Bag et al. / Pattern Recognition 47 (2014) 1187–1201

chosen an optimal value of τ as 4. We refer to the straight line approximated skeleton as the polygonized skeleton. The approximation points are the vertices of the skeleton. The polygonized skeleton closely resembles the character shape. The pre-processing methods we have used are designed to handle the noise arising due to binarization and thinning. The binarization scheme [43] has been designed to handle degraded images. The thinning method [44] applied on the binarized image has been speciﬁcally formulated to avoid spurious branches in the character skeleton. The character skeleton forms the input to our system shown in Fig. 4. The ﬁrst module (Section 2.2) classiﬁes the given character as a compound or a basic character. The identiﬁed basic characters are processed using the method described in our previous work [52]. In this paper we address the recognition of compound characters. The skeleton of a compound character is decomposed into skeletal segments (Section 3), which are then analyzed to identify the strokes. We then identify the convex segments in these strokes (Section 4). The layout of the convex segments in the skeleton constitutes the topological feature (Section 5.1). Recognition is done by assessing the similarity of the given feature with the standard templates formulated for the compound characters (Section 5.2).

2.2. Detecting compound characters We can identify a compound character based on certain structural characteristics. The input to the module is a polygonized character skeleton. In the discussion that follows, we refer to a junction point on the polygonized skeleton as a vertex where three or more branches meet. A split point is a pixel which when removed will split the skeleton into two (disconnected) segments. The center of mass of a skeletal segment is the mean value of the x and y coordinates of all its pixels. A character is identiﬁed as a compound if it satisﬁes any of the four conditions listed below. Before examining for the conditions we apply a two-step preprocessing to the polygonized skeleton: (i) The branches terminating into endpoints are eroded up to the junction point, (ii) The headline (shirorekha) is removed prior to checking conditions 1–3.

Fig. 3. Changing shapes of straight line approximated character images for different values of τ. 1st column: input image; 2nd–7th columns: approximated shapes for τ ¼ 1; 2; 3; 4; 10; 13.

Character Image (Basic + Compound)

Detecting the Compound Characters

1189

However, the headline is retained when checking condition 4. We examine the skeleton for the following observations: 1. The skeleton has two or more junction points and has a closed region that has one of its vertices as the lower-most vertex in the skeleton. Examples are ( þ ), ( þ ), ( þ ), ( þ ). 2. The skeleton has two consecutive closed regions present sideby-side. Examples are ( þ ), ( þ ), ( þ ). 3. The character skeleton can be split into two segments aligned vertically such that (i) the upper (lower) segment has its center of mass above (below) the split point, and (ii) both the segments have at least one junction point. Examples are ( þ ), ( þ ), ( þ ). We identify the split point by examining each pixel on the skeleton and verifying that the segments obtained by removal of the pixel are vertically aligned as stated above. 4. The character skeleton can be split into two segments aligned horizontally such that (i) the segment towards the left (right) has the center of mass towards the left (right) of the split point, and (ii) both the segments have at least one junction point. Some examples are ( þ ), ( þ ), ( þ ). We identify the split point by examining each pixel on the skeleton and verifying that the segments obtained by removal of the pixel are horizontally aligned as stated above. If the skeleton depicts any of the aforementioned observations, then it is identiﬁed as a compound character. The observations primarily rely on simple features like junction points and closed regions which can be detected reliably in the polygonized skeleton. The accuracy of the detection method is around 97% for printed and 96.17% for handwritten characters (see Table 3). Variability in the font and the writing style are the primary factors affecting the accuracy. In the following Sections 3–5 we describe the compound character recognition process.

3. Skeletal decomposition of compound characters In this section we explain our methodology to decompose the polygonized skeleton of a compound character. We look for simpler skeletal segments which tend to form cohesive or meaningful units. We present the decomposition rules for breaking a compound character into simpler structures. Character recognition using decomposition into simpler primitives has been used in the past. Hu et al. [50] have used singular points such as terminal, intersection, bend and directional points to decompose a character and its strokes into primitives. With a similar approach Mozaffari et al. [46] have used shape primitives to obtain a global code which characterizes the topological structure of handwritten numerals. Foggia et al. [51] have also decomposed characters into structural primitives (circular arcs and straight segments) and used statistical descriptions to describe the primitives. Other

Decomposition into skeletal Segments Extracting and segmenting character strokes

Character Label

Similarity measurement with standard templates

Topological feature extraction

Fig. 4. Processing stages of the proposed method.

1190

S. Bag et al. / Pattern Recognition 47 (2014) 1187–1201

Fig. 5. Illustration of the pathways, convexity labels, and substructures.

methods like [48,49] have decomposed characters to formulate reliable stroke extraction from Chinese characters. Lee and Huang [47] have decomposed the character into strokes and then analyzed consecutive strokes for possible combination using fuzzy rules. They obtain a unique radical for a Chinese character which can then be used for pre-classiﬁcation. Like many of the previous works we also harness decomposition primarily to simplify the structure so that the smaller segments can be easily and effectively processed, analyzed, and represented, compared to the complete skeleton. We differ from the previous work in the way we decompose the skeleton. We intend to capture the topological structure of the character skeleton using convex shape primitives. Our objective is to identify segments which will remain fairly stable in spite of the variations in the writing styles of the character. A compound character, represented as a thinned skeleton with straight line approximation (Fig. 2(c)), can be decomposed into simpler shapes by the removal of split point (pixel to be removed) on the skeleton. Thus the obtained skeleton segments can be further decomposed into simpler shapes by identifying other split points. We identify the split points by using our proposed decomposition rules. We now deﬁne some terms that will be used to explain the decomposition rules.

Pathway: A pathway is a sequence of vertices on the skeleton.

The ﬁrst and the last point (called as end-points) of a pathway can be either a junction vertex or an end-point vertex. Fig. 5 (c) shows three pathways w1 ; w2 , and w3. Pathway w1 extends from the junction point p to the end-point vertex q1. Pathway w2 extends from junction p to junction q2 and w3 extends from junction p to end point q3. Traversal of a pathway: A pathway can be traversed starting from any of its end-points. Vertices along the pathway are visited in sequence. As we cross a vertex the path would turn either in the clockwise direction or in the anti-clockwise direction, as illustrated in Fig. 5(a, b). In the former case we label the vertex as ‘R’ and in the latter case we label it as ‘L’. We refer to these labels as the convexity labels. An endpoint of the pathway is assigned the same label as that of its adjacent vertex on the same pathway. The convexity labels assigned to the vertices would change if the pathway is traversed in the opposite direction. Simple and convex pathways: A pathway is considered as simple if there is no change in the convexity label (‘R’ to ‘L’ or ‘L’ to ‘R’) along the path. Fig. 5(c) shows w1 and w2 as simple pathways. A complex pathway has both the ‘R’ and ‘L’ labels for its vertices. Fig. 5(c) shows w3 as a complex pathway. When we concatenate two pathways which are connected to the same junction point, we get a composite pathway. Fig. 5(c) shows concatenated [w1, w2], or [w1, w3], as examples of composite pathways. Sub-structures connected to a junction point: A sub-structure is a skeletal segment that connects to the main skeleton at a junction point. In Fig. 5(c), there are 3 sub-structures connected to the junction point p. A sub-structure can be a simple pathway (like w1),

complex pathway (like w3), or can have branches (like the segment with vertices p, q2, q4, q5). Pathway incident on a junction point: A pathway for which one of its end-points is the junction point concerned, and the other end point can be some other junction vertex or an end-point vertex. In Fig. 5(c), we have w1, w2 and w3 as the incident pathways on the junction point p.

We use qualiﬁers simple, complex, composite for pathways and substructures to indicate as follows:

Pathway

Simple

Complex

Composite

Has the same label for all vertices

Both ‘R’ and ‘L’ labels present

Two pathways merged across a junction point

Complex pathway, or a segment with branching

Two substructures merged at a vertex

Substructure Simple pathway

We now describe the rules for skeletal decomposition. We have designed three decomposition rules, namely R1 ; R2 and R3, based on the topological characteristics typical to the Bangla characters. Each rule has a precondition which indicates when a rule is applicable, action to deﬁne how the split point is chosen, and an effect which gives the kind of skeletal segments obtained. These are explained below. R1: Preconditions: The character skeleton has a vertical line with two junction points, each connected to complex substructures (simple, complex or branched). The sub-structures connected to the upper and the lower junction points are disconnected from each other. Action: The mid-point of the vertical line segment which connects the two junctions acts as the split point. Effect: The skeleton is split into two parts: the upper and the lower part. Examples are given in Fig. 6(a). R2: This is a set of two rules R21 and R22 which govern the splitting of the pathways incident on a junction point. R21: Precondition: A junction point has an incident pathway which terminates on the same junction (i.e. forms a closed loop). Action: The junction point acts as a split point, and is duplicated for the closed loop pathway. Effect: The pathway forming the loop is separated from the skeleton. Examples are given in Fig. 6(b). R22: Precondition: A pathway incident on a junction point is complex. Action: The junction point acts as a split point, and is duplicated for the complex pathway.

S. Bag et al. / Pattern Recognition 47 (2014) 1187–1201

1191

Fig. 6. Decomposition of compound characters. 1st column: input image; 2nd column: skeleton image; 3rd column: straight line approximated image; 4th column: shape components. (a) Decomposition using R1, (b) applying the rule R2 (top row showing R21, and bottom row shows R22), (c) applying the rule R3 on the lower junction.

Effect: The complex pathway gets dissociated from the skeleton. Examples are given in Fig. 6(b). R3: Preconditions: Two pathways meeting at a junction point p are matched in the sense that they have similar orientations at p. Matched pathways are identiﬁed by (i) concatenating the simple pathways across the junction point to form a composite pathway, (ii) choosing a direction of traversal for this composite pathway and assigning convexity labels to all the vertices, (iii) examining the convexity labels to verify that it remains the same for all the vertices of one of the pathways (in the composite pathway), the junction point, and for at least one vertex adjacent to the junction point on the other pathway. Action: The junction point acts as a split point and is duplicated for the composite pathway. Effect: The matched pathways are grouped together. Examples are given in Fig. 6(c). The rules are applied in the sequence: R1 followed by R2 followed by R3. The rules R1, R2 are concerned with the splitting of the compound character into segments, and the rule R3 is concerned with the grouping of eligible segments. The decomposition obtained in this way remains unique as long as the simple pathways do not get distorted to the extent that they become complex pathways due to handwriting variations. The effectiveness of the aforementioned rules primarily depends on the reliability of detection of the junction points, loops, and convexity labels on the pathways. Samples of handwritten compound characters and their corresponding decomposition rules are shown in Table 1. As demonstrated in Section 6, the proposed skeletal decomposition improves the effectiveness of feature extraction and leads

to an increase in the recognition rate for compound characters. Unlike past work, the shape primitives we obtain are not simple strokes. We avoided simple strokes because that kind of decomposition makes the representation very clumsy, susceptible to noise, and loses some structural information if simple layout representations are used. We next discuss our proposed way of describing the shape components.

4. Identifying convex shapes in a skeletal segment The skeletal segments obtained till now may have junction points and branches. This section describes the further steps applied on the skeletal segments so as to identify the convex shapes. In Sub section 4.1 we describe how to trace paths (strokes) in a skeletal segment. A stroke is a sequence of vertices which does not exhibit branching. Identifying convex segments from a stroke is discussed in Sub section 4.2. Larger convex segments are further broken down to obtain smaller convex segments. 4.1. Identifying strokes in a skeletal segment We traverse the possible paths in each of the skeletal segments. A path is a linear sequence of vertices traced in the skeleton or a skeletal segment. Each path is labeled with a unique path identiﬁer (ID). A path can be initiated from an end vertex of the segment and it terminates when it reaches another end vertex. Traversal visits a linear sequence of vertices. When a junction vertex is encountered, one of the other branches emanating from the vertex needs to be selected for further traversal. We select the

1192

S. Bag et al. / Pattern Recognition 47 (2014) 1187–1201

Table 1 Decomposition rules applied on handwritten compound characters. Decomposition rule

Handwritten character

Skeleton image

3

2

Straight line Shape approximated image components

R1

R3

Anticlockwise R1

R1 R22

R1

R3

R1

R3

R1

R21

R1

R2

R2

ﬁrst branch in the anti-clockwise direction for progressing the traversal through the junction point. This is illustrated in Fig. 7. In a skeletal segment with junction points, each vertex will have two paths passing through it, in opposite directions. If the skeletal segment also has a closed loop then we initiate a new path to traverse this closed loop. Any vertex on this closed loop can be chosen as the starting vertex for traversal and the path terminates at the same vertex. If two paths have exactly the same set of vertices then they are considered as duplicate and one of them is discarded. The traversal results for a compound character are shown in Fig. 8. The identiﬁed character strokes (paths) need to be segmented further to obtain the convex segments. 4.2. Identifying convex segments in a skeleton stroke The identiﬁed strokes are segmented into smaller sub-strokes based on the nature of their convexity. We identify the convex segments within each stroke. A convex segment constitutes a

1

Fig. 7. Traversal of a path through a junction point.

subset of vertices of the stroke. We traverse the stroke and label each vertex as ‘R’ or ‘L’ as described in Section 3. We group the consecutive ‘L’ vertices or ‘R’ vertices and form individual convex segments. Fig. 8 shows the identiﬁed convex segments. Some of these segments, e.g. C2 and C5 in Fig. 8, are too long. To enable better approximation of the convex segments with our standard set of convex shape primitives (discussed in the next section) we need to break down these larger convex segments into smaller convex segments. Given a long convex segment, say S shown in Fig. 9(a) with a sequence of points p1, p2, …, pn, we use the following steps for each of the points pk (k ¼1 to n 2): 1. Assume a straight line Lk passing through the points pk and pk þ 1 . We use the line Lk as the reference line. Consider the next point pi (i ¼ k þ 2) on the segment for step 2. 2. Compute the value θi between the line Lk and the line joining pk to pi as shown in Fig. 9(b). 3. If θi r 120 and the sequence of θi values (for i Z k þ 2) computed with Lk as the reference remains monotonically increasing, then move to the next index i and repeat the steps 2 and 3; otherwise go to step 4. 4. The portion between pk and pi is identiﬁed as the smaller convex segment s of the larger segment S. If the set of vertices in s forms a subset of the vertices in a previously identiﬁed smaller segment of S, then we discard the new segment s; otherwise we add it to the list of smaller segments associated with S. Referring to Fig. 9, the segments (d) and (e) are contained in segment (c), and hence are discarded. The ﬁnal set of smaller convex segments will be (c) and (f). 5. Increment k and go to step 1. To summarize, we have (i) decomposed the polygonized skeleton into segments, (ii) traced strokes (paths) for each segment, (iii) identiﬁed convex segments in each stroke, (iv) further broken down the larger convex segments. We have thus obtained the set of convex shape segments for the entire skeleton of the character. To represent the spatial layout of these convex segments we represent the character with a set of shape primitives which best matches the convex segments. This is discussed in the next section.

5. Recognition using topological features Each convex segment of a character is labeled with (or mapped to) its best matching shape primitive. Our repertoire of shape primitives comprises nine shapes as shown in Fig. 10, which allow us to have a good representation of the convex shapes. The procedure to identify the matching shape primitive for each convex segment is discussed next (Fig. 11). Consider a convex segment with k vertex points, namely p1 ; p2 ; …; pk . The end points p1 ðx1 ; y1 Þ and pk ðxk ; yk Þ together form the

S. Bag et al. / Pattern Recognition 47 (2014) 1187–1201

1193

Fig. 8. Detection of convex segment of printed and handwritten compound characters. Top: straight line approximated shape components of a handwritten and a printed Bangla compound character after applying the decomposition rules (Section 3). Middle: different paths with unique path identiﬁers (ID). Bottom: identiﬁed convex segments and the sequences of approximation points belonging to these segments.

Fig. 9. Segmentation of large convex region. (a) Given convex segment; (b)angles θk þ 2 and θk þ 3 ; (c,d,e,f) identiﬁed convex regions. Finally retained segments are (c) and (f).

Fig. 10. Different convex-shape primitives (SC implies closed, SU implies U-shaped, and SL implies L-shaped).

Fig. 11. Structural description of the various shape primitives.

opening of the convex shape. A closed segment does not have any opening because the sequence of vertices terminates on the starting vertex. Such segments are labeled with shape ID SC. Our labeling of

the remaining shape primitives (in the group SU and SL) is primarily based on examining the deviation of the remaining vertices pi (1 o i ok) on the convex segment, from the opening vertices, along

1194

S. Bag et al. / Pattern Recognition 47 (2014) 1187–1201

the horizontal ðxÞ and vertical ðyÞ directions. Since the opening has got two end points p1 and pk, the x; y deviations of a vertex pi on the convex segment is taken as the minimum of the deviation from any of the opening point, as follows: ¼ minðjxi x1 j; jxi xk jÞ and yðiÞ ¼ minðjyi y1 j; jyi yk jÞ: xðiÞ d d Next we compute the maximum x,y deviation values for the entire convex segment w.r.t. its opening (p1 and pk). We denote these values as xd and yd computed as follows: xd ¼ maxfxðiÞ : 1 o i o kg d

and

yd ¼ maxfyðiÞ : 1 o i o kg: d

The horizontal and vertical deviation values xd and yd respectively are used to determine the shape ID of the given shape primitive. To identify the SU set of primitives we examine the truth of the predicates as given below:

SU1: SU2: SU3: SU4:

xd 4 yd xd 4 yd xd oyd xd oyd

and x1 4 xi and xk 4 xi . and x1 o xi and xk o xi . and y1 o yi and yk o yi . and y1 4 yi and yk 4 yi .

To identify the SL set of primitives we ensure the following predicate: xd o 14 maxð x1 xk ; y1 yk Þ and yd o 14maxð x1 xk ; y1 yk Þ: to be true, and further we ensure

for for for for

SL1: SL2: SL3: SL4:

xi 4 xe xi o xe xi 4 xe xi o xe

and and and and

yi 4 ye′ . yi 4 ye′ . yi o ye′ . yi o ye′ .

Our approach to recognize a character involves template-based matching. We construct a standard feature template (adjacency list) for each distinct compound character. The template captures the spatiallayout of the constituent shape primitives of the skeleton. The repertoire of standard templates thus constructed is used for matching and labeling the compound characters. To label a given character image we compute the match score of its feature template (adjacency list) with each of the available standard templates. The standard template which gives the highest match score is used to label the character. Consider a given character with its adjacency list representation AC ¼ fa1 ; …; ai ; …aNC g and a standard template with adjacency list AT ¼ ft 1 ; …; t j ; …t NT g where ai and tj take the form shown in expression 1. The number of shape primitives NC identiﬁed for the character can be different from the number of shape primitives NT in the standard template. In the discussion that follows we use ai or tj to refer to the shape primitive itself, and use 〈ai 〉 or 〈t j 〉 to represent the list of its adjacent shape primitives. Each primitive ai in the list AC votes for each primitive tj in the list AT. A vote by ai for tj is denoted as Vk, a 2-tuple having components s w ðV w is the weight of the vote, i.e. a measure of k ; V k Þ where Vk importance of the vote, and V sk is the match strength of the vote, i.e. a measure of the degree of match between ai and tj. The subscript k varies to cover all the possible ðai ; t j Þ pairs. The weight of the shape primitive tj is taken as the weight Vkw of the vote. To compute the strength Vks of the vote we examine the shape IDs of ai and tj. If they are different, then Vks is taken as 0. If the shape IDs of ai and tj are exactly the same, then the match strength of the vote is computed as V sk ¼

where e; e′ are the end-point indices taking different values, i.e. either ð1; kÞ or ðk; 1Þ.

5.1. Representation of topological features Having identiﬁed the shape primitives the next task is to represent the character as a spatial layout of the shape primitives. We use an adjacency list structure to represent the spatial layout. Two shape primitives are considered adjacent if they share a common vertex. Each shape primitive is tagged with the list of adjacent shape primitives and their relative orientations with respect to the given primitive. For example, if the shape primitive sp1 has its adjacent shape primitives sp2 , sp3 , sp4 , with orientations θ21, θ31, θ41 w.r.t. sp1 , then the adjacency list for shape primitive sp1 is written as sp1 ⟶〈ðsp2 ; θ21 Þ; ðsp3 ; θ31 Þ; ðsp4 ; θ41 Þ〉

5.2. Similarity of feature templates

ð1Þ

The relative orientation, say θk1 , is the angle between a horizontal line passing through the center of mass of the shape sp1 and the line connecting the center of mass of sp1 to the center of mass of spk . The complete skeleton is represented as the adjacency list of all the shape primitives identiﬁed in that skeleton. Thus the topology of the character is captured as the spatial layout of the shape primitives. Through experiments we found that the shape primitive SC plays a more prominent role compared to SU and SL in the correct identiﬁcation of the characters. As discussed in the experimental section we arrived at the weight proportions 5:2:1 for the primitives SC, SU, and SL. A higher weight implies that the shape primitive has a higher invariance across writing styles and captures the unique structural properties of the character. A higher weight reﬂects the effectiveness of the shape primitive for recognizing the character.

1 matchð〈ai 〉; 〈t j 〉Þ: maxðj〈ai 〉j; j〈t j 〉jÞ

Here the function match ð〈ai 〉; 〈t j 〉Þ counts the number of matched shape primitives in the adjacency lists 〈ai 〉 and 〈t j 〉 we have and j〈ai 〉j and j〈t j 〉j denote the total number of shape primitives in the corresponding lists. The shape primitives in the adjacency lists are considered as matched if they have the same shape ID and their relative orientations are within a range of 7 451 to each other. Having computed all the votes, the normalized matching score between a character image and a standard template is computed using the following steps: 1. Arrange the votes in a sequence (list) sorted in descending order based on the vote strength Vks. 2. First consider the highest strength vote. The shape primitives ai and tj concerned with this vote are marked as used. The vote is retained. 3. Consider the next vote in the sequence. If any of its concerned shape primitives ai or tj are already marked as used, then the vote is discarded and deleted from the list. Otherwise, ai and tj are marked as used and the vote is retained. This step is repeated to process all the votes. 4. Scan the list AC and AT to ﬁnd the primitives which have still not been marked as used. For each such unused primitive we add a new vote to the vote list. The new vote has strength V sk ¼ 0 and weight Vkw as the weight of the shape primitive. Given the ﬁnal vote list we compute the normalized match score between AC and AT as MS ¼

∑V sk V w k ∑V w k

where the summation is over all the retained votes in the list. The match score gets computed in the range ½0; 1. Step 4 above ensures some penalty for the shape primitives which do not ﬁnd

S. Bag et al. / Pattern Recognition 47 (2014) 1187–1201

1195

Table 2 Gives the details of our two datasets 1 and 2. We indicate B and C in parenthesis to denote the sets of basic and compound characters. Dataset

Dataset type

No. of Samples

No. of distinct characters

No. of Samples per character

Dataset 1 (for detection)

Printed (B, C) Handwritten (B,C)

6200(B), 6600(C) 17,350(B), 21,450(C)

50(B), 165(C) 50(B), 165(C)

124(B), 40(C) 347(B), 130(C)

Dataset 2 (for recognition)

Printed (C) Handwritten (C)

4950(C) 19,800(C)

165(C) 165(C)

30(C) 120(C)

Table 3 Performance of compound character detection on Dataset 1. Data type

Sample size (B, C)

No. of compound characters

No. of misclassiﬁed samples

No. of correctly classiﬁed samples

Detection rate (%)

Printed

12,800

6,600 (51.56%) 21,450 (55.28%)

376

12,424

97.06

1,485

37,315

96.17

Handwritten 38,800

Table 4 Recognition performance for different values of T on Dataset 2. Value of T

Printed characters

Handwritten characters

Rejection rate Recognition rate Rejection rate Recognition rate

Fig. 12. Examples of handwritten compound characters from the dataset.

a match in the standard template, or the shape primitives in the standard template which are missing in the character. The standard template which computes the highest match score is used to label the given character.

6. Experimental results and discussion We have implemented the Bangla compound character detection and recognition system in C programming language on Fedora 10 running on Intel Core2 Duo 2.20 GHz, 1 GB RAM. We collected printed characters from several heterogeneous printed documents. The handwritten documents were collected from individuals of different age and profession. Samples were collected on a normal writing paper using standard ball-point pens, gel-pens and inkpens. We avoided pens which produce thick strokes like the sketch pens. All the document pages are scanned using HP Scanjet 5590 scanner at 300 dpi. The Bangla characters (basic and compound) were manually isolated. Table 2 gives a summary of our dataset of isolated Bangla characters. The dataset was collected at IIT Kharagpur. Dataset 1 is used for evaluating the detection performance for the compound characters. It contains both the basic and the compound characters. Dataset 2 is used for evaluating the recognition performance for the compound characters. It contains only the compound characters. Dataset 2 is a subset of Dataset 1. Fig. 12 shows some samples from our handwritten compound character dataset. We ﬁrst report the performance of our system for the detection (Section 6.1) and the recognition (Section 6.2) of compound characters using optimal values of certain parameters used in our system. In Section 6.4 we discuss how we have chosen the appropriate values of those parameters.

0.00 0.15 0.20 0.25 0.30 0.35 0.45

0.00 4.50 10.75 15.42 21.20 27.34 34.66

90.20 90.95 91.46 91.85 92.34 92.80 93.78

0.00 9.50 14.18 20.68 25.30 33.50 42.40

86.74 87.10 87.75 88.16 88.45 89.13 90.58

6.1. Performance of compound character detection Table 3 reports the performance of compound character detection as evaluated on Dataset 1. The sample size indicates the total number of basic and compound characters which we used for classiﬁcation. Compound characters constituted more than 50% of the sample size. Characters which satisﬁed any of the conditions given in Section 2.2 were classiﬁed as compound characters, and all others were classiﬁed as basic characters. The classiﬁcation accuracy was 97.06% and 96.17% for printed and handwritten characters respectively. We observed that the misclassiﬁcation happened only for the compound characters. 6.2. Performance on compound character recognition To recognize a given compound character we compute its match score with all the standard feature templates we construct for the compound characters. The one which computes the highest match score is taken as the label for the given character. But a lower value of the highest match score (i.e. a value less than some threshold T ) would imply a lesser similarity of the given character with the best matching template. The recognizer will not have enough conﬁdence to classify such a input sample and it is highly probable that the classiﬁcation will be incorrect. Such input samples are therefore rejected for classiﬁcation. Table 4 shows the recognition performance on Dataset 2, with varying values of the threshold T . When we use a higher value of T a larger proportion of the input samples get rejected. We report the recognition rate as the fraction of the number of samples correctly

1196

S. Bag et al. / Pattern Recognition 47 (2014) 1187–1201

Table 5 Recognition statistics of some printed and handwritten characters in Dataset 2. Printed character

Number of samples recognized (out of 30)

Recognition rate (%)

29 29

Handwritten character

Number of samples recognized (out of 120)

Recognition rate (%)

96.67

115

95.83

96.67

115

95.83

29

96.67

113

94.17

28

93.33

112

93.33

28

93.33

109

90.83

28

93.33

108

90.00

27

90.00

106

88.33

27

90.00

105

87.50 85.83

27

90.00

103

26

86.67

103

85.83

25

83.33

99

82.50

24

80.00

93

77.50

23

76.67

89

74.17

Table 6 Performance for detection and recognition of compound characters in Dataset 1. Data type

Sample size (basic and compound characters)

Number of correctly detected compound characters

Number of correctly recognized compound characters

Recognition rate excluding detection errors (%)

Recognition rate including detection errors (%)

Printed (12,800)

6600(C), 6200(B)

6224

5588

89.78

84.67

Handwritten (38,800)

21,450(C), 17,350(B)

19,965

16,995

85.12

79.23

recognized to the total number of samples classiﬁed (excluding the reject samples). The rejection rate is the fraction of input characters rejected by the recognizer. With no rejection we see that the recognition rate is 90.2% for printed and 86.74% for handwritten characters in Dataset 2. Table 5 shows the recognition rates for some speciﬁc examples of printed and handwritten compound characters. For printed characters the recognition performance is the highest at 96.67% for characters like , and , but falls to lower values like 80.00% for and 76.67% for . For handwritten characters the highest recognition rate and , but characters like and is achieved as 95.83% for have lower recognition rates at 77.50% and 74.17% respectively. Table 6 gives the performance of the complete system which performs the detection and recognition of compound characters in Dataset 1. We report the ﬁnal recognition rate when including the errors in the detection phase. To check the discrimination power of the proposed topological features we now examine the top 5 best matches, instead of only the ﬁrst one which gives the highest matching score. If the correct template occurs in the list of top 5 best matches, then we say that the recognizer has succeeded in bringing the correct template in the extended list of top 5 best matches. We call the success rate of the classiﬁer when considering the extended list as the extended recognition rate. The extended recognition rate is always higher compared to when considering just the single best match. Table 7 shows the extended recognition rates when considering extended lists of varying sizes: with top 2, 3, 4, and 5 best matches. We see that the extended recognition performance deviates by only about 3% compared to the single best choice. This small rise in accuracy when using the extended list can be considered as insigniﬁcant,

Table 7 Compound character recognition rate on Dataset 2 for extended list of best matches. Input type

Top 1 choice (%)

Printed 90.20 Handwritten 86.74

Top 2 choices (%)

Top 3 choices (%)

Top 4 choices (%)

Top 5 choices (%)

91.72 87.51

92.85 88.96

93.17 89.39

93.31 89.42

thus indicating that our features provide a good discrimination between characters of similar shapes. Most of the time the correct match appears as the top scoring match. Fig. 13 shows some examples of top three matches for some characters. 6.3. Effectiveness of skeletal decomposition In Section 4 we identiﬁed the convex shapes in the skeletal segments. Instead of the skeletal segments, we can also give the complete skeleton as an input to the stroke extraction method discussed in Section 4.1. This would identify the convex shapes in the complete skeleton. The detailed methodology used for character recognition without character decomposition is reported in our preliminary work [52] for recognizing the basic Bangla characters. We now report the recognition performance for the compound characters on our Dataset 2, but when using the complete skeleton, i.e., without doing the skeletal decomposition. Table 8 compares the recognition performance with and without applying the skeletal decomposition. It can be seen that the performance is better when applying skeletal decomposition prior to convex shape extraction. The performance improves by 21.35%

S. Bag et al. / Pattern Recognition 47 (2014) 1187–1201

1197

Fig. 13. Match results for some printed and handwritten Bangla compound characters. We show the top three matches for each sample. The matching score (MS) is also indicated.

Table 8 Performance comparison of the proposed method with and without using character decomposition. Evaluation was done on Dataset 2. Input pattern

Recognition rate (%) without skeletal decomposition

Printed 68.85 Handwritten 59.46

Recognition rate (%) with skeletal decomposition 90.20 86.74

Table 9 Shows the recognition performance with different values of the polygonization parameter τ. The performance is reported on the validation dataset. τ values

Recognition rate (%)

1 2 3 4 5 6 7 8 9 10 11 12 13

85.53 86.16 90.53 90.80 55.51 13.68 9.55 9.10 8.49 8.16 6.45 6.45 6.45

for printed and 27.28% for handwritten compound characters. It is worth to consider why the decomposition of a compound character into simpler shape components leads to an improved recognition performance. It seems that after skeletal decomposition the obtained convex shape components can be better approximated by the shape primitives (in our repertoire), compared to when using the complete skeleton. For complex skeletons like that of many compound characters, the extracted topological features, when using the complete skeleton, may not be well-represented by the simple shape primitives we have used; the simple primitives may be adding more clutter than providing discriminative information necessary for correct recognition. Skeletal decomposition seems to help here. Even though skeletal decomposition leads to some loss in the topological information, resulting in fewer and simpler convex shapes when compared to using the complete skeleton, still, as the results show, the performance with decomposition is better than the performance without decomposition. Hence, we can conclude that decomposing a compound character

into shape components improves the effectiveness of topological features. 6.4. Parameter tuning on validation dataset We considered a training set comprising 30% of the printed character instances randomly sampled for each class of compound characters. The training set had 9 characters per class to be used for making the standard templates. In the training phase we manually constructed the feature template for each character instance in the training set of a class. We picked the template which was common to the majority of the 9 training instances as the standard template for that class. The recognition performance depends on certain parameters like the value of τ used in polygonal decomposition and the weights for the primitives SC, SU and SL. For the purpose of parameter tuning we considered a validation dataset comprising a randomly sampled set of 30% of the handwritten instances in Dataset 2. We used only the handwritten characters for validation because they exhibit more variations in the shape compared to the printed characters. The test set comprised all the remaining characters in the Dataset 2, i.e. excluding the training set and the validation set. As discussed in Section 2.1 the straight line approximation of a character skeleton plays an important role in designing the rules for detecting compound characters and for skeletal decomposition. The quality of the polygonal approximation is affected by the number of approximation points, which may, in turn affect the applicability of these rules. Our main objective is to make an approximated shape of the thinned character with less number of approximation points while still being able to represent all the convex segments in the true shape. We tune the straight line approximation method [45] with respect to the number of approximation points by selecting an optimal value for the parameter τ. In Fig. 3 we observed that a smaller value (τ ¼ 1; 2) leads to a very large number of segments and a larger value (τ ¼ 10; 13) distorts the skeleton shape. Such a distortion is not acceptable because structural analysis of a character image is a key part of our technique. The polygonized skeleton affects the feature formulation and thus the ﬁnal recognition performance. Table 9 gives the recognition performance on the validation dataset for different values of τ. The recognition performance improves when increasing the value of τ and peaks for the values τ ¼ 3; 4. For larger values the polygonized skeleton gets distorted and the recognition performance drops drastically. Table 10 shows the effect of varying τ on two handwritten compound characters. Higher values of τ result in a decrease in the number of approximation points

1198

S. Bag et al. / Pattern Recognition 47 (2014) 1187–1201

Table 10 Sensitivity analysis of τ with respect to the number of straight lines present in the approximated skeleton for two compound characters. The numbers given in the table have been averaged over all instances of the character in the validation dataset. Input Character

τ¼1

τ¼2

τ¼3

τ¼4

τ¼5

τ¼7

τ ¼ 10

τ ¼ 13

29 25 92.14

18 17 92.78

16 16 95.35

14 14 95.83

13 13 71.68

12 12 48.50

12 11 43.46

9 9 5

36 33 81.45

23 22 82.16

19 19 84.65

18 18 84.86

17 17 61.17

14 14 27.58

13 12 0.0

13 11 0.0

Character Number of points Number of segments Recognition rate (%) Character Number of points Number of segments Recognition rate (%)

Table 11 Performance evaluation using equal weights for the shape primitives. Experiment Setting

Recognition rate (%) for handwritten characters

Percentage reduction in the recognition rate compared to baseline

Baseline performance Omitting SC segments Omitting SU segments Omitting SL segments

83.26 39.62 65.14 74.33

NIL 52.41 21.76 10.73

samples). We found that the standard templates for the compound characters, the optimal τ value, and the shape-primitive weights remained the same each time. The recognition performance on the complete Dataset 2 using the optimal value of τ and the weights has been shown in Table 7. 6.5. Comparison with other works

Table 12 Compound character recognition rate on the test dataset when considering extended lists of top matches. The test set had a total 3465 printed and 13,860 handwritten compound characters. Input type

Top 1 choice (%)

Printed 90.82 Handwritten 87.12

Top 2 choices (%) 92.10 88.31

Top 3 choices (%) 93.25 89.85

Top 4 choices (%) 93.38 90.25

Top 5 choices (%) 93.62 90.41

(vertices), straight line segments, and the recognition performance. It is clearly seen that the optimal value of τ is 3 or 4. Next we discuss the tuning of the weights to be used for the three types of shape primitives SC, SU and SL described in Section 5. We obtained the baseline recognition performance of the system (see Table 11) on the validation dataset when using equal weights (1:1:1) assigned to the shape primitives SC, SU and SL. We observed that if we discard the SC primitives appearing in all the feature formulations for characters in the dataset and the standard templates, then there is a 52.41% fall in the recognition rate when compared to the baseline performance. Likewise the omission of SU primitives leads to a reduction of 21.76% and the omission of SL primitives results in a reduction of 10.73% in the recognition rate when compared to the baseline performance. This shows that the SC primitives play a more prominent role in the correct recognition of compound characters. Likewise the SU primitives are more effective compared to the SL primitives. We therefore ﬁxed the weights as 5:2:1 for the SC, SU and SL primitives, in proportion to the percentage reduction in recognition rate effected by the omission of those primitives. With the value of τ ﬁxed as 4 and the weight ratios of primitives chosen as 5:2:1 we report in Table 12 the recognition performance on the test dataset. We repeated the cross validation 10 times, each time picking a random subset of the Dataset 2 for training (randomly chosen 30% of the printed samples of each class) and validation (randomly chosen 30% of the handwritten

We provide a comparison of our compound character recognition method with four well-known methods proposed by Chaudhuri and Pal [18], Sural and Das [34], Pal et al. [42], and Das et al. [36]. Table 13 summarizes the features and classiﬁers used in these works. A summary of the recognition performance, number of output classes, and dataset size, as reported in the respective works are given in Table 14. In Table 15, we show some of the confusing pairs reported by the existing methods. Our method could successfully discriminate confusing pairs like ( (

,

,

), (

,

), (

,

), (

,

), (

,

),

). It may be noted that authors of [18,34] have pointed out

that their methods do not perform well due to ornamental patterns of some compound characters and the existence of close similarities between character patterns. However, they have not reported any speciﬁc confusing pairs or failure cases. 6.6. Discussion In our previous work [52] we had reported results on Bangla basic character recognition using topological features. We can as well apply skeletal decomposition to the basic characters prior to extracting the topological features. However, it was seen that the overall performance showed an insigniﬁcant change ( o 1%) for basic character recognition when we used skeletal decomposition. It seems that due to simpler shapes of the basic characters, the decomposition did not improve the effectiveness of the topological features modeled with the shape primitives. Hence we apply skeletal decomposition only when we detect that the given character is a compound character. We now discuss some failure scenarios for our method. Fig. 14 shows some example characters which were not correctly recognized by our method. The reasons of failure are listed below:

1. Over-writing: This generally leads to the formation of extraneous closed regions and convex shapes which change the topological characteristics. Examples are , , . 2. Low contrast: If the scanned image is noisy and has low contrast, then the skeleton may have broken components due to the problems in binarization. This changes the topological characteristics. Examples are , , .

S. Bag et al. / Pattern Recognition 47 (2014) 1187–1201

1199

Table 13 Comparison among different Bangla compound character recognition systems. ‘P’ stands for printed and ‘H’ stands for handwritten input. Method

P/H

Preprocessing

Feature set

Classiﬁer

Chaudhuri and Pal [18]

P

Structural

Sural and Das [34]

P

Decision tree, template matching MLP

Pal et al. [42]

H

Das et al. [36]

H

Text digitization and noise cleaning; skew detection and correction; line, word, and character segmentation Skew correction and noise removal; line, word, and character segmentation Otsu's binarization [53], bounding box extraction; median ﬁltering; size normalization; Mean ﬁltering; gradient image using Robert's ﬁlters Not mentioned

Proposed

P, H

Image binarization; thinning; straight line approximation

Fuzzy set with line- and curve-based features Gradient

Shadow, longest-run, quad-tree Topological

MQDF

MLPþ SVM Template matching

Table 14 Comparison among different Bangla compound character recognition systems. Method

P/H

Recognition rate (%)

Number of output classes

Dataset size

Chaudhuri and Pal [18] Sural and Das [34] Proposed Pal et al. [42] Das et al. [36] Proposed

P P P H H H

82.00 85.40 90.20 85.90 79.25, 80.51a 86.74

Not reported 128 165 138 43 165

Not reported Not reported 4950 20,543 9,768 19,800

a 79.25 and 80.51 are the recognition rates for MLP and SVM classiﬁers respectively.

Table 15 Performance of the existing methods in terms of failure cases. Method

Confusing pairs

Chaudhuri and Pal [18] Sural and Das [34] Pal et al. [42]

Not reported Not reported

Das et al. [36]

,

;

,

,

,

3. Inappropriate decomposition: Some of the strokes may be inordinately long, or quite short. The longer strokes may be over-segmented into convex regions and the shorter ones may be under-segmented. Variations in writing style also lead to simple pathways becoming complex (see Section 3). Example are: ( , ), ( , ). We see that the ﬁrst character in both the variants will get extra shape primitives. 4. Variations in handwritten characters: The same character can be written in different ways by different individuals. However this problem can be addressed by having multiple standard templates for the same compound character. Examples are: ( , ,

), (

,

,

), (

,

,

).

Our recognition scheme attempts to handle inappropriate decomposition by including non-uniform weights for the shape primitives, so that the effect of over-segmentation or undersegmentation on the match score is small. To a limited extent we have also addressed the problem of variations in writing style of the characters by increasing the number of standard templates for each compound character.

Fig. 14. Examples of some erroneous samples. (a) Over-writing; (b) low contrast; (c) inappropriate decomposition; (d) variations in handwritten characters. Note that all the samples are given with their convex segments.

1200

S. Bag et al. / Pattern Recognition 47 (2014) 1187–1201

7. Conclusion In this paper we have proposed novel topological features for recognizing Bangla compound characters. We have formulated decomposition rules to break a compound character into simpler shape components. The decomposition improves the efﬁcacy of the features used and yields a better recognition performance. Our recognition scheme does not require any training with real examples. This is an advantage because many Bangla compound characters are used rarely and ﬁnding a sufﬁcient number of examples for use as training data is very difﬁcult. The proposed method has been used for recognizing printed and handwritten Bangla compound characters with promising results. The decomposition technique and topological features based on convexity analysis have a boarder applicability and may be useful to recognize complex characters in other scripts as well, especially Devanagari, Tamil, Kannada, etc.

Conﬂict of interest None declared.

References [1] H. Fujisawa, Forty years of research in character and document recognition–an industrial perspective, Pattern Recognition 41 (8) (2008) 2435–2446. [2] V.K. Govindan, Character recognition—a review, Pattern Recognition 23 (7) (1990) 671–683. [3] P. Sarkar, Document image analysis for digital libraries, in: International Workshop on Research Issues in Digital Libraries, 2006, Article 12. [4] T. Kameshiro, T. Hirano, Y. Okada, F. Yoda, A document image retrieval method tolerating recognition and segmentation errors of OCR using shape-feature and multiple candidates, in: International Conference on Document Analysis and Recognition, 1999, pp. 681–684. [5] A. Kumar, C.V. Jawahar, R. Manmatha, Efﬁcient search in document image collections, in: Asian Conference on Computer Vision, 2007, pp. 586–595. [6] P.B. Pati, A.G. Ramakrishnan, Word level multi-script identiﬁcation, Pattern Recognition Letters 29 (2008) 1218–1229. [7] D. Genzel, A. C. Popat, N. Spasojevic, M. Jahr, A. Senior, E. Ie, F.Y. Tang, Translation-inspired OCR, in: International Conference on Document Analysis and Recognition, 2011, pp. 1339–1343. [8] A. Bahrampour, W. Barkhoda, B.Z. Azami, Implementation of three text to speech systems for Kurdish language, in: Iberoamerican Congress on Pattern Recognition, 2009, pp. 321–328. [9] S. Laroum, N. Bechet, H. Hamza, M. Roche, HYBRED: an OCR document representation for classiﬁcation tasks, International Journal of Computer Science Issues 8 (3) (2011) 1–8. [10] M. Cheriet, M. El Yacoubi, H. Fujisawa, D. Lopresti, G. Lorette, Handwritten recognition research: twenty years of achievement… and beyond, Pattern Recognition 42 (12) (2009) 3131–3135. [11] T.H. Su, T.W. Zhang, D.J. Guan, H.J. Huang, Off-line recognition of realistic Chinese handwriting using segmentation-free strategy, Pattern Recognition 42 (1) (2009) 167–182. [12] P.K. Wong, C. Chan, Off-line handwritten Chinese character recognition as a compound Bays decision problem, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (9) (1998) 1016–1023. [13] F. Kimura, OCR technologies for machine printed and hand printed Japanese text, in: Digital Document Processing: Major Directions and Recent Advances, 2007, pp. 49–71. [14] H.J. Kim, P.K. Kim, Recognition of off-line handwritten Korean characters, Pattern Recognition 29 (2) (1996) 245–254. [15] J.O. Kwon, B. Sin, J.H. Kim, Recognition of on-line cursive Korean characters combining statistical and structural methods, Pattern Recognition 30 (8) (1997) 1255–1263. [16] A. Amin, Off line Arabic character recognition: a survey, in: International Conference on Document Analysis and Recognition, 1997, pp. 596–599. [17] M.S. Khorsheed, Off-line Arabic character recognition-a review, Pattern Analysis and Applications 5 (1) (2002) 31–45. [18] B.B. Chaudhuri, U. Pal, A complete printed Bangla OCR system, Pattern Recognition 31 (5) (1998) 531–549. [19] R. Jayadevan, S.R. Kolhe, P.M. Patil, U. Pal, Ofﬂine recognition of Devanagari script: a survey, IEEE Transactions on Systems, Man, and Cybernetics—Part C: Applications and Reviews 41 (6) (2011) 782–796. [20] R.J. Kannan, A comparative study of optical character recognition for Tamil script, European Journal of Scientiﬁc Research 35 (4) (2009) 570–582.

[21] M.A. Rahiman, M.S. Rajasree, Printed Malayalam character recognition using back-propagation neural networks, in: International Advance Computing Conference, 2009, pp. 197–201. [22] B.B. Chaudhuri, U. Pal, M. Mitra, Automatic recognition of printed Oriya script, Sādhanā 27 (1) (2002) 23–34. [23] P.P. Kumar, C. Bhagvati, A. Negi, A. Agarwal, B.L. Deekshatulu, Towards improving the accuracy of TeluguOCR systems, in: International Conference on Document Analysis and Recognition, 2011, pp. 910–914. [24] T.V. Ashwin, P.S. Sastry, A font and size-independent OCR system for printed Kannada documents using support vector machines, Sādhanā 27 (1) (2002) 35–58. [25] G.S. Lehal, C. Singh, A Gurmukhi script recognition system, in: International Conference on Pattern Recognition, 2000, pp. 557–560. [26] S. Antani, L. Agnihotri, Gujarati character recognition, in: International Conference on Document Analysis and Recognition, 1999, pp. 218–221. [27] Constitution of India, Government of India, Ministry of Law and Justice, 2007. [28] U. Pal, B.B. Chaudhuri, Indian script character recognition: a survey, Pattern Recognition 37 (9) (2004) 1887–1899. [29] V. Bansal, R.M.K. Sinha, Integrating knowledge sources in Devanagari text recognition system, IEEE Transactions on Systems, Man, and Cybernetics—Part A: Systems and Humans 30 (4) (2000) 500–505. [30] H. Ma, D. Doermann, Adaptive HindiOCR using generalized Hausdorff image comparison, ACM Transactions on Asian Language Information Processing 2 (2003) 193–218. [31] A. Dutta, S. Chaudhury, Bengali alpha-numeric character recognition using curvature features, Pattern Recognition 26 (12) (1993) 1757–1770. [32] J.U. Mahmud, M.F. Raihan, C.M. Rahman, A complete OCR system for continuous Bengali characters, in: TENCON, 2003, pp. 1372–1376. [33] A. Majumdar, Bangla basic character recognition using digital curvelet transform, Journal of Pattern Recognition Research 2 (1) (2007) 17–26. [34] S. Sural, P.K. Das, An MLP using Hough transform based fuzzy feature extraction for Bengali script recognition, Pattern Recognition Letters 20 (8) (1999) 771–782. [35] S. Basu, N. Das, R. Sarkar, M. Kundu, M. Nasipuri, D.K. Basu, A hierarchical approach to recognition of handwritten Bangla characters, Pattern Recognition 42 (2009) 1467–1484. [36] N. Das, B. Das, R. Sarkar, S. Basu, M. Kundu, M. Nasipuri, Handwritten Bangla basic and compound character recognition using MLP and SVM classiﬁer, Journal of Computing 2 (2) (2010) 109–115. [37] U. Pal, B.B. Chaudhuri, Automatic recognition of unconstrained off-line Bangla handwritten numerals, in: International Conference on Advances in Multimodal Interfaces, 2000, pp. 371–378. [38] U. Pal, K. Roy, F. Kimura, A lexicon-driven handwritten city-name recognition scheme for Indian postal automation, IEICE Transactions on Information and Systems E 92-D (5) (2009) 1146–1158. [39] T.K. Bhowmik, P. Ghanty, A. Roy, S.K. Parui, SVM-based hierarchical architectures for handwritten Bangla character recognition, International Journal on Document Analysis and Recognition 12 (2) (2009) 97–108. [40] B.B. Chaudhuri, A. Majumdar, Curvelet-based multi SVM recognizer for ofﬂine handwritten Bangla: a major Indian script, in: International Conference on Document Analysis and Recognition, 2007, pp. 491–495. [41] U. Garain, B.B. Chaudhuri, Compound character recognition by run-numberbased metric distance, in: IS&T/SPIE International Symposium on Electronic Imaging: Science and Technology, vol. 3305, 1998, pp. 90–97. [42] U. Pal, T. Wakabayashi, F. Kimura, Handwritten Bangla compound character recognition using gradient feature, in: International Conference on Information Technology, 2007, pp. 208–213. [43] S. Bag, P. Phowmick, P. Behera, G. Harit, Robust binarization of degraded documents using adaptive-cum-interpolative thresholding in a multi-scale framework, in: International Conference on Image Information Processing, 2011, pp. 1–6. [44] S. Bag, G. Harit, An improved contour-based thinning method for character images, Pattern Recognition Letters 32 (14) (2011) 1836–1842. [45] P. Bhowmick, B.B. Bhattacharya, Fast polygonal approximation of digital curves using relaxed straightness properties, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (9) (2007) 1590–1602. [46] S. Mozaffari, K. Faez, M. Ziaratban, Structural decomposition and statistical description of Farsi/Arabic handwritten numeric characters, in: International Conference on Document Analysis and Recognition, 2005, pp. 237–241. [47] H. Lee, C. Huang, A Fuzzy Rule-based System for Structure Decomposition on Handwritten Chinese Characters, in: International Conference on Systems, man, and Cybernetics, ‘Humans, Information and Technology’, 1994, pp. 487– 492. [48] Z. Su, Z. Cao, Y. Wang, Stroke extraction based on ambiguous zone detection: a preprocessing step to recover dynamic information from handwritten Chinese characters, International Journal on Document Analysis and Recognition 12 (2009) 109–121. [49] J. Zeng, Z. Liu, Stroke segmentation of chinese characters using Markov random ﬁelds, in: International Conference on Pattern Recognition, 2006, pp. 868–871. [50] J. Hu, H. Yan, Structural decomposition and description of printed and handwritten characters, in: International Conference on Pattern Recognition, 1996, pp. 230–234. [51] P. Foggia, Character recognition by geometrical moments on structural decompositions, in: International Conference on Document Analysis and Recognition, 1997, pp. 6–10.

S. Bag et al. / Pattern Recognition 47 (2014) 1187–1201

[52] S. Bag, G. Harit, P. Bhowmick, Topological features for recognizing printed and handwritten Bangla characters, in: Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data, 2011.

1201

[53] N. Otsu, A threshold selection method from gray-level histogram, IEEE Transactions on Systems, Man, and Cybernetics 9 (1979) 62–66.

Soumen Bag received the B.E. and the M.Tech. degree in Computer Science and Engineering from National Institute of Technology (NIT) Durgapur, India, in 2003 and 2008 respectively. From January 2004 to June 2006, he worked as a lecturer in the Department of Computer Science and Engineering in BCET Durgapur, India. He received his Ph.D. from Indian Institute of Technology (IIT) Kharagpur in 2013. Since August 2012, he has been working as an Assistant Professor in International Institute of Information Technology (IIIT), Bhubaneswar, India. He is the recipient of Institute Gold medal for First Class for his Master's degree. His research interests are in the areas of OCR for Indian Scripts, Document Image Analysis, Image Processing, and Pattern Recognition.

Gaurav Harit received his Ph.D. from Indian Institute of Technology Delhi, in 2007. He worked as an Assistant Professor in IIT Kharagpur from 2008 to 2010. Currently he is an Assistant Professor in IIT Jodhpur since July 2010. His areas of interest include Document Image Analysis, Image Analysis, and Computer Vision.

Partha Bhowmick did his B.Tech. from IIT Kharagpur and received his masters and Ph.D. from ISI Kolkata. Presently he is an Associate Professor in CSE Department, IIT Kharagpur. His primary research interests are in digital geometry, computer graphics, low-level image processing, approximate pattern matching, shape analysis, document image analysis, GIS, and biometrics.