IJDAR DOI 10.1007/s10032-010-0142-4
ORIGINAL PAPER
Binarization of degraded document image based on feature space partitioning and classification Morteza Valizadeh · Ehsanollah Kabir
Received: 29 November 2009 / Revised: 21 November 2010 / Accepted: 2 December 2010 © Springer-Verlag 2010
Abstract In this paper, we propose a new algorithm for the binarization of degraded document images. We map the image into a 2D feature space in which the text and background pixels are separable, and then we partition this feature space into small regions. These regions are labeled as text or background using the result of a basic binarization algorithm applied on the original image. Finally, each pixel of the image is classified as either text or background based on the label of its corresponding region in the feature space. Our algorithm splits the feature space into text and background regions without using any training dataset. In addition, this algorithm does not need any parameter setting by the user and is appropriate for various types of degraded document images. The proposed algorithm demonstrated superior performance against six well-known algorithms on three datasets. Keywords Degraded document · Binarization · Mode association clustering · Structural contrast · Feature space partitioning
1 Introduction Document image analysis is an important field of image processing and pattern recognition. It consists of image capturing, binarization, layout analysis, and character recognition. Image binarization aims to convert a gray-scale image into binary and its quality affects the overall performance of document analysis systems. Although various thresholding M. Valizadeh (B) · E. Kabir Department of Electrical Engineering, Tarbiat Modares University, Tehran, Iran e-mail:
[email protected];
[email protected] E. Kabir e-mail:
[email protected] algorithms have been developed, binarization of document images with poor and variable contrast, shadow, smudge, and variable foreground and background intensities is still a challenging problem. The binarization methods reported in the literature are generally global or local. Global methods find a single threshold value by using some criteria based on the gray levels of the image. These methods compare the gray level of each pixel with this threshold and classify it as a text or background pixel. Global thresholding based on clustering [1], entropy minimization [2], and valley seeking in the intensity histogram [3] as well as feature-based [4] and model-based [5] methods has been proposed. These methods are efficient for the images in which the gray levels of the text and background pixels are separable. If histogram of the text overlaps with that of the background, they result in improper binary images. To overcome the disadvantages of global methods, various local binarization methods have been presented. These methods use local information around a pixel to classify it as either text or background. We classify local binarization methods into two categories. The methods in the first category use the gray level of each pixel and its neighborhood and examine some predetermined rules to binarize the image locally [6–13]. The second category contains the methods that use local information to extract local features and apply a thresholding algorithm on them to obtain the binary image [14–17]. Finding appropriate threshold is an important stage of these methods. Recently, some binarization algorithms have been presented that work based on the classification methods. These algorithms use some training samples to improve the binarization results [11,18,19]. The main limitation of these algorithms is that they are efficient exclusively for the specific types of images included in the training set.
123
M. Valizadeh, E. Kabir
Finding a proper boundary or threshold for separating text and background pixels in the feature space is a challenging problem in binarization algorithms. Some methods minimize a global criterion to find a threshold [14,15,17], while the algorithm [16] uses two global attributes of the text and background pixels, labeled using another binarization algorithm, to calculate an adaptive threshold. When the features of the text and background pixels are highly variable in different regions of the image, thresholding based on global criteria or global attributes leads to some errors. In this paper, we use the mode-association clustering algorithm based on hill climbing to partition a 2D feature space into small regions. In this way, the feature space is partitioned in such a way that almost only the instances from either text or background pixels occupy a region, hence resulting many pure regions. Then, we employ the result of Niblack’s algorithm to classify these regions into text or background. Classifying a region in feature space instead of classifying its points individually makes our method robust against the errors of Niblack’s algorithm. Each pixel is then classified as either text or background according to its corresponding region in the feature space. Our algorithm is applicable for various types of degraded document images. The rest of this paper is organized as follows. Section 2 briefly reviews some related works on local binarization methods and their drawbacks. Section 3 describes the proposed binarization algorithm. Experimental results and comparison with some well-known algorithms are discussed in Sect. 4, and conclusions are given in Sect. 5.
2 Survey of local binarization algorithms In general, document image binarization algorithms are categorized into global and local. Since local algorithms yield better results for degraded images, we concentrate on them. Niblack proposed a dynamic thresholding algorithm that calculates a separate threshold for each pixel by shifting a window across the image [6]. The threshold T (x, y) for the center of window is computed using local information. T (x, y) = m(x, y) + ks(x, y)
(1)
where m(x, y) and s(x, y) are local mean and standard deviation in the window centered on pixel (x, y). The window size and k are the predetermined parameters of this algorithm. The value of k is set into −0.2. This method can separate text from background in the areas around the text, but wherever there is no text inside the local window, some parts of the background are regarded as text and background noise is magnified. Sauvola solved the problems of Niblack’s method assuming that the text and background pixels have gray values close to 0 and 255, respectively [13]. He proposed a threshold
123
criterion as follows: T (x, y) = m(x, y) [1 − k(1 − s(x, y)/R))]
(2)
where R is a constant set to 128 for an image with 256 gray levels and k is set into 0.5. This procedure gives satisfactory binary image in the case of high contrast between foreground and background. However, the optimal values of R and k are proportional to the contrast of the text. For poorcontrast images, if the parameters are not set properly, the texts regions are missed. Chen proposed an algorithm for locally setting the binarization parameters [11]. This algorithm is implemented in two stages. In the first stage, a feature representing the region contrast is extracted, and using this feature, the original image is decomposed into sub-regions. In the second stage, three features are extracted from each sub-region. These features are used to examine the sub-regions and classify them into four classes: background, faint strokes, heavy strokes, and heavy and faint strokes. For each sub-region, appropriate parameters are set according to its class. T (x, y) = wm(x, y) + kG N (x, y) ⎧ if background, ⎪ ⎪ ⎨ if faint stroke, where if heavy stroke, ⎪ ⎪ ⎩ if heavy with faint strokes,
w w w w
= 0, = 0.5, = 0.7, = 0.7,
k k k k
=0 = −1.1 = −0.8 = −1.1 (3)
G N (x, y) represents the mean-gradient value of the subregion in the direction of stroke slant. The parameters w and k are set experimentally and are not applicable to different types of images. Logical level thresholding technique uses not only the image gray level values but also the stroke width of the characters to improve the binarization quality [9]. This algorithm is based on the idea of comparing the gray level of each pixel with some local averages in its neighborhood. These comparisons need a threshold to produce some logical values, which are utilized to generate binary images. Yang [12] proposed an adaptive threshold calculation method to improve the logical-level technique, but this threshold is proportional to a predetermined parameter, so the quality of the final binary image depends on the parameter setting by the user. Gatos estimated the background surface for the document image and compared the differences between the original gray levels and this surface with an adaptive threshold to label each pixel as either text or background [16]. To estimate the background surface, he used Sauvola’s binarization algorithm to roughly extract the text pixels and for them calculated the background surface by interpolation of neighboring background pixels intensities. For other pixels, background surface is set to the gray level of original image. In this algorithm, the average distance between foreground and
Binarization of degraded document image based on feature space partitioning and classification Fig. 1 Block diagram of the proposed binarization algorithm
Niblack’s adaptive thresholding
I (x,y)
Feature extraction
background, the average background values of background surface and three predetermined parameter are utilized to calculate the adaptive threshold. It yields satisfactory results for various types of degraded images. However, for document images with uneven background, the parameters should be adapted to obtain better performance. A stroke-model-based method uses the two attributes of a text pixel to extract characters [14]: (i) its gray level is lower than that of its neighbors and (ii) it belongs to a thin connected component with a width less than the stroke width. Based on these two attributes, the gray-scale image is mapped into a double-edge feature image. This mapping increases the separation of text and background pixels and a global thresholding followed by a reliable post-processing extracts the text. In the double-edge image, the separability of the text and background depends on the contrast of the text in the original image. Global thresholding of double-edge image is not suitable for the images with variable foreground and background intensities where the low-contrast texts are missed.
3 Proposed binarization algorithm The diagram of the proposed binarization algorithm is illustrated in Fig. 1. This algorithm consists of feature extraction, feature space partitioning, partition classification, and finally pixel classification stages. The details of each stage are described in this section. 3.1 Feature extraction Feature extraction is one of the most important steps in pattern recognition applications. Mapping the objects into appropriate feature space leads to simple and accurate classification or clustering algorithms. Therefore, we try to map the document image into a feature space in which the text and background pixels are separable. We propose a new feature named structural contrast and use it together with gray level in this application. For document binarization, the most powerful features are those that take into account the structural characteristics of the characters. The stroke width is an important structural characteristic that helps us to extract reliable features. In logical-level technique [9], based on the stroke width and gray
Feature space partitioning
Pixel
Partition classification
p1
classification
p2
( p4′ ) p0
B(x,y)
p3
SW
p4 ( p0′ )
( x0 , y0 ) ( p3′ ) p7
p6 ( p2′ )
p5 ( p1′ )
Fig. 2 The neighboring pixels utilized to extract structural contrast
level of the image, for each pixel, eight logical values are generated. These values are placed in a logical rule to classify each pixel as either text or background. In this work, we modify the logical-level technique to extract the structural contrast. Since this feature takes into account the structural characteristics of the text, it increases the discrimination of the text and non-text pixels. Suppose we want to extract the structural contrast, SC, at pixel (x0 , y0 ) shown in Fig. 2. It is defined as follows: 3
SC(x0 , y0 ) = max {min(M SW ( pk ), M SW ( pk+1 ), k=0 M SW ( pk )), M SW ( pk+1 ) − G(x0 , y0 ) SW SW i=−SW j=−SW G( pkx − i, pky − j) MSW ( pk ) = (2 × SW + 1)2 (4) where G(x0, y0 ) represents the gray level of pixel (x0, y0 ), pkx, pky are the coordinates of pk, pi = p(i+4)mode 8 , and SW denotes the stroke width of the characters determined automatically. We use the structural contrast as the first feature in our application. Figure 3 shows a degraded image and its corresponding structural contrast feature. In the conventional histogram based binarization algorithms, the gray level of each pixel is utilized as feature. Although, in degraded document images, this feature alone cannot separate the text and background pixels, it contains valuable information and using it beside structural contrast makes the text pixel more separable from background. Therefore, we use the gray level as the second feature in this work. The pixel (x, y) is mapped into feature space, f = [ f 1 , f 2 ], where f 1 = SC(x, y) and f 2 = G(x, y)
123
M. Valizadeh, E. Kabir
Fig. 3 a Original image, b structural contrast feature image
(a)
(b) 250
Text pixels Background pixels
Gray level
200 150 100 50 0 10
20
30
40
50
60
70
Structural contrast
Fig. 4 a A typical degraded image, b some pixels of the image mapped into the 2D feature space
3.1.1 Finding the stroke width
3.2 Feature space partitioning
Stroke width, SW, is a useful characteristic of the text. It is used to extract the feature SC. We used our previous method [20] to find SW automatically. This method computes the histogram of the distances between two successive edge pixels in the horizontal scan. Suppose that h(d) is a one-dimensional array, denoting the distance histogram and d ∈ {2, . . . , L} where L is the maximum distance to be counted. The SW is defined as the distance, d, with the highest value in h(d). This method finds the SW in degraded images more accurately than needed and satisfies our requirements. Experimental results showed that the tolerance of this method in finding SW is less than 20%, whereas we experimentally observed that our binarization algorithm works efficiently if SW is found with 40% tolerance. The details of this experiment are cited in Appendix 1.
In pattern recognition applications, clustering algorithms are utilized to group similar objects. In our work, we encounter large number of objects. For example, an image of 2,000 × 3,000 pixels contains 6,000,000 objects in the feature space. Clustering such a large number of objects is not a trivial task and is very time consuming. Instead of clustering the objects, we partition the feature space using the mode-association clustering technique [21]. Our partitioning algorithm consists of the following stages. Histogram estimation: the feature space is divided into bins of equal size and the number of objects inside each bin is counted. In the 2D feature space, we define a 1 × 1 square as a bin. Mode association clustering: starting from any point in the feature space, we use hill climbing algorithm to find the local maxima of the histogram. Those points that climb to the same local maximum are grouped into one partition or region. This algorithm partitions the feature
N space into Ri where N small regions in such a way that R = i=1 R = { f |H ( f ) > 0}, Ri = { f | f climbs to ith local maximum} and H ( f ) denotes the number of pixels in bin f. Figure 5 shows an example of the resulting regions in the feature space.
3.1.2 Discrimination power of the 2D feature space To illustrate the discrimination power of the proposed features, we map some randomly chosen pixels of a typical degraded image shown in Fig. 4a into the 2D feature space and show them in Fig. 4b.
123
Binarization of degraded document image based on feature space partitioning and classification
250
PBR =
i∈background regions
i∈background regions
Gray Level
Nt (i) + Nb (i)
× 100
(5)
where Nt (i) and Nb (i) denote the number of text and background pixels inside the ith region. PTR and PBR represent the purities of text and background regions, respectively. This experiment yielded PTR = 95.7% and PBR = 99.2%, showing acceptable purity. Since manual labeling of the boundary pixels of the text is subjective, some differences between ground truth and the results of our algorithm are negligible.
200
150
100
3.2.2 Complexity of clustering
50
0
Nb (i)
0
20
40
60
80
100
120
140
160
Structural Contrast Fig. 5 Partitioning of feature space into small regions
3.2.1 Purity of the regions In the proposed binarization algorithm, we partition the feature space into many small regions and classify those regions. Text and background pixels inside a wrong region are incorrectly classified in the final binarization stage. Therefore, the purity of regions is important in our application. Like other clustering algorithms, the mode-associated clustering does not guarantee that all the objects assigned to the specific cluster belong to the same class. However, because of the following reasons, it results in relatively pure regions. (i) Text and background pixels are dense in the regions of the feature space associated with related classes while become sparse in the regions that lie between these classes. In most cases, there are clear gaps between text and background classes in the feature space. During clustering, each point gradually climbs to a local maximum of the correct class and does not jump the gaps to an incorrect region. (ii) In this clustering algorithm, the number of clusters is not predetermined; instead, it is determined according to the distribution of the pixels in the feature space. This leads to a large number of small regions with high purity. To examine our algorithm, we used 30 document images from three datasets and their ground truth to measure the purity of the regions experimentally. We first partitioned the feature space of each image into small regions and then counted the number of text and background pixels in each region. The purities of text and background regions are measured as follows: PTR =
i∈text regions
i∈text regions
Nt (i)
Nt (i) + Nb (i)
× 100
In our method, the complexity of clustering algorithm depends on the size of the feature space rather than the number of pixels. To give a measure of complexity, we briefly explain the implementation of the clustering algorithm. We start the algorithm from an arbitrary point in the feature space and place a 5 × 5 window at this point. The maximum value of histogram inside this window is found, and the searching window is moved to it. This process continues until a local maximum is reached. The starting point, the local maximum, and all points in the path between them are labeled and are considered as a group. Then, we repeat this process starting from another point not labeled yet. However, we terminate the maximum search process when we reach either a new local maximum or a previously labeled point. If reaching a labeled point, the starting and all points in the path to the destination point are labeled same as the destination point. Otherwise, the starting point, the new local maximum, and the points in the path between them are labeled as a new group. This process continues until all points in the feature space are labeled. Using this method, the searching window is placed on each point only once. We need 25 comparisons to find the maximum point inside each searching window and one comparison to determine whether the central point is previously labeled or not. Our feature space has 256×256 points. So, the number of comparisons is 256 × 256 × 26. For an image with 2000 × 3000 resolution, the number of comparisons needed is 0.28 per pixel. Feature space partitioning consumes only a small portion of time needed for binarization of an image. 3.3 Region classification After partitioning the feature space into small regions, we use the result of an auxiliary binarization algorithm to classify each region as either text or background. In this way, an auxiliary binary image is produced to generate the primary labels of pixels. To classify region Ri, , the text and background pixels of the auxiliary binary image, which their feature vectors lie in Ri , are counted. Suppose Nab (Ri ) and Nat (Ri ) are the number of pixels in Ri labeled as background and text in the
123
M. Valizadeh, E. Kabir
(a) 1.1
(b) 1.1
0.9
1
0.8
0.9
TP
TN
1
0.7 0.6
0.8
0.5
0.7
0.4
0
50
100
150
0
Region index
100
200
300
400
500
600
700
Region index
Fig. 6 a Percentage of background pixels in the regions labeled as background, b percentage of text pixels in the regions labeled as text
auxiliary image, respectively. Ri is classified as follows: text if Nat (Ri ) > Nab (Ri ) Class(Ri ) = (6) background otherwise where Class(Ri ) represents the class of Ri . Although this method accurately classifies the partitions, using it for classifying the single points in the feature space leads to some classification errors because there are some errors in the primary labels. This is the reason why we first partition the feature space and then classify the resulting regions instead of classifying all single points in the feature space. 3.3.1 Choosing the auxiliary binarization algorithm Our region classification method performs well when we use an auxiliary binarization algorithm that correctly labels more than 50% of the pixels inside each region. We found that Niblack’s method is appropriate for this application. As mentioned in Sect. 1, Niblack’s algorithm extracts the text objects effectively but it classifies some parts of the background as text. The classification result of background pixels depends on the following conditions. 1. Both background and text pixels inside the sliding window are utilized to compute a local threshold. 2. Only the background pixels exist inside the sliding window.
to maintain the local attributes of the image. To justify this, we carried out the following experiment. We applied our partitioning algorithm over 30 document images and classified the small regions in the feature space manually. Then, we measured the percentage of correctly classified pixels using Niblack’s algorithm in each region as follows: Nat (Ri ) class(Ri ) = text Nat (Ri ) + Nab (Ri ) Nab (Ri ) class(Ri ) = background TN = Nat (Ri ) + Nab (Ri ) TP =
(7)
where TP and TN are calculated for the text and background regions, respectively. The results of this experiment showed that Niblack’s algorithm correctly classifies more than 50% of the pixels, which lie in the same region (Fig. 6). Therefore, this algorithm satisfies our requirement. In our work, using a very large window for implementing Niblack’s algorithm leads to not eliminating some smudges, while a very small window may miss some large characters. We used a 60 × 60 sliding window, which is appropriate for a wide range of character size. We use the Niblack’s algorithm to generate the primary labels and apply our region classification algorithm to separate the regions into text and background. Figure 7 illustrates the result of our region classification algorithm applied to the regions in Fig. 5. 3.4 Final binarization
In the first condition, almost all of the background pixels inside the sliding window are classified correctly. In the second condition, in most cases, the number of correctly classified background pixels is larger than the number of background pixels classified falsely. Therefore, more than 50% of background pixels inside a small area in the image are classified correctly. The background pixels inside each small area are almost similar and are expected to map into the same region in the feature space. Therefore, it is likely that more than 50% of the pixels inside each region in the feature space are classified correctly if the sliding window is small enough
123
We use the classification results of regions to binarize the document image. Suppose G(x, y) is mapped into [ f 1 , f 2 ] in the feature space where [ f 1 , f 2 ] ∈ Ri . The binary image, B(x, y), is obtained as follows: 0 if class(Ri ) = text B(x, y) = (8) 1 if class(Ri ) = background In this way, we obtain a binarization algorithm that can deal with document images suffering from uneven background, shadows, non-uniform illumination, and low contrast.
Binarization of degraded document image based on feature space partitioning and classification 250
Gray Level
200
Background region
150
100
Text region
50
0 0
50
100
150
Structural Contrast
Fig. 7 Feature space classification result for Fig. 5
[23], which includes 10 historical document images suffering from the degradations such as smear, smudge, variable foreground and background intensities, and bleeding through. In this dataset, most of the degradations are due to aging. Dataset II includes 10 document images selected from the Media Team Dataset [24]. The images of this dataset have degradations such as textured background, shadow through, variable foreground and background intensities, and low-contrast text. These images are not severely degraded, and most of binarization algorithms can binarize them with small errors. Dataset III involves 10 document images we captured under badly illumination conditions. The images in this dataset suffer from severely variable foreground and background intensities and low contrast due to image acquisition conditions. Binarization of these images is not a trivial task, and most algorithms fail to binarize them properly. 4.1 Visual evaluation We applied the binarization algorithms to document images of different datasets and visually compared the results. These experiments showed that, compared with six well-known algorithms, our algorithm yields superior performance on most of document images. 4.1.1 Experiment 1
Fig. 8 Result of our algorithm applied to the image shown in Fig. 4a
Partitioning the feature space into small regions and classifying them rather than directly dividing the feature space into text and background regions are the main reason for the success of our algorithm. The binarization result of our algorithm applied to the image in Fig. 4a is shown in Fig. 8.
4 Experimental result We carried out a comparative study to evaluate the performance of our algorithm. In this experiment, Otsu’s global thresholding [1], Niblack’s local binarization [6], Sauvola’s local binarization [13], Oh’s algorithm [15], Ye’s algorithm [14], and Gatos’ method [16] were implemented, and the results obtained were utilized as benchmarks. For Sauvola’s algorithm, as recommended in [22] we set the parameter k into 0.34. Using this value yields better results in our experiments. Both visual and quantitative criteria were used in our evaluation. We used three datasets of document images. Each dataset corresponds to the specific degradations and demonstrates the efficiency of different algorithms dealing with those degradations. Dataset I is ICDAR 2009 Dataset
In this experiment, the evaluation is carried out on the images of Dataset I. Figure 9 shows two examples of this experiment. The evaluation results are summarized as follows. Otsu’s algorithm regards some background regions as text and misses some texts. It yields the worst results. Niblack’s algorithm retains all texts but labels large number of noisy blobs as text. Oh’s algorithms miss the low-contrast texts. Sauvola’s algorithm labels some parts of background as text. Gatos’ and Ye’s algorithms result in binary images with small degradations. Our algorithm yields the best binary images in most cases. Only in the cases of severely bleedingthrough degradation, our algorithm does not eliminate some bleeding-through texts and Sauvola’s algorithm outperforms ours. The binarization results of our binarization algorithm for 8 remaining images of Dataset I are given in Appendix 2. 4.1.2 Experiment 2 In this experiment, Dataset II is utilized for evaluation. The images in this dataset are not highly degraded. Therefore, most of binarization algorithms can produce clean images with small degradations. We visually compared the results. Figure 10 shows a document image with shadowthrough degradation and the binarization results. Niblack’s and Oh’s algorithms are not efficient dealing with this degradation and label many noisy blobs as text. Although other
123
M. Valizadeh, E. Kabir
Fig. 9 Binarization results of two typical historical images a Original images; b Otsu; c Sauvola; d Niblack; e Oh; f Ye; g Gatos; h our method
evaluated algorithms correct the shadow-through, they miss some details. For example, they fill the small holes of characters ‘a’ and ‘e’ in the word “Telefax”, magnified for better visualization. Our algorithm corrects this degradation while retaining the details.
4.1.3 Experiment 3 In this experiment, we used the images of Dataset III. There are degradations such as blurring, low-contrast text, and severely variable foreground and background intensities in these images. Figure 11 shows a low-contrast and blurred image and the binarization results of different algorithms. Using Otsu’s algorithm, the text is properly extracted from the background. However, it yields broken and touching characters in some parts of the image. Sauvola’s algorithm gives a binary image in which many characters are missed and some are broken. Niblack’s algorithm extracts many background blobs as text. In addition, this algorithm leads to highest amount of touching characters. Oh’s and Gatos’s algorithms result in similar binary images that have large number of broken characters. Ye’s and our algorithms result in clean binary images.
123
Figure 12 shows a document image with non-uniform illumination and the binarization results of different algorithms. Otsu’s algorithm cannot extract the text from background, because the gray-level values of the text overlaps with those of the background. Sauvola’s algorithm assumes that the gray levels of text and background are close to 0 and 255, respectively. Low-contrast texts with gray levels close to 255 do not satisfy this assumption and are missed or broken up. Niblack’s algorithm labels many background blobs as text. Oh’s, Ye’s and Gatos’ algorithms miss the text in the shadow area. Our algorithm results in a clean binary image. Dealing with document images captured with camera, our algorithm extremely outperforms other algorithms evaluated.
4.2 Quantitative evaluation by F-measure In this experiment, we used F-measure to compare the performance of binarization algorithms. This criterion is utilized in the first document image binarization contest [25]. It is defined as follows: F-measure =
2 × Recall × Precision Recall + Precision
(9)
Binarization of degraded document image based on feature space partitioning and classification Fig. 10 Binarization results of a document image with shadow-through degradation a Original image; b Otsu; c Sauvola; d Niblack; e Oh; f Ye; g Gatos; h our method. The word “Telefax” is magnified for better visualization
Fig. 11 Binarization results of a low-contrast and blurred image a Original image; b Otsu; c Sauvola; d Niblack; e Oh; f Ye; g Gatos; h our method TP TP where Recall = TP+FN , Precision = TP+FP and TP, FP and FN denote true-positive, false-positive, and false-negative values, respectively. We used all three datasets for evaluation. The ground truth of Dataset I is available from [23], and the ground truths of two other datasets were manually
generated. The comparison of 7 algorithms are given in Table 1. Our algorithm yields the best results for all datasets. Comparing the results of our algorithm on Dataset I with the algorithms participated in DIBCO 2009 shows that our
123
M. Valizadeh, E. Kabir
Fig. 12 Binarization results of a document image with non-uniform illumination a Original image; b Otsu; c Sauvola; d Niblack; e Oh; f Ye; g Gatos; h our method Table 1 Comparison of F-measure for 7 algorithms
Method Dataset
Gatos [16]
Ye [14]
Niblack [6]
Sauvola [13]
Otsu [1]
Oh [15]
Ours
Dataset I
0.873
0.871
0.566
0.854
0.798
0.597
0.903
Dataset II
0.868
0.885
0.567
0.746
0.734
0.712
0.889
Dataset III
0.871
0.832
0.553
0.811
0.391
0.632
0.918
algorithms stands in the second step after the winner based on the F-measure. 4.3 Evaluation based on quality of text extraction We used the following five criteria to compare the binarization algorithms quantitatively. The criteria 1, 2, and 3 are chosen from reference [26], and the others are proposed by us.
123
1. Broken characters: characters with undesirable gaps regarded as two or more connected components. 2. Connected characters: characters that touch the adjacent ones and two or more characters establish one connected component. 3. Deformed characters: characters that are binarized inefficiently and are unrecognizable. 4. Missing characters: characters that cannot be extracted from binary image.
Binarization of degraded document image based on feature space partitioning and classification Table 2 Evaluation results of binarization algorithms based on five criteria
Gatos [16] Ye [14] Niblack [6] Sauvola [13] Otsu [1] Oh [15] Ours Broken characters
209
148
347
198
346
69
Connected characters
19
57
613
85
97
43
73
Deformed characters
216
218
84
245
136
142
49
Missing characters
519
617
0
674
2117
384
8
4
0
35
24
0
5
Blobs in background region
5. Blob in background region: False object extracted as text. We applied different binarization algorithms to the images of Dataset III and measured these criteria manually. In this experiment, the test images have 6,304 characters in total. The results are given in Table 2. The results of our evaluation are described as follows: • Otsu’s algorithm is not suitable for the images with uneven background (e.g. Fig. 12). Using this algorithm, a lot of faint texts and those in dark background are missed and some of the faint characters become broken. • Niblack’s method does not miss the text, but it yields a lot of touching characters in low-contrast images. In addition, it labels a large amount of background blobs as text. • Sauvola’s method eliminates the background properly but breaks up faint characters or removes them totally in lowcontrast document images. • Gatos’ algorithm uses three predetermined parameters and two global attributes of document image to calculate the adaptive threshold. It misses some of the characters or breaks them up in the badly illuminated images if its parameters are not adapted for these types of images. • Ye’s algorithm maps the original image into the doubleedge feature image and uses global thresholding to separate text from background. Therefore, when the contrast of the text is variable in different parts of the images, the low-contrast characters are missed. • Oh’s algorithm uses a global threshold to extract regions of interest, ROI, in which rain falls. Some ROIs are missed, and the related texts are lost. In addition, when the variation of gray levels inside a character is high, this algorithm breaks it up. This algorithm is not appropriate for the images with uneven background. • Our algorithm maps the pixels into an appropriate feature space and classifies this space using the result of Niblack’s algorithm. It results in clean binary images for the most of test images. However, dealing with severely bleedingthrough degradation, some of the spurious components are labeled as text.
0
Too many
5 Conclusion In this paper, we proposed a new algorithm for the binarization of document images. We presented a novel feature, structural contrast, and mapped the document image into a 2D feature space in which the text and background pixels are separable. Then, we partitioned this space into small regions. These regions were classified to text and background using the result of Niblack’s algorithm. Each pixel is then classified as either text or background based on its location in the feature space. Our binarization algorithm does not need any parameter setting by the user and is appropriate for various types of degraded images. The main advantage of our algorithm against the previous classification-based algorithms is that it does not need any training dataset and uses the result of Niblack’s algorithm. In different experiments, the proposed algorithm demonstrated superior performance against six well-known algorithms on three datasets of degraded images. Acknowledgments This work was supported in part by Iran Telecommunication Research Center, ITRC, under contact T500-4683.
Appendix 1 In our work, stroke width, SW, is used as a structural characteristic of the text, and the accuracy of its value affects the efficiency of the feature SC. It is important to determine that how sensitive the binarization results are to the errors in the estimation of SW. We did the following experiment to measure the quality of binarization algorithm for different values of error in the estimation of SW. We manually determined the SW of the text for 30 document images and used [α × SW ] in our binarization algorithm instead of automatic estimation of SW, where 0.2 ≤ α ≤ 1.8 and [A] denotes the rounded value of A. F-measure was calculated for different values of α as shown in Fig. 13. F-measure has small variation when α ≥ 0.6 and decreases rapidly for α < 0.6. Visual inspection of resulting binary images verified this experiment, and we observed that for 0.6 ≤ α ≤ 1.5, the binary images have almost same qualities. However, setting α < 0.6 breaks
123
M. Valizadeh, E. Kabir
or misses some low-contrast characters, and setting α > 1.5 fills small holes of characters and labels some small smudges in the images as text. Therefore, the efficiency of our algorithm decreases if stroke width estimation is off more than approximately 40%.
0.95
F-measure
0.9 0.85 0.8
Appendix 2
0.75 0.7 0.2
0.4
0.6
0.8
1
α
1.2
1.4
1.6
Fig. 13 Comparison of F-measure for different values of α
1.8
To give the readers more sense from the performance of our algorithm, we illustrate the results of our algorithm for the 8 remaining images of ICDAR 2009 dataset in Fig. 14. Although SC is sensitive to character size, in these images
Fig. 14 Binarization results of our algorithm on the images of Dataset I (a)–(h) shows binary images for H01-H03 and P01-P05, respectively
123
Binarization of degraded document image based on feature space partitioning and classification
our algorithm yields satisfactory results even for the documents that have text with different font size, as in Fig. 14e and f. In these images, the gray level discriminates the text from background. Therefore, our algorithm can handle these cases although SC is not a good feature for them.
References 1. Otsu, N.: A threshold selection method from grey level histogram. IEEE Trans. Syst. Man Cybernet. 9, 62–66 (1979) 2. Kapur, J.N., Sahoo, P.K., Wong, A.K.C.: A new method for gray level picture thresholding using the entropy of the histogram. Comput. Vis. Graph. Image Process. 29, 273–285 (1985) 3. Weszka, J.S., Rosenfield, A.: Histogram modification for threshold selection. IEEE Trans. Syst. Man Cybernet. 9, 38–52 (1979) 4. Dawoud, A., Kamel, M.S.: Iterative multimodel subimage binarization for handwritten character segmentation. IEEE Trans. Image Process. 13, 1223–1230 (2004) 5. Liu, Y., Srihari, S.N.: Document image binarization based on texture features. IEEE Trans. Pattern Anal. Mach. Intell. 19, 540–544 (1997) 6. Niblack, W.: An Introduction to Digital Image Processing. Prentice Hall, Englewood Cliffs, NJ, USA (1986) 7. White, J.M., Rohrer, G.D.: Imager segmentation for optical character recognition and other applications requiring character image extraction. IBM J. Res. Dev. 27, 400–411 (1983) 8. Parker, J.R.: Gray level thresholding in badly illuminated images. IEEE Trans. Pattern Anal. Mach. Intell. 13, 813–819 (1991) 9. Kamel, M., Zhao, A.: Extraction of binary character/graphics images from grayscale document images. Graph. Models Image Process. 55, 203–217 (1993) 10. Bernsen, J.: Dynamic thresholding of grey-level images. In: Proceedings of the 8th International Conference on Pattern Recognition, Paris, pp. 1251–1255 (1986) 11. Chen, Y., Leedham, G.: Decompose algorithm for thresholding degraded historical document images. IEE Proc. Vis. Image Signal Process. 152, 702–714 (2005) 12. Yang, Y., Yan, H.: An adaptive logical method for binarization of degraded document images. Pattern Recognit. 33, 787–807 (2000)
13. Sauvola, J., Pietikainen, M.: Adaptive document image binarization. Pattern Recognit. 33, 225–236 (2000) 14. Ye, X., Cheriet, M., Suen, C.Y.: Stroke-model-based character extraction from gray-level document images. IEEE Trans. Image Process. 10, 1152–1161 (2001) 15. Oh, H.H., Lim, K.T., Hien, S.I.: An improved binarization algorithm based on a water flow model for document image with inhomogeneous backgrounds. Pattern Recognit. 38, 2612– 2625 (2005) 16. Gatos, B., Pratikakis, I., Perantonis, S.J.: Adaptive degraded document image binarization. Pattern Recognit. 39, 317–327 (2006) 17. Lu, S., Tan, C.L.: Binarization of badly illuminated document images through shading estimation and compensation. In: 9th International Conference on Document Analysis and Recognition, Brazil, pp. 312–316 (2007) 18. Huang, S., Ahmadi, M., Sid-Ahmed, M.A.: A hidden Markov model-based character extraction method. Pattern Recognit. 41, 2890–2900 (2008) 19. Chou, C.H., Lin, W.H., Chang, F.: A binarization method with learning-built rules for document images produced by cameras. Pattern Recognit. 43, 1518–1530 (2010) 20. Valizadeh, M., Komeili, M., Armanfard, N., Kabir, E.: Degraded document image binarization based on combination of two complementary algorithms. In: International Conference in Advances in Computer Tools for Engineering Applications, Lobanon, pp. 595–599 (2009) 21. Li, J., Ray, S., Lindsay, B.G.: A nonparametric statistical approach to clustering via mode identification. J. Mach. Learn. Res. 8, 1687– 1723 (2007) 22. Badekas, E., Papamarkos, N.: Automatic evaluation of document binarization results. In: Proceedings of the 10th Iberoamerican Congress on Pattern Recognition, pp. 1005–1014. Havana (2005) 23. First international document image binarization contest (http:// users.iit.demokritos.gr/~bgat/DIBCO2009/benchmark/) 24. Media Team Oulu Document database (http://www.mediateam. oulu.fi/MTDB/) 25. Gatos, B., Ntirogiannis K., Pratikakis, I.: ICDAR 2009 document image binarization contest (DIBCO 2009). In: 10th International Conference on Document Analysis and Recognition, Spain, pp. 1375–1382 (2009) 26. Solihin, Y., Leedham, C.G.: Integral ratio: A new class of global thresholding techniques for handwriting images. IEEE Trans. Pattern Anal. Mach. Intell. 21, 761–768 (1999)
123