Self Learning Classification for Degraded Document Images by Sparse Representation Bolan Su1 , Shuangxuan Tian1 , Shijian Lu2 , Thien Anh Dinh1 and Chew Lim Tan1 1 School of Computing, National University of Singapore Computing 1, 13 Computing Drive, Singapore 117417 Email: {subolan,tians,thienanh,tancl}@comp.nus.edu.sg 2 Institute for Infocomm Research 1 Fusionopolis Way, 21-01 Connexis, Singapore 138632 Email:
[email protected] Abstract—Document Image Binarization is a technique to segment text out from the background region of a document image, which is a challenging task due to high intensity variations of the document foreground and background. Recently, a series of document image binarization contests (DIBCOs) had been held that have drawn great research interest in this area. Several document binarization techniques have been proposed and achieve great performance on the contest datasets. However, those proposed techniques may not perform well on all kinds of degraded document images because it is difficult to design a classification method that correctly models the nonuniform degraded document background and text foreground simultaneously. In this paper, we propose a self learning classification framework that combines binary outputs of different binarization methods. The proposed framework makes used of the sparse representation to re-classify the document pixels and produces a better binary results. The experimental results on the recent DIBCO contests show the great performance and robustness of our proposed framework on different kinds of degraded document images.
I. I NTRODUCTION Document image binarization is usually performed in the document preprocessing stage. It converts a gray scale document image into a binary version that contains only the text pixels. It is an important preprocessing technique for ensuring different document image processing tasks such as optical character recognition and document image retrieval. Many document binarization methods [1] have been reported in the literature. Generally speaking, assigning a global threshold to the whole document image based on intensity is straightforward, because the document text is usually in different intensity levels compared to document background in clean document images. These global thresholding methods [2] are widely used in many different document analysis techniques for their simplicity and efficiency. However, the clear bimodal pattern doesn’t exist in lots of degraded document images due to the intensity variation of foreground text and document background. Therefore, local thresholding methods [3] that make use of the mean and standard deviation of the intensities of a local region have
been proposed. Although they are usually better approaches for binarization of degraded document images without uniform background and foreground intensity distribution, they may not perform well on low-contrast document image regions. Another drawback of these approaches is that the thresholding performance depends on the local windows size which is hard to choose. Several approaches are developed in order to compensate for non-uniform intensity and contrast, including background subtraction [4] that reduces the intensity and contrast variation by subtracting an estimated document background. The document image binarization contest (DIBCO) series [5], [6], [7], [8] have shown some recent progress on this issues. The winning algorithm of DIBCO 2009 [5] detects the text strokes by estimating the document background using iterative polynomial smoothing technique [9]. One of the two tied winners of H-DBICO 2010 [7] makes use of the local image contrast that is evaluated by the ratio of local maximum and minimum intensities [10]. The other algorithm iteratively expands the foreground text regions based on local mean intensity of foreground and background pixels [11]. In DIBCO 2011 [6], the best algorithm first divides the document pixels into three groups: foreground, background and uncertain, then re-classifies those uncertain pixels by modeling the foreground and background pixels [12]. Furthermore, the winner of H-DIBCO 2012 [8] tries to binarize the document image by optimizing Laplacian energy function [13]. Despite great performance that have been achieved on the contest datasets, those state-of-the-art techniques may fail to produce good binary results on certain kinds of document images due to high intensity variation within both the document background and foreground caused by degradation. In this paper, we present a self learning framework that combines different existing document threhsolding methods to produce better and robust binarization results on different kinds of document images. There are a few binarization techniques that make use of the outputs of different thresh-
Binary Output One B1 Binarization Algorithm 1
Input Document Image I
Binarization Algorithm 2
Combined Label Image L
Estimated sparse coefficient vector x
Extracted Features of each image patches F(1...N)
Sparse Representation Classifiers
Output Binary Image
Binary Output Two B2
Feature Extraction
Figure 1: The overall flowchart of our proposed system
olding methods. Majority voting strategy is a simple and straight forward way to combine several binarization results, which is used as an intermediate result in B. Gatos et al.’s work [14]. E.Badekas and N. Papamarkos [15] apply neural network technique to learn from different binary results. This method takes advantages of different binarization methods, however it is time consuming and requires some prior knowledge about the participating binarization techniques. Su et al. [16] also proposed a document image binarization framework to combine the outputs of different binarization methods for better binarization results. This framework iteratively re-classify those uncertain document pixels that assigned different labels by different participating thresholding techniques. The uncertain pixels are labeled based on their distance to local foreground and background clusters in a predefined feature space. However, the selection of intermediate combined result may affect the final binarization results. We propose a self learning classification technique that makes use of the sparse representation (SR) to produce better binary results. The SR has achieved great success in face recognition domain. Recently, Wright et al. [17] proposed sparse representation-based classifier (SRC) that aims at representing the test data using a small number (sparsity) of training data by linear combination. The objective function can be written as: xbst
= arg min||x||1 x
subject to: ||A ∗ x − y||2
≤
(1)
where A, y, and x denote the train data, test data, and sparse coefficient vector, respectively. || · ||1 and || · ||2 denote L1 and L2 norms. xbst is the desired sparse coefficient vector. This optimization problem can be solved efficiently by L1regularized optimization methods [18], [19]. Based on xbst , the SRC evaluates the contribution of different classes in the reconstruction of the test data y to classify the test data. Document image binarization can be viewed as assigning labels L ∈ [0, 1] to each pixel of an image I. If we divide the document image into smaller image patches, the two similar image patches should be assigned similar labels. Based on
such an assumption, we can first assign initial labels (either foreground, background or uncertain) to document image pixels using the outputs of different binarization methods, then classify those pixels with uncertain pixels by looking up similar image patch. Inspired by SRC, we can represent an image patch I p by a linear combination of a series of image patches I q , q ∈ [1, · · · , n] and obtain a sparse coefficient vector x, with non-negative L1 constraint. Then we can assume that the corresponding label patch Lp of I p is the same linear combination of the corresponding label patches Lq of I q using the sparse coefficient vector x. One advantage of the proposed framework is, we can only consider constructing a classifier for a small image patch, without inducing a classification rule for the whole document image. Furthermore, the proposed classifier captures the similarity between image patches level instead of image pixel level, which allows more information for better classification results. Experiments on DIBCO 2009 and 2011 datasets [5], [6] show the superior performance and robustness of our proposed technique. II. P ROPOSED M ETHOD The overall flowchart of our proposed technique is illustrated in Figure 1. Given an input image I, the binary outputs B are first obtained using the participating binarization algorithms. Then the label image L is generated by combining the binary outputs. The input document image I is further divided into smaller non-overlapped image patches I p , after that features F p are extracted for each image patches. Likewise, the label image L is divided into corresponding patches Lp . After that, a sparse representation classifier is constructed for each image patch using the other image patches of the input document images. Sparse coefficient vectors xp are estimated accordingly. Finally, the output binary image is created by updating the label patches Lp using its corresponding sparse coefficient vector. The details of each step will be explained in later parts of this section. A. Feature Extraction First, to construct the label image L, we combine the binary outputs B based on majority voting strategy. As each image pixel is labeled either 0 (foreground text) or 1 (document background), we could observe that those pixels that are classified consistently the same by different methods are usually correctly classified, and those pixels which are classified as text by some methods and classified as background by other methods have higher possibility to be misclassified than others. Since we don’t have any other information about the participating thresholding algorithm, each method is given the same weight during the combination. So the L is generated as follows: Pn Bi (2) L = i=1 n
where Bi denotes the i-th binarization method’s output and n denotes the number of participating binarization methods. The document pixels are then assigned with different possibilities to foreground text and document background. We treat those document pixels with label values 0 or 1 as correctly classified pixels, and the others as uncertain pixels. Only the labels of those uncertain pixels are updated after classification to avoid inducing noise. After that, the input document image I and the constructed label image L are divided into corresponding nonoverlapped image patches and label patches with the same size separately. Features are then extracted in each image patch. Generally Speaking, the intensity and gradient variation within an image patch are used to determine the document background and foreground text because the human being recognizes text strokes based on the intensity change at the edge and the shape of text strokes. In this paper, we directly apply vectorization on the intensity matrix of the image patch and use it as one feature vector. To reflect the gradient variation, we adopt the histogram of oriented gradients (HoG) descriptor [20] of the image patch as another feature vector. Since the document text strokes have similar shapes, the HoG descriptor can capture the similarity of the text strokes between different image patches. Both the HoG feature and intensity feature are normalized into [0, 1] before applying classification.
Algorithm 1 Sparse representation-based classifier (SRC) for document image Require: The n image patches [I1p , · · · , Inp ] of the whole document image. Ensure: The sparse coefficient vectors [xq1 , · · · , xqm ] of m image patches and the corresponding classifier residual [r1 , · · · , rm ]. q q ] from the whole 1: Construct m image patches [I1 , · · · , Im document image by removing those image patches without uncertain pixels. q q q 2: for Each image patches Ii in [I1 , · · · , Im ]: do p 3: Build the training sample set [I1 , · · · , Ikp ] from the whole document image by removing Iiq and its eight neighboring image patches. 4: Join the training samples into a matrix A: A = [I1p , · · · , Ikp ]. 5: Find a sparse coefficient vector xq i of length k by solving the L1-norm minimization problem defined in Equation 1. 6: Calculate the classification residual ri using the following equation:
B. Classification using Sparse Representation
Algorithm 1. The feature with larger residual should be given smaller weight, which is specified in Equation 3. Pc q q t=1 wt · xt xf = P (3) c i=1 wt
In this paper, we propose to use the non-parametric sparse representation [17] to construct classifiers for each document image patch. Roughly speaking, we are trying to use the other image patches of the document image to represent the testing document image patch. More specifically, the testing sample is one patch of the document image under study and the training set consists of N image patches of the same document image. We need to exclude the testing document image patch as well as its neighbor image patches from the training document image patches beforehand. Otherwise, the classifier tends to choose itself or its neighbors to representing the testing sample. As only the labels of uncertain pixels are re-labeled, we can just apply SRC to those image patches that contain uncertain pixels. The classification algorithm is summarized in Algorithm 1. Since there are two types of feature vectors constructed for intensity and gradient, instead of joining the two feature vectors into one for classification, we build two classifiers for each type of the feature vectors separately so that we can estimate the weight of each feature vector for different image patches adaptively. Then the final sparse coefficient vector xqf of an image patch I q can be combined by the sparse coefficient vectors estimated using different feature vectors. The weight of these two feature vectors can be determined based on the classifier residual r calculated in
ri = ||A · xqi − Iiq ||2 7:
end for
where xqt denotes the sparse coefficient vector obtained using the classifier based on t-th type of feature vectors. wt = r1t denotes the weight of the t-th classifier, which is calculated using the corresponding residual rt . The label patch of the corresponding image patch are then updated as follows: Pk q
L =
i=1
Pk
xqf i · Lpi
i=1
xpf i
(4)
where Lq and Lpi denote the corresponding label patches of I q and Iip as in Algorithm 1, respectively, xqf i denotes the ith element of the calculated sparse coeficient vector xqf . The value of Lq calculated using Equation 4 is a weighted combination of the labels of the other image patches. Each image patch contributes a weight to label 1 (document background) or 0 (foreground text) of the testing image pixels. So Lq can be interpreted as the possibility of assigning the corresponding image pixel into 1 (document background). The image pixel is set to 1 if the value of its corresponding label is larger than 0.5, and 0 otherwise.
Table I: Evaluation Results of the dataset of DIBCO 2009 Methods OTSU [2] SAUV [3] Combined OTSU and SAUV HOWE [13] LMM [10] Combined HOWE and LMM BE [9] Combined LMM and BE
F-Measure(%) 78.72 85.41 86.13 88.66 91.06 90.73 91.24 91.49
PSNR 15.34 16.39 16.53 17.20 18.5 18.29 18.6 18.85
MPM(×10−3 ) 13.3 3.2 4.56 1.59 0.46 0.61 0.55 0.36
DRD 20.24 7.04 6.51 4.29 2.82 2.94 3.05 2.58
Rank Score 223 254 247 229 137 137 123 90
III. E XPERIMENTS AND D ISCUSSIONS The proposed technique is tested and compared with stateof-the-art methods over on two competition datasets: DIBCO 2009 dataset [5], and DIBCO 2011 dataset [6]. The binarization performance is evaluated by using F-Measure, Peak Signal to Noise Ratio (PSNR), Misclassification Penalty Metric (MPM), Distance Reciprocal Distortion(DRD) and rank score that are adopted from DIBCO 2011 [6]. We quantitatively compare the performance of state-ofthe-art techniques with performance of their combination using our proposed technique on DIBCO 2009 and 2011 datasets [5], [6]. These methods include Otsus method (OTSU) [2], Sauvolas method (SAUV) [3], Howe’s method (HOWE) [13] and our previous winning algorithms of DIBCO 2009 [5] and H-DIBCO 2010 [7]), as well as the method submitted for DIBCO 2011 [6] (SNUS). The two datasets are composed of the same series of document images that suffer from several common document degradations such as smear, smudge, bleed-through and low contrast. During experiments, the image patch size and HoG cell size of our proposed technique is set to 16 and 8, respectively. It will take average 1-2 minutes in Matlab on a PC to process a document image in the dataset for sparse representation classification (SRC), without considering the computation time of the binarization algorithms. It can be improved while implementing in C/C++, we will try to improve it in future. The binarization performance is improved after combination, as illustrated in Table I, and II. Furthermore, the proposed technique usually has a small rank score, which means that it is more robust to produce reasonable results on different kinds of document images, while the other binarization methods often fail on some testing images. Figure 2 shows one degraded document image with its corresponding binarization results produced by different binarization methods and our proposed technique. IV. C ONCLUSION This paper presents a novel self-training classification method for binarization of badly degraded document images. The proposed method first obtains the binary outputs from the input document image using existing binarization
(a)
(b) HOWE [13]
(c) SNUS
(d) Combined
Figure 2: Binarization Results of the image in (a).
algorithms and combines them into a label image. The input document image and corresponding label image are then divided into image patches separately. Features F p are extracted for each image patches and applied to sparse representation classifiers to construct each image patch by a linear combination of the other image patch of the input document images. So the corresponding label patch is updated using the same linear combination of the other label patches. The advantages of our proposed method can be summarized as follows: First, we only need to consider constructing a classifier for a small image patch, without inducing a classification rule for the whole document image. Second, the proposed classifier captures the similarity between image patches level instead of image pixel level, which allows more information for better classification results. Experiments on DIBCO 2009 and 2011 datasets [5], [6] show the superior
Table II: Evaluation Results of the dataset of DIBCO 2011 Methods OTSU [2] SAUV [3] Combined OTSU and SAUV LMM [10] BE [9] Combined LMM and BE SNUS HOWE [13] Combined SNUS and HOWE
F-Measure(%) 82.22 82.54 84.34 85.56 81.67 86.73 85.2 88.74 89.27
performance and robustness of our proposed technique. V. ACKNOWLEDGMENT This research is supported in part by MOE Grant MOE2011-T2-2-146 (R252-000-480-112) R EFERENCES [1] M. Sezgin and B. Sankur, “Survey over image thresholding techniques and quantitative performance evaluation,” Journal of Electronic Imaging, vol. 13, no. 1, pp. 146–165, 2004. [2] N. Otsu, “A threshold selection method from gray level histogram,” IEEE Transactions on System, Man, Cybernetics, vol. 19, no. 1, pp. 62–66, January 1978. [3] J. Sauvola and M. Pietikainen, “Adaptive document image binarization,” Pattern Recognition, vol. 33, no. 2, pp. 225– 236, 2000. [4] B. Gatos, I. Pratikakis, and S. Perantonis, “Adaptive degraded document image binarization,” Pattern Recognition, vol. 39, no. 3, pp. 317–327, 2006. [5] B. Gatos, K. Ntirogiannis, and I. Pratikakis, “ICDAR 2009 document image binarization contest(DIBCO 2009),” International Conference on Document Analysis and Recognition, pp. 1375–1382, July 2009. [6] I. Pratikakis, B. Gatos, and K. Ntirogiannis, “ICDAR 2011 document image binarization contest (DIBCO 2011),” International Conference on Document Analysis and Recognition, September 2011. [7] ——, “H-DIBCO 2010 handwritten document image binarization competition,” International Conference on Frontiers in Handwriting Recognition, pp. 727–732, November 2010.
PSNR 15.77 15.78 16.15 16.75 15.59 17.10 17.16 17.84 18.15
DRD 8.72 8.09 6.21 6.02 11.24 4.65 15.66 5.37 4.15
MPM(×10−3 ) 15.64 9.20 8.99 6.42 11.40 5.64 9.07 8.64 5.84
Rank Score 245 208 221 224 234 198 161 159 150
[10] B. Su, S. Lu, and C. L. Tan, “Binarization of historical handwritten document images using local maximum and minimum filter,” International Workshop on Document Analysis Systems, pp. 159–166, June 2010. [11] I. Bar-Yosef, I. Beckman, K. Kedem, and I. Dinstein, “Binarization, character extraction, and writer identification of historical hebrew calligraphy documents,” IJDAR, vol. 9, no. 2, pp. 89–99, April 2007. [12] T. Lelore and F. Bouchara, “International conference on document analysis and recognition,” Super-Resolved Binarization of Text Based on the FAIR Algorithm, pp. 839–843, September 2011. [13] N. Howe, “A laplacian energy for document binarization,” International Conference on Document Analysis and Recognition, September 2011. [14] B. Gatos, I. Pratikakis, and S. Perantonis, “Improved document image binarization by using a combination of multiple binarization techniques and adapted edge information,” International Conference on Pattern Recognition, pp. 1–4, 2008. [15] E. Badekas and N. Papamarkos, “Optimal combination of document binarization techniques using a selforganizing map neural network.” Engineering Applications of Artificial Intelligence, vol. 20, pp. 11–24, 2007. [16] B. Su, S. Lu, and C. L. Tan, “Combination of document image binarization techniques,” in International Conference on Document Analysis and Recognition, September 2011, pp. 22 – 26. [17] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210 – 227, February 2009. [18] S. Boyd and L. Vandenberghe, Convex optimization. bridge University Pres, 2004.
Cam-
[8] ——, “ICFHR 2012 competition on handwritten document image binarization (H-DIBCO 2012),” International Conference on Frontiers in Handwriting Recognition, September 2012.
[19] S.-J. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky, “An interior-point method for large-scale l1-regularized least squares,” IEEE Journal of Selected Topics in Signal Processing, vol. 1, no. 4, pp. 606 – 617, December 2007.
[9] S. Lu, B. Su, and C. L. Tan, “Document image binarization using background estimation and stroke edges,” IJDAR, vol. 13, pp. 303–314, December 2010.
[20] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005, pp. 886 – 893.