Automatic Feature Selection with Applications to ... - Semantic Scholar

Report 2 Downloads 58 Views
Automatic Feature Selection with Applications to Script Identification of Degraded Documents Vitaly Ablavsky and Mark R. Stevens Charles River Analytics Inc. Cambridge, MA 02138 USA* E-mail: {vablavsky, mstevens}@cra.com Abstract Current approaches to script identification rely on hand-selected features and often require processing a significant part of the document to achieve reliable identification. We present an approach that applies a large pool of image features to a small training sample and uses subset feature selection techniques to automatically select a subset with the most discriminating power. At run time we use a classifier coupled with an evidence accumulation engine to report a script label once a preset likelihood threshold has been reached. We apply the system to a diverse corpus of printed Russian and English documents that suffer from common degradation problems. Our validation study shows promising results both in terms of the script identification accuracy and the ability to identify script on the scale of individual words and text lines.

1. Introduction Automatic script identification plays an important role in processing large volumes of documents of unknown origin. In addition, the ability to reliably identify script using the least amount of textual data is essential when dealing with document pages that contain multiple languages. Since script analysis takes place prior to the actual text recognition, it cannot rely on character identification, and, ideally, should consume only a small fraction of the total processing time. Current approaches to script identification attempt to capture the difference between scripts from either a local geometric or a global statistical standpoint. For instance, in [1] exemplars are grouped into clusters based on a similarity measure. A representative template is then extracted from each cluster and used to identify those characters in a novel testing image. Several handcrafted rules are used to characterize similarity for cluster creation and similarity between the template and the document. A slightly different approach is taken in [2], where Gabor features are used to identify document script. Gabor features are a popular filter for use in pattern recognition due to their biological underpinnings, but often require numerous filters to ensure invariance to rotation, translation, and scale. In [3], script detection is accomplished

using horizontal and vertical projections of the characters in a line of text. These handcrafted features work quite well but can be sensitive to degradations in image quality. In addition, examining marginal distributions (the result of horizontal and vertical projections) discounts covariance information, which is often invaluable in pattern classification. The approach in [4] focuses heavily on reliable character extraction and then uses a variety of decision procedures, some handcrafted and others trainable, to detect document script. In [5], Korean “circles” and vertical strokes are chosen as features to distinguish among the three language scripts examined. The main contribution of this paper is threefold. First, we are able to dispense with hand-picking image features in favor of automatically deriving an optimal subset from a very large feature library. Second, by relying on the subset with the most discriminating power, we are able to accurately classify two target scripts based on a small number of extracted textual symbols. Third, we are able to train directly on the language alphabet instead of a document corpus.

2. Overview of the approach Our system processes a stream of connected components (textual symbols) and reports a script label when enough evidence has been accumulated to make the decision. In practice the input consists of document pages, so initial preprocessing is necessary to extract both characters and words (section 3). We do not make an assumption that all lines on a given page are written in the same script, but we do assume some continuity of a single script. At training time, the system analyzes a series of images of respective alphabets. The alphabet images are generated electronically in the approximate font faces and styles that we wish to recognize at run time. For instance, training images consist of the letter A in Arial and Times Roman in plain and bold font faces (font size is irrelevant due to our feature set’s invariance to scale). We also generate degraded versions of the training set by producing copy and fax derivative images. At this stage we do not address the issue of scripts and languages that are either connected at baseline or whose

*

This work was supported by the Army Research Labs, Adelphi, MD, under contract DAAD17-02-C-0043.

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE

character appearance changes depending on the position in the word (e.g., Pashto). The focus of this work is on automatic discovery of powerful features. We believe that the choice of Latin and Cyrillic scripts, which are fairly similar, makes this a challenging problem. We currently partition these alphabet images into three classes: Latin characters, Cyrillic characters, and a common characters, since some are shared by both scripts. We apply a large pool of feature extractors to extract geometric properties from each alphabet. In contrast with other systems, the training consists not just of partitioning the resulting feature space, but also determining which small subset of features will result in the best classifier performance (section 5). The top two rows of Figure 1 show the training process. Given the reduced feature set, script identification at run time consists of generating a stream of connected components and applying those features to the resulting textual symbols. A single classifier (for validation purposes we use a nonparametric classifier, as is explained in section 6) assigns the likelihood of a script to each symbol. These likelihood values feed into the evidence accumulation framework that reports a label once it exceeds a preset confidence threshold (section 6). In practice we are able to identify script on the scale of individual text lines.

Figure 2. Regions produced by connected components

Our current system processes each of these connected components without filtering obvious errors caused by degraded document quality. Figure 2 shows an example of the word offer. In this example, the ff letters are segmented as one region due to the degraded quality. Obviously, the classification system will fail on these cases; however, when the entire word is examined enough evidence can be accrued to correctly detect the script. In section 8 we describe an approach that promises to extract script evidence from connected components that span multiple characters.

4. Feature set Given a binary component extracted from a document image using connected components, that image can be replaced by a reduced set of measurements (features) that characterize that pattern. In many cases, these features can be used to reconstruct the original image. A variety of such feature measurements have been proposed for characterizing the 2D shapes present in a binary image. By far, the most popular 2D shape descriptors are the Cartesian moments [8]: ∞ ∞

m pq =

∫ ∫x

p

y q f ( x, y )dxdy

−∞ −∞

where f(x,y) is the pixel in the character image at position (x,y). For binary images f is a two-valued function. These moments are sensitive to character placement, orientation, and scale. Typically, nine moments are computed (p,q=0,1,2). To compensate for sensitivity to character position, centralized moments are introduced: Figure 1. System diagram

3. Preprocessing While we do not assume that zoning has already taken place, we do assume that the document images contain mostly text, and that text in multicolumn newsprint articles shares the same baseline. Although we have already developed a skew detector based on the Radon transform, the system tolerates rotation of several degrees and slight baseline curl. We have found that the current system deals well with the slightly rotated lines often associated with automatic document scanning, and have therefore neglected skew correction in this paper. Consequently, we used projections [7] to separate individual lines and words, then applied 4-connected labeling to extract individual textual symbols. To preserve dots and diacritics above symbols, we considered all pixels within a bounding box and used small tolerance in the vertical direction. Figure 2 shows an example of the regions extracted from clean and degraded documents.

µ pq =

∞ ∞

∫ ∫ (x − x)

p

( y − y ) q f ( x, y )dxdy

−∞ −∞

which gains invariance to character position in the image by subtracting the mean. Again, nine features are computed based on the centralized moments. To remove the sensitivity to scale, normalized moments are computed:

η pq =

µ pq µ γpq

where

γ =

p+q +1 2

which scale each centralized moment by the maximum possible value. Finally, to compensate for changes in rotation, Hu moments [8] are used. Hu moments encode the concept of the shape principal axis to achieve rotational invariance (this is analogous to estimating the principal axis of the shape and performing a rotation). These moments tend to break down when dealing with rotationally symmetric shapes. Additional moments, such as Zernike and Legendre, have not yet been incorporated into our feature set.

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE

When dealing with moments when either p or q is zero, the measurements are a parametric representation of the marginal distributions. For instance, when p is zero, the centralized moment is proportional to the variance of the pixel values projected onto the vertical axis. Likewise, the Cartesian moment is proportional to the mean. Therefore, these features encode, parametrically, the same information found to be useful for script detection in [3]. While several of the moment features examined are robust under translation, rotation, and scaling, they can be sensitive to degradations in image quality, or so-called image noise. Such degradations are often present when dealing with poorly scanned documents or faxes. Additional features based on shape can also be used to reduce sensitivity to image noise. An example of such a feature is perimeter, which counts the number of pixels on the boundary of the character. To normalize this value, compactness is used:

compactness =

4π ⋅ area perimeter

Other shape features capture curvature, holes, the lengths of the major and minor axis, eccentricity, elongation, and convexity. The final category of features in our set is co-occurrence features [9]. These features are based on measuring how often corresponding positions in the image share the same pixel value. Co-occurrence features are considered textural features and capture notions of symmetry, holes, and repetitive structure. The first step in computing these features involves computing a large matrix representing counts of co-occurring pixel values. To achieve rotation invariance, this matrix is computed at several different orientations. Twenty-seven feature measurements are extracted as a parametric representation of this matrix: mean, variance, entropy, etc.

5. Subset feature selection The previous section discussed a set of generic features that have been found to be useful in pattern recognition. One approach toward exploiting a large pool of features is via an ensemble of weak classifiers. AdaBoost [10] is one such method. Another approach is to determine directly which features that have the most discrimination potential. Filters are a technique to isolate good subsets of feature data without using expensive search algorithms. The idea is to compute statistics over the features that indicate which features are better than others. Those features with a low weighting (as set by a threshold) are discarded. Perhaps the most widely used filter is the RELIEF-F algorithm [11]. This algorithm updates a weighting vector based on the nearest hit (the closest instance in feature space having the same class label) and the average of all of the nearest misses (the closest instance in feature space to each of the different classes). Figure 3 shows the pseudocode for the algorithm. This algorithm treats the features as dependent, so com-

binations of features are examined (due to the search for hits and misses being over all attributes). RELIEF-F produces higher weightings for attributes that have instances close to neighbors of the same class and farther from instances with different class labels. Likewise, the weights for attributes that have instances closer in feature space to the wrong class label are decreased. In practice the complexity of the algorithm is O(N2), rather than O(2N), since the nearest-neighbor computation will dominate. The approach has been used successfully in many large subset feature selection tasks [12] and shows promise here. Set all weights W[A] := 0; For I := 1 to num_instances do Find nearest hit H and nearest misses M; For A := 1 to num_attributes do Sum := 0; For C := 1 to num_classes do If C != ClassOf(I) then Sum := Sum + dist(A, I, M[C]); EndIf EndFor W[A] := W[A] – dist(A,I,H) + Sum; EndFor EndFor

Figure 3. Pseudocode for RELIEF-F algorithm

6. Classification and evidence accumulation For demonstration purposes, we have chosen to use KNearest Neighbors (KNN) algorithm for classification [6]. This algorithm has no training time: the only training is the storage of all of the training instances for use during classification. During the on-line classification stage, the K (we used K=3) nearest neighbors of the current feature vector are found. A voting algorithm is then used to assign a class label based on the true labels of these neighbors. The Mahalanobis distance metric is used to reduce sensitivity to distances in the feature space [6]. The KNN algorithm tends to perform very accurately in most domains, but it has much slower run-time performance than the other classifiers due to the required search through the feature space for neighbors. The algorithm is sensitive to both noisy features and irrelevant information. The KNN algorithm assigns labels (binary likelihoods) to the output of the connected components algorithm. To mitigate errors associated with degraded documents, voting is used to accumulate evidence over a text line. Each connected component on a text line is classified using KNN and the output label is produced by the majority-vote rule.

7. Results Algorithm validation took place on a corpus of English and Russian document images with 1624 lines of text. The sources were newsprint and books scanned at 300 dpi. Degraded documents were first-generation photocopies of first-generation facsimiles of photocopies. This resulted in

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE

all characters being distorted and each line containing merged characters. We created the training set by typing the Latin and Cyrillic alphabets into a word-processing program and printing a single page of each alphabet. These printed pages were then scanned and used as labeled training instances. Each single page contained the alphabets in a variety of fonts (Arial, Times Roman, etc.) and font faces (bold, plain, etc.)

7.1.

Feature subset selection results

The RELIEF-F algorithm was applied to the training set to determine which of the 72 feature measures are most appropriate for this domain. While all of the features implemented have shown promise in other domains, some clearly provide better discrimination in this domain. We retain the top 25 features based on their rank. Table 1 shows the highest- and lowest-ranked features. Table 1. Four best and worst features from RELIEF-F Feature

Category

Major Axis Direction Eccentricity

Shape Shape

RELIEF-F Score 277 181

µ 21 µ 12

Moment

175

Moment

168

Shape Shape Co-occurrence Shape

102 101 96 91

Perimeter Fiber width Entropy Compactness

7.2.

Individual connected components

Our first analysis is to determine how well the script identification algorithm works at the connected components level (Table 2). We first ran the classification algorithm on the training data to estimate performance on a larger document corpus (testing on the training data demonstrates the separability of our features). Table 2. Classification rate on training data Common Cyrillic Latin

7.3.

Common 0.94 0.01 0.00

Cyrillic 0.05 0.97 0.03

Latin 0.01 0.02 0.97

results of the algorithm. The results are promising when the images are not corrupted and therefore the connected components correspond to glyphs similar to our training set. When dealing with second- and third-generation documents, the results degrade due to multiple glyphs being extracted as regions. In the next section, we discuss a promising approach for extracting characters from connected components. Table 3. Results on corpus of corrupted data Scanned Cyrillic Latin

Ambiguous 0.01 0.06

Cyrillic 0.99 0.08

Latin 0.00 0.86

Copied then scanned Cyrillic Latin

Ambiguous

Cyrillic

Latin

0.01 0.13

0.99 0.36

0.00 0.51

Copied then faxed then scanned Cyrillic Latin

Ambiguous

Cyrillic

Latin

0.05 0.14

0.94 0.42

0.01 0.43

While the confusion matrix is a standard method for displaying results, it does not convey the difficulty of the dataset. Figure 4 illustrates several failure cases. The major source of errors comes from imperfect zoning, inaccurate line extraction in multicolumn text, and joined characters. By contrast, the actual script classification of degraded but properly segmented characters is accurate. Figure 5 illustrates the algorithm strengths. The algorithm tends to be fairly robust to changes in font size, tolerant of large amounts of image noise, and tolerant of skew.

Figure 4. Example failure cases

Voting over words on a single line

We chose a text line as the script identification unit for the purpose of evidence accumulation. We accomplish evidence accumulation using a simple voting algorithm: we compute the number of Cyrillic labels and the number of Latin labels assigned to a given scan line and the higher count is used to label the line. Note that the common label is not counted in the voting, as it does not provide insight into what types of script are present. Table 3 shows the

Figure 5. Example success cases

8. Handling merged characters Since our system is trained to recognize individual alphabet characters, we can either ignore connected components that span multiple symbols or attempt to mine them for additional evidence to support (or reject) a script hy-

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE

pothesis. We are currently developing a preprocessor that recursively breaks up long connected components. The objective is not to separate all characters (a task that requires linguistic context) but to obtain smaller textual units that may be characters. Given subcomponents c1, c2, … ,ck the likelihood of scripts S is postulated to be: p ( S i ) = p ( S i | c1 ) ⋅ p ( S i | c2 )Λ p ( S i | ck ) which automatically handles conflicting evidence, since two subcomponents voting for two different scripts Si and Sj will “zero out” each other. For each component whose length exceeds a preset tolerance we construct a line adjacency graph (LAG) [13] whose edge weights (wij) are in the range [0 1]. We then construct a refined graph (which we call r-LAG), by introducing multiple vertices for runs that exceed an average ~ connecting those addistroke width. The edge weights w ij

~