Using Perturbed Handwriting to Support Writer ... - Semantic Scholar

Report 4 Downloads 73 Views
Using Perturbed Handwriting to Support Writer Identification in the Presence of Severe Data Constraints Jin Chen, Wen Cheng and Daniel Lopresti Lehigh University, 19 Memorial Drive West, Bethlehem, PA 18015 U.S.A; ABSTRACT Since real data is time-consuming and expensive to collect, label, and use, researchers have proposed approaches using synthetic variations for the tasks of signature verification, speaker authentication, handwriting recognition, keyword spotting, etc. However, the limitation of real data is particularly critical in the field of writer identification in that in forensics, enemies or criminals usually leave little amount of real data. Therefore, it is unrealistic to always assume sufficient real data for writer identification. In addition, this field differs from many others in that we strive to preserve as much inter-writer variations, but model perturbed handwriting might break such discriminability among writers. In this work, we started by conducting user studies where human subjects were involved in calibrating realistic-looking transformations. Next, we measured the effects of incorporating perturbed handwriting into the real training dataset. Experimental results justified our hypothesis that with limited real data, model perturbed handwriting improved the performance of writer identification. In addition, we justified by experiments that it was beneficial to search for better performance in the parameter subspaces. Keywords: Writer Identification, Synthetic Variations, Arabic Handwritten Documents

1. INTRODUCTION Writer identification is the task where, given a query and a set of known writers, the system attempts to output the identity of the handwriting. In general, the output is a list of potential authors with associated confidence scores in descending order.1, 2 Sometimes, a rejection option is available as well.2 Since the survey by Plamondon and Lorette that summarized the state of art in 1989,3 there have been significant improvements in the field.1, 2, 4, 5 However, there are very limited number of databases available for research. The fact of data shortage in writer identification is more critical than that in other fields. For example in handwriting recognition, researchers can enlarge datasets if more time and grants are available. However, it is not the case for writer identification in forensics. We should assume that enemies or criminals usually leave little amount of real data. To address similar data shortage, there exists in general two approaches for handwriting synthesis in the field of handwriting recognition: Model Generated Handwriting (MGH) and Model Perturbed Handwriting (MPH). For the first method, some researchers made use of the neuromuscular model from the kinematic theory to generate handwriting synthesis. For example in Plamondon and Guerfali’s work, handwriting was modeled as the time superimposition of discrete stroke segments.6 Bezine, et.al., presented an algebraic Beta-elliptic model that was proven to better understand the inherent mechanisms underlying people’s handwriting movements.7 For the second, researchers worked directly on real handwriting and attempted to perturb them. M¨ argner and Pechwitz presented an image distortion method that operated on character and word level for machine printed OCR.8 For handwriting recognition, Varga and Bunke proposed a perturbation approach that modeled an entire handwriting text line and justified its efficacy in their HMM-based handwriting recognition system.9 In addition to handwriting recognition, dataset amplification is also useful for the tasks of signature verification, keyword spotting, etc.10, 11 Further author information: (Send correspondence to J.C.) J.C.: Email: [email protected] W.C.: Email: [email protected] D.L: Email: [email protected]

In this work, we hypothesize that: (1) with limited real data, MPH based “realistic-looking” synthetic variations can be useful for writer identification; (2) we might be able to find better parameter subspaces inside the realistic-looking ones. By realistic-looking we mean model perturbed handwriting that human subjects consider not altered. To justify the hypothesis, we extended our work from Varga and Bunke.9 To ensure that handwriting synthesis preserves as much a writer’s handwriting idiosyncrasies, we conducted user studies in which 15 human subjects were recruited to filter out unrealistic-looking transformations. Since realistic-looking does not necessarily imply “idiosyncrasy-preserving,” we proceeded to find better subspaces and the results seemed promising and statistically significant. The paper is structured as follow: in Section 2 we briefly review the perturbation model proposed by Varga and Bunke and also introduce our extensions to their approach. Next, we state our propositions in Section 3. Then we explain the experimental data preparation, the “contour-hinge” feature extraction, and the SVM classifier configurations in Section 4. Finally we show some experimental results that justify our hypothesis in Section 5 and conclude in Section 6.

2. PERTURBATION MODEL AND EXTENSION We adopted Varga and Bunke’s perturbation model as the basis. In their model, transformations of each text line were modulated by a “cosine wave” function. By adjusting the amplitude and the length of the function, we were able to transform an entire text line directly. For completeness, we briefly describe four transformations used in their model. Further details are included in Varga and Bunke’s paper.9 The horizontal scaling, transforms each ink pixel horizontally by a pre-defined shifting factor. The vertical scaling is conducted using the estimated baseline of the handwritten text line. The effects of vertical scaling is that ink pixels are elongated or shortened along vertical directions. The baseline bending attempts to shift ink pixels vertically by a pre-defined offset. The shearing transformation is computed using the estimated baseline of the handwritten text line. For each ink pixel, its new position is computed as transforming the current position by a pre-defined angle. Some examples of these four transformation are displayed in Figure 1. It should be stressed that Varga and Bunke’s perturbation model is originally designed for the task of handwriting recognition. Although handwriting recognition and writer identification share a great deal on feature extraction and classification techniques, one big difference is that for the former, researchers strive for minimizing inter-writer variations. In contrast, maximum inter-writer variations are the goals for writer identification. Therefore, it is critical to investigate reasonable parameter spaces for each transformation. To address this issue, we designed user studies to assist us to calibrate the parameter spaces.12 Each of 15 human subjects was asked to filter out handwritten text lines that seem unrealistic to them. In the end, we summed up all subjects’ judgement and acquired four discrete parameter spaces: nine values for horizontal scaling, 10 for vertical scaling, four for baseline bending, and 16 for shearing. Although “realistic-looking” does not imply that these transformations preserve all the idiosyncrasies in people’s handwriting, our experimental results showed that these initial parameter spaces are good enough to justify our hypothesis. In addition, by examining these spaces at a finer granularity, we might be able to acquire better subspaces which can significantly reduce the number of model perturbed handwriting and thus speed-up the follow-up procedures. We discuss this issue in Section 5.2.

3. EXPERIMENTAL PROPOSITION Our motivation of investigating handwriting perturbation is that real data is time-consuming and expensive to collect, label, and use. First of all, enemies or criminals usually leave little amount of data. In addition, even if they are forced to supply with extra handwriting data, it is likely that they would attempt to disguise their normal handwriting styles. Therefore, not only we concern about the expense of real data collection, but much more importantly, the assumption of real data shortage is more realistic than those assumed in labs before. To make our goals clear, this work is not to design novel features, nor is it for proposing powerful classifiers. Instead, these modules are treated as black boxes in our experimental evaluation. In this work, we attempt to investigate the feasibility of amplifying datasets using model perturbed handwriting. Specifically, we state the following propositions:

(a) One original line image.

(b) After horizontal scaling, the horizontal distances between components and the component widths changed.

(c) After vertical scaling, the component heights changed.

(d) After baseline bending, the component bottoms were not necessarily aligned on a straight line.

(e) After shearing, more components were tilted.

Figure 1: An example of four synthesizing transformations used in our experiments. Dashed lines highlight several altered parts. Proposition 1: When real data is limited, dataset amplification can benefit the identification performance. Proposition 2: The “idiosyncrasy-preserving” parameter space can outperform the “realistic-looking” one. In addition, the former performs statistically significantly better than the latter.

4. EXPERIMENTAL SETUP In this section, we first introduce the database we are working on in Section 4.1, then we discuss the usages of “contour-hinge” features in Section 4.2 and the SVM classifier in Section 4.3.

4.1 Data Preparation The Arabic database we are working on is from the DARPA MADCAT project as provided by the Linguistic Data Consortium (LDC).13 In the current release, there are 259 native Arabic writers involved. Thus in our experiments there are 259 classes. All documents are scanned at a resolution of 600 dpi and then binarized. The dataset was pre-processed by the Ratheon BBN Technologies. First, they excluded scanning noise by removing small connected-components that are smaller than 64 pixels. Second, they employed a technique to remove ruling-lines presented in handwritten documents.14 Meanwhile, they preserved any skews on the basis of connected component, character, “PAWs (part-of-Arabic-words),” and words. Although feasible to extract handwriting style information from an entire document page, it is much easier to conduct it on the line basis and then combine identification results within each page. To prepare the data, we

Table 1: A case study of using perturbed data for training. Baseline # of Lines Accuracy % Gain %

259 0.2 N/A

Horizontal Scaling 2590 5 2400

Vertical Scaling 2590 6 2900

Baseline Bending 2072 5 2400

Shearing

Mixed

4144 5 2400

10356 8 3600

randomly selected 10 pages for each writer for the testing set and all the following experiments rely on the same testing dataset. The testing dataset contains 50130 handwritten lines generated from 259 writers. However for training set, we started by selecting only one single line for each writer from the corpus to multiple lines (2, 5, 10 lines), and so forth to multiple pages (1, 2, 3 pages). These datasets served as the baseline in the control experiment. For document pages, we cropped out each text line within pages according to the bounding box associated with each line.

4.2 Feature Extraction We implemented one particular set of features from Bulacu and Schomaker’s work.1 The idea is to capture the characteristics along the contour of people’s handwriting. In this set of so-called “contour-hinge” features, for each two adjacent segments (5-pixel long) along contours, their angles against the horizontal axis are computed and served as two random variables. By quantizing the angle plane ([0, 2π)) into 24 bins, we accumulated the count in each bin as we traversed all contours. As the authors did in their work, we only considered cases where the second angle is not smaller than the first. In the end, we normalized the distribution table to compute the joint probability distribution function as a feature vector (300-dimensional).

4.3 Classification For classification, we use Support Vector Machines for writer identification. An SVM constructs a hyperplane with maximum margin in higher dimensional vector space, where a non-linearly separable classification problem in the original vector space may become linearly separable after mapping these feature vectors into higher dimensional space. The mapping functions are called “kernels” in the literature. The choice of kernels is critical for determining how to perform the projection into higher dimensional spaces. Commonly used kernels are the linear, polynomial, radial basis functions (RBF), Gaussian Radial basis, etc. K(x, y) = (x · y)

(1) d

K(x, y) = (x · y + 1)

(2)

K(x, y) = exp(−γkx − yk2 ), γ > 0

(3)

2

K(x, y) = exp(−

kx − yk ) 2σ 2

(4)

Note that the generic form of SVM is only applicable to 2-class classification. The common way of using it for multi-class classification is to run k(k − 1)/2 2-class classifiers where k is the number of classes, and then vote for a multi-class decision using the outputs of all the 2-class classifiers. In our experiments, we employ the libSVM tool.15 We use the RBF kernel because it offers better discriminability than the linear kernel, while using less parameters than the polynomial kernel. From our experimental results, we found that setting the cost c = 10000 performed best. To facilitate SVM training and testing, we normalize feature vectors into the unit hyper-cube. In addition, there is also a probability output option available for us to compute the Top-N list. Recall that our task is to identify writers given their handwritten document pages, thus for output, we should determine writer identities based on documents, rather than lines as we conduct classification. Suppose the probabilistic output of the SVM classification is denoted as p(li,k |x), where x denotes the input feature vector,

Performance Gains with Different Numbers of Real Data

Identification Performance with Different Numbers of Real Data 240

40

Horizontal Scaling Vertical Scaling Baseline Bending shearing Mixed

220 35

Relative Performance Gain %

Identification Performance %

200 30

25

20

15

Baseline Horizontal Scaling Vertical Scaling Baseline Bending shearing Mixed

10

5

0 1 line

2 lines

5 lines

10 lines

180 160 140 120 100 80 60 40 20

1 page

0 2 lines

5 lines

10 lines

1 page

2 pages

# of Real Data Per Writer for Training

# of Real Data Per Writer

(a) The net performance gains.

(b) The relative performance gains.

Figure 2: Performance gains using horizontal scaling, vertical scaling, baseline bending, and shearing transformations to perturb handwritten text lines for training. The “Mixed” curves means combining all four transformations together to form a dataset. and li,k means the i-th line is written by the k-th writer. Then the decision is made on the maximum votes within its corresponding page, as the following equation shows: X

|dj :li ∈dj |

Identity Hypothesis(x) = argmax k

p(li,k |x)

(5)

i

where dj denotes the document page j that contains the handwriting text line li , and | · | computes the number of lines in a document page. One common way for presenting the identification results is to display a Top-N writer list with corresponding descending confidence scores. Note that our focus is not to compare the identification performance with the state-of-art, we hence only present the top choices in the following discussion.

5. EXPERIMENTAL RESULTS Now we justify the two propositions in Section 5.1 and Section 5.2, respectively. In Section 5.3, we justify the statistical significance of the performance gains discussed in Section 5.2.

5.1 Performance using Perturbed Data As shown in Table 1, the baseline system was trained on only 259 handwritten lines and without surprise, the identification performance was not a satisfaction. However, we achieved roughly a 24x performance gain using individual transformations. Moreover, we achieved a 36x performance gain when we combined all four transformations. Note this case is not shown in Figure 2b. Next, we investigated how the performance gains vary when real data increases. To implement this, we gradually increased the size of real datasets for training. In each of the settings, we transformed every handwritten text line using corresponding parameters. Next, we trained the SVM classifier and then decoded on the same testing dataset. Similar as the above experiment, we measured the performance gains by individual transformation datasets, as well as the mixed dataset. As expected, the curve trends of Figure 2 align with our hypothesis that the more real data involved, the less benefits can perturbed data provide. Comparing the performance gains between using one line/writer and two lines/writer, we observed a gain reduction from 36x to 240% (the “Mixed” set in Figure 2b). Still, it is beneficial to include perturbed handwriting in the training process. As more and more real data were available, the performance gains were decreasing. In the end, perturbed data would did not help because writers’ handwriting idiosyncrasies were covered by real data.

Performance with Different Sizes of Parameter Spaces

Reduction of Synthetic Datasets

40

240

Subspaces Realistic-looking Space

39.9 39.8

220

Training Set Size (k Lines)

39.7 39.6

Performance %

Subspaces Realistic-looking Space

230

39.5 39.4 39.3 39.2 39.1 39 38.9 38.8 38.7

210 200 190 180 170 160 150 140 130 120 110 100

38.6

90

38.5 1

2

3

4

5

80 1

# of Iterations

2

3

4

5

# of Iterations

(a) Performance for parameter subspaces.

(b) Reduction of perturbed data for training.

Figure 3: Experiments that have different sizes of parameter spaces for handwriting perturbation.

5.2 Performance for Parameter Subspaces So far we have not taken into account the overhead of using perturbed handwriting. For each handwritten text line, there are nine transformations for horizontal scaling, 10 for vertical scaling, 16 for shearing, and four for baseline bending. Adding these numbers up, the mixed dataset has 39x as the real dataset. Thus the SVM training time can easily explode as the real data grows. Recall that during user studies, we asked human subjects to select handwritten text lines that looked unaltered. However, realistic-looking handwriting does not necessarily imply that transformed handwriting preserves the writer’s idiosyncrasies. In fact, it is easy to perturb a writer’s handwriting that seems realistic but deviates her normal handwriting style. In other words, the parameter spaces for each transformation suggested by our former user study might be large enough to unnecessarily introduce extra intra-writer variations. One straightforward improvement is to detect subspaces that preserve as much writers’ handwriting idiosyncrasies. This improvement would have twofold benefits. First, the number of transformations is reduced and thus the procedures of dataset amplification, feature extraction, and the SVM training and decoding would largely speed up. Second, more accurate parameter spaces would facilitate the SVM training such that it might improve the performance. To examine our hypothesis, we reduced the sizes of each parameter space iteratively. Note that in our current perturbation models, the parameter spaces were discrete-valued intervals. Thus at each iteration, we cut off the boundary values one at a time at both ends of the interval, as long as the parameter space contained more than two values. This space reduction was applied to all transformations except for the baseline bending, because currently its parameter has only four values. For each of these reduced datasets, we reran the classification pipeline and plotted the performance results in Figure 3a. We observed a 39.59% accuracy with the baseline at 39.25%. Moreover, after three iterations of reducing the sizes of parameter spaces, the total number training data shrank from 206k to 113k lines, as shown in Figure 3b. This result justified our hypothesis that the idiosyncrasypreserving parameter spaces reside inside the realistic-looking ones. In addition, with significant reduction of training samples, the SVM training completed roughly 3x faster.

5.3 Significance Test Although the best performance seems to differ marginally from the others, we prove these differences are statistically significant. Denote Fi as the i-th parameter space in Figure 3a and U (·) its performance. Dietterich16 suggests evaluating the difference between two classification approaches using the McNemar’s test: Z2 =

(|n01 − n10 | − 1)2 . n10 + n01

(6)

where we first divide misclassified samples into two groups, and then state the hypothesis test: • • • •

n01 : number of samples misclassified by F3 , but not by the others, e.g., Fn (n 6= 3). n10 : number of samples misclassified by Fn , but not by F3 , n 6= 3. Null Hypothesis H0 : U (F3 ) = U (Fn ), n 6= 3. Alternative Hypothesis H1 : U (F3 ) > U (Fn ), n 6= 3.

It turns out that the test statistic Z 2 approximately follows the χ2 distribution with 1 degree of freedom. Therefore, after counting n01 and n10 from the classification results, we then computed the test statistic for each case. The lookup table of χ2 shows that we need to have Z 2 > 3.84 for a confidence level of 95%, and Z 2 > 6.64 for that of 99%. Therefore, we conclude that F3 performed significantly better than all the others at a confidence level of 95% or above. For example, for F3 and F5 , Z 2 = 5.90, and for F3 and F4 , Z 2 = 34.57. In addition, F3 , F4 , and F5 performed significantly better than the realistic-looking baseline at a confidence level of 99%.

6. CONCLUSIONS In this work, we investigated the efficacy of model perturbed handwriting for the task of writer identification. First, we extended Varga and Bunke’s perturbation model by involving human subjects to calibrate parameter spaces of transformations. Next, we conducted a series of experiments examining the effects of perturbed data when the number of real data varies. We observed that when real data was limited, incorporating perturbed data greatly improved the performance. Finally, experimental results also justified that it is beneficial to look into the “realistic-looking” parameter spaces to seek better performance.

ACKNOWLEDGMENTS This work is supported by a DARPA IPTO grant administered by Raytheon BBN Technologies. The authors would like to thank Huaigu Cao and Rohit Prasad for providing us with the preprocessed MADCAT dataset.

REFERENCES [1] Bulacu, M. and Schomaker, L., “Text-independent writer identification and verification using textural and allographic features,” IEEE Transaction on Pattern Analysis and Machine Intelligence 29, 701–717 (2007). [2] Schlapbach, A. and Bunke, H., “A writer identification and verification system using hmm based recognizers,” Pattern Analysis and Application 10, 33–43 (2007). [3] Plamondon, R. and Lorette, G., “Automatic signature verification and writer identification – the state of the art,” Pattern Recognition 22, 107–131 (1989). [4] Li, X. and Ding, X., “Writer identification of chinese handwriting using grid microstructure feature,” in [ICB], 1230–1239 (2009). [5] Li, B., Sun, Z., and Tan, T., “Hierarchical shape primitive features for online text-independent writer identification,” in [Proc. 10th International Conference on Document Analysis and Recognition ], 986–990 (August 2009). [6] Plamondon, R. and Guerfali, W., “The generation of handwriting with delta-lognormal synergies,” Biological Cybernetics 78, 119–132 (1998). [7] Bezine, H., Alimi, A., and Sherkat, N., “Generation and analysis of handwriting script with the beta-elliptic model,” in [Proc. the 9th International Workshop on Frontiers in Handwriting Recognition ], 167–172 (2004). [8] M¨ argner, V. and Pechwitz, M., “Synthetic data for arabic ocr system development,” in [Proceedings of the sixth International Conference on Document Analysis and Recognition ], 1159–1164 (2001). [9] Varga, T. and Bunke, H., “Generation of synthetic training data for an hmm-based handwriting recognition system,” in [Proceedings of the seventh International Conference on Document Analysis and Recognition ], 618–622 (2003). [10] Brault, J. and Plamondon, R., “A complexity measure of handwritten curves: modeling of dynamic signature forgery,” IEEE Transaction on Systems, Man, and Cybernetics 23, 400–413 (1993).

[11] Konidaris, T., Gatos, B., Ntzios, K., Pratikakis, L., Theodoridis, S., and Perantonis, S., “Keyword-guided word spotting in historical printed documents using synthetic data and user feedback,” International Journal of Document Analysis and Recognition 9, 167–177 (2007). [12] Cheng, W. and Lopresti, D., “Parameter calibration for synthesizing realistic-looking variability in offline handwriting,” in [Proc. Document Recognition and Retrieval XVIII (IS&T/SPIE International Symposium on Electronic Imaging)], (2011). accepted for publication. [13] “The linguistic data consortium.” http://www.ldc.upenn.edu/. [14] Cao, H., Prasad, R., and Natarajan, P., “A stroke regeneration method for cleaning rule-lines in handwritten document images,” in [Proc. of the MOCR workshop at the 10th international Conference on Document Analysis and Recognition ], (2007). [15] Chang, C.-C. and Lin, C.-J. in [LIBSVM: a library for support vector machines], (2001). Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm. [16] Dietterich, T., “Approximate statistical tests for comparing supervised classification learning algorithms,” Neural Computation 10, 1895–1923 (1998).