Writer Identification Using Directional Element Features and Linear Transform Xianliang Wang, Xiaoqing Ding, Hailong Liu State Key Laboratory of Intelligent Technology and Systems, Department of Electronic Engineering, Tsinghua University, Beijing 100084, P.R.China
[email protected] Abstract In this paper, we present an effective method to writer identification that is carried out using single Chinese character as script, which is very flexible and very easy to be used in practice. The directional element features are first extracted from the handwriting character scripts, then the dimensions of the features is reduced using PCA in order to cope with the small sample size problem. The most discriminative features are extracted from the reduced feature space using Fisher’s Linear Discriminant Analysis. The Euclidian distance is proposed for classification. Experimental result verified the effectiveness of the proposed method.
1. Introduction Writer identification is the process of confirming a writer’s identity by comparing some specific attributes of his handwriting with those of all the writers enrolled in a reference database. According to the script used, the identification methods can be classified into two types: text-insensitive (or text-independent) approaches and textsensitive (or text-dependent) ones [1]. Said, et al [3] proposed a text-insensitive method, they viewed the handwriting as textural image and extracted textural features based on Gabor filters. For this method, whole pages of handwritten texts are needed. Srihari [4] combined macro and micro features for writer identification. However, the macro features extracted from the paragraph-level are usually not stable. In this paper, we proposed a text-sensitive writer identification method using only a single character as handwriting script. First, directional element features (DEFs) which have been successfully used in Chinese character recognition [5][6] are extracted from the handwriting script, then, in order to
handle the small sample size problem and to improve identification accuracy, Principal Component Analysis (PCA) is used to reduce the dimensions of the features, and finally, Linear Discriminant Analysis (LDA) is used to extract the most discriminative features from the reduced feature space. We use Euclidean distance classifier to assign an unknown authorship handwriting to one of the writers. The DEF feature extraction method proposed here is similar to Yoshimura’s [2] weighted directional index histogram method. But the main difference between our method and Yoshimura’s method is that we use linear transform to improve the separability of the DEFs. The paper is organized as follows: In section 2 we will introduce the DEFs extraction, while in section 3 a brief introduction to PCA and LDA is provided. Section 4 gives the classifier design, and the experimental results are given in section 5. Conclusions are drawn in section 6.
2. DEFs Extraction Directional element features are frequently and successfully used in Chinese character recognition [5][6]. In character recognition, characters are first normalized by nonlinear normalization method in order to alleviate the writing style changes. But in writer identification, we must keep the different writing styles of different peoples. So we can’t use nonlinear normalization method, instead, we use a linear normalization method called gravity-center normalization.
2.1. Linear Normalization Assume the original script be [ F (i , j )]W ×H , the width of
the script is W and height be H. The gravity of the script G = (Gi , G j ) can be computed using the following equation
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
W
Gi =
H
∑∑ i ⋅ F (i, j ) i =1 j =1 W H
∑∑ F (i, j )
W
, Gj =
where w( x , y ) is weighting function. Align the DEFs
H
∑∑ j ⋅ F (i, j ) i =1 j =1 W H
i =1 j =1
∑∑ F (i, j )
(1)
of the normalized script in (i, j ) is equal to the pixel value of the original script in ( m, n ) , where i
M m= W − 2(Gi − W) (i − M) M 2G j M j n= H − 2(G j − H) ( j − M) M
if i < M/2 if i ≥ M/2 (2) if j < M/2 if j ≥ M/2
Through linear normalization, we eliminate the difference of script size and at the same time, keep the writing styles of each person.
2.2. Feature Vector Extraction After normalization, contour extraction is done. Then each contour pixel is assigned a 4-dimensional vector to measure four types of directional element (horizontal, vertical, and two diagonal) attribute according to its neighboring contours. Divide the script into N 1 × N 1 subblocks and count up the vector of contour pixels in each block, which is denoted as Ckl( h ) , Ckl( v ) , Ckl( + ) , Ckl( − ) ( 1 ≤ k ≤ N 1 ,
1 ≤ l ≤ N 1 ), then divide the script into N 2 × N 2 sub-areas, each sub-area contains several sub-blocks, the ( x , y ) subarea contains the following sub-blocks ( k , l ) ∈ Dxy Dxy = {( k , l ) | max(1,2 x − 2) ≤ k ≤ min( N1,2 x ),
max(1, 2 y − 2) ≤ l ≤ min( N 1 , 2 y )} The DEFs in m ( m = N 2 ⋅ x + y ) sub-area can be extracted by the following equation (h)
∑
Cm ( x , y ) =
∑
(h)
(+)
Cm ( x , y ) =
∑
(v)
Ckl ⋅ w( k − (2 x − 1), l − (2 y − 1)) (+)
( k , l )∈D xy (−)
Cm ( x , y ) =
∑
Ckl ⋅ w( k − (2 x − 1), l − (2 y − 1)) (−)
(h)
(v)
(+)
(−) T
2
2
2
2
2
2
2
2
we then get the DEFs of the script.
3. PCA and LDA for Writer Identification Although DEFs have great classification ability, in order to further increase identification accuracy rate and overcome small training sample size problem compared with the DEFs dimension, methods for reducing the dimensionality of the DEFs are required. Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) are two widely used methods for dimension reduction. We briefly describe these methods in the following section.
3.1. Principal Component Analysis PCA [7] is transformation
defined
by
the
following
linear
T (4) Y =W X where X is the original n dimensional vector and Y is the transformed m dimensional vector, W represents the transform matrix. The m columns of W are the m eigenvectors of the covariance matrix Σ x whose
corresponding eigenvalues are dominant. Assume N to be the number of the training samples, Σ x the covariance matrix, and
X the mean vector, then
X =
1 N 1
N
∑X
i
i =1
(5)
N
∑
T
( X i − X )( X i − X ) N i =1 Although PCA can find the most expressive features, there is no guarantee that these features be suited for discriminating between data belonging to different classes. The most discriminant features can be found through LDA.
Let Y be the original m dimension feature vector, and
Ckl ⋅ w( k − (2 x − 1), l − (2 y − 1))
( k , l )∈Dxy
(−)
3.2. Linear Discriminant Analysis
Ckl ⋅ w( k − (2 x − 1), l − (2 y − 1))
( k , l )∈D xy (v)
(+)
(v)
V = [C1 , C1 , C1 , C1 ,… , C N , C N , C N , C N ]
Σx =
Where, N1 = 13, N2 = 7, and
Cm ( x , y ) =
V (h)
i =1 j =1
The normalized script is [ A(i , j )]M ×M , and the pixel value
2Gi
extracted from the N 2 × N 2 sub-area into a column vector
T
(3)
vector Z = V Y a new d dimension vector obtained by LDA [7] transform and V is the transform matrix. Assume there are c classes with class mean vectors M i , i = 1, 2,..., c , respectively, and each class has Ni samples. Then the within-class scatter matrix is defined as
( k , l )∈D xy
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
Ni
c
Sw =
∑ p ∑ (Y i
i =1
j
− M i )(Y j − M i )
T
(6)
j =1
And the between-class scatter matrix is defined as c
S b = ∑ pi ( M i − M )( M i − M )
T
(7)
i =1
where M is the total mean vector, pi , i = 1, 2,..., c are the a priori probabilities of each class. The transform matrix can be obtained by maximize the T
ratio
det(V S bV ) T
det(V S wV )
. It has been proven [7] that the d
columns of the transform matrix V are the d eigenvectors −1
of the matrix S w S b whose corresponding eigenvalues are dominant.
3.3. Writer Identification by PCA and LDA As for writer identification, the number of the training samples is smaller than the dimension of the DEFs, we can’t use LDA directly because of the degeneration of within-class scatter matrix S w . This can be solved by first using PCA to reduce the dimensionality of the original feature space followed by LDA transform. Assume W be the PCA transform matrix, and V the LDA transform matrix. Then the transform matrix of PCA plus LDA is U = WV . The original feature vector X becomes a new feature vector Z using the transform matrix U T (8) Z =U X
4. Classifier We use the Euclidean distance classifier to identify the writer. For an unknown writer test script, the DEFs are first extracted, and then the feature vector is transformed into a new vector Z using (8). The writer of the test script is identified as writer k if k = arg min( Dist (i )) 1≤ i ≤ c
where Dist (i ) is the Euclidean distance between the vector of the test script and the mean vector of writer i.
5. Experimental Results For the experiments, we use two databases to evaluate the effectiveness of the proposed method. One database consists of 34 Chinese characters as scripts, each was written 16 times by 25 people. We call this database SET1.
The other consists of 20 Chinese characters as scripts, each was written 16 times by 27 people and 1 times by the rest 599 writers, so there are 626 writers in total. We call this database SET2. In the first preprocessing step, the linear normalization is carried out as described in section 2.1. In the second step, the DEFs are extracted using the method as described in section 2.2. In the last step, the DEFs are projected into a new feature vector space using the combination of PCA and LDA linear transform, which is described in section 3. Writer identification is experimented for each character script from a writer. For SET1, we use 10 samples of each character script for each writer as training set and the remaining 6 samples as test one. So there are total 250 samples for each character script used for training, and 150 samples for each character script used for testing. For SET2, 10 samples of each character script for the first 27 writers and 1 sample of each character script for the remaining writers are used as training set and the remaining samples of each character script as test one. So there are in total 869 samples for each script used for training, and 162 samples for each character script for testing. The identification results of SET1 of 34 character scripts are listed in Table 1. For comparison, we also list the results of direct use of DEFs for classification. From Table 1, we can see that the correct identification rate differs from one character script to another dependent on the structural complexity of the script (For example, the identification accuracy rate of character script “ ” using DEFs plus PCA & LDA is 100.00%, while the one of “ ” is 92.00%). We can also see that the identification rate using DEFs plus PCA & LDA transform is better than direct use of DEFs for classification (There are two exceptions, which are the character script “ ” and the character script “ ”). The average identification rate for direct use of DEFs is 93.88% and the cumulative rate of 5 hypotheses is averagely 98.92%, while for DEFs plus PCA & LDA is 96.12% and the cumulative rate of 5 hypotheses is averagely 99.37%, which shows 2.24% and 0.45% increase in correct rate compared with those of direct use of DEFs, respectively. SET2 is used to evaluate the classification ability of the proposed method for large class case. As the identification accuracy rate using DEFs plus PCA & LDA transform is better than that of direct use of DEFs, we only list the identification result of DEFs plus PCA & LDA, which can be found in Table 2. It is shown that the identification accuracy rate would decrease as the class number increase. One main reason is that there is only 1 sample used as training sample for 599 writers. It is also very normal since, for more writers, the first nearest neighbor to a query sample has a higher chance of being a wrong writer
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
rather than the correct one. Again, the structure complexity of character script influences the identification rate. Generally, correct rate of character script with complex structure is higher than that with simple structure.
Table 1. Identification results results in SET1 using single
character script
script
6. Conclusion We proposed a new text-sensitive writer identification method using single Chinese characters as script. The directional element features are first extracted from the handwriting character script. The dimension of the DEFs is reduced using PCA transform. The most discriminative features are then extracted from the reduced feature space using Fisher’s Linear Discriminant Analysis. The Euclidian distance is proposed for classification. Experiments demonstrate the effectiveness of the proposed method. We hope the combination of several character scripts can further improve the identification accuracy rate and the combination scheme would be addressed by our future research.
Acknowledgements This work is supported by 863 Hi-tech Plan (project 2001AA114081) & National Natural Science Foundation of China (project 69972024).
References [1] R. Plamondon, G. Lorette, “Automatic Signature Verification and Writer Identification – The State of the Art”, Pattern Recognition, vol. 22, No. 2, pp. 107-131, 1989. [2] I. Yoshimura, M. Yoshimura, “Off-line Writer Verification Using Ordinary Characters as The Object”, Pattern Recognition, vol. 24, No. 9, pp. 909-915, 1991. [3] H. E. S. Said, K. D. Baker, T. N. Tan, “Personal Identification Based on Handwriting”, Proc. of the IEEE 14th International Conference on Pattern Recognition, vol. 2, 1998, pp. 1761-1764. [4] S. N. Srihari, S. Cha, S. Lee, “The Discriminatory Power of Handwriting”, Proceedings SPIE, Document Recognition and Retrieval IX, Vol. 4670 San Jose, CA, January, 2002, pp. 129142. Invited Paper. [5] N. Kato, M. Suzuki, S. Omachi, etc, “A Handwritten Character Recognition System Using Directional Element Feature and Asymmetric Mahalanobis Distance”, IEEE Trans. on PAMI, vol. 21, No. 3, Mar. 1999. [6] J. Zhang, X. Ding, C. Liu, “Multi-Scale Feature Extraction and Nested-Subset Classifier Design for High Accuracy Handwritten Character Recognition”, Proc. 15th Int. Conf. on Pattern Recognition, vol. 2, 2000, pp. 581-584. [7] K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, New York, second edition, 1990.
average
Direct use of DEFs Correct Rate of rate (%) 5th (%) 98.67 94.67 94.67 98.67 94.00 98.67 92.00 98.67 96.00 99.33 95.33 98.67 92.00 98.67 97.33 98.67 95.33 99.33 90.00 98.00 96.00 99.33 98.67 94.00 94.00 98.67 94.00 99.33 92.67 100.00 90.67 98.00 92.67 98.67 97.33 100.00 94.67 100.00 95.33 100.00 97.33 99.33 93.33 99.33 97.33 100.00 98.00 100.00 89.33 97.33 92.67 98.00 97.33 100.00 88.67 98.00 94.67 98.67 92.67 98.00 96.67 100.00 88.67 98.00 88.67 97.33 94.00 99.33 93.88 98.92
DEFs + PCA & LDA Correct Rate of rate (%) 5th (%) 92.67 98.67 100.00 100.00 98.67 95.33 98.67 93.33 100.00 98.00 98.67 97.33 100.00 98.67 99.33 99.33 99.33 96.00 98.67 92.00 99.33 96.67 92.67 98.67 98.67 95.33 100.00 97.33 100.00 98.67 99.33 93.33 99.33 94.00 100.00 97.33 100.00 98.00 100.00 97.33 99.33 99.33 100.00 97.33 100.00 98.00 100.00 98.67 98.00 94.00 100.00 96.00 100.00 98.00 99.33 94.00 98.67 97.33 98.67 92.67 99.33 97.33 100.00 94.67 99.33 93.33 98.67 94.00 99.37 96.12
Table 2. Identification results in SET2 using single
character script script
Script Rate (%) Script Rate (%) Script Rate (%) Script Rate (%) Average(%)
77.16
92.59
82.72
91.36
87.65
86.42
88.89
81.48
74.69
82.10
83.95
84.57
69.14
80.86
83.33
85.80
83.33
83.33 82.16
88.89
54.94
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE