color laser printer identification by analyzing statistical features on ...

Report 2 Downloads 27 Views
COLOR LASER PRINTER IDENTIFICATION BY ANALYZING STATISTICAL FEATURES ON DISCRETE WAVELET TRANSFORM Jung-Ho Choi† , Dong-Hyuck Im† , Hae-Yeoun Lee†† , Jun-Taek Oh††† , Jin-Ho Ryu††† and Heung-Kyu Lee† †

††

School of Electrical Engineering and Computer Science, KAIST School of Computer and Software Engineering, Kumoh National Institute of Technology ††† Information Technology Laboratory, Technology Research Institute, Korea Minting & Security Printing Corporation, Republic of Korea ABSTRACT

Color laser printers are nowadays abused to print or forge official documents and bills. Identifying color laser printers will be a step for media forensics. This paper presents a new method to identify color laser printers with printed color images. First, 39 noise features of color printed images are extracted from the statistical analysis of the HH sub-band on discrete wavelet transform. Then, these features are applied to train and classify the support vector machine for identifying the color laser printer. In the experiment, 9 models of 4 brands, Xerox, Konica, HP, Canon, are tested to classify the brand of color laser printer, the color toner, and the model of color laser printer. The results prove that the presented identification method performs well using the noise features of color printed images. Index Terms— Media forensics, Color laser printer identification, Discrete wavelet transform 1. INTRODUCTION Technical advances in digital world have led to the development of highly efficient and low price digital devices (e.g., digital camera, scanner, printer etc.). People can acquire high quality images and printed documents with these devices. Furthermore, image processing tools allow novice to modify images conveniently and un-perceptually. Since people can make some fake documents with high quality laser printers, authenticating images and printed documents are a big challenge and forensic tools help establishing the origin and authenticity of such images and documents. In recent, digital forensic researches are growing up in various fields. One group tries to identify imaging source by specifying characteristics of the image acquisition device. When the device which forged the image is found, it helps to detect the criminal who forged the image. Another group tries to identify printing device which is used to print documents. There have been several printer identification

978-1-4244-5654-3/09/$26.00 ©2009 IEEE

1505

methods for monochrome or black and white laser printers [1][2][3]. Printer identification using the printing defect was presented by Mikkilineni et al., which was known as banding in monochrome laser [1]. Printers have different banding frequency sets that are dependent upon brand and model. These frequency features are easily extracted from halftone images. However, since it was difficult to obtain the banding frequencies from text, Delp introduced printer identification that used gray level co-occurrence features of monochrome laser printer from text [2]. This paper presents a color laser printer identification scheme for halftone images. In forensic analysis, there was no work for color laser printers. Especially, since the color laser printers are often used to forge official documents and bills, the authentication of color printed documents is important accordingly. After color noise is characterized through wavelet analysis, 39 statistical features are extracted from the color noise. These features are applied to train the support vector machine (SVM) classifier [4][5] which identify the color printer used to print unknown documents. The paper is organized as follows. Section 2 explains the overall scheme with a block diagram. The statistical features of color noise from printed color images are presented in Section 3. Section 4 summaries the experimental results and Section 5 concludes. 2. PRINTER IDENTIFICATION SCHEME The overall structure to identify color laser printers is depicted in Fig. 1, which is composed of two steps: training and classifying. In the training step, the SVM classifier is trained by extracting features from various color images by known color laser printers. In the classifying step, when an unknown document is presented, the color laser printer used to print the document is identified. Two steps have a common process extracting features. First, the printed color images are scanned in RGB domain and transformed to CMYK domain. Then, color noise in the transformed image is characterized through wavelet analy-

ICIP 2009

Fig. 1. Overall structure for color laser printer identification Fig. 3. Printed image (up) and its HH sub-band of DWT (down). (a) Original image, (b) Xerox document center C400, (c) Canon iRC3200N, and (d) HP color laserjet 4650.

Fig. 2. Filter structure of 1-level 2-D DWT and its example sis. In 1-level 2-D discrete wavelet transformation(DWT), an image is decomposed to 4 sub-bands, low-low(LL), lowhigh(LH), highlow(HL), and high-high(HH) as shown in Fig. 2. Among these four sub-bands, HH sub-band contains high-frequency components which mean color noise of the image. 39 statistical features are extracted from each HH sub-band per image, which will be explained in Section 3. Each feature vector is applied to train the SVM classifier. The trained SVM classifier performs identifying the color printer to print the unknown document in the classifying step. 3. COLOR NOISE FEATURES Determining a set of features to characterize each color laser printer is critical to the performance of the printer identification by observing printed images. In this paper, a noise based identification scheme for scanner [6] is applied to extract features. We notice that the same printed images from different color laser printers are slightly different in human visual system (HVS). These differences are clear in the HH sub-band of the DWT image. For several printer brands, Fig. 3 shows several examples of HH sub-band after 1-level 2-D DWT. Although we can not quantify the difference in spatial domain, each HH sub-band, color noise, of the DWT image can describe its own characteristic from different color laser printers. To model the characteristic of each HH sub-band, im-

1506

age statistical analysis is performed to the HH sub-band image [7]. Some statistical features are extracted from each color channel, (red, green, blue): standard deviation, skewness, and kurtosis. Covariance and correlation are calculated between two color channels. Table 1 summarizes these features. Since there are three pairs among three color channels, red and green, green and blue, blue and red, 15 features (3 × 3 + 2 × 3) can be extracted from RGB channel of an HH sub-band image. To display images on monitor, they are expressed in RGB domain. However, when they are printed on color laser printers, they are expressed in CMYK domain. Hence, we extract features in CMYK domain to analyze more correctly. Color laser printers use 4 color toners, cyan, magenta, yellow, and black. Since RGB and CMYK domains are both device-dependent, simple conversion formula between two domains does not exist. We used MATLAB 7.6 function to convert RGB domain to CMYK domain [8]. Similarly in RGB domain, statistical features are obtained from CMYK domain of an HH sub-band image. Since there are 4 color channels in CMYK domain, 24 features (3 × 4 + 2 × 6) per image can be extracted. Total 39 statistical features from the HH sub-band of a DWT image are considered as feature vectors characterizing it. 4. EXPERIMENTAL RESULTS This section shows experimental results to demonstrate the performance of our color laser printer identification scheme based on 39 statistical color noise features. 99 digital images having various colors were printed out using 9 color laser printers from 4 brands, Xerox, Konica, HP, and Canon. The size of test images was 512 × 512 pixels. With 300 dpi scanning resolution, all printed images were scanned by Epson perfection 2400 model. For convenient test environments, scanned 512 × 512 pixel images are divided equally into four 256 × 256 pixel images. Thus, 3,564 images were used in experiments. The 9 color laser printers with its brand are shown in Ta-

Table 1. Statistical features for the presented scheme Features Sdv

 1 N −1

Formula 1

N  i,j=1

(I(x, y) − I(x, y))2

Skewness

1 N ·s3

kurtosis

1 N ·s4

Covariance

Cov(I1 , I2 ) =

1 N

N  x,y=1

Correlation

N   x,y=1 N   x,y=1

Table 2. A list of color laser printer models Symbol XA XB XC XD KA HA CA CB CC

I(x, y) =

(I(x, y) − I(x, y))3 (I(x, y) − I(x, y))4

Corr(I1 , I2 ) = √

ble 2. Symbol is specified to match the brand and model in Table 3, Table 5, and Table 6. Some test images are depicted in Fig. 4. Among 3,564 images, the half of them is selected randomly and trained to the SVM classifier, LIBSVM [9][10], for classification. The other half is test to identify their source device.

Model Document centre C400 Document centre C450 DocuCentre C5540 I DocuCentre C6550 I bizhub C250 ColorLaserJet 4650 iRC2620 iRC3200N iRC2600N

where

I1 (x, y) · I2 (x, y) −

Fig. 4. A few sample images in our experiment

Brand Xerox Xerox Xerox Xerox Konica Minolta HP Canon Canon Canon

2

1 N1

N1 

x,y=1 Cov(I1 ,I2 )

1 N



N  i,j=1

I(x, y)



I1 (x, y) ·

1 N2

N2  x,y=1

I2 (x, y)

V ar(I1 )·V ar(I2 )

means that the different brands have different color mixing and printing process. Also, the presented color noise features can identify these differences in brands. Next, we tested identifying the color toners of each laser printer. Since some different color laser printers use same color toners, we checked whether the difference of color toners would make the difference in printed images or not. The list of color toners is presented in Table 4. The classification results are summarized in Table 5. The average classification accuracy was 92.28% which was smaller than 97.89% achieved by the brand identification test. However, it means that the presented scheme still can determine the color toners without concerning the model. We checked whether the presented scheme could classify the printer model in detail, not its brand. Table 6 shows the model classification results for 9 color laser printers. The average classification accuracy was 80.24%. We could know that the presented scheme has difficulty in classifying the 9 color laser printers exactly. It means that the presented 39 features are not enough to characterize the printer. In order to achieve the high performance, some novel features characterizing each printer are needed. However, these results demonstrate the possibility of the presented scheme in addressing the color laser printer identification. Table 3. Brand identification of color laser printers HH out Xerox Konica HP Canon in HH H Xerox Konica HP Canon

We expect that the color printing techniques are different for each brand. Therefore, to identify whether the presented scheme can identify the printer brand, we classified images into 4 image sets depending on the brand and trained the SVM classifier. Results are summarized in the confusion matrix of Table 3. The average classification accuracy was 97.89%. It

1507

98.9% 2.2% 5.4% 2.4%

0.2% 97.8% 0.5% 0.7%

0.7% 0% 93.0% 0.7%

0.2% 0% 1.1% 96.1%

5. CONCLUSION In this paper, we presented a scheme for identifying color laser printer. 39 statistical features from a HH sub-band image

Acknowledgement

Table 4. A list of color toners with the printer model XA XB XC XD KA HA CA CB CC

black CT200206 CT200539 CT200568 CT200568 8938-505 C9720A NPG-22 NPG-22 NPG-22

cyan CT200207 CT200540 CT200569 CT200569 8938-508 C9721A NPG-22 NPG-22 NPG-22

magenta CT200208 CT200541 CT200570 CT200570 8938-507 C9723A NPG-22 NPG-22 NPG-22

yellow CT200209 CT200542 CT200571 CT200571 8938-506 C9722A NPG-22 NPG-22 NPG-22

This work was partially supported by Defense Acquisition Program Administration and Agency for Defense Development under the contract, the Korea Science and Engineering Foundation(KOSEF) grant funded by the Korea government(MEST) (No. R0A-2007-000-20023-0), and KOMSCO 6. REFERENCES [1] A. K. Mikkilineni, G. N. Ali, P. J. Chiang, G. T.-C. Chiu, J. P. Allebach, and E. J. Delp, “Signature-embedding in printed documents for security and forensic applications,” in Proc. of the SPIE, 2004, vol. 5306, pp. 455– 466.

Table 5. Toner identification results of color laser printers HH out XA XB XC,XD KA HA CA,CB,CC in HH H XA XB XC,XD KA HA CA,CB,CC

89.8% 1.1% 2.3% 0.0% 0.0% 0.0%

2.2% 82.0% 3.9% 1.6% 1.1% 1.4%

8.1% 13.4% 93.5% 0.0% 4.3% 1.0%

0.0% 0.8% 0.1% 98.4% 0.5% 0.7%

0.0% 1.6% 0.2% 0.0% 93.0% 0.7%

[2] A. K. Mikkilineni, P. J. Chiang, G. N. Ali, G. T.-C. Chiu, J. P. Allebach, and E. J. Delp, “Printer identification based on graylevel co-occurrence features for security and forensic applications,” in Proc. of the SPIE, 2005, vol. 5681, pp. 430–440.

0.0% 1.1% 0.0% 0.0% 1.1% 96.1%

[3] N. Khanna, A. K. Mikkilineni, A. F. Martone, G. N. Ali, G. T.-C. Chiu, J. P. Allebach, and E. J. Delp, “A survey of forensic characterization methods for physical devices,” in Proc. of Digital Forensic Research Workshop, 2006, vol. 3, pp. 17–28.

of the DWT image were applied to train the SVM classifier. The classifier was used to identify the source of the unknown documents. Using 9 color laser printers from 4 brands, 99 sample images were printed out, scanned, and used for the experiments. The presented scheme was tested to classify the brand of color laser printer, classify the color toner, and classify the model of color laser printer. The accuracy was 97.89%, 92.28%, and 80.24%, respectively. In future works, we study to determine the additional features that can represent the characteristic of color laser printers, especially focusing on the color toner identification. Also, further experiments using additional various color laser printers and images will be performed.

[4] E. Alpaydin, Introduction to Machine Learning, The MIT Press, 2004. [5] Support Vector Machine, http://en.wikipedia.org / wiki / Support vector machine. [6] H. Gou, A. Swaminathan, , and M. Wu, “Robust scanner identification based on noise features,” in Proc. of the SPIE, 2007, vol. 6505, p. 65050S. [7] S. M. Ross, Introduction to Probability and Statistics for Engineers and Scientists, Elsevier Academic Press, 3rd edition, 2004. [8] R. C. Gonzalez, R. E. Woods, and S. L. Eddins, Digital Image Processing USING MATLAB, Prentice-Hall, 2004.

Table 6. Color laser printer model identification results (unit: %) HH out XA XB XC XD KA HA CA CB CC in HH H XA XB XC XD KA HA CA CB CC

88.7 1.6 6.5 3.2 0 0 0 0 0 1.1 84.1 3.5 8.9 0.8 1.6 0 0 0 6.5 4.0 42.5 46.2 0 0.8 0 0 0 0.4 5.0 8.2 86.3 0.1 0 0 0 0 0 1.6 0 0 97.8 0 0 0.5 0 0.5 0.5 1.1 2.7 0.5 93.5 1.1 0 0 0 2.2 0 0.5 0.5 0.5 82.3 10.2 3.8 0 1.1 0 0.5 0 1.1 7.0 87.6 2.7 0 0 2.4 0 2.4 0 4.8 33.3 57.1

[9] C.-C. Chang and C.-J. Lin, LIBSVM: a library for support vector machines, 2001, Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm. [10] C.-C. Chang, C.-W. Hsu, and C.-J. Lin, A practical guide to support vector classification, http://www.csie.ntu.edu.tw/ cjlin/papers/guide/guide.pdf.

1508