A Pixel-Based Digital Photo Authentication Framework via Demosaicking Inter-Pixel Correlation Na Fan
Cheng Jin
Yizhen Huang
Department of Electronic Engineering Department of Computer Science and Computer Sciences Department East China Normal University Engineering University of Wisconsin-Madison 500 Dongchuan Rd, Shanghai, China Shanghai Jiaotong University 1210 West Dayton Street, Madison, +86-021-54342272 800 Dongchuan Rd, Shanghai, China Wisconsin, United States +86-021-54744702 001-608-262-5086
[email protected] [email protected] [email protected] binary similarity measures to select features as betterment. McKay et al. [7] combined statistical noise features and declared recognition rate as high as 98% for standalone cameras. Gallagher et al. [8] high-pass filtered images, and then estimated positional variance of each diagonal to find periodicities in the variance signal on spread spectrum indicating the presence of demosaicking. Huang et al. [9] modeled the problem using a quadratic form. If the 169-element quadratic coefficient matrix is considered as features, this model is appropriate to be viewed as feature-based. Swaminathan et al. [10] investigated demosaicking artifacts using an analysis-by- synthesis method. A limitation of feature-based approaches is that, it merely extracts one aspect of the overall statistical models as their metrics for classification. Rough metrics neglect potentially salutary image information, which hinders accurate anomaly detection of small cropped windows and demands more sample images for training.
ABSTRACT Demosaicked images possess spatially periodic inter-pixel correlation, because interpolation strategies at any logically equivalent spatial pixel location are identical. Taking this statistical characteristic, much research on imaging forensics has been done recently. We proposed a generalized neural network framework to simulate the stylized computational rules in demosaicking through bias and weight value adjustment. As experiments show, our framework is effective in recognizing the demosaicking algorithms for raw CFA images, as well as digital photo authentication, compared to the state-of-the-art methods.
Categories and Subject Descriptors K.6.5 [Management of Computing and Information Systems]: Security and Protection –Authentication.
General Terms
The main idea of our NN framework is to apply inverse operation against interpolation through learning DIPC by virtue of NNs. With quite rough prior knowledge in interpolation algorithms to be categorized, the flexible topologies and forms of NNs enable us to represent the diverse and amorphous computational rules in many interpolation algorithms. This framework is not only a classifier, but also reveals a systematic series of traces hidden inside images induced by demosaicking.
Algorithms, Design, Experimentation, Security, Verification
Keywords:
Photo authentication, demosaicking, inter-pixel correlation, neural network
1. INTRODUCTION With the help of popular software such as Photoshop, editing digital photos is no longer a professional task, not to mention some up-to-date toolkits like Lazy Snapping [1] and Picasa. Doctored photos become more and more common, which influences people’s attitudes towards the credibility of photographs [2]. To cope with this challenge, digital photo forgery detection has become an active topic after 2000 (see, for example, [3] for a short survey). Aside from a few exceptions e.g. [4], the majority of these detection methods focuses on the image processing pipeline of digital cameras and utilizes camera hardware limitations as vital clues.
The rest of the manuscript is arranged as follows: Section 2 details the proposed NN framework; Section 3 presents experimental results; Section 4 comments on conclusion and points out possible future work.
2. TAILORED NN FOR INTERPOLATION RECOGNITION The most prevalent CFA pattern is the Bayer CFA pattern. Figure 1 is a schematic diagram, where r, g, and b denote red, green and blue color filters.
Demosaicking Inter-Pixel Correlation (DIPC) is one of such critical clues: Kharazzi et al. [5] tried to identify source cameras in CFA configuration and color processing / transformation pipeline by a feature-based approach. Celiktutan et al. [6] used
The original intention for people who invented NN is to mimic the rationale of biological neurons. It has been proved to be useful in many applications of the cognitive science. But in our case, NN is applied to learn kind of mechanical interpolation computational rules: NN models the multiform unknown non-linear relationship between the pixel under consideration and its neighboring pixels, or namely, the correlated pixel set. The correlated pixels are selected in a cross-shaped neighborhood as the grayed pixels appeared in Figure 1, for which the basis is that, natural images feature a higher correlation in horizontal and vertical direction [11].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM&Sec’09, September 7-8, 2009, Princeton, New Jersey, USA. Copyright 2009 ACM 978-1-60558-492-8/09/09...$10.00.
125
2.2 NN Learning and Three Classifiers
O M M M M M M M N L g-3,3 r-2,3 g-1,3 r0,3 g1,3 r2,3 g3,3 L
L L L L L L
b-3,2 g-2,2 b-1,2 g0,2 b1,2 g2,2 b3,2 g-3,1 r-2,1 g-1,1 r0,1 g1,1 r2,1 g3,1 b-3,0 g-2,0 b-1,0 g0,0 b1,0 g2,0 b3,0 g-3,-1 r-2,-1 g-1,-1 r0,-1 g1,-1 r2,-1 g3,-1 b-3,-2 g-2,-2 b-1,-2 g0,-2 b1,-2 g2,-2 b3,-2 g-3,-3 r-2,-3 g-1,-3 r0,-3 g1,-3 r2,-3 g3,-3
N M
M
M
M
M
M
4-layer feed-forward back propagation NNs are adopted. The input, 1st hidden, 2nd hidden and output layers have 60, 76, 2 and 1 neurons respectively. For the sake of adjustability and flexibilities, the 1st hidden layer has multiple sigmoid transfer functions, each assigned for a specified number of neurons as listed in Table 1. f(x)=x is to realize a linear filter, and f(x)=exp(x) is to recover logarithmized values. The 2nd hidden layer has 3 tansfer functions, f(x)=x, logsig(x) and radial_basis(x) for maintaining both linearity and nonlinearity importing from the 1st hidden layer. The transfer function of the output layer is linear. The additional setup of a linear transfer function at every layer is to leave a passageway for imitating purely linear interpolation algorithms.
L L L L L L
Table 1. Transfer functions in the 1st hidden layer
M O
Transfer function definition No. of neurons assigned f(x)=x 1 f(x)=exp(x) 5 tan_sig(x)=2/(1+exp(-2x))-1 10 log_sig(x)=1/(1+exp(-x)) 10 double_sig(x)=exp(-x)/(1+exp(-x))2 10 double_log_sig(x)=sgn1(x)(1-exp(-x2)) 10 f(x)=cos(x) 10 radial_basis(x)=exp(-x2) 10 f(x)=1-|x|,-1≤x≤1, f(x)=0 otherwise 10
Figure 1. The Bayer CFA pattern, relative coordinates numbered for reference, where (0,0) is the coordinate of the pixel under consideration, and grayed pixels belong to the correlated pixel set.
2.1 Preprocessing and PCA Incorporating problem dependent structural information in the architecture of NN often lowers the overall complexity. To facilitate the establishment of certain frequently encountered quantitative relations, we expand the input data dimension by appending the ratio hues and pixel value logarithms. Hence an observation set consists of 25 pixels in the correlated pixel set, and each pixel has 3 color components, 3 corresponding logarithms, and 2 ratio hues. Note that the known interpolation result and its relevant auxiliary parameters should not be fed to NN as input which would otherwise entices the NN to degrade as a simple response function to it. This excludes 1 color component, its corresponding logarithm, and 2 ratio hues at the pixel under consideration from the input data, yielding to a 25×(3+3+2)-1-1-2=196 dimensional observation set β.
Both input and target values to NNs are proportionally scaled so that they correspond to the sensitive region of the sigmoid function. The training algorithm is the Scaled Conjugate Gradient method [12]. The initial weight values are normalized and proportional to correlation coefficients between pixel at (0,0) and pixels with relative spatial offsets. There is obvious distinction among the distributions of weight values learnt from images demosaicked by different algorithms. Our 1st classifier ΨA relies on this weight space with proper PCA and multi-class Support Vector Machine [13]. For a given test image Ω, a set of 122 pre-trained NNs is able to simulate every of its pixels, and so construct a new re-interpolated image, say Ω’. The more similar the means to demosaick Ω and the training images, the closer to Ω is Ω’. Rigorously speaking, Ω is recognized, as the L1norm of the image vector Ω-Ω’ is minimum, which is our 2nd classifier ΨB. The difference map |Ω-Ω’| has periodic intensity fluctuation, which manifests as particularly salient symmetric peaks in the frequency domain (see Figure 4). This phenomenon is quantized by the maximum of the dot product between the kernel [1,2,4,2,1][1,2,4,2,1]T and a 5×5 image patches in highband, which is our 3rd classifier ΨC.
Color of the same channel varies smoothly in natural images, especially in low-gradient flat image patches that accounts for a larger proportion compared to high-gradient steep patches. This usually results in extraordinarily close values from the same channel of a correlated pixel set, and makes the covariance matrix in PCA almost singular. To avoid singularity for computing eigenvalues, only the first value from a channel remains intact and the rest values are subtracted by the first value. The dimensionality of β is excessively large and deemed to have data redundancy that ought to be distilled by PCA: Assume there are L samples of the observation set {β1, β2…, βL} with average vector β . The difference vectors are βi*=βi- β (1≤i≤L). So the estimate of the covariance matrix is written as C=[β1*, β2*…, βL*][β1*, β2*…, βL*]T/(L-1). Compute the eigenvalues and eigenvectors of C, denoted as {λ1, λ2…, λ196} and {ξ1, ξ2…, ξ196}(λ1≥λ2…≥λ195≥λ196) respectively. Choose M eigenvectors with the M largest eigenvalues to form the feature matrix V=[ξ1, ξ 2…, ξM]T. Experiments indicate that M=60 is enough for our case. Finally, βi* (i=1,2,…L) is turned into Γi=Vβi* with reduced dimensionality, which removes unnecessary data relevancy and tremendously saves computing expense for NNs.
The classifiers ΨA, ΨB and ΨC make 3 “relatively” 3 independent classification judgments for each channel. The Majority-Voting Scheme (MVS) [14] integrates them into a global decision: An image is attributed to a demosaicking algorithm, only if a consensus
1
sgn(x) represents the sign function. sgn(x)=0 for x=0; sgn(x)=x/|x| for x≠0.
The imposed dimensionality of logarithms and hues will not be reduced by PCA, because PCA is orthogonal linear transformation. It is also due to this reason that, NN can always find a linear transformed vector of weight values to accommodate the inputs after multiplying feature matrix.
2
4 phases per channel × 3 channels = 12, see explanation in Subsection 3.3 3 It is inadequate to say “absolute” independence because crosscorrelation between channels exists.
126
classification. Customarily, in a counterfeit image, only a small portion of pixels is modified, which is insufficient to impact the status of NNs i.e. weight space, and hereby loses the keenness of ΨA to localized tampering. Nevertheless, the metrics of ΨB and ΨC are still excellent indicators: Figure 7 is an images containing perceptually plausible forgeries created using Adobe Photoshop with the suspicious regions framed, and their corresponding difference map |Ω-Ω’| for ΨB and the Fourier transformed suspicious framed windows in |Ω-Ω’| for ΨC. Their real photos, from which the fabrication is made, are shot by the Samsung S860 camera. The ΨB metrics i.e. the L1-norms of the suspicious framed windows in |ΩΩ’| augment from 16485 to 49546 respectively, and the ΨC metrics i.e. the kernel-dot- product maximum diminishes from 15981 to 7893.
is reached among the 9 i.e. at least 5 make the same judgment; otherwise it is a rejection, and the image is deemed to be nondemosaicked.
2.3 Spatial Phase Alignment At the beginning, we are unaware of whether a given pixel of a demosaicked image is interpolated or originally sampled. In respect that interpolation strategies at any logically equivalent spatial pixel location are identical, the spatial period along x/y axis of most CFA patterns including RGBG (see Figure 1), CMYG and CMYK are 2 pixel units, labeled as phase 0 and phase 1, leading to totally 4 possible spatial phases (0,0), (0,1), (1,0) and (1,1): g0,0 b1,0 c0,0 m1,0 c0,0 m1,0 r0,1 g1,1
y0,1 g1,1
y0,1 k1,1
4. CONCLUDING REMARKS
The color values situated at one spatial phase are possibly subject to disparate interpolation compared to other spatial phases, thus requiring a separate storage of NN’s weight space. For each spatial phase, it is a must to keep separate weight and bias values for each color channel. Accordingly, 12 NNs are designated to memorize 1 demosaicking algorithm. Fortunately, every pixel excluding ones of image borders gives an observation sample so that an W×H image has L=(W-3)×(H-3)/12 observation samples for each NN, which is far from enough.
Besides imitating the cognitive faculty of humans, machine learning tools are supposed to be more successful in large-scale mechanical cybernetics problems that are too onerous and laborious to humans. Interpolation recognition is one such case. Extensions of this work involve image enhancement and restoration algorithms recognition, audio interpolation algorithms recognition etc. Although Huang [15] theoretically proved that any direct digital forgery detection method can be cracked by an anti-detection method aiming against it, exploiting statistical characteristics available within digital photos is still beneficial for authentication purpose and improvement on robustness against cropping and spread spectrum filtering in second generation watermarks [16].
For a single training image, it is not necessary to determine its spatial phase, as at the outset 4 NNs for a color channel are logically equivalent. When more than one image is involved, their spatial phases at a fixed reference point (such as the top-left-corner pixel) need to be aligned: All 4 spatial phases are enumerated for a test image. Along an axis, if phase 0 better matches the pending NNs in terms of ΨX, X ∈ {A,B,C}, thenΨX votes for phase 0 and vice versa. The phase receiving 2 votes wins out as the MVS promises [14] and it is impossible to have rejections.
5. REFERENCES [1] Y.Li, J.Sun, C.K.Tang, and H.Y.Shum, Lazy Snapping, Proceedings of ACM Siggraph 2004, pp.303-308. [2] D.L.M.Sacchi, F.Agnoli, and E.F.Loftus, Changing history: Doctored photographs affect memory for past public events, Applied Cognitive Psychology, 21(8): 1005–1022, 2007. [3] T.V.Lanh, K.S.Chong, S.Emmanuel, and M.S.Kankanhalli, A survey on digital camera image forensic methods, Proceedings of IEEE International Conference on Multimedia and Expo 2007, pp.16-19. [4] P.Nillius, and J.O.Eklundh, Automatic estimation of the projected light source direction, Proceedings of CVPR 2001, pp.1076-1083. [5] M.Kharrazi, H.T.Sencar, and N.Memon, Blind Source camera identification, Proceedings of IEEE International Conference on Image Processing 2004, pp.709-712. [6] O.Celiktutan, B.Sankur, and I.Avcibas, Blind identification of source cell-phone model, IEEE Transactions on Information Forensics and Security, 3(3): 553-566, 2008. [7] C.McKay, A.Swaminathan, H.Gou, and M.Wu, Image acquisition forensics: Forensic analysis to identify imaging source, Proceedings of IEEE International Conference on Acoustic, Speech and Signal Processing 2008, pp. 1657-1660. [8] A.Gallagher, and T.Chen, Image authentication by detecting traces of demosaicing, Proceedings of CVPRW 2008. [9] Y.Z.Huang, and Y.J.Long, Demosaicking recognition with applications in digital photo authentication based on a quadratic pixel correlation model, Proceedings of CVPR 2008, pp.1-8. [10] A.Swaminathan, M.Wu, and K.J.Ray Liu, Non-intrusive
As to M training images I1, I2…IM, if Ix (1≤x≤M) is treated as the sole training image and the rest are treated as the test images, the spatial phases for the rest images can be obtained based on the single-training-image mode offered in the previous paragraph. The MVS again composes the ultimate spatial phase decision from (M-1) local ones.
3. EXPERIMENTS In experiments, each set of 12 NNs is trained by 10 800×600 images and simulated by other 50 800×600 images. To shake off the interference from the camera built-in demosaicking, each color channel of all training and test images is independently blurred with a 3x3 binomial filter, and down-sampled by a factor of 2 in both x and y axes direction. These down-sampled color images are then re-sampled onto a Bayer array, and demosaicked by the 10 algorithms mentioned in Section 2, generating 11 classes of images including the original non-demosaicked ones. Non-demosaicked images has no DIPC, and ordinarily their inter-pixel correlation is stochastic and chaotic, entirely from scene correlation. Therefore no NN is dispatched to learn their DIPC, instead, they are identified when a rejection arises from the MVS. Figure 5 is a graphical representation of misclassification that signifies accuracies affected by the attempt to distinguish semblable algorithms, and nearly perfect categorization in linear interpolators. ΨA orders a set of blank NNs with no prior trainings to be trained by the test image, and the acquired NN weight space is collected for
127
[11]
[12] [13]
[14]
n o n d e m o s a i c k e d
pattern recognition: An analysis of its behavior and performance, IEEE Transactions on Systems, Man and Cybernetics, 27(5): 553–568, 1997. [15] Y.Z.Huang, Can digital image forgery detection unevadable? A case study: Color filter array interpolation statistical feature recovery, SPIE vol. 5960, Visual Communications and Image Processing 2005, pp.980-991. [16] M.Kutter, S.K.Bhattacharjee, and T.Ebrahimi, Towards second generation watermarking schemes, Proceedings of IEEE International Conference on Image Processing 1999, pp.320323.
component forensics of visual sensors using output images, IEEE Transactions on Information Forensics and Security, 2(1): 91-106, 2007. M.Kutter, S.Voloshynovskiy, and A.Herrigel, The watermark copy attack, SPIE vol. 3971, Security and Watermarking of Multimedia Contents II 2000, pp.371-380. A.F.Moller, A scaled conjugate gradient algorithm for fast supervised learning, Neural Networks, 6(4): 525-533, 1993. C.W.Hsu, and C.J.Lin, A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks, 13: 415–425, 2002. L.Lam, and C.Y.Suen, Application of majority voting to Ω
|Ω-Ω’|
DFT(|Ω-Ω’|)
b i l i n e a r
G o s' N N
Figure 4. The left column of each row, from top to bottom, is a non-demosaicked image, followed by images demosaicked by the bilinear, and Go's NN method. The middle column is the difference map4 |Ω-Ω’| between the test image Ω and its re-interpolated image Ω’, corresponding to the left column. The right column is the magnitude of the Fourier transform of |Ω-Ω’|. The Ω’ of undemosaicked images differs greater to Ω, and its |Ω-Ω’| has faint periodic patterns.
4
Noted as |Ω-Ω’|, for displaying purpose, all pixel intensities of |Ω-Ω’| in this paper is scaled by a factor of 32.
128
10%
5%
0% undemosaicked bilinear
bicubic
sub. CHB ratio CHB gradient-based TBVNG Go’s NN adaptive NN Gunturk's Kimmel's
non-dem osaicked bilinear bicubic subt ract ive CHB rat io CHB gradient -based T BVNG Go's NN Adapt ive NN Gunt urk's Kim m el's
96% 100% 98% 96% 90% 96% 88% 98% 92% 94% 88% Figure 5. Misclassification rate distributions using the proposed framework to recognize Bayer array re-sampled images artificially demosaicked by the 10 algorithms introduced in Section 2. The heights of colored pillars above the name and recognition rate of an algorithm/ camera indicate the misclassified percentage from it to another algorithm/camera whose name is marked in the legend rightward.
Ω
|Ω-Ω’|
DFT(|Ω-Ω’|framed window)
real
fake
Figure 7. An exhibition of a rendering forgery. The 3 juxtaposed columns of images in a row, from left to right, are the test image Ω, the difference map |Ω-Ω’|, and the magnitude of the Fourier transform of the suspicious framed windows in |Ω-Ω’| respectively. Exceptional highlight in |Ω-Ω’| and the absence of evident peaks in spread spectrum over the framed windows in |Ω-Ω’| unveils the corruption of DIPC, implying probably juggling areas.
129