Steganalysis of Boundary-based Steganography using Autoregressive Model of Digital Boundaries M. Jiang*, X. Wu**, E. K. Wong*, and N. Memon* *Department of Computer and Information Science Polytechnic University 5 Metrotech Center Brooklyn, NY 11201, USA ** Department of Electrical & Computer Engineering McMaster University Hamilton, Ontario, L8G 4K1, Canada {
[email protected],
[email protected] ,
[email protected],
[email protected]. }
Abstract In this paper, we present a novel technique for the steganalysis of digital documents, when the secret message is embedded along the boundaries of text characters or other symbols in the documents. The proposed technique uses an auto-regressive model to detect marked documents, as well as to estimate the relative length of the embedded messages. Experimental results demonstrate the effectiveness and accuracy of the proposed technique.
1. INTRODUCTION Over recent years, a number of watermarking and data hiding techniques have been developed for binary document images [1]. An important application of data hiding is steganography, where a secret message is hidden in a document for the purpose of covert communication. The goal of steganalysis is to develop techniques that will automatically distinguish documents with secret messages (stega objects) from documents without secret messages (cover objects). Such techniques are useful as a counter measure to detect covert communications among enemies, criminals, and terrorists, as well as to protect against insider threat of unauthorized covert release of information using steganographic techniques within government agencies and private institutions. Depending on the data embedding methods used, different steganalysis techniques need to be developed. Steganalysis research has been stimulated by the increasing number of digital steganographic techniques developed over recent years. The basic idea for This work was supported by AFOSR Grant # F30602-03-C-0091.
steganalysis is that after evaluating many original images and stegoimages, it is possible to capture characteristics that are not "normal" [2]. In [2, 3], the palette table of an image is analyzed for abnormality. In [4, 5], the changes in the bit patterns of the least-significant-bit planes of images are examined. Other steganalysis techniques can be found in [6-8]. Recent surveys of steganalysis techniques can be found in [9, 10]. Steganalysis techniques reported in previous work, however, focused on grayscale and color images. They cannot be directly applied to binary document images. An effective steganographic technique for binary document images is to embed data along the boundaries of characters or symbols in a document. In [11], a boundarybased method was developed for electronic document images. This technique inserts data along the 8-connected boundaries of characters or symbols by using a set of dual five-pixel-long boundary patterns. It was shown to have high data embedding capacity and produce marked images with excellent image quality. In [12], the input binary image is divided into 3 x 3 (or larger) blocks. Data is embedded such that the total number of black pixels is either odd or even in a block. Within each block, data is also inserted along the boundaries of characters or symbols. In this paper, we propose a technique for the steganalysis of boundary-based steganography using an autoregressive model of digital boundaries. To the best of our knowledge, this is the first steganalysis technique developed for document images. It is a fairly general technique that can be applied to any steganographic
2. PROPOSED APPROACH
Using the autoregressive model, if an image is marked the variance of joint probability distribution P(e) of the estimation error vector will increase (See Figure 1(b)). Furthermore, the deviation of expected estimation error vector from the origin and the value of the variance of P(e) will increase as the length of the embedded message increases.
2.1 Auto-Regressive Model
2.2 Determining the Regression Coefficients
In our model, boundary-based data hiding can be viewed as a process that hides data along the boundaries of characters or symbols in a document by mixing small pixel disturbances amongst quantization and digitization noise. By quantization noise we mean the errors caused by rasterization of smooth curves in character font rendering. Digitization noise refers to errors caused by optical scanning of documents. Although quantization and/or digitization noise can visually mask small changes in boundary pixel positions caused by steganography, we strive to statistically distinguish these types of noise from subtle modifications to discrete curve boundaries caused by steganography. The boundaries of characters or symbols in a document can normally be modeled by a cubic polynomial. Indeed, computer fonts are typically shaped by cubic splines. Even hand written letters tend to consist of strokes that can be approximated by cubic curves. Therefore, we can model the discrete curve (xi, yi), which is a sequence of pixels, of a character font or an object boundary as an autoregressive process. Then a boundary pixel position (xi, yi) can be estimated from its neighbors in a window of size 2T+1 via cubic polynomials:
The coefficients of the autoregressive model, namely α , β , γ , and η , are determined using a large training set of unmarked document images. In the current implementation, regression coefficients are trained separately for characters and symbols of different sizes and fonts. First, we assume that the estimation errors in the x and y directions are independent of each other. We then estimate the regression coefficients in Equations (1) and (2) separately by minimizing the sum-of-squared-errors Ex and Ey defined as follows:
technique that embeds messages by flipping pixels along the boundaries of characters or symbols in a document. In Section 2, we present our proposed method. Section 3 gives experimental results, and in Section 4, we present concluding remarks and discussion.
^
T
xi = ∑ (α (1) t xi −t + α ( 2 ) t x 2 i −t + α ( 3) t x 3 i −t t =1
+β ^
(1)
t
xi + t + β
( 2)
t
x
2
i +t
+β
( 3)
t
x
3
i +t
(1)
)
T
y i = ∑ (γ (1) t y i −t + γ ( 2) t y 2 i −t + γ ( 3) t y 3 i −t t =1
+η
(1)
t
y i +t + η
( 2)
t
yt
2
i +t
+η
( 3)
t
y
3
i +t
(2)
^
(4)
i =1
N
^
E y = ∑ ( yi − y i ) 2
(5)
i =1
where N is the total number of boundary pixels around a character. Details on how to compute the regression coefficients based on the minimum sum-of-squared error criterion can be found in standard text books such as [13].
2.3 Steganalysis Based on Statistical Properties of Estimation Errors In the current implementation, we combine the errors in the x- and y-directions and use it as a measure in the steganalysis of marked documents. Specifically, we compute _
)
^
^
e i =| xi − x i | + | y i − y i |
Because the curve rasterization algorithm approximates a continuous curve by pixels that are closest to the true curve, the estimation error vector
e i = ( xi − xˆ i , yi − yˆ i )
N
E x = ∑ ( xi − x i ) 2
(3)
is a zero-mean random vector with a joint Laplacian distribution (See Figure 1(a)). But steganography along the discrete curve boundaries shifts pixels away from the underlying continuous curve. This degrades the fit of the autoregressive model and consequently changes the joint distribution of estimation error vector ei.
(6)
We then compute the mean and variance of the combined error
1 µ= M
∑e
1 σ = M 2
M _
(7)
i
1
M
_
∑ (e
i
− µ)2
(8)
1
where M is the total number of boundary points among all the characters and symbols in a document. We then use
the following rule to decide whether an input document I is marked or not
I is marked if µ > Tµ and σ2 > otherwise I is unmarked where
Tσ
(9)
Tµ and Tσ are thresholds obtained empirically
from the training data. Once a document has been detected as marked, we can further estimate the relative message length l using the computed mean µ and variance σ . Here, relative message length is defined to be the total number of altered pixels divided by the embedding capacity of the 2
document. For a given document, µ and σ are increasing functions of the relative message length. The relationship between µ and relative message length is illustrated in Figure 2. From the training data, we obtained the following empirical formula for the estimation of relative message length based on the error mean: 2
l = 43.82 µ − 7.724
(10)
A linear equation for estimating relative message length using the error variance was also obtained empirically:
l = 36.99σ 2 − 1.398
(11)
We found that Equations (10) and (11) have similar accuracies and either one could be used to estimate relative message lengths. These two equations are effective for documents marked with any boundary-based data hiding methods. In Section 3, we use the boundarybased data hiding method reported in [11] to perform our experiments.
3. EXPERIMENTAL RESULTS Experiments were conducted to evaluate the performance of our proposed steganalysis technique. A set of 73 test document images was generated using the “Paint” program on a Window PC. The generated images have a resolution of 96 dpi and their spatial dimensions range from 400 x 400 pixels to 1,000 x 700 pixels. They contain text characters and symbols of a variety of fonts and sizes. These documents were then marked with messages using the boundary data hiding techniques reported in [11]. We successfully detected all marked documents using Equation (9). We also estimated the relative message lengths in these documents using Equations (10), with an average error of about 1.61%, relative to the original relative message lengths. As a specific example, one of the images has an embedded message length of 4,166 bits embedded among 757
characters and symbols (Figure 3.) The maximum data hiding capacity of the document is 4,171 bits. The marked image is visually indistinguishable from its unmarked version but can be detected using our proposed technique. Figure 4 shows the errors in the estimated relative message lengths when messages of different lengths are embedded into the test document. As shown in the figure, the error rate is quite low (between –1.5% to 2.0%) for different message lengths.
4. CONCLUDING REMARKS AND DISCUSSION We have developed a novel steganalysis technique for binary electronic documents. This technique can detect document images that contain hidden messages along character and symbol boundaries. In addition, it can estimate the relative lengths of the hidden messages with good accuracy. The experimental results in Section 3 validate the effectiveness of our proposed technique for computer-generated fonts. Although we have only tested on computer-generated fonts, we believe our method will work for non-standard fonts such as those created by users, or handwritten characters or symbols as long as the boundaries are locally smooth when they are created. We intend to perform testing on non-standard fonts created by users, and on handwritten characters and symbols. The results will be published in a future paper. For computer generated fonts, one can argue that we can directly compare the characters and symbols on a marked document with images generated with the same computer fonts. But in a fully automated environment, this would first require the computer identification of font styles and sizes in a marked document, followed by the recognition of individual characters and symbols using an OCR software, and finally, re-generation of the original document image. (Note that our proposed technique does not require this process.) Manual generation of the original document image would be a tedious process. Besides, a clever person would probably use non-standard or user-created fonts in steganography to decease the likelihood of being detected.
5. REFERENCES [1] M. Chen, E. K. Wong, N. Memon, and S. Adams, “Recent Developments in Document Image Watermarking and Data Hiding,” Proc. SPIE Conf on Multimedia Systems and Applications IV, Denver, CO, Aug. 2001. [2] N. F. Johnson and S. Jajodia, “Steganalysis: the investigation of hidden information,” Information Technology Conference, IEEE, 1-3 Sept. 1998, Page(s): 113 -116.
[3] N. F. Johnson and S. Jajodia, “ Steganalysis of Images Created using Current Steganography Software”, Workshop on Information Hiding Proceedings, Portland, Oregon, USA, 15 - 17 April 1998. [4] I Avcibas, N. Memon, and B. Sankur, “Image steganalysis with binary similarity measures, Image Processing,” IEEE International Conference on Image Processing, Rochester, New York, September 2002, Volume: 3, Page(s): 645 648. [5] C. Chandramouli and N. Memon. “Analysis of LSB-based Image Steganography techniques,” IEEE International Conference on Image Processing, Thessaloniki, Greece, October 2001 [6] N. Provos and P. Honeyman, “Hide and Seek: An Introduction to Steganography,” IEEE Security and Privacy, May/June 2003, PP. 32-44. [7] N. Provos and P. Honeyman, “Detecting Steganographic Content on the Internet,” Proc. 2002 Network and Distributed System Security Symp., Internet Soc., 2002. [8] J. Fridrich, M. Goljan, and D. Hogea, “Steganalysis of JPEG Images: Breaking the F5 Algorithm,” Proc. 5th Int’l Workshop Information Hiding, Springer-Verlag, 2002. [9] J. Fridrich and M. Goljan, “Practical Steganalysis—State of the Art,” Proc. SPIE Photonics Imaging 2002, Security and Watermarking of Multimedia Contents, vol. 4675, SPIE Press, 2002, pp. 1–13. [10] A. Westfeld and A Pfitzmann, “Attacks on Steganographic Systems,” Information Hiding, LNCS 1768, pp. 61-76, Springer-Verlag Berlin Heidelberg, 1999. [11] Q. Mei, E. K. Wong, and N. Memon, “Data hiding in binary text documents,” SPIE Proc Security and Watermarking of Multimedia Contents III, San Jose, CA., Jan. 2001. [12] M. Wu, E. Tang, and B. Liu, “Data hiding in digital binary images,” Proc. IEEE Int’l Conf. on Multimedia and Expo, Jul 31-Aug 2, 2000, New York, NY. [13] A. Popoulis. Probability, Random Variables, and Stochastic Processes. 3rd edition, McGraw-Hill, Inc. (1991)
Figure 1. Error distributions of unmarked and marked images
Figure 2. Mean of estimated errors versus relative message length.
Figure 3. A test document image
Figure 4. Error of estimated message length versus actual relative message length.