Modern Trends in Steganography and Steganalysis Jessica Fridrich Department of Electrical and Computer Engineering State University of New York
Fundamental questions • • • • • •
How to hide information securely? What does it mean securely? How much can I embed undetectably? How should I embed? What are the most relevant open problems? What are the main achievements?
IEEE publications per year journal steganography steganalysis
conference
Number of software stego applications
Data courtesy of Chet Hosmer, Wetstone Tech
Data hiding applications by media
Number of software applications that can hide data in electronic media as of June 2008 Data courtesy of Neil Johnson
Fundamental questions • • • • • •
How to hide information securely? What does it mean securely? How much can I embed undetectably? How should I embed? What are the most relevant open problems? What are the main achievements?
Answers to these questions depend on whom you ask!
Everything depends on the source Artificial sources (iid sequences, Markov chains) • Steganographic capacity known (positive rate) • Capacity-approaching embedding using QIM • Optimal detectors are LRTs abstraction
Empirical sources (digital media) • • • •
Model unknown Capacity unknown Best embedding algorithms unknown Optimal detectors unknown
reality
Steganographic channel y py
x px
Cover source Stego source Message source Key source Channel
Warden
random variable x on X random variable y on X random variable on M random variable on K error-free/noisy (passive/active Warden)
Steganographic security Measure of security DKL(px py) Kullback Leibler divergence (Cachin 1998) Perfect security DKL(px py) 0 -security DKL(px py) • KL divergence is error exponent for optimal Neyman Pearson detector: PD(PFA) exp( nDKL), n is no. of pixels • Other Ali-Silvey distances could be used Note that x can be - pixels (DCT coefficients) - groups of pixels - features = low-dimensional image representations - entire images
Perfect security • Given full knowledge of the cover source px , perfectly secure stegosystems exist (Anderson & Petitcolas, 1997) • Can be realized using the principle of cover synthesis by utilizing a perfect compressor of the cover source
• Sender communicates on average H(x) per object sent, the entropy of the cover source Message source Encrypted message are random bits m
Source decompressor D
y = D(m) Stego object y is generated (no cover x on sender’s input)
• First practical realization for JPEG images, Model-based steganography (Sallee, IWDW 2003)
Steganographic capacity Is the maximum rate over all perfectly-secure stegosystems • Measured in bits per cover element (rate)
• Property of the cover+key+message sources and the channel, not of a specific embedding scheme! • It is positive – the size of secure payload increases linearly with the cover size.
Steganographic capacity The most general result (Moulin & Wang, 2004, 2008) for active warden and distortion-limited embedder: Distortion per cover element: Em,k,x[d(x,y)] Ds Channel distortion (y y’): Ex[d(y,y’)] Dw Steganographic constraint: px py Sender and Warden know the channel px, pm, pk, py
y’
Capacity C(Ds,Dw) can be derived when the size of covers, n
.
Explicit formula available for some simple cover sources: Bernoulli sequence with d(.,.) = Hamming distance (Moulin & Wang, 2004) Gaussian source with d(.,.) = square distance (Gonzales & Comesaña, 2007)
Capacity for empirical sources? For empirical sources, px is unknown. Alice cannot preserve it and Warden cannot build LRT If Alice preserves a simplified model, Warden will work outside of the model and detect embedding History teaches us that no matter how hard Alice tries, the Warden will be smarter and will find representation of the covers where detection will be possible (DKL > 0) Alice knows that her scheme is not secure. How big a payload (secure payload) can Alice send with a fixed* risk? *Fixed risk = fixed KL-divergence.
The square root law of secure payload Square root law of imperfect steganography (Ker 2007 for iid sources, Filler 2009, Markov sources) • Secure payload property of the embedding scheme! This is why we do not use the term “capacity” • Secure payload grows only with the square root of the cover size • Model mismatch does not decrease the communication rate but makes it drop to zero! (already suspected by Anderson & Petitcolas 1997) This is fundamentally very different from other situations in communication • Experimentally verified on images for a wide spectrum of steganographic methods
The square root law assumptions 1. Cover source is a first-order Markov chain with transition probability matrix A (aij), Pr(xn j | xn+1 i) - simplest tractable model that captures dependencies among pixels
2. Embedding is a probabilistic mapping applied to each pixel independently Pr(y k | x l) bkl - captures the essence of most practical stegosystems 3. Steganographic system is not perfect (DKL 0) - we already know that perfectly secure stegosystems have positive rate (secure payload increases linearly), this result holds only for artificial covers
stego
j’
i’
bii’ cover
i
k’
bjj’ aij
j
l’
bkk’ ajk
k
bll’ akl
l
HMC embedding MC
The square root law theorem As n
,
Conservative payload: M/ n 0 Stegosystem is asymptotically perfectly secure (DKL 0) Unsafe payload: M/ n Arbitrarily accurate detector can be built Borderline payload: M/ n const. Stegosystem is –secure (DKL ) n M
cover size (e.g., number of pixels) payload size (message length in bits)
Where does the SRL come from? DKL(px py)
½ nβ2 I(0) + O(β3) const.
Fixed risk nβ2 const.
n px py I(0)
payload nβ
n
number of cover elements cover distribution stego distribution with fraction of β of changed pixels Fisher information w.r.t. β at β = 0 I(0) = Ex[( log py(β)/ β)2]
Secure payload M (bits)
Experimental verification
What does this mean for practitioners? • A ten-times bigger image can hold only three-times
bigger payload at the same level of statistical detectability - Alice and Bob, be conservative with your payload! • It is easier to detect the same relative payload
in larger images! - careful when comparing steganalysis on different databases - need to fix the payload/ n – the root rate • Use root rate for comparing/benchmarking/
designing stegosystems
So, how shall we embed in practice? Options available: • Force embedding to resist a given attack - necessary but not sufficient, usually rather weak
• Adopt model (or its sampled version) and preserve it - unless model sufficiently complex usually easy to detect
• Mimic natural processing / noise - very hard to do right (Franz, 2008)
• Minimize embedding distortion - the best steganographic algorithms for digital images known today work precisely in this manner - no need to estimate high-dimensional distributions - can work with very high-dimensional models - well developed, entire framework is in place
Steganography minimizing distortion Historical development d(.,.) = number of embedding changes - Matrix Embedding (Crandall 1998, Westfeld 2001, …) d(.,.) = L1 or L2 distortion - MMEx by Kim & Duric & Richards 2006, Sachnev & Kim 2009, Perturbed quantization 2004, …) general d(.,.) - syndrome-trellis codes (Filler 2009–11, Pevný HUGO 2010)
Strategy a) fix the embedding operation b) define the embedding distortion function d(x,y) c) embed by minimizing d(x,y) for a given payload and cover x d) search for d(.,.) that minimizes detectability in practice
Formalization • During embedding, cover x is slightly modified to stego object y that carries the desired payload • For each x, the sender specifies a set of plausible stego images y Y • LSB embedding: y {x, LSBflip(x)} • k embedding: y {x, x 1,x 1,…,x k, x k} • When x is not to be modified, y {x}
• Distortion measure d(x,y) is an arbitrary bounded function of x and y • Embedding algorithm selects each y Y with prob. (y) H( ) Expected distortion: E [d] Expected payload:
y
(y) d(x,y)
Searching for the best Payload-limited sender Minimize expected distortion
E [d]
subject to payload constraint
H( )
y
(y)d(x,y)
y
(y)d(x,y)
M
Distortion-limited sender
Maximize expected payload
H( )
subject to distortion bound
E [d]
The optimization is over all distributions
over Y
It is and Y that encapsulate the entire embedding algorithm. They also depend on cover x !
Gibbs distribution The optimal distribution (algorithm) for both senders is the Gibbs distribution (y) 1/Z( ) exp(
d(x,y))
> 0 is a scalar parameter determined from the appropriate constraint (the “inverse temperature”) Z( ) y exp( function)
d(x,y)) is the normalization factor (partition
We can import many useful algorithms from statistical physics into steganography
Separation principle • By sampling from (y), one can simulate the impact of theoretically optimal embedding! • Gibbs sampler, Markov chain Monte Carlo methods • Detectability can be tested before implementing the scheme
• Near-optimal practical embedding schemes can be constructed using syndrome-trellis codes • when d(.,.) is a sum of locally-supported potential functions • includes the most common case when d is additive (more on next slide)
• We can compute the rate-distortion bound relationship between expected distortion E[d] and payload H( ) • Bound tells us how close to optimum a given implementation is • Thermodynamic integration + Gibbs sampler
Additive distortion d(x,y)
k
k(x,yk)
• k(.,.) arbitrary bounded functions • Embedding changes do not interact • Sampling is easy as each pixel can be changed independently of other pixels (no need to use MCMC samplers) • Embedding with minimal d is source coding with fidelity constraint (Shannon 1954) • Near-optimal quantizers exist based on Viterbi algorithm: Syndrome–trellis codes (Filler 2009)
Examples k(x,yk)
1 iff xk yk, d is number of embedding changes k(x,yk) {1, } wet paper codes k(x,yk) k for xk yk, changes have different weights yk M , |M| m, m-ary embedding
Embedding while preserving model • • • •
Represent image using feature vector f Rk f is an image model f can be high-dimensional, e.g., 106 or higher Define distortion
d(x,y) • Minimizing d(.,.)
f(x) – f(y) model preservation
HUGO HUGO (Highly Undetectable steGO) by Pevný, Filler, Bas (IH 2010, Calgary) - uses model of dimensionality 107 - steganalyzers of 2010 unable to detect payloads 0.4 bpp
• Message hidden while minimizing an additive distortion function designed to correlate with statistical detectability • Incorporates syndrome-trellis codes for near-optimal contentadaptive embedding
cover image
actual changes
Modern steganalysis • Cast as supervised pattern recognition problem (Avcibas, Memon 2001, Farid 2002)
• Represent images using features (dim. reduction) - features sensitive to embedding but not to content - often computed from the noise component - sampled joint or conditional distribution (Shi & Chen & Chen 2006–2008) - designed to capture dependencies among neighboring pixels
• Train a classifier on examples of cover/stego images - support vector machines, neural networks, ensemble classifiers - binary, one-class, or multi-class - regressors can estimate change rate
Steganalysis using rich models • Rich cover model - union of many smaller diverse submodels - all dimensions well populated - diversity more important than dimensionality • Scalable machine learning - ensemble classifier - fast w.r.t. dimensionality as well as training set size - allows using rich cover models
Assembling the rich model (spatial domain) • Define pixel predictors Pred(k)(x), k = 1, …, p, Pred(xij) is the value of xij predicted from local neighborhood not containing xij • Compute: kth noise residual R(k) = Pred(k)(x) – x • Quantize: R(k) = round(R(k) / q), q = quantization step • Truncate: if R(k) > T then R(k) = T • Form co-occurrence matrices of neighboring residual samples Pred(xij ) xi,j+1 Pred(xij ) (xi,j+1 xi,j-1 )/2 Pred(xij ) (xi,j-1 3xi,j+1 xi,j+2 )/3
horizontally constant model horizontally linear model horizontally quadratic model
…
Other options include vertical models, predictors utilizing local 3 3 or 5 5 neighborhoods or their parts, non-linear operations min and max, lookup tables of conditional probabilities learned from cover source images, etc.
Complex models better adapt to content Adaptive schemes are likely to embed here even though the content is modelable in the vertical direction
Edge close up Image with many edges
However, pixel differences will mostly be in the marginal. Linear or quadratic models bring the residual back inside the co-occurrence matrix
Ensemble classifiers Feature space dim = N
Random subspace dim = k