Modern Trends in Steganography and Steganalysis

Report 7 Downloads 63 Views
Modern Trends in Steganography and Steganalysis Jessica Fridrich Department of Electrical and Computer Engineering State University of New York

Fundamental questions • • • • • •

How to hide information securely? What does it mean securely? How much can I embed undetectably? How should I embed? What are the most relevant open problems? What are the main achievements?

IEEE publications per year journal steganography steganalysis

conference

Number of software stego applications

Data courtesy of Chet Hosmer, Wetstone Tech

Data hiding applications by media

Number of software applications that can hide data in electronic media as of June 2008 Data courtesy of Neil Johnson

Fundamental questions • • • • • •

How to hide information securely? What does it mean securely? How much can I embed undetectably? How should I embed? What are the most relevant open problems? What are the main achievements?

Answers to these questions depend on whom you ask!

Everything depends on the source Artificial sources (iid sequences, Markov chains) • Steganographic capacity known (positive rate) • Capacity-approaching embedding using QIM • Optimal detectors are LRTs abstraction

Empirical sources (digital media) • • • •

Model unknown Capacity unknown Best embedding algorithms unknown Optimal detectors unknown

reality

Steganographic channel y py

x px

Cover source Stego source Message source Key source Channel

Warden

random variable x on X random variable y on X random variable on M random variable on K error-free/noisy (passive/active Warden)

Steganographic security Measure of security DKL(px py) Kullback Leibler divergence (Cachin 1998) Perfect security DKL(px py) 0 -security DKL(px py) • KL divergence is error exponent for optimal Neyman Pearson detector: PD(PFA) exp( nDKL), n is no. of pixels • Other Ali-Silvey distances could be used Note that x can be - pixels (DCT coefficients) - groups of pixels - features = low-dimensional image representations - entire images

Perfect security • Given full knowledge of the cover source px , perfectly secure stegosystems exist (Anderson & Petitcolas, 1997) • Can be realized using the principle of cover synthesis by utilizing a perfect compressor of the cover source

• Sender communicates on average H(x) per object sent, the entropy of the cover source Message source Encrypted message are random bits m

Source decompressor D

y = D(m) Stego object y is generated (no cover x on sender’s input)

• First practical realization for JPEG images, Model-based steganography (Sallee, IWDW 2003)

Steganographic capacity Is the maximum rate over all perfectly-secure stegosystems • Measured in bits per cover element (rate)

• Property of the cover+key+message sources and the channel, not of a specific embedding scheme! • It is positive – the size of secure payload increases linearly with the cover size.

Steganographic capacity The most general result (Moulin & Wang, 2004, 2008) for active warden and distortion-limited embedder: Distortion per cover element: Em,k,x[d(x,y)] Ds Channel distortion (y y’): Ex[d(y,y’)] Dw Steganographic constraint: px py Sender and Warden know the channel px, pm, pk, py

y’

Capacity C(Ds,Dw) can be derived when the size of covers, n

.

Explicit formula available for some simple cover sources: Bernoulli sequence with d(.,.) = Hamming distance (Moulin & Wang, 2004) Gaussian source with d(.,.) = square distance (Gonzales & Comesaña, 2007)

Capacity for empirical sources? For empirical sources, px is unknown. Alice cannot preserve it and Warden cannot build LRT If Alice preserves a simplified model, Warden will work outside of the model and detect embedding History teaches us that no matter how hard Alice tries, the Warden will be smarter and will find representation of the covers where detection will be possible (DKL > 0) Alice knows that her scheme is not secure. How big a payload (secure payload) can Alice send with a fixed* risk? *Fixed risk = fixed KL-divergence.

The square root law of secure payload Square root law of imperfect steganography (Ker 2007 for iid sources, Filler 2009, Markov sources) • Secure payload property of the embedding scheme! This is why we do not use the term “capacity” • Secure payload grows only with the square root of the cover size • Model mismatch does not decrease the communication rate but makes it drop to zero! (already suspected by Anderson & Petitcolas 1997) This is fundamentally very different from other situations in communication • Experimentally verified on images for a wide spectrum of steganographic methods

The square root law assumptions 1. Cover source is a first-order Markov chain with transition probability matrix A (aij), Pr(xn j | xn+1 i) - simplest tractable model that captures dependencies among pixels

2. Embedding is a probabilistic mapping applied to each pixel independently Pr(y k | x l) bkl - captures the essence of most practical stegosystems 3. Steganographic system is not perfect (DKL 0) - we already know that perfectly secure stegosystems have positive rate (secure payload increases linearly), this result holds only for artificial covers

stego

j’

i’

bii’ cover

i

k’

bjj’ aij

j

l’

bkk’ ajk

k

bll’ akl

l

HMC embedding MC

The square root law theorem As n

,

Conservative payload: M/ n 0 Stegosystem is asymptotically perfectly secure (DKL 0) Unsafe payload: M/ n Arbitrarily accurate detector can be built Borderline payload: M/ n const. Stegosystem is –secure (DKL ) n M

cover size (e.g., number of pixels) payload size (message length in bits)

Where does the SRL come from? DKL(px py)

½ nβ2 I(0) + O(β3) const.

Fixed risk nβ2 const.

n px py I(0)

payload nβ

n

number of cover elements cover distribution stego distribution with fraction of β of changed pixels Fisher information w.r.t. β at β = 0 I(0) = Ex[( log py(β)/ β)2]

Secure payload M (bits)

Experimental verification

What does this mean for practitioners? • A ten-times bigger image can hold only three-times

bigger payload at the same level of statistical detectability - Alice and Bob, be conservative with your payload! • It is easier to detect the same relative payload

in larger images! - careful when comparing steganalysis on different databases - need to fix the payload/ n – the root rate • Use root rate for comparing/benchmarking/

designing stegosystems

So, how shall we embed in practice? Options available: • Force embedding to resist a given attack - necessary but not sufficient, usually rather weak

• Adopt model (or its sampled version) and preserve it - unless model sufficiently complex usually easy to detect

• Mimic natural processing / noise - very hard to do right (Franz, 2008)

• Minimize embedding distortion - the best steganographic algorithms for digital images known today work precisely in this manner - no need to estimate high-dimensional distributions - can work with very high-dimensional models - well developed, entire framework is in place

Steganography minimizing distortion Historical development d(.,.) = number of embedding changes - Matrix Embedding (Crandall 1998, Westfeld 2001, …) d(.,.) = L1 or L2 distortion - MMEx by Kim & Duric & Richards 2006, Sachnev & Kim 2009, Perturbed quantization 2004, …) general d(.,.) - syndrome-trellis codes (Filler 2009–11, Pevný HUGO 2010)

Strategy a) fix the embedding operation b) define the embedding distortion function d(x,y) c) embed by minimizing d(x,y) for a given payload and cover x d) search for d(.,.) that minimizes detectability in practice

Formalization • During embedding, cover x is slightly modified to stego object y that carries the desired payload • For each x, the sender specifies a set of plausible stego images y Y • LSB embedding: y {x, LSBflip(x)} • k embedding: y {x, x 1,x 1,…,x k, x k} • When x is not to be modified, y {x}

• Distortion measure d(x,y) is an arbitrary bounded function of x and y • Embedding algorithm selects each y Y with prob. (y) H( ) Expected distortion: E [d] Expected payload:

y

(y) d(x,y)

Searching for the best Payload-limited sender Minimize expected distortion

E [d]

subject to payload constraint

H( )

y

(y)d(x,y)

y

(y)d(x,y)

M

Distortion-limited sender

Maximize expected payload

H( )

subject to distortion bound

E [d]

The optimization is over all distributions

over Y

It is and Y that encapsulate the entire embedding algorithm. They also depend on cover x !

Gibbs distribution The optimal distribution (algorithm) for both senders is the Gibbs distribution (y) 1/Z( ) exp(

d(x,y))

> 0 is a scalar parameter determined from the appropriate constraint (the “inverse temperature”) Z( ) y exp( function)

d(x,y)) is the normalization factor (partition

We can import many useful algorithms from statistical physics into steganography

Separation principle • By sampling from (y), one can simulate the impact of theoretically optimal embedding! • Gibbs sampler, Markov chain Monte Carlo methods • Detectability can be tested before implementing the scheme

• Near-optimal practical embedding schemes can be constructed using syndrome-trellis codes • when d(.,.) is a sum of locally-supported potential functions • includes the most common case when d is additive (more on next slide)

• We can compute the rate-distortion bound relationship between expected distortion E[d] and payload H( ) • Bound tells us how close to optimum a given implementation is • Thermodynamic integration + Gibbs sampler

Additive distortion d(x,y)

k

k(x,yk)

• k(.,.) arbitrary bounded functions • Embedding changes do not interact • Sampling is easy as each pixel can be changed independently of other pixels (no need to use MCMC samplers) • Embedding with minimal d is source coding with fidelity constraint (Shannon 1954) • Near-optimal quantizers exist based on Viterbi algorithm: Syndrome–trellis codes (Filler 2009)

Examples k(x,yk)

1 iff xk yk, d is number of embedding changes k(x,yk) {1, } wet paper codes k(x,yk) k for xk yk, changes have different weights yk M , |M| m, m-ary embedding

Embedding while preserving model • • • •

Represent image using feature vector f Rk f is an image model f can be high-dimensional, e.g., 106 or higher Define distortion

d(x,y) • Minimizing d(.,.)

f(x) – f(y) model preservation

HUGO HUGO (Highly Undetectable steGO) by Pevný, Filler, Bas (IH 2010, Calgary) - uses model of dimensionality 107 - steganalyzers of 2010 unable to detect payloads 0.4 bpp

• Message hidden while minimizing an additive distortion function designed to correlate with statistical detectability • Incorporates syndrome-trellis codes for near-optimal contentadaptive embedding

cover image

actual changes

Modern steganalysis • Cast as supervised pattern recognition problem (Avcibas, Memon 2001, Farid 2002)

• Represent images using features (dim. reduction) - features sensitive to embedding but not to content - often computed from the noise component - sampled joint or conditional distribution (Shi & Chen & Chen 2006–2008) - designed to capture dependencies among neighboring pixels

• Train a classifier on examples of cover/stego images - support vector machines, neural networks, ensemble classifiers - binary, one-class, or multi-class - regressors can estimate change rate

Steganalysis using rich models • Rich cover model - union of many smaller diverse submodels - all dimensions well populated - diversity more important than dimensionality • Scalable machine learning - ensemble classifier - fast w.r.t. dimensionality as well as training set size - allows using rich cover models

Assembling the rich model (spatial domain) • Define pixel predictors Pred(k)(x), k = 1, …, p, Pred(xij) is the value of xij predicted from local neighborhood not containing xij • Compute: kth noise residual R(k) = Pred(k)(x) – x • Quantize: R(k) = round(R(k) / q), q = quantization step • Truncate: if R(k) > T then R(k) = T • Form co-occurrence matrices of neighboring residual samples Pred(xij ) xi,j+1 Pred(xij ) (xi,j+1 xi,j-1 )/2 Pred(xij ) (xi,j-1 3xi,j+1 xi,j+2 )/3

horizontally constant model horizontally linear model horizontally quadratic model



Other options include vertical models, predictors utilizing local 3 3 or 5 5 neighborhoods or their parts, non-linear operations min and max, lookup tables of conditional probabilities learned from cover source images, etc.

Complex models better adapt to content Adaptive schemes are likely to embed here even though the content is modelable in the vertical direction

Edge close up Image with many edges

However, pixel differences will mostly be in the marginal. Linear or quadratic models bring the residual back inside the co-occurrence matrix

Ensemble classifiers Feature space dim = N

Random subspace dim = k