Reconstructing Data Perturbed by Random ... - Semantic Scholar

Report 2 Downloads 44 Views
Reconstructing Data Perturbed by Random Projections when the Mixing Matrix is Known Yingpeng Sang (presenter ), Hong Shen School of Computer Science,

Hui Tian School of Mathematical Sciences, The University of Adelaide, Adelaide, SA, Australia

{yingpeng.sang,hong.shen,hui.tian}@adelaide.edu.au

ECML PKDD 2009 – 1 / 37

Overview • Overview

Introduction

Introduction Problems

Problems

Data Reconstruction Methods

Data Reconstruction Methods

Experiments and Comparisons

Experiments and Comparisons

Conclusions

Conclusions

ECML PKDD 2009 – 2 / 37

• Overview Introduction • Privacy-preserving Computation (Secure Multi-party Computation) • Privacy-preserving Data Mining • Random Projection of a Vector • Random Projection of a Database • An Example of Random Projection • Random Projection-based Data Perturbation • Privacy-preserving Data Mining by Random Projections

Introduction

Problems Data Reconstruction Methods Experiments and Comparisons Conclusions

ECML PKDD 2009 – 3 / 37

Privacy-preserving Computation (Secure Multiparty Computation) • Overview Introduction • Privacy-preserving Computation (Secure Multi-party Computation) • Privacy-preserving Data Mining • Random Projection of a Vector • Random Projection of a Database • An Example of Random Projection • Random Projection-based Data Perturbation • Privacy-preserving Data Mining by Random Projections

Secure Multiparty Computation: computing F (X, Y, Z, ...) without disclosing the privacy X, Y, Z . Existing cryptographic solutions for F :

Problems Data Reconstruction Methods Experiments and Comparisons Conclusions

• Evaluation circuit (by Andrew C. Yau): exponential complexity, e.g. O(|X|N )

• Improvement efforts (based on Homomorphic encryption): polynomial complexity, e.g. O((lg |X|)a N b ) ECML PKDD 2009 – 4 / 37

Privacy-preserving Data Mining • Overview Introduction • Privacy-preserving Computation (Secure Multi-party Computation) • Privacy-preserving Data Mining • Random Projection of a Vector • Random Projection of a Database • An Example of Random Projection • Random Projection-based Data Perturbation • Privacy-preserving Data Mining by Random Projections

Existing cryptographic solutions:

• not secure: information leakage among the cryptographic blocks

• not efficient: need to process high volumes of data

Problems Data Reconstruction Methods Experiments and Comparisons Conclusions

ECML PKDD 2009 – 5 / 37

Privacy-preserving Data Mining (Contd.) • Overview Introduction • Privacy-preserving Computation (Secure Multi-party Computation) • Privacy-preserving Data Mining • Random Projection of a Vector • Random Projection of a Database • An Example of Random Projection • Random Projection-based Data Perturbation • Privacy-preserving Data Mining by Random Projections Problems Data Reconstruction Methods Experiments and Comparisons Conclusions

Existing non-cryptographic solutions:

• F can be very efficient; • Existing perturbation methods: additive distortion, multiplicative distortion, swapping, anonymization, etc.

• This presentation: risk analysis of one multiplicative distortion method (random projection). ECML PKDD 2009 – 6 / 37

Random Projection of a Vector • Overview Introduction • Privacy-preserving Computation (Secure Multi-party Computation) • Privacy-preserving Data Mining • Random Projection of a Vector • Random Projection of a Database • An Example of Random Projection • Random Projection-based Data Perturbation • Privacy-preserving Data Mining by Random Projections Problems Data Reconstruction Methods Experiments and Comparisons Conclusions

A vector can be randomly projected into a k -dimensional space by a k × m random matrix:









r1,1 ... r1,m x1 , ... x =  ...  , R =  xm rk,1 ... rk,m

in which entry ri,j is an independently and identically distributed random variable. The result of the random projection is a k -dimensional vector:





  r1,1 x1 + ... + r1,m xm u1  =  ...  . ... u = Rx =  rk,1 x1 + ... + rk,m xm uk ECML PKDD 2009 – 7 / 37

Random Projection of a Database • Overview Introduction • Privacy-preserving Computation (Secure Multi-party Computation) • Privacy-preserving Data Mining • Random Projection of a Vector • Random Projection of a Database • An Example of Random Projection • Random Projection-based Data Perturbation • Privacy-preserving Data Mining by Random Projections Problems Data Reconstruction Methods Experiments and Comparisons

A database X with n records and m attributes can be treated as an m × n matrix:



x1,1

X= xm,1



... x1,n  ... ... xm,n

The random projection of X into a k -dimensional space:







r1,1 ... r1,m x1,1 ... x1,n k×n =U  ... ... R · X = rk,1 ... rk,m xm,1 ... xm,n

k×m

m×n

U has n records, but each record has k attributes.

Conclusions

ECML PKDD 2009 – 8 / 37

An Example of Random Projection • Overview Introduction • Privacy-preserving Computation (Secure Multi-party Computation) • Privacy-preserving Data Mining • Random Projection of a Vector • Random Projection of a Database • An Example of Random Projection • Random Projection-based Data Perturbation • Privacy-preserving Data Mining by Random Projections Problems Data Reconstruction Methods Experiments and Comparisons Conclusions

ECML PKDD 2009 – 9 / 37

Random Projection-based Data Perturbation • Overview Introduction • Privacy-preserving Computation (Secure Multi-party Computation) • Privacy-preserving Data Mining • Random Projection of a Vector • Random Projection of a Database • An Example of Random Projection • Random Projection-based Data Perturbation • Privacy-preserving Data Mining by Random Projections Problems Data Reconstruction Methods Experiments and Comparisons Conclusions

Random projection can well keep the distances among the data points, when the entries of R are i.i.d variables with mean zero and variance σ 2 :

• It can be proved that R is approximately orthogonal:

R′ · R ≈ (kσ 2 )I , I is an m × m identity matrix; • Random projections: ui =

√1 Rxi , kσ

uj =

√1 Rxj ; kσ

• u′i · uj ≈ x′i xj ; d(xi , xj ) ≈ d(ui , uj ). Random projection can disguise the attribute values of each record of X :









u1,i r1,1 x1,i + ... + r1,m xm,i . ... ui =  ...  = Rxi =  uk,i rk,1 x1,i + ... + rk,m xm,i

ECML PKDD 2009 – 10 / 37

Privacy-preserving Data Mining by Random Projections • Overview Introduction • Privacy-preserving Computation (Secure Multi-party Computation) • Privacy-preserving Data Mining • Random Projection of a Vector • Random Projection of a Database • An Example of Random Projection • Random Projection-based Data Perturbation • Privacy-preserving Data Mining by Random Projections

The miner will complete the data mining tasks by the data perturbed by random projections:

• the original data X, Y are privacy of the data owners; • random projection can keep the Euclidean distances inside the original data, when the same R is used by all the data providers for the perturbations.

• this perturbation is suitable for data mining tasks requiring the metrics based on Euclidean distance.

Problems Data Reconstruction Methods Experiments and Comparisons Conclusions

(a) Centralized Model: (b) Distributed Model: there is a third party there is no third party ECML PKDD 2009 – 11 / 37

• Overview Introduction Problems • Is random projection a secure method to protect privacy? • Data Reconstruction & Recovery Rate

• Our Contributions Data Reconstruction Methods

Problems

Experiments and Comparisons Conclusions

ECML PKDD 2009 – 12 / 37

Is random projection a secure method to protect privacy? • Overview

Some apriori knowledge is easy to be leaked to the miner:

Introduction Problems • Is random projection a secure method to protect privacy? • Data Reconstruction & Recovery Rate

• Our Contributions Data Reconstruction Methods Experiments and Comparisons Conclusions

• in the centralized model, R will be leaked to the miner by collusion;

• in the distributed model, R is certainly known by the miner;

• the mean vector and covariance matrix of the original data can be estimated:

◦ Let C be the population where X is extracted,

C ∼ P (µ, Σ); ◦ if enough samples X ′ ⊂ C can be obtained, then

µ, Σ can be estimated from the training set X ′ . ECML PKDD 2009 – 13 / 37

Data Reconstruction & Recovery Rate • Overview

Data reconstruction is a reverse process for data perturbation:

Introduction Problems • Is random projection a secure method to protect privacy? • Data Reconstruction & Recovery Rate

• Our Contributions

Recovery rate is to evaluate the reconstruction performance:

Data Reconstruction Methods Experiments and Comparisons Conclusions

• let xi,j be the (i, j)-th entry of the original data X ;

ˆ; • let x ˆi,j be the (i, j)-th entry of the recovered data X • the recovery rate is the percentage of reconstructed entries whose relative errors are within ǫ (ǫ is a given parameter):

ˆ ǫ) = r(X,

#{ˆ xi,j : |

xi,j −ˆ xi,j | xi,j

≤ ǫ, i = 1, ..., m, j = 1, ..., n} m∗n ECML PKDD 2009 – 14 / 37

Our Contributions • Overview Introduction Problems • Is random projection a secure method to protect privacy? • Data Reconstruction & Recovery Rate

• Our Contributions Data Reconstruction Methods Experiments and Comparisons

• We will act as an adversarial miner, who knows the perturbed data U .

• We assume we have known the mixing matrix R, the estimated mean vector µ and covariance matrix Σ for the original data X .

• We will reconstruct X with high recovery rates,

Conclusions

◦ to capture the privacy in X

ECML PKDD 2009 – 15 / 37

• Overview Introduction Problems Data Reconstruction Methods • Existing Methods: Solving a Linear System • Existing Methods: Finding a Partition Matrix • Existing Methods: Maximum A Posteriori Estimation • Existing Methods: the Application of MAP in Blind Source Separation • Our Reconstruction Method I • Our Reconstruction Method II

Data Reconstruction Methods

Experiments and Comparisons Conclusions

ECML PKDD 2009 – 16 / 37

Existing Methods: Solving a Linear System • Overview

Given R and U , construct a system of linear equations:

Introduction Problems Data Reconstruction Methods • Existing Methods: Solving a Linear System • Existing Methods: Finding a Partition Matrix • Existing Methods: Maximum A Posteriori Estimation • Existing Methods: the Application of MAP in Blind Source Separation • Our Reconstruction Method I • Our Reconstruction Method II









r1,1 x1,i + ... + r1,m xm,i u1,i . ... ui =  ...  = Rxi =  rk,1 x1,i + ... + rk,m xm,i uk,i • when k ≥ m, there is a unique solution for xi ; • when k < m, there are infinitely many solutions for xi .

Experiments and Comparisons Conclusions

ECML PKDD 2009 – 17 / 37

Existing Methods: Finding a Partition Matrix • Overview

When k < m, there may exist a Partition Matrix P for R, e.g.

Introduction Problems Data Reconstruction Methods • Existing Methods: Solving a Linear System • Existing Methods: Finding a Partition Matrix • Existing Methods: Maximum A Posteriori Estimation • Existing Methods: the Application of MAP in Blind Source Separation • Our Reconstruction Method I • Our Reconstruction Method II Experiments and Comparisons Conclusions

  r1,1 r1,2 r1,3 R= , r2,1 r2,2 r2,3     c c a 0 0 P = C × R = 1,1 1,2 R = , c2,1 c2,2 0 b c   ax1,1 Cui = CRxi = P xi = . bx1,2 + cx1,3 • Whether P exists and how it will partition depends on the structure of R; • At most k − 1 attributes can be separated; • When m ≥ 2k − 1 and m ≥ 2, it can be proved that P does not exist for any R. ECML PKDD 2009 – 18 / 37

Existing Methods: Maximum A Posteriori Estimation • Overview Introduction Problems Data Reconstruction Methods • Existing Methods: Solving a Linear System • Existing Methods: Finding a Partition Matrix • Existing Methods: Maximum A Posteriori Estimation • Existing Methods: the Application of MAP in Blind Source Separation • Our Reconstruction Method I • Our Reconstruction Method II Experiments and Comparisons Conclusions

ˆ Let p(ˆ x|u) be the posterior probability of observing a vector x given u. ˆ in Maximum A Posteriori (MAP) estimation is to search an x the space Rm to maximize p(ˆ x|u): p(ˆ x)p(u|ˆ x) ˆ = arg max p(ˆ x x|u) = arg max p(u) = arg max p(ˆ x) Rˆ x=u

In this equation,

• p(u) is a constant; • p(u|ˆ x) can be treated as a constraint:

( 1, if Rˆ x = u, p(u|ˆ x) = 0, if Rˆ x 6= u, ECML PKDD 2009 – 19 / 37

Existing Methods: Maximum A Posteriori Estimation (Contd.) • Overview Introduction

MAP is a promising method for our problem, but it is only a general framework:

Problems Data Reconstruction Methods • Existing Methods: Solving a Linear System • Existing Methods: Finding a Partition Matrix • Existing Methods: Maximum A Posteriori Estimation • Existing Methods: the Application of MAP in Blind Source Separation • Our Reconstruction Method I • Our Reconstruction Method II

• MAP has been used to separate sparse and independent data,

◦ how to reduce the permutation and scaling ambiguities after the separation?

• how about the scenarios of non-sparse and non-independent data?

Experiments and Comparisons Conclusions

ECML PKDD 2009 – 20 / 37

Existing Methods: the Application of MAP in Blind Source Separation • Overview Introduction

Underdetermined Blind Source Separation (or Independent Component Analysis):

Problems Data Reconstruction Methods • Existing Methods: Solving a Linear System • Existing Methods: Finding a Partition Matrix • Existing Methods: Maximum A Posteriori Estimation • Existing Methods: the Application of MAP in Blind Source Separation • Our Reconstruction Method I • Our Reconstruction Method II Experiments and Comparisons

• There are m series of source signals which are mutually independent and sparse, e.g. voices, images, etc. They can be modeled by the Laplace distribution: p(xi ) ∝ e−|xi | .

• There are k (k < m) linear receivers. The received data

U = RX . • Underdetermined BSS (ICA) is to estimate some of the sources xi from the received data.

Conclusions

◦ Step 1: Estimate R; ◦ Step 2: Estimate xi . ECML PKDD 2009 – 21 / 37

Existing Methods: the Application of MAP in Blind Source Separation (Contd.) • Overview Introduction Problems Data Reconstruction Methods • Existing Methods: Solving a Linear System • Existing Methods: Finding a Partition Matrix • Existing Methods: Maximum A Posteriori Estimation • Existing Methods: the Application of MAP in Blind Source Separation • Our Reconstruction Method I • Our Reconstruction Method II

The estimation of xi needs to solve a linear programming (L1 -norm minimization) problem:

x = arg max p(x) Rx=u

= arg max e−|x1 |−...−|xm | Rx=u

= arg min

Rx=u

m X

|xi |

i=1

Experiments and Comparisons Conclusions

ECML PKDD 2009 – 22 / 37

Our Reconstruction Method I • Overview Introduction Problems Data Reconstruction Methods • Existing Methods: Solving a Linear System • Existing Methods: Finding a Partition Matrix • Existing Methods: Maximum A Posteriori Estimation • Existing Methods: the Application of MAP in Blind Source Separation • Our Reconstruction Method I • Our Reconstruction Method II Experiments and Comparisons

In the scenarios that the attributes are mutually independent and sparse, our reconstruction method is based on Underdetermined BSS (or ICA). However, U-BSS (or U-ICA) has permutation and scaling ambiguities. Our method:

• Since R is known, we make use of R to eliminate entirely the permutation ambiguity;

• By the estimated mean and variance vector, we reduce the scaling ambiguity.

Conclusions

ECML PKDD 2009 – 23 / 37

Our Reconstruction Method I (Contd.) • Overview

Our method to reduce the the scaling ambiguity:

Introduction Problems Data Reconstruction Methods • Existing Methods: Solving a Linear System • Existing Methods: Finding a Partition Matrix • Existing Methods: Maximum A Posteriori Estimation • Existing Methods: the Application of MAP in Blind Source Separation • Our Reconstruction Method I • Our Reconstruction Method II

• We firstly remove the means and variances in the original data,

• then recover the data with zero means and identical variances by solving an L1 -norm minimization problem,

• finally add back the means and variances to the recovered data.

Experiments and Comparisons Conclusions

ECML PKDD 2009 – 24 / 37

Our Reconstruction Method II • Overview Introduction Problems Data Reconstruction Methods • Existing Methods: Solving a Linear System • Existing Methods: Finding a Partition Matrix • Existing Methods: Maximum A Posteriori Estimation • Existing Methods: the Application of MAP in Blind Source Separation • Our Reconstruction Method I • Our Reconstruction Method II

In the scenarios that the attributes are not mutually independent and not sparse, the method of U-BSS (or U-ICA) will not be effective, since it models the sources by m series of Laplace data (mutually independent and sparse).

• Laplace distribution is super-Gaussian with kurtosis > 3. • Gaussian (normal) distribution has a kurtosis of 3.

Experiments and Comparisons Conclusions

ECML PKDD 2009 – 25 / 37

Our Reconstruction Method II: a new p.d.f • Overview Introduction Problems Data Reconstruction Methods • Existing Methods: Solving a Linear System • Existing Methods: Finding a Partition Matrix • Existing Methods: Maximum A Posteriori Estimation • Existing Methods: the Application of MAP in Blind Source Separation • Our Reconstruction Method I • Our Reconstruction Method II Experiments and Comparisons

In the scenarios that the attributes are not mutually independent and not sparse, we use the Multivariate Normal Distribution to model the original data X :

• X = (x1 , ..., xm ), X ∼ N (µ, ΣX ); • µ is the mean vector; • ΣX is the covariance matrix;

1 − 12 (x−µ)′ Σ−1 X (x−µ) p(x) = e (2π)m/2 |ΣX |1/2

Conclusions

ECML PKDD 2009 – 26 / 37

Our Reconstruction Method II: a new optimization problem • Overview Introduction

We go back to the MAP estimation, and obtain a constrained Quadratic Programming problem:

Problems Data Reconstruction Methods • Existing Methods: Solving a Linear System • Existing Methods: Finding a Partition Matrix • Existing Methods: Maximum A Posteriori Estimation • Existing Methods: the Application of MAP in Blind Source Separation • Our Reconstruction Method I • Our Reconstruction Method II Experiments and Comparisons

ˆ = arg max p(ˆ x x) Rˆ x=u

1 x − µ)′ Σ−1 x − µ), = arg min (ˆ X (ˆ Rˆ x=u 2 When ΣX is positive definite,



1 (ˆ x 2

− µ)′ Σ−1 x − µ) is a convex function, X (ˆ

• the constrained QP problem has a unique solution.

Conclusions

ECML PKDD 2009 – 27 / 37

Our Reconstruction Method II: solve it efficiently • Overview Introduction

We solve the constrained QP problem by the Lagrange Multiplier method:

Problems Data Reconstruction Methods • Existing Methods: Solving a Linear System • Existing Methods: Finding a Partition Matrix • Existing Methods: Maximum A Posteriori Estimation • Existing Methods: the Application of MAP in Blind Source Separation • Our Reconstruction Method I • Our Reconstruction Method II Experiments and Comparisons Conclusions

1 ′ L(ˆ x, Λ) = (ˆ x − µ)′ Σ−1 (ˆ x − µ) + Λ (Rˆ x − u) X 2 ˆ of the QP problem is the solution of ∇L = 0: The solution x ˆ = µ + ΣX R′ Σ−1 x U (u − Rµ). • We can also prove that ΣU , the covariance matrix of the perturbed data U , is nonsingular with high probability.

ˆ is a reconstruction of the original data vector x. • x

ECML PKDD 2009 – 28 / 37

• Overview Introduction Problems Data Reconstruction Methods Experiments and Comparisons • An Available Method for Comparisons: Principle Component Analysis • Comparisons of Our Method I with PCA • Comparisons of Our Method II with PCA

Experiments and Comparisons

Conclusions

ECML PKDD 2009 – 29 / 37

An Available Method for Comparisons: Principle Component Analysis • Overview Introduction Problems Data Reconstruction Methods Experiments and Comparisons • An Available Method for Comparisons: Principle Component Analysis • Comparisons of Our Method I with PCA • Comparisons of Our Method II with PCA Conclusions

As far as we know, the only available reconstruction method for random projection-based perturbation is Principle Component Analysis (PCA)-based reconstruction. However, PCA assumes the attributes of the original data are uncorrelated.

• Attributes which are “uncorrelated” are not necessarily mutual independent.

• When attributes are correlated, PCA is not suitable. • PCA can separate at most k signals.

ECML PKDD 2009 – 30 / 37

Comparisons of Our Method I with PCA • Overview Introduction Problems

In the scenarios that the attributes are mutually independent and sparse: (ǫ = 20%)

Data Reconstruction Methods Experiments and Comparisons • An Available Method for Comparisons: Principle Component Analysis • Comparisons of Our Method I with PCA • Comparisons of Our Method II with PCA Conclusions

ECML PKDD 2009 – 31 / 37

Comparisons of Our Method II with PCA • Overview Introduction Problems

In the scenarios that the attributes are not mutually independent and not sparse: (ǫ = 20%)

Data Reconstruction Methods Experiments and Comparisons • An Available Method for Comparisons: Principle Component Analysis • Comparisons of Our Method I with PCA • Comparisons of Our Method II with PCA Conclusions

ECML PKDD 2009 – 32 / 37

• Overview Introduction Problems Data Reconstruction Methods Experiments and Comparisons Conclusions

• Summary • Risk Analysis of Random Projection is Necessary

Conclusions

• Open Problems • The End

ECML PKDD 2009 – 33 / 37

Summary • Overview Introduction

We propose two reconstruction methods to recover the data perturbed by random projections, with high recovery rates:

Problems Data Reconstruction Methods Experiments and Comparisons Conclusions

• Summary • Risk Analysis of Random Projection is Necessary

• Open Problems • The End

• the attributes are mutually independent and sparse; • the attributes are not mutually independent and not sparse; The high recover rates of our methods essentially mean the potential privacy leakage of the original data.

ECML PKDD 2009 – 34 / 37

Risk Analysis of Random Projection is Necessary • Overview Introduction Problems Data Reconstruction Methods Experiments and Comparisons Conclusions

• Summary • Risk Analysis of Random Projection is Necessary

• Open Problems • The End

Random projection-based perturbations are convenient and efficient methods for privacy-preserving data mining, however, before they are used, a risk analysis should be made. The following venture factors should be considered:

• is the mixing matrix R prone to be leaked? (any collusion?)

• can the p.d.f of the original data be effectively modeled?

x = arg max p(x) Rx=u

ECML PKDD 2009 – 35 / 37

Open Problems • Overview Introduction Problems Data Reconstruction Methods Experiments and Comparisons Conclusions

• Summary • Risk Analysis of Random Projection is Necessary

• Open Problems • The End

• how to reconstruct data without enough priori knowledge?

◦ e.g. the mixing matrix is not known. • is there any better perturbation method than random projections?

◦ without distorting the Euclidean distances among the data points (additive value distortion can not keep them),

◦ robust against various reconstructions.

ECML PKDD 2009 – 36 / 37

The End • Overview Introduction Problems Data Reconstruction Methods Experiments and Comparisons Conclusions

Thank You!

• Summary • Risk Analysis of Random Projection is Necessary

• Open Problems • The End

ECML PKDD 2009 – 37 / 37