A Randomized Misfit Approach for Data Reduction in Large-Scale ...

Report 2 Downloads 42 Views
arXiv:1603.01562v1 [cs.NA] 4 Mar 2016

A Randomized Misfit Approach for Data Reduction in Large-Scale Inverse Problems Ellen B. Le1 , Aaron Myers1 , Tan Bui-Thanh1,2 1

Institute for Computational Engineering and Sciences, The University of Texas at Austin, Austin, TX 78712, USA 2 Department of Aerospace Engineering and Engineering Mechanics, The University of Texas at Austin, Austin, TX 78712, USA E-mail:

{ellenle,aaron,tanbui}@ices.utexas.edu

Abstract. We present a randomized misfit approach as a data reduction method for large-scale inverse problems via a novel randomization of the objective function. Derived using stochastic programming, the method may be understood as an application of a Johnson-Lindenstrauss transform or random projection to reduce the dimension of the data-misfit vector. Using large deviation bounds, it is shown that the solution of this reduced inverse problem with a significantly reduced data dimension is a discrepancy principle-satisfying solution to the original problem. The purpose of this paper is to develop a systematic framework for the application of random projections in the context of ill-posed inverse problems with large data sets and give theoretical insight into the efficacy. This randomized misfit approach permits the use of a large class of distributions from the extensive literature on random projections. In particular, we analyze sparse random projections which have additional data-reduction properties. We also provide a different proof for a variant of the Johnson-Lindenstrauss lemma. This proof provides intuition into the O(ε−2 ) factor. The main contribution of this paper is a theoretical result that shows the method efficacy is attributable to the combination of both Johnson-Lindenstrauss theory and Morozov’s discrepancy principle. This result provides justification for the suitability of the proposed approach in solving inverse problems with big data. Numerical verification of theoretical findings are presented for model problems of estimating a distributed parameter in an elliptic partial differential equation. Results with different random projections are presented to demonstrate the viability and accuracy of the proposed approach.

Keywords Random projections, stochastic programming, large-scale inverse problems, model reduction, large deviation theory, Johnson-Lindenstrauss lemma, data dimension reduction. AMS classification scheme numbers: 35Q62, 62F15, 35R30, 35Q93, 65C60

Randomized Misfit Approach for Data Reduction

2

1. Introduction 1.1. The randomized misfit approach for inverse problems Consider the following additive noise-corrupted pointwise observational model dj = w (xj ; u) + ηj ,

j = 1, . . . , N,

(1)

where the objective is to reconstruct the distributed parameter u given N data points dj , with N large. For a given u, the observations w (x; u) are obtained by solving a complex forward model, governed by PDEs. The location of an observational data point in an open and bounded spatial domain Ω is denoted by xj , and ηj is assumed to be Gaussian random noise with mean 0 and variance σ 2 . Concatenating the observations, we rewrite (1) as d = F (u) + η,

(2)

where F (u) := [w (x1 ; u) , . . . , w (xN ; u)]> is the parameter-to-observable map. Although the forward problem is usually well-posed, the inverse problem is ill-posed. An intuitive reason is that the dimension of observations d is much smaller than that of the parameter u (which is infinite before discretization), and hence the observations provide limited information about u. As a result, the null space of the Jacobian of the parameter-to-observable map F is non-empty. For a broad class of inverse scattering problems in particular, the Gauss-Newton approximation of the Hessian is a compact operator and therefore its range space is effectively finite-dimensional [1, 2, 3]. The standard deterministic Tikhonov approach mitigates the ill-conditioning by adding a quadratic term to the cost function, so that our inverse problem may now be formulated as

2 1 2 1



b b min J (u) := d − F (u) + R 2 u , u 2 b−F b (u) := 1 [d − F(u)] is the data-misfit vector, Euclidean norm in RN is where d σ 1 denoted by k·k, and R 2 · is a norm weighted by a regularization operator R. This method is representative of the classical non-statistical approach which does not typically account for the randomness due to measurements and other sources (though one can equip the deterministic solution with a confidence region by postprocessing—see, e.g., [4] and references therein). If the regularization term is replaced by the Cameron-Martin norm of u (the second term in (4) below), then Tikhonov solution is in fact identical to the maximum a posteriori (MAP) point of (4) [5, 6, 7, 8]. This point estimate, however, offers limited insight regarding the randomness inherent in the problem. The Bayesian paradigm [7, 9, 10, 11, 12] allows for a systematic accounting of the ill-posedness and error uncertainty in inverse problems. Here, we seek a statistical description of all possible parameter fields that conform to some prior knowledge and are consistent with the observations. The Bayesian approach reformulates the

Randomized Misfit Approach for Data Reduction

3

inverse problem in the framework of statistical inference, incorporating uncertainties in the observations, the forward map F, and prior information into a probability distribution. The probability density is the solution to the Bayesian inverse problem. It requires specification of a likelihood model, which characterizes the probability that the parameter u could have produced the observed data d, and a prior model, which is problem-dependent and encodes all information regarding u before observations are made. The additive-noise model (2) is used to construct the likelihood pdf which is expressed as 

2  1 b b

πlike (d|u) ∝ exp

d − F (u) . 2 The prior is chosen to be the Gaussian measure µ := N (u0 , C) where u0 is a mean function and C is an appropriate covariance operator [7]. Here C is defined to be the inverse of an elliptic second-order differential operator with high-enough order to guarantee well-posedness [7, 13, 14]. We then discretize the prior, the forward equation, and the parameter u (yielding a finite-dimensional vector u ∈ Rm ) through the finite element method (see [13, 14] for a comprehensive treatment) so that the finitedimensional posterior probability of u is given by Bayes formula as  

2 1 1 b b

2 π (u|d) ∝ exp (3)

d − F (u) + ku − u0 kC 2 2

1 denotes the weighted L2 (Ω) norm induced by the L2 (Ω) where k·kC := C − 2 · L2 (Ω)

inner product h·, ·iL2 (Ω) . We then define the MAP point (or estimator) of (3) by

2 1 1

b b

? u := arg min J (u, d) = d − F (u) + ku − u0 k2C . u 2 2

(4)

We note that the last term in (4) may be viewed as a Tikhonov regularization term, and subsequently the MAP point may be considered as a solution to the deterministic inverse problem with a regularization “inspired” by the prior. It is important to stress that the Bayesian solution provides more than a point solution. The Bayesian posterior encodes all possible solutions with an associated confidence (probability). For the purpose of this paper, we restrict ourselves to the MAP computation, in order to focus on methodology development in addressing the challenge of big data, i.e., large N . The main idea of the paper is as follows. Let r ∈ RN be a random vector with   mean zero and identity covariance, i.e. Er rr> = I (equivalently, let r be the vector of N i.i.d. random variables ζ with mean zero and variance 1). Then the misfit term of (4) can be rewritten as:

 >   h  i2 

b b 2 > > b b b b b b d − F(u) = Er r d − F(u) ,

d − F(u) = d − F(u) Er rr which allows us to write the objective functional in (4) as 1 h  b b i2 1 J (u) = Er r> d − F(u) + ku − u0 k2C . 2 2

Randomized Misfit Approach for Data Reduction

4

We then approximate the expectation Er [·] using a Monte Carlo approximation (also known as the Sample Average Approximation (SAA) [15, 16]) with n i.i.d. samples {rj }nj=1 . This leads us to define the following randomized inverse problem n i2 1 1 X h > b b rj d − F (u) + ku − u0 k2C . min Jn (u; r) = u 2n j=1 2

2 1 1

˜ ˜

= d − F (u) + ku − u0 k2C , 2 2

(5)

˜ := √1 [r1 , . . . , rn ]> d, b and F b (u) ∈ Rn . We call ˜ (u) := √1 [r1 , . . . , rn ]> F where d n n ˜ −F ˜ (u) the reduced data-misfit vector. d For a reduced misfit vector dimension n  N , we call this randomization the randomized misfit approach (RMA). The reduced data problem (5) may be solved using any scalable robust iterative optimization algorithm. For our numerical experiments in section 3 we choose an implementation of the state-of-the-art bound constrained trust region Newton-CG method [17]. If we define the MAP point of (5) as u?n := arg min Jn (u) , u

(6)

the optimal RMA cost as Jn? := Jn (u?n ), and the optimal true cost as J ? := J (u? ), then we need to address the challenge of estimating the errors |Jn? − J ? | and ku?n − u? k as the sample size n increases. This is the subject of section 2. 1.2. A brief literature review and our contributions Since [18], the majority of randomized methods aimed at reducing the computational complexity of large-scale inverse problems focus on using the randomized SVD (RSVD) algorithm of [19] to generate truncated SVD approximations of the parameter-toobservable operator [20, 21, 22, 23], the regularization operator [24, 25], or the priorpreconditioned Hessian of the objective function [13, 26, 27, 28, 29]. The RSVD algorithm uses a Gaussian random projection matrix to produce a low-rank operator (the “sample matrix”). This operator is subsequently factored to generate an approximate SVD decomposition for the original operator A. Theoretical results in [19] guarantee the spectral norm accuracy of this approximation is of order σk+1 (A) with a very high user-defined probability. Here k is equal to the reduced parameter dimension n plus a small number of oversampling vectors. Theoretical results known about the accuracy of an inverse solution (e.g., Proposition 2 in [30], Theorem 1 in [22]) to a problem approximated with a randomized method are derived using this result. A method referred to as source encoding is shown in [31] to be a fast, effective method for parameter estimation in the context of seismic inversion. Source encoding involves taking random linear combinations of sources to produce a randomized objective function, in an effort to reduce the computational complexity of an inverse problem with

Randomized Misfit Approach for Data Reduction

5

multiple sources. The work in [32] provides an explanation for the efficacy this method by showing that the method is equivalent to the SAA [15, 16] through a stochastic programming reformulation of the inverse optimization problem. Recently, the work in [33] shows that source encoding in its stochastic reformulation (and as a stochastic trace estimator method [34]) is equivalent to an application of the random projection in [35]. This paper extends above work in many directions. The randomized misfit approach, in contrast with the aforementioned methods which typically reduce the parameter dimension through low-rank approximations, focuses on data dimension reduction. Our work extends to a much larger class of random projections than the distribution given in [35]. This class includes Gaussian random projections and very sparse random projections. We also offer an insight into why the reduced data dimension n is O(ε−2 ), where ε is the relative error of the randomized cost function. We build a theoretical foundation for why a small number of samples works well in the context of inverse problems. Moreover, we show the solution of the randomized problem is indeed a reasonable solution to the original problem. Although this result has been suggested with numerical results in the above-indicated papers, to our knowledge, a theoretical guarantee of solution viability has not been derived previously. The two main theoretical contributions of this paper are: 1) we use large deviation theory and the stochastic reformulation of a general inverse optimization problem [32, 33] to show a version of the Johnson-Lindenstrauss embedding theorem [36, 37, 38] and 2) we discover that the efficacy of the randomized misfit approach is due to Morozov’s discrepancy principle and large deviation theory acting in concert with each other. For the specific application of random projections to inverse problems, we prove that it is the combination of both effects that allows this method to work well, even with very small sample size n. Related methods [32, 33] note that the lower bounds on the reduced data dimension given by the current literature are pessimistic in practice. In particular, it is well-known that taking only a handful of randomly weighted combinations is surprisingly effective. In this paper, we present a discrepancy principle-based justification: it is the error inherent to inverse problems that allows for further reduction in the reduced data dimension offered by random projection theory. We stress here that the randomized misfit approach achieves overall data reduction, even in the case of problems with just one source. The approach does not take random combinations of sources as in [31, 32, 33, 39] but rather of the data-misfit vector itself. The structure of the paper is as follows. Section 2 presents the theoretical analysis for the randomized misfit approach by deriving the large deviation bounds on the error for a broad class of distributions. The reduced data dimension is shown to be independent of the original dimension. This derivation leads to a different proof of a variant of the celebrated Johnson-Lindenstrauss embedding theorem. Using Morozov’s discrepancy principle, Theorem 2 shows that the effective reduced data dimension is also bounded below by the noise in the problem. Therefore, the RMA solution is a guaranteed solution for the original problem with a high user-defined probability. Section 3 summarizes numerical experiments on a model inverse heat conduction problem in one-,

Randomized Misfit Approach for Data Reduction

6

two-, and three-dimensions. We compare the RMA solution obtained with four different distributions to the solution of the full problem. We also provide numerical support for Theorem 2. 2. An analysis of the randomized misfit approach (RMA) For a given u in parameter space, it is clear that Jn (u; r) in (5) is an unbiased estimator of J (u). It is also clear from the Law of Large Numbers that Jn (u) converges almost surely to J (u). However, the efficacy of the randomized misfit approach lies in the exponential decay of its errors which, as shown below, is provided by large deviation theory. We first show that errors larger than δ/2, for a given δ > 0, decay with a rate at least as fast as the tail of a centered Gaussian. That is, for some distribution in (5) we have   δ ≤ e−N I(δ) , (7) P |Jn (u; r) − J (u)| > 2 where I (δ) ≥ c

δ2 . 2θ2

for some c > 0 and some θ. This rate is sufficient to guarantee the solution attained from the the randomized misfit approach is a discrepancy principle-satisfying solution for the original inverse problem as will be shown in Theorem 2. Inequality (7) is equivalent to the statement   that P |Jn (u; r) − J (u)| > 2δ satisfies a large deviation principle with large deviation rate function I (δ) [40]. The following proposition may be viewed as a special case of Cram´er’s Theorem, which states that a sample mean of i.i.d. random variables X asymptotically obeys    a large deviation principle with rate I (δ) = supk kδ − ln E ekX [40]. However we require the exact non-asymptotic bounds as derived here to show convergence of the RMA for n = O(1). Recall that a real-valued random variable X is θ-subgaussian if   2 2 there exists some θ > 0 such that for all t ∈ R, E etX ≤ eθ t /2 . Proposition 1 The RMA error |Jn (u; r) − J (u)| has a tail probability that decays exponentially in n with a nontrivial large deviation rate. Furthermore, if the RMA is constructed with r such that 2|Jn (u; r) − J (u)| is the sample mean of i.i.d. θδ2 subgaussian random variables, then its large deviation rate is bounded below by c 2θ 2 for some c > 0. Proof. Given r, define the random variable

2 h  i2

b b

> b b T (r; u) := r d − F (u) − d − F (u) .

(8)

Randomized Misfit Approach for Data Reduction

7

By a standard Chernoff bound (see, e.g.[41]), we have that the RMA tail error decays exponentially as # " n 1X T (rj ; u) > δ ≤ e−nI(δ) , (9) P n j=1    where I(δ) = maxt tδ − ln E etT (r;u) is the large deviation rate.   The second part of the proposition follows with c = 1 by bounding E etT (r;u) in (9) and computing the maximum of tδ − θ2 t2 /2. A great number of distributions are known to be subgaussian, notably the Gaussian and Rademacher (also referred to as Bernoulli) distributions, and in fact any bounded random variable is subgaussian. One class of subgaussian distributions that offers theoretically verifiable data dimension reduction is the following. Definition 1 (`-percent sparse random variables [37, 42]) Let s = ` ∈ [0, 1) is the level of sparsity desired. Then  1  , +1 with probability 2s √  ζ= s 0 with probability ` = 1 − 1s ,   −1 with probability 1 2s

1 1−`

where

(10)

is a `-percent sparse distribution. Note that for ` = 0, ζ corresponds to a Rademacher distribution, and that ` = 2/3 corresponds to the Achlioptas distribution [35]. By inspection we have that E [ζ] = 0 and E [ζ 2 ] = 1, and thus draws from ζ can be used in the randomized misfit approach. Distribution (10) is well-suited for the randomized misfit approach: it is easy to implement, and the computation of the randomized misfit vector amounts to only summations and subtractions, adding a further speedup to the method. Increasing from s = 1 to s > 1 results in a s-fold speedup as only 1/s of the data is included. The RMA method takes n random combinations from the N -dimensional misfit vector to form an n-dimensional misfit vector. Note that since each random combination has a different sparsity pattern, we effectively do not exclude any data, yet each computation requires only `-percent of the data. MCMC-based sampling methods require the number of samples to be large to ensure accuracy. However, RMA is not an MCMC method for attaining the inverse solution. It is a Monte Carlo approximation of the cost function for large-scale inverse problems with a large data dimension, to be subsequently solved with any robust optimization algorithm. Additionally, the noise inherent in the problem, along with the large deviation principles make the probability of failure very small for moderate size n. This will be made rigorous in Theorem 2. The methods mentioned in section 1.2 rely on similar ideas to obtain accurate results with relatively few random vectors, most notably the RSVD algorithm [18]. We note that for the distribution (10), 1 ≤ s < ∞, the random variable ζ distributed

Randomized Misfit Approach for Data Reduction

8

√   b2 t2 by (10) has† E etζ ≤ e 2 with b = s − 2 ln s, ∀t ∈ (0, 1] . So, we may use it in the following theorem. b−F b (u) ∈ RN . If r in (8) has components that are bTheorem 1 Define v := d √ subgaussian for some b ≥ 1/ 2, then the RMA error has a large deviation rate bounded √ 2 δ2 below by c 2θ 2 and some 0 < c < 8b14 . 2 for θ = kvk / Proof. Let r ∈ RN such that r has i.i.d. b-subgaussian components ri , with √ v b ≥ 1/ 2, E [ri ] = 0, and E [ri2 ] = 1. Define w = kvk and X = r> w. Then 

tT

E e



=e

−tkvk2

h

tkvk2 X 2

E e

i

∀t ∈ R.

(11)

From [37, Lemma 2.2], E [X 2 ] = 1 and X is also b-subgaussian. Then, by [38, Remark 5.1], for 0 ≤ t ≤ 4b12 , h 2i √ E etX ≤ 2. (12) For 0 < t ≤ h

1 , 4b2 kvk2

tkvk2 X 2

E e

i

we have ∞

E [X 4 ] X + ≤ 1 + t kvk + t kvk 2 k=3 2

4

 1 k 4b2

4b2 t kvk2 k!

k

  E X 2k

  2k  ∞ 1 k 4 X  E X E [X ] 3 2 2 4b ≤ 1 + t kvk + t2 kvk + 4b2 t kvk 2 k! k=3 3 h 1 2 i E [X 4 ] ≤ 1 + t kvk2 + t2 kvk4 + 4b2 t kvk2 E e 4b2 X 2 4 √ E [X ] ≤ 1 + t kvk2 + t2 kvk4 + 64 2b6 t3 kvk6 2 √ 2 4 4 2 ≤ 1 + t kvk + 8b t kvk + 64 2b6 t3 kvk6 2

2

≤ etkvk

4

√ +8b4 t2 kvk4 +64 2b6 t3 kvk6

,

using (12) in the fourth inequality and [43, p.93] in the fifth inequality. Let t? =

δ 8b4 kvk4 q

where q > 1. Assuming kvk2  δ, we have that √ 3 √ 6 3 δ2  ?  4 6 4 2 + 2 6δ 6 3 8b kvk q . E et T ≤ e8b t? kvk +64 2b t? kvk = e 8b4 kvk4 q2

Then   I (δ) ≥ δt? − ln E et? T ≥

  √ 1 δ2 δ3 δ 1− − 2 ≥c , 4 6 q 8b4 kvk q 8b6 kvk q 3 kvk4

† Using the inequality (2k)! ≥ 2k k! and the Taylor expansion around 0, we have that for t ∈ (0, 1] k  ∞ ∞  tζ  1 X st2 2 s 2 s 2 1 X st2 1 s 2 E e = ≤ = e 2 t = e− ln s+ 2 t ≤ e−t ln s+ 2 t . k s (2k)! s 2 k! s k=0

k=0

Randomized Misfit Approach for Data Reduction

9

where 0 < c < 8b14 . Taking 2θ2 = kvk4 concludes the proof. A sharper result can be obtained for RMA constructed with b-subgaussian random variables where b ≤ 1. Note that this includes the distribution (10) with s = 1 (Rademacher) and s = 3 (Achlioptas) by the above theorem. Following [38, (5)], let g be a standard Gaussian random variable, independent of all other random variables. 1 Then, we have that for 0 < t < 2kvk 2, # "N h i h i Y 2 2 2 2 1 tkvk2 g 2 tkvk2 X b tkvk wi g ≤ Eg e =q . E e ≤ Eg e i 1 − 2t kvk2 So from (11) we have that 2



tT (u,r)

E e



≤q

e−tkvk

2

= e−tkvk

− 21 ln(1−2tkvk2 )

.

2

1 − 2t kvk

Then

 1 ln 1 − 2t kvk2 =: f (t) . 2 Computing the derivative, we have that f (t) attains a maximum at tδ − ln (E [T (u, r)]) ≥ tδ + t kvk2 +

tmax =

δ . 2 kvk + δ kvk2 4

Thus, we have   1 δ δ2 δ  + ln 1 − + max f (t) = 2 2 kvk4 + δ kvk2 2 kvk2 + δ kvk2 + δ δ2 1 1 δ2 δ3  = − −   − ··· 4 kvk2 + δ 2 6 kvk2 + δ 3 2 kvk4 + δ kvk2 ( ) δ2 1 δ2 δ2 + − = 2 4 4 kvk4 + δ kvk2 kvk4 + δ kvk2 kvk2 + δ −

1 δ3 δ2 − · · · ≥ c ,  6 kvk2 + δ 3 kvk4

where we employed the Taylor expansion in the second equality, and in the last inequality c is some constant less than 1/4. Note that the last inequality holds for δ  kvk2 and taking 2θ2 = kvk4 concludes the proof. The next theorem is our main result. It guarantees with high probability that the RMA solution will be a solution of the original problem under Morozov’s discrepancy principle, for relatively small n. We first need the following lemma. b F b (u). Suppose that r is distributed such that the large deviation Lemma 1 Let v := d− √ 2 δ2 rate of the RMA error is bounded below by c 2θ 2. Given 2 for some c > 0 and θ = kvk / a cost distortion tolerance ε > 0 and a failure rate β > 0, let n≥

β . cε2

(13)

Randomized Misfit Approach for Data Reduction

10

Then with probability at least 1 − e−β , n

1 X > 2 r v ≤ (1 + ε) kvk2 , (1 − ε) kvk ≤ n j=1 j

(14)

(1 − ε) J (u) ≤ Jn (u; r) ≤ (1 + ε) J (u) .

(15)

2

and hence, 2

Proof. The proof follows from setting δ = ε kvk in (7). This lemma demonstrates a remarkable fact that with n i.i.d. draws one can reduce √ the data dimension from N to n while bearing a relative error of ε = O (1/ n) in the cost function, where the reduced dimension n is independent of the dimension N of the data. This idea is the basis for data-reduction techniques via variants of the Johnson-Lindenstrauss Lemma in the current research area of random projections (see, e.g. [44, 45, 46] for examples of recent applications). Unlike other applications of the Monte Carlo method, e.g. Markov chain Monte Carlo, in which n must be large to converge, n can be moderate or small for inverse problems, depending on the noise η in (2). In the following theorem we show this is possible via Morozov’s discrepancy principle [47]. To avoid over-fitting the noise,

from

b b ? 2 ? ? (1) one seeks a MAP point u such that |dj − w (xj ; u )| ≈ σ, i.e. d − F (u ) ≈ N . We say that an inverse solution u? satisfies Morozov’s discrepancy principle with parameter τ if

b b ? 2

d − F (u ) = τ N for some τ ≈ 1. Theorem 2 Suppose that the conditions of Lemma 1 are met. If u?n is a discrepancysatisfying solution for the RMA cost, i.e., n 1 X h >  b b ? i2 ? Jn (un , r) := rj d − F (un ) = τ 0N n j=1 for some τ 0 ≈ 1, then with probability at least 1 − e−β , u?n is also solution for the original problem that satisfies Morozov’s discrepancy principle with parameter τ , i.e.

b b ? 2 J (u?n ) := d − F (un ) = τ N.  τ0 τ0  , 1−ε . for τ ∈ 1+ε Proof. The claim is a direct consequence of (14). We now in the position to show a different proof of the Johnson-Lindenstrauss embedding theorem using a stochastic programming derivation of the RMA. Following [48], we define a map S from Rn to RN , where n  N , to be a Johnson-Lindenstrauss transform (or JLT) if (1 − ε) kvk2 ≤ kSvk2 ≤ (1 + ε) kvk2 , holds with some probability p = p (n, ε), where ε > 0.

Randomized Misfit Approach for Data Reduction

11

Theorem 3 (Johnson-Lindenstrauss embedding theorem [36, 37, 38]) Suppose that r is distributed such that the large deviation rate of the RMA error is bounded beδ2 N low by c 2θ 2 for some c > 0 and some θ. Let 0 < ε < 1, vi ∈ R , i = 1, . . . , m, and n = O (ε−2 ln m). Then there exists a map F : RN → Rn such that (1 − ε) kvi − vj k2 ≤ kF (vi ) − F (vj )k2 ≤ (1 + ε) kvi − vj k2

∀i, j.

(16)

Proof. The conditions of Lemma 1 hold, thus for a given v ∈ RN , note that (14) is equivalent to (1 − ε) kvk2 ≤ kΣvk2 ≤ (1 + ε) kvk2 , (17) where

1 Σ := √ [r1 , . . . , rn ]> . n

Define F (v) := Σv. Inequality (16) is then a direct consequence of (17) for a pair c 2 (vi , vj ) with probability at least 1 − e− 2 nε . Using an union bound over all pairs, claim (16) holds for any pair with probability at least 1 − m−α if n ≥ c (2+α) ln m. ε2 As discussed above, Jn (u; r) is an unbiased estimator of J (u). It is therefore reasonable to expect that Jn? := minu Jn (u; r) converges to J ? := minu J (u). The following result [16, Propositions 5.2 and 5.6] states that under mild conditions Jn? in fact converges to J ? . It is not unbiased, but is however downward biased. Proposition 2 Assume that Jn (u; r) converges to J (u) with probability 1 uniformly in u, then Jn? converges to J ? with probability 1. Furthermore, it holds that  ?  E [Jn? ] ≤ E Jn+1 ≤ J ?, that is, Jn? is a downward-biased estimator of J ? . Stochastic programming theory gives a stronger characterization of this 1 convergence. One can show that u?n converges weakly to u? with the rate of n− 2 . If J (u) is convex with finite value, then u?n = u? with probability exponentially converging to 1. See Chapter 5 in [16] for details. For a linear forward map F (u) = Fu, that is, J (u) is quadratic, we can derive a bound on the solution error using the spectral norm of F. b Then Theorem 4 Suppose the conditions of Lemma 1 hold. Let m := rank(F). i) (1 − ε) J ? ≤ Jn? ≤ (1 + ε) J ? , and ii) if F is linear, then with probability at least 1 − m−α

  ε

b ?

b b ? ? kun − u k ≤ 2

F ku k + d

F , σmin (G)  1 >b > −1 2 b , and n = O (ε−2 (2 + α) ln r). where G := F ΣΣ F + C

(18)

Randomized Misfit Approach for Data Reduction

12

Proof. The first assertion follows from (15) and the definition of u?n (6), indeed Jn? = Jn (u?n ) ≤ J (u? ) ≤ (1 + ε) J (u? ) = (1 + ε) J ? , and the other direction is similar. For the second assertion, note that u? and u?n are solutions of the following first optimality conditions   b + C−1 u0 , b >F b + C−1 u? = F b >d F (19a)   b + C−1 u0 . b > ΣΣ> F b + C−1 u? = F b > ΣΣ> d F (19b) n Define ∆ := u? − u?n . An algebraic manipulation of (19) gives     >b >b > −1 > >b b−F b b > ΣΣ> d, b >d b b b F ΣΣ F + C ∆ = F ΣΣ F − F F u? + F Taking the inner product of both sides with ∆ we have D   E D E Tb Tb ? T −1 ? b b b ∆, F ΣΣ F + C ∆ = F∆, ΣΣ Fu − Fu E D b . (20) b − ΣΣT d b d + F∆, Then we can bound the left-hand side of (20): D   E >b > −1 2 b ∆, F ΣΣ F + C ∆ ≥ σmin (G) ∆2 .

(21)

To bound terms on right hand side of (20), we need the following straightforward variant of (17), i.e. ∀v ∈ RN and n = O (ε−2 ):

ΣΣ> v − v ≤ ε kvk . (22) Using Cauchy-Schwarz inequality we have

2 E D

b Tb ? ? b b F∆, ΣΣ Fu − Fu ≤ ε F

k∆k ku? k ,



E D

b b ≤ ε b − ΣΣT d b b d F∆,

F

k∆k d

,

(23a) (23b)

where we have used (22) and definition of matrix norm. Next, combining (23) and (21) ends the proof. Note that for inequalities in (23) to be valid, it is sufficient to choose n, α, ε such b and hence that (22) is valid for m basis vectors spanning the column space of F, −2 n = O (ε (2 + α) ln m) by the union bound. Remark 1 The bound in (18) is not a unique estimation. One can first rewrite J (u) and Jn (u; r) as

" # " # 2

b b 1 F d

J (u) = − u

,

2 C−1/2 u0 C−1/2

" # (" # " # ) 2

> b b 1 F d

Σ 0

Jn (u; r) = − u . −1/2 −1/2

2 0 I C C u0

Randomized Misfit Approach for Data Reduction

13 "

If Σ is a Johnson-Lindenstrauss transform, then S :=

Σ> 0 0 I

# is also a JLT

with the same parameters:

"

" # 2 #

v 2 v

2 2 2 2

S

= kΣvk + kwk ≤ (1 + ε) kvk + kwk ≤ (1 + ε)

.

w w Applying [48, Theorem 12], we conclude that with probability at least 1/3, ku?n − u? k ≤

ε √ λmin

J ?,

where λmin is the minimum nonzero singular value of

h

b > , C−1/2 F

i>

.

3. Numerical Experiments In this section we evaluate the performance of the randomized misfit approach using four different distributions for r in (5). We also verify that the convergence is indeed √ O(1/ n) as guaranteed by Theorem 3. Lastly we verify Theorem 2. The distributions that we test with the randomized misfit approach are: • Gaussian • Rademacher • Achlioptas • 95%-sparse (s-sparse (10) with s = 20) There are a number of other distributions suitable for RMA in the current literature on Johnson-Lindenstrauss transforms that we do not consider, particularly the Subsampled Randomized Hadamard Transform of [49, 50] and its subsequent fast and sparse variants. These will be tested in future work. For our model problem we consider the estimation of a coefficient in an elliptic partial differential equation. This Poisson-type problem arises in various inverse applications, such as the heat conductivity or groundwater problem, or in finding a membrane with a given spatially-varying stiffness. For concreteness we consider the heat conduction problem on an open bounded domain Ω, governed by − ∇ · (eu ∇w) = 0

in Ω

u

= Bi w

on ∂Ω \ ΓR ,

u

= −1

on ΓR ,

−e ∇w · n −e ∇w · n

(24)

where u is the logarithm of distributed thermal conductivity, w is the forward state (the map from log conductivity to temperature), n is the unit outward normal on ∂Ω, and Bi is the Biot number. Here, ΓR is a portion of the boundary ∂Ω on which the

Randomized Misfit Approach for Data Reduction

14

inflow heat flux is 1. The rest of the boundary is assumed to have Robin boundary condition. We are interested in reconstructing the distributed log conductivity u, given some temperature measurements w observed on Ω. The standard H 1 (Ω) finite element method is used to discretize the misfit and the regularization operator. The synthetic truths that we seek to recover are a 1D sinusoidal curve, a 2-D Gaussian on a thermal fin, and a cube with non-zero log conductivity values on a sphere in the center and semispheres in the opposing corners. Figure 1 shows representations of utruth on a mesh for these cases. The synthetic noisy temperature observations are then generated at all mesh points through the forward model (24). The misfit vector generated from 1(a) has data dimension N = 1025 (with 1% percent added noise), from 1(b) has data dimension N = 1333 (with .1% percent added noise), and from 1(c) has data dimension N = 2474 (with .2% percent added noise), respectively. 2 0.35 0.3

1

u

0.25 0.2

0

0.15 0.1

-1 -2 0

0.05 0

0.5

-0.05

1

x (a) truth u for 1D experiment

(b) truth u for 2D experiment

(c) truth u for 3D experiment Figure 1. The distributed truth log conductivity parameters used in the experiments. The parameter fields are used to obtain noise-corrupted temperature data through the forward model (24).

For the optimization algorithm we use a subspace trust region inexact Newton conjugate gradient method developed in [51, 17, 52, 53, 54]. The stopping criteria used

Randomized Misfit Approach for Data Reduction

15

is when either the Newton step size, cost function value, or norm of the gradient is below 10−6 . 3.1. Convergence results We first verify the convergence of the RMA cost Jn (u0 ) to the original cost J (u0 ) for a fixed distributed parameter u0 , using the model heat problem (24). We choose a random u0 from the prior distribution and construct the RMA cost Jn (u0 ) with the various random projections listed above. Since u0 lives in high-dimensional space Rm , where m is the number of finite element nodal values, for the purpose of visualization Figure 2 shows plots of the RMA cost Jˆn (κ) := Jn (u0 + κs) in a direction s := ∇J (u0 ) for the 3D example. For each of the random projections tested we observe convergence of Jˆn (κ) to Jˆ (κ) as n increases. More importantly, for all distributions, the minimizer of Jˆ (κ) is well-approximated by Jˆn (κ), even for n small, thus verifying Theorem 2. The plots for the 1D and 2D examples are similar and thus omitted (see http://users.ices.utexas.edu/~ellenle/RMAplots.pdf). Theorem 4 states that u?n , the minimizer of Jn , and the minimum objective function value Jn? converge at the same rate, given by the distortion tolerance ε, but with different ˆ ?n may converge quickly to u ˆ ?, constants. Figure 2 illustrates how an RMA solution u ˆ u) can be slow due to the although convergence of the minimum value Jˆn (uˆn ) to J(ˆ different constant. To test this hypothesis at the actual minimizer u?n , we plot the error of the RMA MAP point u?n and its corresponding optimal value Jn? in Figure 3 for the 3D example and different random projections‡. Both the absolute errors |Jn? − J ? | and ku?n − u? k and normalized errors |Jn? − J ? |/|J ? | and ku?n − u? k / ku? k are shown, and a √ √ reference curve 1/ n is plotted to show the convergence rate is indeed 1/ n for both u?n and Jn? . However as can be seen, the absolute error of u?n is orders of magnitude smaller than Jn? for all considered random projections. The results of minimizing the RMA cost with different n in the 1D, 2D, and 3D example are shown alongside the true MAP estimate u? in Figures 4, 5 and 6. The figures shown are results with r distributed by the Achlioptas distribution (66% sparse). We see that the original MAP point u? is well-approximated by the RMA solution u?n in all cases with 50 ≤ n ≤ 100. In a different experiment, we consider a 3D example in which only surface observations are available. The parameters are the same as the problem in Figure 1(c) but the data are now obtained from 901 observations on the surface of the cube (except the bottom surface), and the truth log conductivity is non-zero within the sphere of radius 0.5 centered at the origin as seen in Figure 7. Figure 8(d) shows the original MAP estimate u? . Compared to the above example the recovery is poorer, but this is expected due to having less observational data. Our interest however is in reducing the data dimension while recovering a reasonable MAP estimation. Subsequently, we compare the RMA MAP point u?n to the true MAP point u? (a minimizer of J). The ‡ Again, similar results have been obtained for the 1D and 2D examples and are omitted here.

Randomized Misfit Approach for Data Reduction J1 J10 J25 J50 J mins

108 106 104 102

-2

0

1010

cost J(u0 + 5s)

cost J(u0 + 5s)

1010

J1 J10 J25 J50 J mins

108 106 104 102

2 #10-3

5

16

-2

(a) Gaussian

106 104

0

J1 J10 J25 J50 J mins

108 106 104 102

2 #10-3

5

1010

cost J(u0 + 5s)

108

-2

0

5

(c) Achlioptas

2 #10-3

(d) 95% sparse

12 #10

6

J50 ; r 9 J50 ; r 9 J50 ; r 9 J50 ; r 9 J mins

10

cost J(u0 + 5s)

cost J(u0 + 5s)

J1 J10 J25 J50 J mins

-2

2 #10-3

(b) Rademacher

1010

102

0

5

8 6

G R A S

4 2 0

-2

0

5

2 #10-3

(e) all distributions, n = 50

Figure 2. Convergence of the RMA cost to the original cost for various random projections in the 3D example at a random parameter u0 drawn from the prior distribution. G = Gaussian, R = Rademacher, A = Achlioptas, S = 95%-sparse

Randomized Misfit Approach for Data Reduction 10

4

10

3

p 1= n ju? ! u?n j jJ ? ! Jn? j ju? ! u?n j=ju? j jJ ? ! Jn? j=J ?

10 1 10

0

4

10

3

p 1= n ju? ! u?n j jJ ? ! Jn? j ju? ! u?n j=ju? j jJ ? ! Jn? j=J ?

10 1 10

10 -1 10

10

10 2

Error

Error

10 2

17

0

10 -1

-2

0

50

100

150

10

200

-2

0

50

n

10 4 10 3

10 4

10 1

10 2 10 1

10 0

10 0

10 -1

10 -1

-2

100

150

200

n

(c) Achlioptas

p 1= n ? ju ! u?n j jJ ? ! Jn? j ju? ! u?n j=ju? j jJ ? ! Jn? j=J ?

10 3

Error

Error

10 2

50

200

(b) Rademacher p 1= n ? ju ! u?n j jJ ? ! Jn? j ju? ! u?n j=ju? j jJ ? ! Jn? j=J ?

0

150

n

(a) Gaussian

10

100

10

-2

0

50

100

150

200

n

(d) 95% sparse

Figure 3. Error convergence of the minimizer u?n and the corresponding optimal cost Jn? . Both absolute and relative errors are shown. Data shown is an average of five runs. Theorem 4 states that u?n and Jn? converge at the same rate, given by the distortion √ tolerance ε = O(1/ n), but with different constants.

results in Figure 8 show the RMA solutions u?n as n increases. As can be seen, with n = 150, i.e. a 6-fold reduction in the data dimension, the RMA approximation u?150 is still a good approximation to the original MAP solution u? . 3.2. Verification of Theorem 2 Table 1 presents results for solving the model problem for the 1D, 2D, and 3D examples with Morozov’s criterion, again using the Achlioptas random projection in the randomized misfit approach. We perform several numerical experiments and choose an n for each example such that Morozov’s principle is met for Jn (u?n ) with τ 0 ≈ 1. We then compute the corresponding ranges for τ that are guaranteed with probability at least p ≥ 1 − e−β , after choosing an acceptable cost distortion tolerance of ε = 0.5 and β as large as possible from (13). As can be seen, evaluating J (u?n ) gives a τ within

0

0.5

1.5 1 0.5 0 -0.5 -1 -1.5 -2 0

1

u? u?100

0.5

1

0

1

1.5 1 0.5 0 -0.5 -1 -1.5 -2 0

x

(d) n = 100

x

(e) n = 300

0.5

1

(c) n = 50 u? u?300

0.5

u? u?50

x

(b) n = 10

u

u

(a) n = 5

0

0.5

1.5 1 0.5 0 -0.5 -1 -1.5 -2

x

x

1.5 1 0.5 0 -0.5 -1 -1.5 -2

u? u?10

u

u? u?5

18

u

1.5 1 0.5 0 -0.5 -1 -1.5 -2

u

u

Randomized Misfit Approach for Data Reduction

1

1.5 1 0.5 0 -0.5 -1 -1.5 -2 0

u? u?600

0.5

1

x

(f) n = 600

Figure 4. 1D elliptic PDE example: convergence of u?n to u? as n increases. The Achlioptas random projection (66% sparse) is used for Σ and the original data dimension is N = 1025.

the specified range, which satisfies Morozov’s criterion. That is, even for moderately small values of n, if the discrepancy principle is satisfied for Jn (u?n ), then the discrepancy principle is also satisfied for J (u?n ). Thus u?n is a discrepancy principle satisfying solution for both the randomized reduced data dimension problem (5) and the original problem (4). Table 1. Verification of Morozov’s discrepancy principle for the RMA solution with ε = 0.5.

 τ0 τ0  N n Jn (u?n ) τ0 , p J (u?n ) τ 1+ε 1−ε 1D 1025 100 1220 1.190 [0.793, 2.380] 95.6% 1074 1.048 2D 1333 50 1240 0.930 [0.620, 1.860] 79.0% 1406 1.055 3D 2474 75 2646 1.070 [0.713, 2.139] 90.4% 3928 1.588

4. Conclusions and future work We have presented a randomized misfit approach as a general framework for reduction of the data dimension in large-scale inverse problems. The method builds on recent efforts to reduce the computational complexity of large-scale inverse problems via randomization, but applies the random reduction directly to the data misfit vector. Any

Randomized Misfit Approach for Data Reduction 0.35

0.35

0.3

0.3

0.25

0.25

0.2

0.2

0.15

0.15

0.1

0.1

0.05

0.05

0

0

-0.05

-0.05

(a) n = 30

(c) n = 100

19

(b) n = 50 0.35

0.35

0.3

0.3

0.25

0.25

0.2

0.2

0.15

0.15

0.1

0.1

0.05

0.05

0

0

-0.05

-0.05

(d) u?

Figure 5. 2D elliptic PDE example: convergence of u?n to u? as n increases. The Achlioptas random projection (66% sparse) is used for Σ and the original data dimension is N = 1333.

subgaussian distribution guarantees the solution obtained from the randomized misfit approach will satisfy Morozov’s discrepancy principle with a low failure rate (that decays exponentially with respect to the reduced dimension n). By a stochastic approximation, we have shown that the method is equivalent to applying a random projection to the data misfit vector. This leads to a different stochastic programming-based proof (up to a constant) of a Johnson-Lindenstrauss lemma variant proved previously (see, e.g. [55, 56] for proofs based on combinatorics and communication theory, respectively). Our connection provides two main theoretical insights. The first is intuition into the numerical accuracy of previous work that uses a surprisingly small reduced data dimension n using Morozov’s discrepancy principle. These methods may be understood as random projection methods, for which this highaccuracy dimension reduction property is well-established and drives an active area of research in computer science and machine learning. The second is an intuition into the √ ubiquitous O(1/ n) factor in Johnson-Lindenstrauss transforms (a rate shown to be

Randomized Misfit Approach for Data Reduction

(a) n = 50

(c) n = 150

20

(b) n = 100

(d) u?

Figure 6. 3D elliptic PDE example: convergence of u?n to u? as n increases. The Achlioptas random projection (66% sparse) is used for Σ and the original data dimension is N = 2474.

Figure 7. truth u for 3D experiment with surface observations: Using the same number of mesh elements as in Figure 1(c) except the synthetic parameter is a single sphere, and observational data is obtained from N = 901 mesh points on the top and side surfaces of the cube.

tight by [55]) using a Monte Carlo framework. The focus of this work is on the presentation and analysis of the method. We presented results for a medium size (N = O(103 )) synthetic example in 1D, 2D, and 3D with four different distributions for numerical justification of theoretical results and

Randomized Misfit Approach for Data Reduction

(a) n = 10

(c) n = 150

21

(b) n = 50

(d) u? (solution of the full cost)

Figure 8. 3D elliptic PDE example with surface observations: convergence of u?n to u? as n increases. The MAP solution is nearly approximated with an RMA reduced data dimension of n = 150 (a 6-fold reduction from the N = 901 observational data points on the surface). The Achlioptas random projection (66% sparse) is used for Σ.

illustration of the method. Future work will apply the method to larger problems with big data, e.g. time dependent data governed by complex forward models, and extend it to a Bayesian framework. Our results are valid for nonlinear inverse problems with the exception of part (ii) in Theorem 4 (which only applies to linear forward models). We expect such a result is also true for nonlinear inverse problems, and this is under investigation. Combining dimension reduction and uncertainty quantification is the broader focus of our current research towards developing scalable methods for largescale inverse problems in high-dimensional parameter space with big data. Acknowledgments We would like to thank Prof. Mark Girolami for pointing out the similarity between randomized projections and the randomized misfit approach, which led to the connection with Johnson-Lindenstrauss theory. This in turn allowed us to carry out the analysis of the randomized misfit approach presented here. We also thank Vishwas Rao for careful proofreading. This research was partially supported by DOE grants de-sc0010518 and de-sc0011118. We are grateful for the support.

Randomized Misfit Approach for Data Reduction

22

References [1] Bui-Thanh T and Ghattas O 2012 Inverse Problems 28 055001 [2] Bui-Thanh T and Ghattas O 2012 Inverse Problems 28 055002 [3] Bui-Thanh T and Ghattas O 2013 Analysis of the Hessian for inverse scattering problems. Part III: Inverse medium scattering of electromagnetic waves Inverse Problems and Imaging [4] Vexler B 2004 Adaptive finite element methods for parameter identification problems Ph.D. thesis University of Heidelberg [5] Prato G D 2006 An Introduction to Infinite-dimensional Analysis Universitext (Springer) [6] Prato G D and Zabczyk J 1992 Stochastic Equations in Infinite Dimensions (Cambidge University Press) [7] Stuart A M 2010 Acta Numerica 19 451–559 [8] Dashti M, Law K J, Stuart A M and Voss J 2013 Inverse Problems 29 095017 [9] Franklin J N 1970 Journal of Mathematical Analysis and Applications 31 682–716 [10] Lehtinen M S, Paivarinta L and Somersalo E 1989 Inverse Problems 5 599 [11] Lasanen S 2002 Discretizations of generalized random variables with applications to inverse problems Ph.D. thesis University of Oulu [12] Piiroinen P 2005 Statistical measurements, experiments, and applications Ph.D. thesis Department of Mathematics and Statistics, University of Helsinki [13] Bui-Thanh T, Ghattas O, Martin J and Stadler G 2013 SIAM Journal on Scientific Computing 35 A2494–A2523 [14] Petra N, Martin J, Stadler G and Ghattas O 2014 SIAM Journal on Scientific Computing [15] Nemirovski A, Juditsky A, Lan G and Shapiro A 2009 SIAM Journal on Optimization 19 1574– 1609 [16] Shapiro A, Dentcheva D and Ruszczynski A 2009 Lectures on Stochastic Programming: Modeling and Theory (Society for Industrial and Applied Mathematics) [17] Branch M A, Coleman T F and Li Y 1999 SIAM Journal on Scientific Computing 21 1–23 (electronic) [18] Halko N, Martinsson P G and Tropp J A 2011 SIAM Review 53 217–288 [19] Martinsson P G, Rokhlin V and Tygert M 2011 Applied and Computational Harmonic Analysis 30 47–68 ISSN 10635203 [20] Isaac T, Petra N, Stadler G and Ghattas O 2015 Journal of Computational Physics 296 348–368 [21] Xiang H and Zou J 2015 Inverse Problems 31 085008 [22] Xiang H and Zou J 2013 Inverse Problems 29 085008 [23] Chaillat S and Biros G 2012 Journal of Computational Physics 231 4403–4421 [24] Lee J and Kitanidis P 2014 Water Resources Research 50 5410–5427 [25] Kitanidis P and Lee J 2014 Water Resources Research 50 5428–5443 [26] Saibaba A K and Kitanidis P K 2015 Advances in Water Resources 82 124–138 ISSN 03091708 [27] Alexanderian A, Petra N, Stadler G and Ghattas O 2014 SIAM Journal on Scientific Computing 36 A2122–A2148 [28] Bui-Thanh T and Girolami M A 2014 Inverse Problems Special Issue 114014 [29] Bui-Thanh T, Burstedde C, Ghattas O, Martin J, Stadler G and Wilcox L C 2012 Extreme-scale UQ for Bayesian inverse problems governed by PDEs SC12: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis [30] Saibaba A K, Lee J and Kitanidis P K 2015 Numerical Linear Algebra with Applications ISSN 1099-1506 [31] Krebs J R, Anderson J E, Hinkley D, Neelamani R, Lee S, Baumstein A and Lacasse M D 2009 Geophysics 74 WCC177–WCC188 [32] Haber E, Chung M and Herrmann F 2012 SIAM Journal on Optimization 22 739–757 [33] Young J and Ridzal D 2012 SIAM Journal on Scientific Computing 34 A2344–A2365 [34] Hutchinson M F 1990 Communications in Statistics-Simulation and Computation 19 433–450

Randomized Misfit Approach for Data Reduction [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56]

23

Achlioptas D 2003 Journal of Computer and System Sciences 66 671–687 ISSN 00220000 Dirksen S 2015 Foundations of Computational Mathematics Matousek J 2008 Random Struct. Algorithms 33 142–156 Indyk P and Naor A 2007 ACM Transactions on Algorithms (TALG) 3 31 van Leeuwen T, Aravkin A Y and Herrmann F J 2011 International Journal of Geophysics 2011 Touchette H 2009 Physics Reports 478 1–69 Kelly F P 1991 Queueing systems 9 5–15 Li P, Hastie T J and Church K W 2006 Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’06 287 Stroock D W 2011 Probablity Theory: An Analytic View 2nd ed (Cambrdige University Press, Cambridge) Holub V and Fridrich J 2013 Information Forensics and Security, IEEE Transactions on 8 1996– 2006 Liu L, Fieguth P, Clausi D and Kuang G 2012 Pattern Recognition 45 2405–2418 Fowler J E and Du Q 2012 Image Processing, IEEE Transactions on 21 184–195 Morozov V A 1966 Soviet Math. Dokl. Sarlos T 2006 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06) 143–152 ISSN 0272-5428 Ailon N and Chazelle B 2009 SIAM Journal of Computing 39 302–322 Tropp J A 2010 Advances in Adaptive Data Analysis 03 8 ISSN 1793-5369 Coleman T F and Li Y 1996 SIAM Journal on Optimization 6 418–445 Heinkenschloss M, Ulbrich M and Ulbrich S 1999 Mathematical Programming 86 615–635 Bui-Thanh T 2007 Model-Constrained Optimization Methods for Reduction of Parameterized Large-Scale Systems Ph.D. thesis Department of Aeronautics and Astronautics, MIT Bui-Thanh T, Willcox K and Ghattas O 2008 SIAM Journal on Scientific Computing 30 3270– 3288 Alon N 2003 Discrete Mathematics 273 31–53 Jayram T and Woodruff D P 2013 ACM Transactions on Algorithms (TALG) 9 26

Recommend Documents