arXiv:1603.02004v1 [math.NA] 7 Mar 2016
Posterior Consistency for Gaussian Process Approximations of Bayesian Posterior Distributions Andrew M. Stuart1 , Aretha L. Teckentrup1 1
Mathematics Institute, Zeeman Building, University of Warwick, Coventry, CV4 7AL, England.
[email protected],
[email protected] Abstract We study the use of Gaussian process emulators to approximate the parameter-to-observation map or the negative log-likelihood in Bayesian inverse problems. We prove error bounds on the Hellinger distance between the true posterior distribution and various approximations based on the Gaussian process emulator. Our analysis includes approximations based on the mean of the predictive process, as well as approximations based on the full Gaussian process emulator. Our results show that the Hellinger distance between the true posterior and its approximations can be bounded by moments of the error in the emulator. Numerical results confirm our theoretical findings.
Keywords: inverse problem, Bayesian approach, surrogate model, Gaussian process regression, posterior consistency AMS 2010 subject classifications: 60G15, 62G08, 65D05, 65D30, 65J22
1
Introduction
Given a mathematical model of a physical process, we are interested in the inverse problem of determining the inputs to the model given some noisy observations related to the model outputs. Adopting a Bayesian approach [18, 37], we incorporate our prior knowledge of the inputs into a probability distribution, referred to as the prior distribution, and obtain a more accurate representation of the model inputs in the posterior distribution, which results from conditioning the prior distribution on the observations. Since the posterior distribution is generally intractable, sampling methods such Markov chain Monte Carlo (MCMC) [17, 24, 32, 10, 15, 13] are typically used to explore it. A major challenge in the application of MCMC methods to problems of practical interest is the large computational cost associated with numerically solving the mathematical model for a given set of the input parameters. Since the generation of each sample by the MCMC method requires a solve of the governing equations, and often millions of samples are required, this process can quickly become very costly. This drawback of fully Bayesian inference for complex models was recognised several decades ago in the statistics literature, and resulted in key papers which had a profound influence on methodology [33, 19, 28]. These papers advocated the use a of Gaussian process surrogate model to approximate the solution of the governing equations, and in particular the likelihood, at a much lower computational cost. This approximation then results in an approximate posterior 1
distribution, which can be sampled more cheaply using MCMC. However, despite the widespread adoption of the methodology, there has been little analysis of the effect of the approximation on posterior inference. In this work, we study this issue, focussing on the use of Gaussian process emulators [30, 36, 33, 19, 28, 6] as surrogate models. Other choices of surrogate models such as those described in [8, 3], generalised Polynomial Chaos [40, 21], sparse grid collocation [4, 20] and adaptive subspace methods [12, 11] might also be studied similarly, but are not considered here. Indeed we note that the paper [20] studied the effect, on the posterior distribution, of stochastic collocation approximation within the forward model and was one of the first papers to address such questions. That paper used the Kullback-Leibler divergence, or relative entropy, to measure the effect on the posterior, and considered finite dimensional input parameter spaces. The main focus of this work is to analyse the error introduced in the posterior distribution by using a Gaussian process emulator as a surrogate model. The error is measured in the Hellinger distance, which is shown in [37, 14] to be a suitable metric for evaluation of perturbations to the posterior measure in Bayesian inverse problems, including problems with infinite dimensional input parameter spaces. We consider emulating either the parameter-to-observation map or the negative log-likelihood. In both cases, we study three different approximations to the posterior distribution. Firstly, we consider using the mean of the Gaussian process emulator as a surrogate model, resulting in a deterministic approximation to the posterior distribution. Our second approximation is obtained by using the full Gaussian process as a surrogate model, leading to a random approximation in which case we study the expected value of the Hellinger distance between the true and the approximate posterior distribution. Finally, we construct an alternative deterministic approximation by using the full Gaussian process as surrogate model, and taking the expected value (with respect to the distribution of the surrogate) of the likelihood. Our analysis is restricted to finite dimensional input spaces. This reflects the state-of-the-art with respect to Gaussian process emulation itself; the analysis of the effect on the posterior is less sensitive to dimension. For the three approximations discussed above, we show that the Hellinger distance between the true and approximate posterior distribution can be bounded by the error between the true parameter-to-observation map (or log-likelihood) and its Gaussian process approximation, measured in a norm that depends on the approximation considered. Combining this with results from the theory of scattered data interpolation [39, 34, 26], we give convergence estimates of the error in terms of the design points used to construct the Gaussian process emulator. The remainder of this paper is organised as follows. In section 2, we set up the Bayesian inverse problem of interest. We then recall some results on Gaussian process regression in section 3. The heart of the paper is section 4, where we introduce the different approximations to the posterior and perform an error analysis. Our theoretical results are confirmed on a simple model problem in section 5, and some conclusions are finally given in section 6.
2
Bayesian Inverse Problems
Let X and V be separable Banach spaces, and define the measurable mappings G : X → V and O : V → RJ , for some J ∈ N. Denote by G : X → RJ the composition of O and G. We refer to G as the forward map, to O as the observation operator and to G as the parameter-to-observation map. We denote by k · k the Euclidean norm on Rn , for n ∈ N. We consider the setting where the Banach space X is a compact subset of RK , for some finite K ∈ N, representing the range of a finite number K of parameters u. The inverse problem of interest is to determine the parameters 2
u ∈ X from the noisy data y ∈ RJ given by y = G(u) + η, where the noise η is a realisation of the RJ -valued Gaussian random variable N (0, ση2 I), for some known variance ση2 . We adopt a Bayesian perspective in which, in the absence of data, u is distributed according to a prior measure µ0 . We are interested in the posterior distribution µy on the conditioned random variable u|y, which can be characterised as follows. Proposition 2.1. ([37]) Suppose G : X → RJ is continuous and µ0 (X) = 1. Then the posterior distribution µy on the conditioned random variable u|y is absolutely continuous with respect to µ0 and given by Bayes’ Theorem: 1 dµy (u) = exp − Φ(u) , dµ0 Z where
Φ(u) =
1 ky − G(u)k2 2ση2
and
Z = Eµ0 exp − Φ(u) .
(2.1)
We make the following assumption on the regularity of the parameter-to-observation map G. Assumption 2.2. We assume that G : X → RJ satisfies G ∈ H s (X; RJ ), for some s > K/2, and that supu∈X kG(u)k =: CG < ∞. Under Assumption 2.2, it follows that the negative log-likelihood Φ : X → R satisfies Φ ∈ and supu∈X |Φ(u)| =: CΦ < ∞. Since s > K/2, the Sobolev Embedding Theorem furthermore implies that G and Φ are continuous. Examples of model problems satisfying Assumption 2.2 include linear elliptic and parabolic partial differential equations [9, 35] and non-linear ordinary differential equations [38, 16]. A specific example is given in section 5. H s (X),
3
Gaussian Process Regression
We are interested in using Gaussian process regression to build a surrogate model for the forward map, leading to an approximate Bayesian posterior distribution that is computationally cheaper to evaluate. Generally speaking, Gaussian process regression (or Gaussian process emulation, or kriging) is a way of building an approximation to a function f , based on a finite number of evaluations of f at a chosen set of design points. We will here consider emulation of either the parameterto-observation map G : X → RJ or the negative log-likelihood Φ : X → R. Since the efficient emulation of vector-valued functions is still an open question [5], we will focus on the emulation of scalar valued functions. An emulator of G in the case J > 1 is constructed by emulating each entry independently. Let now f : X → R be an arbitrary function. Gaussian process emulation is in fact a Bayesian procedure, and the starting point is to put a Gaussian process prior on the function f . In other words, we model f as f0 ∼ GP(m(u), k(u, u′ )), (3.1) with known mean m : X → R and two point covariance function k : X × X → R. Here, we use the Gaussian process notation as in, for example, [30]. In the notation of [37], we have f0 ∼ N (m, C), where m = m(·) and C is the integral operator with covariance function k as kernel. 3
Typical choices of the mean function m include the zero function and polynomials [30]. A family of covariance functions k frequently used in applications are the Mat`ern covariance functions [22], given by √ ku − u′ k ν √ ku − u′ k 1 ′ 2 Bν , (3.2) kν,λ,σ2 (u, u ) = σk 2ν 2ν k Γ(ν)2ν−1 λ λ
where Γ denotes the Gamma function, Bν denotes the modified Bessel function of the second kind and ν, λ and σk2 are positive parameters. The parameter λ is referred to as the correlation length, and governs the length scale at which f0 (u) and f0 (u′ ) are correlated. The parameter σk2 is referred to as the variance, and governs the magnitude of f0 (u). Finally, the parameter ν is referred to as the smoothness parameter, and governs the regularity of f0 as a function of u. As the limit when ν → ∞, we obtain the Gaussian covariance ku − u′ k2 ′ 2 . (3.3) k∞,λ,σ2 (u, u ) = σk exp − k 2λ2 Now suppose we are given data in the form of a set of distinct design points U := {un }N n=1 ⊆ X, together with corresponding function values f (U ) := [f (u1 ), . . . , f (uN )] ∈ RN .
(3.4)
Conditioning the Gaussian process (3.1) on the known values (U, f (U )), we obtain another Gaussian process fN , known as the predictive process. We have fN ∼ GP(mfN (u), kN (u, u′ )),
(3.5)
where the predictive mean mfN : X → R and predictive covariance kN : X × X → R are known explicitly [30], and depend on the modelling choices made in (3.1). In the following discussion, we will focus on the popular choice m ≡ 0; the case of a non-zero mean is discussed in Remark 3.7. When m ≡ 0, we have mfN (u) = k(u, U )T K(U, U )−1 f (U ),
kN (u, u′ ) = k(u, u′ ) − k(u, U )T K(U, U )−1 k(u′ , U ), (3.6)
where k(u, U ) = [k(u, u1 ), . . . , k(u, uN )] ∈ RN and K(U, U ) ∈ RN ×N is the matrix with ij th entry equal to k(ui , uj ) [30]. There are several points to note about the predictive mean mfN in (3.6). Firstly, mfN is a linear combination of the function evaluations f (U ), and hence a linear predictor. It is in fact the best linear predictor [36], in the sense that it is the linear predictor with the smallest mean square error. Secondly, mfN interpolates the function f at the design points U , since the vector k(un , U ) is the nth row of the matrix K(U, U ). In other words, we have mfN (un ) = f (un ), for all n = 1, . . . , N . Finally, we remark that mfN is a linear combination of kernel evaluations, mfN (u)
=
N X
αn k(u, un ),
n=1
where the vector of coefficients is given by α = K(U, U )−1 f (U ). Concerning the predictive covariance kN , we note that kN (u, u) < k(u, u) for all u ∈ X, since K(U, U )−1 is positive definite. Furthermore, we also note that kN (un , un ) = 0, for n = 1, . . . , N , since k(un , U )T K(U, U )−1 k(un , U ) = k(un , un ). 4
For stationary covariance functions k(u, u′ ) = k(ku − u′ k), the predictive mean is a radial basis functions interpolant of f , and we can make use of results from the radial basis function literature to investigate the behaviour of mfN and kN as N → ∞. Before we do this, in subsection 3.2, we recall some results on native spaces (also know as reproducing kernel Hilbert spaces) in subsection 3.1.
3.1
Native spaces of Mat` ern kernels
We recall the notion of the reproducing kernel Hilbert space corresponding to the kernel k, usually referred to as the native space of k in the radial basis function literature. Definition 3.1. A Hilbert space Hk of functions f : X → R, with inner product h·, ·iHk , is called the reproducing kernel Hilbert space (RKHS) corresponding to a symmetric, positive definite kernel k : X × X → R if i) for all u ∈ X, k(u, u′ ), as a function of its second argument, belongs to Hk , ii) for all u ∈ X and f ∈ Hk , hf, k(u, ·)iHk = f (u). By the Moore-Aronszajn Theorem [2], a unique RKHS exists for each symmetric, positive definite kernel k. Furthermore, this space can be constructed using Mercer’s Theorem [23], and it is equal to the Cameron-Martin space [7] of the covariance operator C with kernel k. For covariance kernels of Mat`ern type, the native space is isomorphic to a Sobolev space [39, 34]. Proposition 3.2. Let kν,λ,σ2 be a Mat`ern covariance kernel as defined in (3.2). Then the native k
space Hkν,λ,σ2 is equal to the Sobolev space H ν+K/2 (X) as a vector space, and the native space norm k
and the Sobolev norm are equivalent. Native spaces for more general kernels, including non-stationary kernels, are analysed in [39]. For stationary kernels, the native space can generally be characterised by the rate of decay of the Fourier transform of the kernel. The native space of the Gaussian kernel (3.3), for example, consists of functions whose Fourier transform decays exponentially, and is hence strictly contained in the space of analytic functions. Proposition 3.2 shows that as a vector space, the native space of the Mat`ern kernel kν,λ,σ2 is fully determined by the smoothness parameter ν. The parameters λ and k σk2 do, however, influence the constants in the norm equivalence of the native space norm and the standard Sobolev norm.
3.2
Radial basis function interpolation
For stationary covariance functions k(u, u′ ) = k(ku − u′ k), the predictive mean is a radial basis functions interpolant of f . In fact, it is the minimum norm interpolant [30], mfN =
arg min g∈Hk : g(U )=f (U )
kgkHk .
(3.7)
Given the set of design points U = {un }N n=1 ⊆ X, we define the fill distance hU , separation radius qU and mesh ratio ρU by hU := sup inf ku − un k, n u∈X u ∈U
qU := 5
1 min kuj − ui k, 2 i6=j
ρU :=
hU ≥ 1. qU
The fill distance is the maximum distance any point in X can be from U , and the separation radius is half the smallest distance between any two distinct points in U . The mesh ratio provides a measure of how uniformly the design points U are distributed in X. We have the following theorem on the convergence of mfN to f [39, 25, 26]. Proposition 3.3. Suppose X ⊆ RK is a bounded, Lipschitz domain that satisfies an interior cone condition, and the symmetric positive definite kernel k is such that Hk is isomorphic to the Sobolev space H τ (X), with τ = n + r, n ∈ N, n > K/2 and 0 ≤ r < 1. Suppose mfN is given by (3.6). If f ∈ H τ (X), then there exists a constant C, independent of f , U and N , such that τ −β kf − mfN kH β (X) ≤ ChU kf kH τ (X) ,
for any β ≤ τ,
for all sets U with hU sufficiently small. Proposition 3.3 assumes that the function f is in the RKHS of the kernel k. Convergence estimates for a wider class of functions can be obtained using interpolation in Sobolev spaces [26]. Proposition 3.4. Suppose X ⊆ RK is a bounded, Lipschitz domain that satisfies an interior cone condition, and the symmetric positive definite kernel k is such that Hk is isomorphic to the Sobolev space H τ (X). Suppose mfN is given by (3.6). If f ∈ H τ˜ (X), for some τ˜ ≤ τ , τ˜ = n + r, n ∈ N, n > K/2 and 0 ≤ r < 1, then there exists a constant C, independent of f , U and N , such that τ˜−β τ˜−β kf − mfN kH β (X) ≤ ChU ρU kf kH τ˜ (X) ,
for any β ≤ τ˜,
for all sets U with hU and ρU sufficiently small. Convergence of the predictive variance kN (u, u) follows under the assumptions of Proposition 3.3 or Proposition 3.4 using the relation in Proposition 3.5 below. This was already noted, without proof, in [34]; we give a proof here for completeness. Proposition 3.5. Suppose mfN and kN are given by (3.6). Then 1
kN (u, u) 2 =
sup |g(u) − mgN (u)|.
kgkHk =1
Proof. For any u ∈ X, we have sup |g(u) − mgN (u)| =
kgkHk =1
=
N X (k(u, U )T K(U, U )−1 )j g(uj ) sup g(u) −
kgkHk =1
j=1
N X (k(u, U )T K(U, U )−1 )j hg, k(·, uj )iHk sup hg, k(·, u)iHk −
kgkHk =1
=
j=1
N X sup hg, k(·, u) − (k(u, U )T K(U, U )−1 )j k(·, uj )iHk
kgkHk =1
j=1
T
= kk(·, u) − k(·, U ) K(U, U )−1 k(u, U )kHk .
6
The final equality follows from the Cauchy-Schwarz inequality, which becomes an equality when the two functions considered are linearly dependent. By Definition 3.1, we then have kk(·, u) − k(·, U )T K(U, U )−1 k(u, U )k2Hk
= hk(·, u) − k(·, U )T K(U, U )−1 k(u, U ), k(·, u) − k(·, U )T K(U, U )−1 k(u, U )iHk
= hk(·, u), k(·, u)iHk − 2hk(·, u), k(·, U )T K(U, U )−1 k(u, U )iHk
+ hk(·, U )T K(U, U )−1 k(u, U ), k(·, U )T K(U, U )−1 k(u, U )iHk
= k(u, u) − 2k(u, U )T K(U, U )−1 k(u, U ) + k(u, U )T K(U, U )−1 k(u, U )
= kN (u, u).
The identity which leads to the third term in the penultimate line uses the fact that hk(·, u′ ), k(·, u)iHk = k(u, u′ ), for any u, u′ ∈ X. If ℓ(u) = K(U, U )−1 k(u, U ) then X hk(·, U )T K(U, U )−1 k(u, U ), k(·, U )T K(U, U )−1 k(u, U )iHk = ℓj (u)hk(·, uj ), k(·, uk )iHk ℓk (u) j,k
=
X
ℓj (u)k(uj , uk )ℓk (u)
j,k
= ℓ(u)T K(U, U )ℓ(u) = k(u, U )T K(U, U )−1 k(u, U ) as required. This completes the proof. The second string of equalities, appearing in the middle part of the proof Proposition 3.5, might appear counter-intuitive at first glance in that the left-most quantity is a norm squared of quantities which scale like k, whilst the right-most quantity scales like k itself. However, the space Hk itself depends on the kernel k, and scales inversely proportional to k, explaining that the identity is indeed dimensionally correct. Remark 3.6. (Exponential convergence for the Gaussian kernel) The RKHS corresponding to the Gaussian kernel (3.3) is no longer isomorphic to a Sobolev space; it is contained in H τ (X), for any τ < ∞. For functions f in this RKHS, Gaussian process regression with the Gaussian kernel converges exponentially in the fill distance hU . For more details, see [39]. Remark 3.7. (Regression with non-zero mean) If in (3.1) we use a non-zero mean m(·), the formula for the predictive mean mfN changes to mfN (u) = m(u) + k(u, U )T K(U, U )−1 (f (U ) − m(U )),
(3.8)
where m(U ) := [m(u1 ), . . . , m(uN )] ∈ RN . The predictive covariance kN (u, u′ ) is as in (3.6). As in the case m ≡ 0, we have mfN (un ) = f (un ), for n = 1, . . . , N , and mfN is an interpolant of f . If m ∈ Hk , then mfN given by (3.8) is also in Hk , and the proof techniques in [25, 26] can be applied. The conclusions of Propositions 3.3 and 3.4 then hold, with the factor kf k in the error bounds replaced by kf k + kmk.
7
4
Approximation of the Bayesian posterior distribution
In this section, we analyse the error introduced in the posterior distribution µy when we use a Gaussian process emulator to approximate the parameter-to-observation map G or the negative log-likelihood Φ. The aim is to show convergence, in a suitable sense, of the approximate posterior distributions to the true posterior distribution as the number of observations N tends to infinity. For a given approximation µy,N of the posterior distribution µy , we will focus on bounding the Hellinger distance [37] between the two distributions, which is defined as
dHell (µy , µy,N ) =
1 2
Z
X
s
dµy dµ0
−
s
dµy,N dµ0
2
1/2
dµ0
.
As proven in [14, Lemma 6.12 and 6.14], the Hellinger distance provides a bound for the Total Variation distance √ 1 dTV (µy , µy,N ) = sup Eµy (f ) − Eµy,N (f ) ≤ 2 dHell (µy , µy,N ), 2 kf k∞ ≤1
and for f ∈ L2µy (X) ∩ L2µy,N (X), the Hellinger distance also provides a bound on the error in expected values Eµy (f ) − Eµy,N (f ) ≤ 2(Eµy (f 2 ) + Eµy,N (f 2 ))1/2 dHell (µy , µy,N ).
Depending on how we make use of the predictive process GN or ΦN to approximate the Radony y Nikodym derivative dµ dµ0 , we obtain different approximations to the posterior distribution µ . We will distinguish between approximations based solely on the predictive mean, and approximations that make use of the full predictive process.
4.1
Approximation based on the predictive mean
Using simply the predictive mean of a Gaussian process emulator of the parameter-to-observation y,N,Φ map G or the negative log-likelihood Φ, we can define the approximations µy,N,G mean and µmean , given by
2 1 1 dµy,N,G mean (u) = mean exp − 2 y − mGN (u) , dµ0 ZN,G 2ση
2 1 mean , ZN,G = Eµ0 exp − 2 y − mGN (u) 2ση
1
dµy,N,Φ 1 mean (u) = mean exp − mΦ N (u) , dµ0 ZN,Φ mean , (u) ZN,Φ = Eµ0 exp − mΦ N J
where mGN (u) = [mGN (u), . . . , mGN (u)] ∈ RJ . We have the following lemma concerning the normalmean and Z mean , which is followed by the main Theorem 4.2 and Corollary 4.3 ising constants ZN,G N,Φ y,N,Φ concerning the approximations µy,N,G mean , µmean .
8
Lemma 4.1. Suppose supu∈X kG(u) − mGN (u)k and supu∈X |Φ(u) − mΦ N (u)| converge to 0 as N tends to ∞, and assume supu∈X kG(u)k ≤ CG . Then there exist positive constants C1 and C2 , independent of U and N , such that mean C1 ≤ ZN,G ≤1
and
mean C2 ≤ ZN,Φ ≤ C2−1 .
mean . The upper bound follows from a straight forward calculation, Proof. Let us first consider ZN,G
2 since the potential 1 2 y − mG (u) is non-negative: 2ση
N
2 1 mean ZN,G = Eµ0 exp − 2 y − mGN (u) ≤ Eµ0 (1) = 1. 2ση
For the lower bound, we have
2
2 1 1 mean = exp − 2 sup y − mGN (u) , ZN,G ≥ Eµ0 exp − 2 sup y − mGN (u) 2ση u∈X 2ση u∈X R since X µ0 (du) = 1. Using the triangle inequality, the assumption supu∈X kG(u)k ≤ CG and the fact that every convergent sequence is bounded, we have
2
2
(4.1) sup y − mGN (u) ≤ sup ky − G(u)k2 + sup G(u) − mGN (u) =: − ln C1 , u∈X
u∈X
u∈X
where C1 is independent of U and N . R mean is similar. For the upper bound, we use The proof for ZN,Φ X µ0 (du) = 1 and the triangle inequality to derive Φ Φ mean ZN,Φ ≤ sup exp − mΦ N (u) ≤ exp sup |mN (u)| ≤ exp sup |Φ(u)| + sup |Φ(u) − mN (u)| . u∈X
u∈X
u∈X
u∈X
Since supu∈X |Φ(u)| is bounded when supu∈X kG(u)k is bounded, the fact that every convergent sequence is bounded again gives sup |Φ(u)| + sup |Φ(u) − mΦ N (u)| =: − ln C2 ,
u∈X
u∈X
for a constant C2 independent of U and N . For the lower bound, we note that since −1 mean = exp − sup |mΦ (u)| ZN,Φ ≥ Eµ0 exp − sup |mΦ N (u)| ≥ C2 . N
R
X
µ0 (du) = 1,
u∈X
u∈X
y,N,Φ We may now prove the desired theorem and corollary concerning µy,N,G mean and µmean .
Theorem 4.2. Under the Assumptions of Lemma 4.1, there exist constants C1 and C2 , independent of U and N , such that
G
dHell (µy , µy,N,G mean ) ≤ C1 G − mN L2µ (X;RJ ) ,
0 Φ
and dHell (µy , µy,N,Φ ) ≤ C Φ − m 2 mean N L2 (X) . µ0
9
y,N,G Proof. Let us first consider µmean . By definition of the Hellinger distance, we have
2 d2Hell (µy , µy,N,G mean ) = 2 ≤ Z
Z
Z
X
s
dµy dµ0
−
s
2 y,N,G dµmean dµ0
µ0 (du)
2 1 1 exp − 2 ky − G(u)k2 − exp − 2 y − mGN (u) 4ση 4ση X 2 mean −1/2 mean Z −1/2 − (ZN,G ) + 2 ZN,G
2
µ0 (du)
=: I + II.
For the first term, we use the local Lipschitz continuity of the exponential function, together with the equality a2 − b2 = (a − b)(a + b) and the reverse triangle inequality to bound
2 2 1 1 2 G
exp − 2 ky − G(u)k − exp − 2 y − mN (u) µ0 (du) 4ση 4ση X Z
2 2 1 2 G ky − G(u)k − y − mN (u) µ0 (du) ≤ 2ση2 X Z
2 2 1 ky − G(u)k + ky − mGN (u)k G(u) − mGN (u) µ0 (du) = 4 X 4ση
2 2 1 ≤ 4 sup ky − G(u)k + ky − mGN (u)k G(u) − mGN (u) L2 (X;RJ ) µ0 4ση u∈X
Z I= 2
Z
As in equation (4.1), the first supremum can be bounded independently of U and N , from which it follows that
2
I ≤ C G(u) − mGN (u) L2 (X;RJ ) , µ0
for a constant C independent of U and N . For the second term, a very similar argument, together with Lemma 4.1 and Jensen’s inequality, shows 2 mean −1/2 mean Z −1/2 − (ZN,G ) II = 2 ZN,G
mean 2 mean mean −3 ) )|Z − ZN,G | ≤ 2 ZN,G max(Z −3 , (ZN,G Z 2
2 1 1 2 mean −3 mean −3 G
exp − 2 ky − G(u)k − exp − 2 y − mN (u) = 2 ZN,G max(Z , (ZN G ) ) µ0 (du) 4ση 4ση X
2 ≤ C G(u) − mGN (u) L2 (X;RJ ) , µ0
for a constant C independent of U and N . The proof for µy,N,Φ mean is similar. We use an identical corresponding splitting of the Hellinger y,N,Φ y distance dHell (µ , µmean ) ≤ I + II. Using the local Lipschitz continuity of the exponential function, we have Z
2 2 Z
µ0 (du) ≤ 2 Φ(u) − mΦ exp − Φ(u) − exp − mΦ I= N (u) L2µ (X) . N (u) 2 0 X 10
Using Lemma 4.1 and Jensen’s inequality, we furthermore have II ≤
mean mean −3 2 ZN,Φ max(Z −3 , (ZN,Φ ) )
2
2 ≤ C Φ(u) − mΦ N (u)
Lµ0 (X)
Z
X
,
2 Φ exp − Φ(u) − exp − mN (u) µ0 (du)
for a constant C independent of U and N . We remark here that Theorem 4.2 does not make any assumptions on the predictive means G Φ and mΦ N other than the requirement that supu∈X kG(u) − mN (u)k and supu∈X |Φ(u) − mN (u)| converge to 0 as N tends to ∞. Whether the predictive means are defined as in (3.6), or are derived by alternative approaches to Gaussian process regression [30], does not affect the conclusions of Theorem 4.2. Under Assumption 2.2, we can combine Theorem 4.2 with Proposition 3.3 (or Proposition 3.4) with β = 0 to obtain error bounds in terms of the fill distance of the design points. mGN
j
G Corollary 4.3. Suppose mΦ ern kernel N and mN , j = 1, . . . , J, are defined as in (3.6), with Mat` k = kν,λ,σ2 . Suppose Assumption 2.2 and the assumptions of Proposition 3.3 and Theorem 4.2 are k satisfied. Then there exist constants C1 and C2 , independent of U and N , such that ν+K/2
dHell (µy , µy,N,G mean ) ≤ C1 hU
4.2
,
and
ν+K/2
dHell (µy , µy,N,Φ mean ) ≤ C2 hU
.
Approximations based on the predictive process
Alternative to the mean-based approximations considered in the previous section, we now consider approximations to the posterior distribution µy obtained using the full predictive processes GN and G Φ the the distribution of GN and by νN ΦN . For the remainder of this section, we denote by νN distribution of ΦN , for N ∈ N ∪ {0}. We note that since the process GN consists of J independent Q j G Gj G is a product measure, νN = Jj=1 νN . ΦN is a Gaussian Gaussian processes GN , the measure νN j process with mean mΦ N and covariance kernel kN , and GN , for j = 1, . . . , J, is a Gaussian process j with mean mGN and covariance kernel kN . Replacing G by GN in (2.1), we obtain the approximation µy,N,G sample given by
dµy,N,G sample dµ0
where
(u) =
1 sample ZN,G
exp −
1 ky − GN (u)k2 , 2 2ση
1 sample = Eµ0 exp − 2 ky − GN (u)k2 . ZN,G 2ση
Similarly, we define for the predictive process ΦN the approximation µy,N,Φ sample by dµy,N,Φ sample dµ0
(u) =
1 sample ZN,Φ
exp − ΦN (u) ,
sample ZN,Φ = Eµ0 exp − ΦN (u) .
y,N,Φ y The measures µy,N,G sample and µsample are random approximations of the deterministic measure µ . y Deterministic approximations of the posterior distribution µ can now be obtained by taking the
11
expected value with respect to the predictive processes GN and ΦN . This results in the marginal approximations dµy,N,G marginal dµ0
(u) =
1 2 exp − E , ky − G (u)k G N ν sample 2ση2 ) N Eν G (ZN,G 1
N
dµy,N,Φ marginal dµ0
(u) =
1 sample ) Eν Φ (ZN,Φ N
Eν Φ exp − ΦN (u) . N
y,N,Φ Note that by Tonelli’s Theorem, the measures µy,N,G marginal and µmarginal are indeed probability measures. y,N,Φ Before proving bounds on the error in the marginal approximations µy,N,G marginal and µmarginal in y,N,Φ section 4.2.2, and the error in the random approximations µy,N,G sample and µsample in section 4.2.3, we sample sample crucially prove boundedness of the normalising constants ZN,G and ZN,Φ in section 4.2.1.
4.2.1
sample sample Moment bounds on ZN,G and ZN,Φ
Firstly, we recall the following classical results from the theory of Gaussian measures on Banach spaces [29, 1]. Proposition 4.4. (Fernique’s Theorem) Let E be a separable Banach space and ν a centred Gaussian measure on (E, B(E)). If λ, r > 0 are such that 1 − ν(f ∈ E : kf kE ≤ r) ≤ −1 − 32λr 2 , log ν(f ∈ E : kf kE ≤ r) then
Z
E
exp(λkf k2E )ν(df ) ≤ exp(16λr 2 ) +
e2 e2 − 1.
Proposition 4.5. (Borell-TIS Inequality1 ) Let f be a scalar, almost surely bounded Gaussian field on a compact domain T ⊆ RK , with zero mean E(f (t)) = 0 and bounded variance 0 < σf2 := supt∈T V(f (t)) < ∞. Then E(supt∈T f (t)) < ∞, and for all r > 0, P(sup f (t) − E(sup f (t)) > r) ≤ exp(−r 2 /2σf2 ). t∈T
t∈T
Proposition 4.6. (Sudakov-Fernique Inequality) Let f and g be scalar, almost surely bounded Gaussian fields on a compact domain T ⊆ RK . Suppose E((f (t) − f (s))2 ) ≤ E((g(t) − g(s))2 ) and E(f (t)) = E(g(t)), for all s, t ∈ T . Then E(sup f (t)) ≤ E(sup g(t)). t∈T
t∈T
1
The Borell-TIS inequality is named after the mathematicians Borell and Tsirelson, Ibragimov and Sudakov, who independently proved the result.
12
sample sample Using these results, we are now ready to prove bounds on moments of ZN,G and ZN,Φ , similar to those proved in Lemma 4.1. The reader interested purely in approximation results for the posterior can simply read the statements of the following two lemmas, and then proceed directly to subsections 4.2.2 and 4.2.3.
Lemma 4.7. Let X ⊆ RK be compact. Suppose supu∈X G(u) − mGN (u) , supu∈X Φ(u) − mΦ N (u) and supu∈X kN (u, u) converge to 0 as N tends to infinity, and assume supu∈X kG(u)k ≤ CG < ∞. Suppose the assumptions of the Sudakov-Fernique inequality hold, for g = Φ0 and f = ΦN − mΦ N, j j Gj and for g = G0 and f = GN − mN , for j ∈ {1, . . . , J}. Then, for any 1 ≤ p < ∞, there exist positive constants C1 and C2 , independent of U and N , such that for all N sufficiently large sample −p sample p ) ≤ C1 . ) ≤ 1, and 1 ≤ Eν G (ZN,G C1−1 ≤ Eν G (ZN,G N
N
and
sample p C2−1 ≤ Eν Φ (ZN,Φ ) ≤ C2 , N
and
sample ZN,G .
sample −p C2−1 Eν Φ (ZN,Φ ) ≤ C2 . N
Proof. We start with Since the potential 2σ1 2 ky − GN (u)k2 is non-negative and η R G (dGN ), we have for any 1 ≤ p < ∞, 1 = C 0 (X;RJ ) νN sample p ) ) Eν G ((ZN,G N
=
Z
C 0 (X;RJ )
R
X
µ0 (du) =
p 1 2 G exp − 2 ky − GN (uk µ0 (du) νN (dGN ) ≤ 1. 2ση X
Z
From Jensen’s inequality, it then follows that sample −p sample p −1 ≥ 1. ) ) ) Eν G ((ZN,G ≥ Eν G ((ZN,G N
N
To determine C1 , we use the triangle inequality to bound, for any 1 ≤ p < ∞, −p Z Z 1 sample −p 2 G (dGN ) νN Eν G (ZN,G ) exp − 2 ky − GN (u)k µ0 (du) = N 2σ 0 J X C (X;R ) η Z −p G 1 exp − 2 sup ky − GN (u)k2 ≤ νN (dGN ) 2ση u∈X C 0 (X;RJ ) Z p 2 G exp = sup ky − G (u)k νN (dGN ) N 2ση2 u∈X C 0 (X;RJ ) !Z ! supu∈X ky − mGN (u)k2 supu∈X kGN (u) − mGN (u)k2 G ≤ exp exp (dGN ). νN −1 σ 2 2p−1 ση2 2p 0 J C (X;R ) η The first factor can be bounded independently of U and
N using the triangle inequality, together with supu∈X kG(u)k ≤ CG and supu∈X G(u) − mGN (u) → 0 as N → ∞. For the second factor, we
13
use Fernique’s Theorem (Proposition 4.4). First, we note that (using independence) ! Z supu∈X kGN (u) − mGN (u)k2 G (dGN ) νN exp −1 σ 2 2p 0 J C (X;R ) η ! P j Z supu∈X Jj=1 |G j N (u) − mGN (u)|2 G exp = (dGN ) νN −1 σ 2 2p 0 J C (X;R ) η Z J Gj 2 j X supu∈X |G N (u) − mN (u)| G νN (dGN ) exp ≤ 2p−1 ση2 0 J C (X;R ) j=1 ! j Z J Y supu∈X |G j N (u) − mGN (u)|2 G exp = νN (dGN ) 2p−1 ση2 C 0 (X;RJ ) j=1 ! j J Z Y supu∈X |G j N (u) − mGN (u)|2 j Gj exp = (dGN ). νN −1 2 2p ση C 0 (X) j=1
It remains to show that, for N sufficiently large, the assumptions of Fernique’s Theorem hold for Gj λ = pση−2 /2 and a value of r independent of U and N , for ν equal to the push-forward of νN j
j
G ⊂ C 0 (X) the set of all functions f such that under the map T (f ) = f − mGN . Denote by BN,r j
j
j j = GN − mGN . By the Borell-TIS kf − mGN kC 0 (X) ≤ r, for some fixed r > 0 and 1 ≤ j ≤ J. Let GN j Inequality, we have for all r > E(supu∈X (GN (u)), Gj
j νN (GN
2 ! j G (u) r − E(sup u∈X N j , : sup GN (u) > r) ≤ exp − 2 2σN u∈X
j j j j ′ 2 ′ 2 2 := sup where σN u∈X kN (u, u). By assumption, Eν j ((GN (u) − GN (u )) ) ≤ Eν j ((G0 (u) − G0 (u )) ), N
0
j and so E(supu∈X (GN (u)) ≤ E(supu∈X (G0j (u)), by the Sudakov-Fernique Inequality. We can hence choose r > E(supu∈X (G0j (u)), independent of U and N , such that the bound 2 ! j r − E(sup G (u) j u∈X 0 j j G , νN : sup GN (GN (u) > r) ≤ exp − 2 2σN u∈X
2 → 0 as N → ∞, and by the symmetry of Gaussian holds for all N ∈ N . By assumption we have σN j j G G ) → 1 as N → ∞, for all r > E(supu∈X (G0j (u)). For N = N (p) (BN,r measures, we hence have νN sufficiently large, the inequality ! Gj Gj 1 − νN (BN,r ) ≤ −1 − 32λr 2 , log Gj Gj νN (BN,r )
in the assumptions of Fernique’s Theorem is then satisfied, for λ = pση−2 /2 and r > E(supu∈X (G0j (u)), sample −p both independent of U and N . Hence, E (ZN,G ) ≤ C1 (p), for a constant C1 (p) < ∞ independent of U and N . From Jensen’s inequality, it then finally follows that sample p sample −p −1 Eν G ((ZN,G ) ≥ Eν G ((ZN,G ) ) ≥ C1−1 (p). N
N
14
sample The proof for ZN,Φ is similar. Using sample p Eν Φ (ZN,Φ ) = N
Z
C 0 (X)
Z
Z
X
R
X
µ0 (du) = 1 and the triangle inequality, we have
p Φ (dΦN ) exp − ΦN (u) µ0 (du) νN
Φ exp p sup |ΦN (u)| νN (dΦN ) u∈X C 0 (X) Z Φ Φ exp p sup |ΦN (u) − mΦ ≤ exp p sup |mN (u)| N (u)| νN (dΦN ). ≤
C 0 (X)
u∈X
u∈X
The first factor can be bounded independently of U and N , since supu∈X kG(u)k ≤ CG and converges to 0 as N → ∞. The second factor can be bounded by Fersupu∈X Φ(u) − mΦ (u) N Φ (B Φ ) → 1 as nique’s Theorem. Using the same proof technique as above, we can show that νN N,r Φ 0 N → ∞ for all r > E(supu∈X (Φ0 (u)), where BN,r ⊂ C (X) denotes the set of all functions f such that kf − mΦ N kC 0 (X) ≤ r. Hence, it is possible to choose r > 0, independent of U and N , such that Φ under the map the assumptions of Fernique’s Theorem hold for ν equal to the push-forward of νN Φ T (f ) = f − mN , for some λ > 0 also independent of U and N . By Young’s inequality, we have Φ 2 2 exp p sup |ΦN (u) − mΦ N (u)| ≤ exp λ sup |ΦN (u) − mN (u)| + p /4λ , u∈X
u∈X
sample p ) ≤ C2 (p), for a constant C2 (p) < ∞ independent of U and N , for and it follows that Eω (ZN,Φ any 1 ≤ p < ∞. Furthermore, we note Z Φ sample −p exp p sup |ΦN (u)| νN (dΦN ) ≤ C2 (p). Eν Φ (ZN,Φ ) ≤ N
C 0 (X;R)
u∈X
sample p sample −p ) ≥ C2 (p)−1 . ) ≥ C2 (p)−1 and Eν Φ (ZN,Φ By Jensen’s inequality, we finally have Eν Φ (ZN,Φ N
N
In Lemma 4.7, we supposed that the assumptions of the Sudakov-Fernique inequality hold, for j j Gj g = Φ0 and f = ΦN − mΦ N , and for g = G0 and f = GN − mN , for j ∈ {1, . . . , J}. This is an assumption on the predictive variance kN . In the following Lemma, we prove this assumption for the predictive variance given in (3.6). Lemma 4.8. Suppose the predictive variance kN is given by (3.6). Then the assumptions of the j j Gj Sudakov-Fernique inequality hold, for g = Φ0 and f = ΦN −mΦ N , and for g = G0 and f = GN −mN , for j ∈ {1, . . . , J}. j
j j G Proof. We give a proof for g = Φ0 and f = ΦN − mΦ N , the proof for g = G0 and f = GN − mN , for j ∈ {1, . . . , J}, is identical. For any u, u′ ∈ X, we have Eν Φ (Φ0 (u)) = 0 = Eν Φ (ΦN (u) − mΦ N (u)), 0 N and Eν Φ ((ΦN (u) − mΦ (u)) − (ΦN (u′ ) − mΦ (u′ )))2 = kN (u, u) − kN (u, u′ ) − kN (u′ , u) + kN (u′ , u′ ), N N N Eν Φ (Φ0 (u) − Φ0 (u′ ))2 = k(u, u) − k(u, u′ ) − k(u′ , u) + k(u′ , u′ ). 0
By (3.6), we have
kN (u, u′ ) = k(u, u′ ) − k(u, U )T K(U, U )−1 k(u′ , U ), 15
and so Eν Φ (Φ0 (u) − Φ0 (u′ ))2 − Eν Φ ((ΦN (u) − mΦ (u)) − (ΦN (u′ ) − mΦ (u′ )))2 N N 0 N T ′ T = k(u, U ) − k(u , U ) K(U, U )−1 k(u, U ) − k(u′ , U ) ≥ 0,
since the matrix K(U, U )−1 is positive definite. We are now ready to prove bounds on the approximation error in the posterior distributions. 4.2.2
y,N,Φ Error in the marginal approximations µy,N,G marginal and µmarginal
y,N,Φ We start by analysing the error in the marginal approximations µy,N,G marginal and µmarginal .
Theorem 4.9. Under the assumptions of Lemma 4.7, there exist constants C1 and C2 , independent of U and N , such that
1/(1+δ)
y,N,G 1+δ y
, for any δ > 0, dHell (µ , µmarginal ) ≤ C1 Eν G kG − GN k
N
L2µ0 (X)
(|Φ − Φ |) dHell (µy , µy,N,Φ ) ≤ C Φ
E N 2 νN marginal
L2µ0 (X)
.
Proof. We start with µy,N,G marginal . By the definition of the Hellinger distance, we have 2 d2Hell (µy , µy,N,G marginal ) = 2 ≤ Z
Z
X
s
+ 2Eν G
N
= I + II.
Z
X
s
dµy dµ0
−
s
dµy,N,G marginal dµ0 s
2
µ0 (du)
! 2 1 1 2 2 exp − 2 ky − G(u)k − Eν G exp − 2 ky − GN (u)k µ0 (du) N 2ση 2ση 2 sample −1/2 sample Z −1/2 − Eν G ZN,G ZN,G N
√ √ For the first term, we use the (in)equalities a − b = (a2 − b2 )/(a + b) and ( a + b)2 ≥ a + b, for a, b > 0, to derive ! s s Z 2 Z 1 1 2 2 I= µ0 (du) exp − 2 ky − G(u)k − Eν G exp − 2 ky − GN (u)k N 2 2ση 2ση X 2 Z exp − 2σ12 ky − G(u)k2 − Eν G exp − 2σ1 2 ky − GN (u)k2 η η N ≤ µ0 (du) 2 1 1 X exp − 2σ2 ky − G(u)k + Eν G exp − 2σ2 ky − GN (u)k2 η η N −1 1 1 2 2 ≤ sup exp − 2 ky − G(u)k + Eν G exp − 2 ky − GN (u)k N 2ση 2ση u∈X Z 2 1 1 2 2 exp − 2 ky − G(u)k − Eν G exp − 2 ky − GN (u)k µ0 (du). N 2ση 2ση X 16
For the first factor, using the convexity of 1/x on (0, ∞), together with Jensen’s inequality, we have for all u ∈ X the bound −1 1 1 2 2 exp − 2 ky − G(u)k + Eν G exp − 2 ky − GN (u)k N 2ση 2ση −1 −1 1 1 ≤ exp − 2 ky − G(u)k2 + Eν G exp − 2 ky − GN (u)k2 N 2ση 2ση 1 1 2 2 ≤ exp exp ky − G(u)k ky − G (u)k + E G N νN 2ση2 2ση2 1 1 2 2 ≤ exp exp . sup ky − G(u)k sup ky − G (u)k + E G N νN 2ση2 u∈X 2ση2 u∈X
As in the proof of Lemma 4.7, it then follows by Fernique’s Theorem that the right hand side can be bounded by a constant independent of U and N . For the second factor in the bound on Z2 I, the linearity of expectation, the local Lipschitz continuity of the exponential function, the equality a2 − b2 = (a − b)(a + b), the reverse triangle inequality and H¨older’s inequality with conjugate exponents p = (1 + δ)/δ and q = 1 + δ give 2 1 1 2 2 µ0 (du) exp − 2 ky − G(u)k − Eν G exp − 2 ky − GN (u)k N 2ση 2ση X Z 2 1 1 2 2 µ0 (du) = Eν G exp − 2 ky − G(u)k − exp − 2 ky − GN (u)k N 2ση 2ση X Z 1 2 1 2 2 ≤2 Eν G | 2 ky − G(u)k − 2 ky − GN (u)k | µ0 (du) N 2ση 2ση X Z 2 2 Eν G (ky − G(u)k + ky − GN (u)k) kG(u) − GN (u)k µ0 (du) ≤ 4 N 4ση X Z 2/(1+δ) 2δ/(1+δ) 2 ≤ 4 µ0 (du) Eν G kG(u) − GN (u)k1+δ Eν G (ky − G(u)k + ky − GN (u)k)(1+δ)/δ N N 4ση X 2/(1+δ) 2δ/(1+δ) Z 2 µ0 (du), Eν G kG(u) − GN (u)k1+δ ≤ 4 sup Eν G (ky − G(u)k + ky − GN (u)k)(1+δ)/δ N N 4ση u∈X X
Z
for any δ > 0. The supremum in the above expression can be bounded by a constant independent of U and N by Fernique’s Theorem as in the proof of Lemma 4.7, since supu∈X kG(u)k ≤ CG < ∞. It follows that there exists a constant C independent of U and N such that
1/(1+δ)
2
Z 1+δ
. I ≤ C Eν G kGN − Gk
2 N 2 Lµ (X) 0
For the second term in the bound on the Hellinger distance, we have 1
2 sample −1/2 sample −3 sample 2 −1/2 −3 II = Z − E Z ≤ max(Z , (E ) )|Z−E | . G G Z G Z N,G N,G N,G ν ν ν sample
2Eν G ZN,G N
N
N
17
N
Using the linearity of expectation, Tonelli’s Theorem and Jensen’s inequality, we have 2 sample − E Z G Z N,G νN 2 Z 1 1 2 2 µ0 (du) = Eν G exp − 2 ky − G(u)k − exp − 2 ky − GN (u)k N 2ση 2ση X Z 2 1 1 2 2 ≤ Eν G exp − 2 ky − G(u)k − exp − 2 ky − GN (u)k µ0 (du). N 2ση 2ση X
which can now be bounded as before. The first claim of the theorem now follows by Lemma 4.7. The proof for µy,N,Φ marginal is similar. We use an identical corresponding splitting of the Hellinger y,N,Φ distance dHell (µy , µmarginal ) ≤ I + II. For the first term, we have
Z I= 2
Z
X
q
exp − Φ(u) −
r
Eν Φ
N
!2 exp − ΦN (u) µ0 (du)
−1 ≤ sup exp − Φ(u) + Eν Φ exp − ΦN (u) N u∈X Z 2 µ0 (du). exp − Φ(u) − Eν Φ exp − ΦN (u) N
X
The first factor can again be bounded using Jensen’s inequality,
−1 ≤ exp sup Φ(u) + Eν Φ exp sup ΦN (u) , sup exp − Φ(u) + Eν Φ exp − ΦN (u) N
u∈X
u∈X
N
u∈X
which as in the proof of Lemma 4.7, can be bounded by a constant independent of U and N by Fernique’s Theorem. For the second factor in the bound on Z2 I, the linearity of expectation and the local Lipschitz continuity of the exponential function give Z 2 µ0 (du) exp − Φ(u) − Eν Φ exp − ΦN (u) N X Z 2 µ0 (du) Eν Φ exp − Φ(u) − exp − ΦN (u) = N X Z 2 µ0 (du) Eν Φ |Φ(u) − ΦN (u)| ≤2 N X
2
. = 2 Eν Φ (|Φ(u) − ΦN (u)|) 2 N
Lµ0 (X)
For the second term in the bound on the Hellinger distance, the linearity of expectation, Tonelli’s Theorem and Jensen’s inequality give 2 Z 2 sample µ0 (du), Eν Φ exp − Φ(u) − exp − ΦN (u) Z − Eν Φ ZN,Φ ≤ N
X
N
which can now be bounded as before. The second claim of the theorem then follows by Lemma 4.7. 18
Similar to Theorem 4.2, Theorem 4.9 provides error bounds for general Gaussian process emulators of G and Φ. An example of a Gaussian process emulator that satisfies the assumptions of Theorem 4.9 is the emulator defined by (3.6), however, other choices are possible. Corollary 4.10. Suppose GN and ΦN are defined as in (3.6), with Mat`ern kernel k = kν,λ,σ2 . k Suppose Assumption 2.2 and the assumptions of Proposition 3.3 and Theorem 4.9 are satisfied. Then there exist constants C1 , C2 , C3 and C4 , independent of U and N , such that ν+K/2
dHell (µy , µy,N,G marginal ) ≤ C1 hU
+ C2 hνU ,
ν+K/2
dHell (µy , µy,N,Φ marginal ) ≤ C3 hU
and
+ C4 hνU .
y,N,Φ Proof. We give the proof for µy,N,G marginal , the proof for µmarginal is similar. Using Theorem 4.9, Jensen’s inequality and the triangle inequality, we have
1/2
2 y,N,G y 2 2
dHell (µ , µmarginal ) ≤ C Eν G kG − GN k
N
=C
Z
≤ 2C
ZX
L2µ0 (X)
Eν G kG(u) − GN (u)k2 µ0 (du)
X
N
kG(u) −
mGN (u)k2 µ0 (du) +
2C
Z
X
Eν G kmGN (u) − GN (u)k2 µ0 (du). N
The first term can be bounded by using Assumption 2.2, Proposition 3.2 and Proposition 3.3, Z
X
kG(u) −
mGN (u)k2 µ0 (du)
Z X J J X 2ν+K j Gj 2 kG j k2H ν+K/2 (X) , (G (u) − mN (u)) µ0 (du) ≤ ChU = X j=1
j=1
for a constant C independent of U and N . The second term can be bounded by using Assumption 2.2, Proposition 3.2, Proposition 3.3, Proposition 3.5, the linearity of expectation and the Sobolev Embedding Theorem Z
X
Eν G
N
kmGN (u)
Z J X j j (u))2 µ0 (du) (mGN (u) − GN Eν G − GN (u)k µ0 (du) = 2
N
X
=J
Z
j=1
kN (u, u)µ0 (du) X
≤ J sup
sup |g(u) − mgN (u)|2
u∈X kgkHk =1
≤ Ch2ν U , for a constant C independent of U and N . The claim of the corollary then follows. ν+K/2
, correNote that it in Corollary 4.10, we were not able to recover the convergence rate hU 2 sponding to the convergence rate of the error measured in the L (X)-norm in Proposition 3.3. The reason for this is the supremum over g appearing in the expression for kN (u, u) in Proposition 3.5, 1/2 which allows us only to conclude on the lower rate of convergence hνU for kkN kL2 (X) .
19
4.2.3
y,N,Φ Error in the random approximations µy,N,G sample and µsample
y,N,Φ We have the following result for the random approximations µy,N,G sample and µsample .
Theorem 4.11. Under the Assumptions of Lemma 4.7, there exist constants C1 and C2 , independent of U and N , such that
1/(2+δ) 1/2
y,N,G 2 2+δ y
, ≤ C1 Eν G dHell (µ , µsample )
Eν G kG − GN k N
N
1/2 2 ) Eν Φ dHell (µy , µy,N,Φ sample N
L2µ0 (X)
1/2
2
≤ C2 Eν Φ |Φ − ΦN |
N
.
L2µ0 (X)
Proof. We start with µy,N,G sample . By the definition of the Hellinger distance and the linearity of expectation, we have s 2 s Z y,N,G 2 dµsample dµy y,N,G µ0 (du) ) = Eν G Eν G 2 dHell (µy , µsample − N N dµ dµ 0 0 X 2 µ0 (du) exp − Φ(u)/2 − exp − ΦN (u)/2 X sample −1/2 sample −1/2 2 |Z − (ZN,G ) | + 2 Eν G ZN,G
2 ≤ Eν G Z N
Z
N
=: I + II.
For the first term, Tonelli’s Theorem, the local Lipschitz continuity of the exponential function, the equality a2 − b2 = (a − b)(a + b), the reverse triangle inequality and H¨older’s inequality with conjugate exponents p = (1 + δ)/δ and q = 1 + δ give 2 ! Z 1 1 Z 2 2 Eν G exp − 2 ky − G(u)k − exp − 2 ky − GN (u)k I= µ0 (du) N 2 4ση 4ση X Z 2 1 ≤ 2 µ0 (du) Eν G ky − G(u)k2 − ky − GN (u)k2 ση X N Z 1 ≤ 2 Eν G (ky − GN (u)k + ky − G(u)k)2 kGN (u) − G(u)k2 µ0 (du) ση X N Z 1/(1+δ) δ/(δ+1) 1 µ0 (du) Eν G kG(u) − GN (u)k2(1+δ) Eν G (ky − G(u)k + ky − GN (u)k)2(1+δ)/δ ) ≤ 2 N N ση X 1/(1+δ) δ/(δ+1) Z 1 2(1+δ)/δ 2(1+δ) µ0 (du). Eν G kG(u) − GN (u)k ≤ 2 sup Eν G (ky − G(u)k + ky − GN (u)k) ) N N ση u∈X X for any δ > 0. The supremum in the above bound can be bounded independently of U and N by Fernique’s Theorem as in the proof of Lemma 4.7. It follows that there exists a constant C independent of U and N such that
1/2(1+δ)
2
Z 2(1+δ)
I ≤ C Eν G kGN − Gk .
2 N 2 Lµ (X) 0
20
For the second term in the bound on the Hellinger distance, we have 1 sample −1/2 sample −1/2 2 |Z − (ZN,G ) | II = Eν G ZN,G N 2 sample 2 sample −3 sample | . ) )|Z − ZN,G max(Z −3 , (ZN,G ≤ Eν G ZN,G N
By Jensen’s inequality and the same argument as above, we have Z 2 1 1 sample 2 2 2 |Z − ZN,G | = exp − 2 ky − G(u)k − exp − 2 ky − GN (u)k µ0 (du) 4ση 4ση X Z 1 ≤ 4 (ky − GN (u)k + ky − G(u)k)2 kGN (u) − G(u)k2 µ0 (du). ση X Together with Tonelli’s Theorem and H¨older’s inequality with conjugate exponents p = (1 + δ)/δ and q = 1 + δ, we then have 1 II 2 Z 1 sample sample −3 2 2 −3 ≤ 4 Eν G ZN,G max(Z , (ZN,G ) ) (ky − GN (u)k + ky − G(u)k) kGN (u) − G(u)k µ0 (du) ση N X Z 1 sample sample −3 max(Z −3 , (ZN,G ) )(ky − GN (u)k + ky − G(u)k)2 kGN (u) − G(u)k2 µ0 (du) Eν G ZN,G = 4 ση X N δ/(δ+1) 1 sample (1+δ)/δ sample −3 (1+δ)/δ ) max(Z −3 , (ZN,G ) ) (ky − G(u)k + ky − GN (u)k)2(1+δ)/δ ) ≤ 4 sup Eν G (ZN,G N ση u∈X Z 1/(1+δ) µ0 (du), Eν G kGN (u) − G(u)k2(1+δ) X
N
for any δ > 0. The supremum in the bound above can be bounded independently of U and N by Lemma 4.7 and Fernique’s Theorem. The first claim of the Theorem then follows. The proof for µy,N,Φ Using an identical corresponding splitting of the Hellinger sample is similar. distance Eν Φ 2 d2Hell (µy , µy,N,Φ sample ) ≤ I + II, we bound the first term by Tonelli’s Theorem and the N local Lipschitz continuity of the exponential function:
Z 1/2
2
2 Z 2
µ0 (du) ≤ Eν G (ΦN − Φ) . Eν G exp − Φ(u)/2 − exp − ΦN (u)/2 I=
2 N N 2 X Lµ (X) 0
For the second term, we have as before
1 sample sample −3 sample 2 max(Z −3 , (ZN,Φ ) )|Z − ZN,Φ | . II ≤ Eν Φ ZN,Φ N 2
and |Z −
sample 2 ZN,Φ |
Z =
X
2 Z exp − Φ(u) − exp − ΦN (u) µ0 (du) ≤ 4 (Φ(u) − ΦN (u))2 µ0 (du). X
21
Together with Tonelli’s Theorem and H¨older’s inequality with conjugate exponents p = (1 + δ)/δ and q = 1 + δ, we then have Z 1 sample sample −3 2 −3 II ≤ 4Eν Φ ZN,Φ max(Z , (ZN,Φ ) ) (Φ(u) − ΦN (u)) µ0 (du) N 2 X Z sample sample −3 −3 Eν Φ ZN,Φ max(Z , (ZN,Φ ) )(Φ(u) − ΦN (u))2 µ0 (du) =4 X
N
δ/(δ+1) sample (1+δ)/δ sample −3 (1+δ)/δ ) max(Z −3 , (ZN,Φ ) ) ≤ 4 Eν G (ZN,Φ Z N 1/(1+δ) µ0 (du), Eν Φ kΦ(u) − ΦN (u)k2(1+δ) X
N
for any δ > 0. The first expected value in the bound above can be bounded independently of U and N by Lemma 4.7. The second claim of the Theorem then follows. Similar to Theorem 4.2 and Theorem 4.9, Theorem 4.11 provides error bounds for general Gaussian process emulators of G and Φ. As a particular example, we can take the emulators defined by (3.6). Corollary 4.12. Suppose GN and ΦN are defined as in (3.6), with Mat`ern kernel k = kν,λ,σ2 . k Suppose Assumption 2.2 and the assumptions of Proposition 3.3 and Theorem 4.11 are satisfied. Then there exist constants C1 , C2 , C3 and C4 , independent of U and N , such that ν+K/2
dHell (µy , µy,N,G marginal ) ≤ C1 hU
+ C2 hνU ,
and
ν+K/2
dHell (µy , µy,N,Φ marginal ) ≤ C3 hU
+ C4 hνU .
Proof. The proof is similar to that of Corollary 4.10, exploiting that for a Gaussian random variable X, we have E((X − E(X))4 ) = 3(E((X − E(X))2 ))2 . We furthermore have the following result on a generalised total variation distance [31], defined by dgTV (µy , µy,N,G sample ) =
sup kf kC 0 (X) ≤1
1/2 Eν G |Eµy (f ) − Eµy,N,G (f )|2 , N
sample
y,N,Φ for µy,N,G sample , and defined analogously for µsample .
Theorem 4.13. Under the Assumptions of Lemma 4.7, there exist constants C1 and C2 , independent of U and N , such that
1/(2+δ)
y,N,G 2+δ y
, dgTV (µ , µsample ) ≤ C1
2
EνNG kG − GN k Lµ0 (X)
1/2
y,N,Φ 2 y
. dgTV (µ , µsample ) ≤ C2 Eν G kΦ − ΦN k
N
L2µ0 (X)
22
y,N,G Proof. We give the proof for µy,N,Φ sample ; the proof for µsample is identical. By definition, we have
dgTV (µy , µy,N,Φ sample ) =
=
sup
Eν Φ
N
kf kC 0 (X) ≤1
≤
Eν Φ
2 ≤ Z
N
sup kf kC 0 (X) ≤1
Eν Φ |Eµy (f ) − Eµy,N,Φ (f )|2 N
sample
1/2
Z 2 !!1/2 sample −1 f (u) exp(−Φ(u))Z −1 − exp(−ΦN (u))(Z µ (du) ) 0 N,Φ X
2 !!1/2 Z sample −1 µ0 (du) exp(−Φ(u))Z −1 − exp(−ΦN (u))(ZN,Φ ) X
Eν Φ
N
Eν Φ
N
Z
1/2 + | exp(−Φ(u)) − exp(−ΦN (u))| µ0 (du) 2
X
1/2 sample 2 −1 sample −1 2 (ZN,Φ ) |Z − (ZN,Φ ) |
=: I + II.
The terms I and II can be bounded by the same arguments as the terms I and II in the proof of sample −1 2 sample −4 sample 2 Theorem 4.11, by noting that |Z −1 − (ZN,Φ ) | ≤ max(Z −4 , (ZN,Φ ) )|Z − ZN,Φ | .
5
Numerical Examples
We consider the model inverse problem of determining the diffusion coefficient of a divergence form elliptic partial differential equation (PDE) from observation of a finite set of noisy continuous functionals of the solution. This type of equation arises, for example, in the modelling of groundwater flow in a porous medium. We consider the one-dimensional model problem −
dp d κ(x; u) (x; u) = 1 in (0, 1), dx dx
p(1; u) = p(0; u) = 0,
(5.1)
K where the coefficient κ depends on parameters u = {uj }K j=1 ∈ [−1, 1] through the linear expansion K
X uj 1 κ(x; u) = + sin(2πjx). 100 200(K + 1) j=1
N duj As prior measure µ0 on [−1, 1]K , we use the uniform product measure µ0 (du) = K j=1 2 . The observations y are given by noisy point evaluations of the solution, yj = p(xj ; u∗ ) + ηj with η ∼ N (0, I) and {xj }Jj=1 evenly spaced points in (0, 1). The truth u∗ was chosen as a random sample from the prior, and the solution p was approximated by standard, piecewise linear, continuous finite elements on a uniform grid with mesh size h = 1/1024. In this setting, the forward map G : [−1, 1]K → H01 (D), defined by G(u) = p, is an analytic function [9]. Since the observation operator O is linear and bounded, Assumption 2.2 is hence satisfied for any s > K/2. The emulators GN and ΦN are computed as described in section 3.2, with mean and covariance kernel given by (3.6). In the Gaussian process prior (3.1), we choose m ≡ 0 and k = kν,1,1 , a Mat`ern kernel with variance σk2 = 1, correlation length λ = 1 and smoothness parameter ν.
23
For a given approximation µy,N to µy , we will compute twice the Hellinger distance squared, 2dHell (µy , µy,N )2 =
Z
[−1,1]K
s
dµy dµ0
(u) −
s
dµy,N dµ0
2
(u) dµ0 (u).
The integral over [−1, 1]K is approximated by a randomly shifted lattice rule with product weight parameters γj = 1/j 2 [27]. The generating vector for the rule used is available from Frances Kuo’s website (http://web.maths.unsw.edu.au/∼fkuo/) as “lattice-39102-1024-1048576.3600”. For the marginal and random approximations, the expected value over the Gaussian process is approximated by Monte Carlo sampling, using the MATLAB command mvnrnd. The solution p to (5.1) is approximated by standard, piecewise linear, continuous finite elements on a uniform grid with mesh size h = 1/32. For the design points U , we choose a uniform tensor grid. In [−1, 1]√K , the uniform tensor grid consisting of N = N∗K points, for some N∗ ∈ N, has fill distance hU = K(N∗ − 1)−1 . In Table 1, we give the convergence rates in N for supu∈X |f (u) − mfN (u)|2 and kf − mfN k2L2 (X) predicted by Proposition 3.3. supu∈X |f (u) − mfN (u)|2 kf − mfN k2L2 (X) ❍ ❍ ❍❍ K ❍❍ K 1 2 3 4 3 ❍❍ ❍❍ 1 2 ν ν ❍ ❍ 1 2 1 0.67 0.5 1 3 2 1.7 5
5
3.3
5
6
4 1.5
4.3
Table 1: Convergence rates in N predicted by Proposition 3.3 for uniform tensor grids.
5.1
Mean-based approximations
y,N,Φ 2 2 y In Figure 1, we show 2dHell (µy , µy,N,G mean ) (left) and 2dHell (µ , µmean ) (right), for a variety of choices of K and ν, for J = 1. For each choice of the parameters K and ν, we have as a dotted line added the least squares fit of the form C1 N −C2 , for some C1 , C2 > 0, and indicated the rate N −C2 in the legend. By Corollary 4.3, we expect to see the faster convergence rates in the right panel of Table 1. For convenience, we have added these rates in parentheses in the legends in Figure 1. For µy,N,G mean , we observe the rates in Table 1, or slightly faster. For µy,N,Φ , we observe rates slightly faster than mean predicted for ν = 1, and slightly slower than predicted for ν = 5. Finally, we remark that though the convergence rates of the error are slightly slower for µy,N,Φ mean , the actual errors are smaller for µy,N,Φ . mean y,N,G 2 2 In Figure 2, we again show 2dHell (µy , µmean ) (left) and 2dHell (µy , µy,N,Φ mean ) (right), for a variety of choices of K, with J = 15 and ν = 1. We again observe convergence rates slightly faster than the rates predicted in the right panel of Table 1. As in Figure 1, we observe that the errors in y,N,G µy,N,Φ mean are smaller, though the rates of convergence are slightly faster for µmean .
24
5.2
Marginal approximations
y,N,G 2 In Figure 3, we show 2dHell (µy , µmarginal )2 (left) and 2dHell (µy , µy,N,Φ marginal ) (right), for a variety of choices of K and ν, for J = 1. For each choice of the parameters K and ν, we have again added the least squares fit of the form C1 N −C2 , and indicated the rate C2 in the legend. By Corollary 4.10, we expect the error to be the sum of two contributions, one of which decays at the rate indicated in the left panel of Table 1, and another which decays at the rate indicated by the right panel of Table 1. For convenience, we have added these rates in parentheses in the legends in Figure 3.For µy,N,G marginal , we observe the faster convergence rates in the right panel of Table 1, although a
closer inspection indicates that the convergence is slowing down as N increases. For µy,N,G marginal , the observed rates are somewhere between the two rates predicted by Table 1.
5.3
Random approximations
y,N,G 2 2 ) ) (left) and 2Eν Φ (dHell (µy , µy,N,Φ In Figure 4, we show 2Eν G (dHell (µy , µsample sample ) ) (right), for a N N variety of choices of K and ν, for J = 1. For each choice of the parameters K and ν, we have again added the least squares fit of the form C1 N −C2 , and indicated the rate C2 in the legend. By Corollary 4.12, we expect the error to be the sum of two contributions, as for the marginal approximations considered in the previous section, and the corresponding rates from Table 1 have y,N,G been added in parentheses in the legends. For µsample , we again observe the faster convergence rates in the right panel of Table 1, but the convergence again seems to be slowing down as N increases. For µy,N,G marginal , the observed rates are very close to the slower rates in the left panel of Table 1.
6
Conclusions and further work
Gaussian process emulators are frequently used as surrogate models. In this work, we analysed the error that is introduced in the Bayesian posterior distribution when a Gaussian process emulator is used to approximate the forward model, either in terms of the parameter-to-observation map or the negative log-likelihood. We showed that the error in the posterior distribution, measured in the Hellinger distance, can be bounded in terms of the error in the emulator, measured in a norm dependent on the approximation considered. An issue that requires further consideration is the efficient emulation of vector-valued functions. A simple solution, employed in this work, is to emulate each entry independently. In many applications, however, it is natural to assume that the entries are correlated, and a better emulator could be constructed by including this correlation in the emulator. Furthermore, there are still a lot of open questions about how to do this optimally [5]. Another interesting question is the optimal choice of hyper-parameters λ and σk2 in the Gaussian process emulator. And finally the questions of scaling the Gaussian process methodology to both high dimensional input and output spaces remain open.
References [1] R. Adler, The Geometry of Random Fields, John Wiley, 1981. [2] N. Aronszajn, Theory of Reproducing Kernels, Transactions of the American Mathematical Society, 68 (1950), pp. 337–404. 25
-2
10
-2
10
-4
10
-4
Twice Hellinger distance squared
Twice Hellinger distance squared
10
10 -6
10
-8
10 -10 10 -12
10
-14
10
-16
K=2, ν=1 -2.6 N (2) K=2, ν=5 N -6.2 (6) K=3, ν=1 -2.4 N (1.7) K=3, ν=5 -4.5 N (4.3)
10 1
10 2
10 3
10 4
10 -6
10
-8
10 -10 10 -12
10
-14
10
-16
K=2, ν=1 -2.5 N (2) K=2, ν=5 N -5.4 (6) K=3, ν=1 -2 N (1.7) K=3, ν=5 -3.8 N (4.3)
10 1
10 2
N
10 3
10 4
N
10 0
10 0
10 -2
10 -2
10
Twice Hellinger distance squared
Twice Hellinger distance squared
y,N,Φ 2 2 y Figure 1: 2dHell (µy , µy,N,G mean ) (left) and 2dHell (µ , µmean ) (right), for a variety of choices of K and ν, for J = 1.
-4
10 -6
10
-8
10
-10
10
-12
10 -14 0 10
K=1, ν=1 N -4.1 (3) K=2, ν=1 -2.7 N (2) K=3, ν=1 -2.3 N (1.7) K=4, ν=1 -2.3 N (1.5) 10
1
10
2
10
3
10
10
-4
10 -6
10
-8
10
-10
10
-12
10 -14 0 10
4
N
K=1, ν=1 N -4 (3) K=2, ν=1 -2.7 N (2) K=3, ν=1 -2.1 N (1.7) K=4, ν=1 -1.9 N (1.5) 10
1
10
2
10
3
10
4
N
y,N,Φ 2 2 y Figure 2: 2dHell (µy , µy,N,G mean ) (left) and 2dHell (µ , µmean ) (right), for a variety of choices of K and ν = 1, for J = 15.
26
-3
10
-3
10
-4
10
-4
10
-5
10
-5
10
-6
10
-7
Twice Hellinger distance squared
Twice Hellinger distance squared
10
10 -8 10
-9
10 -10 10
-11
10
-12
10 -13 1 10
K=2, ν=1 N -2.6 (2+1) K=2, ν=5 -6.2 N (6+5) K=3, ν=1 -2.2 N (1.7+0.67) K=3, ν=5 N -4.6 (4.3+3.3) 10
10 -6 10
-7
10 -8 10
-9
10 -10 10
-11
10 -12 2
10
3
10
10 -13 1 10
4
K=2, ν=1 N -1.8 (2+1) K=2, ν=5 -4.9 N (6+5) K=3, ν=1 1.1 N (1.7+0.67) K=3, ν=5 N -3.2 (4.3+3.3) 10
2
N
10
3
10
4
N
y,N,Φ 2 y 2 Figure 3: 2dHell (µy , µy,N,G marginal ) (left) and 2dHell (µ , µmarginal ) (right), for a variety of choices of K and ν, for J = 1.
10
10 -2
Expected value of twice Hellinger distance squared
Expected value of twice Hellinger distance squared
10 -2 -3
10 -4 10
-5
10 -6 10
-7
10 -8 10
-9
10 -10 10 -11 1 10
K=2, ν=1 -2.3 N (2+1) K=2, ν=5 -6.1 N (6+5) K=3, ν=1 N -1.7 (1.7+0.67) K=3, ν=5 -4.4 N (4.3+3.3) 10 2
10 3
10
-3
10 -4 10
-5
10 -6 10
-7
10 -8 10
-9
10 -10 10 -11 1 10
10 4
N
K=2, ν=1 -1.1 N (2+1) K=2, ν=5 N -4.9 (6+5) K=3, ν=1 -0.76 N (1.7+0.67) K=3, ν=5 N -3.3 (4.3+3.3) 10 2
10 3
10 4
N
y,N,G 2 2 ) ) (left) and 2Eν Φ (dHell (µy , µy,N,Φ Figure 4: 2Eν G (dHell (µy , µsample sample ) ) (right), for a variety of choices N N of K and ν, for J = 1.
27
[3] S. Arridge, J. Kaipio, V. Kolehmainen, M. Schweiger, E. Somersalo, T. Tarvainen, and M. Vauhkonen, Approximation Errors and Model Reduction with an Application in Optical Diffusion Tomography, Inverse Problems, 22 (2006), p. 175. [4] I. Babuˇ ska, F. Nobile, and R. Tempone, A stochastic collocation method for elliptic partial differential equations with random input data, SIAM review, 52 (2010), pp. 317–355. [5] I. Bilionis, N. Zabaras, B. A. Konomi, and G. Lin, Multi-output separable gaussian process: Towards an efficient, fully bayesian paradigm for uncertainty quantification, Journal of Computational Physics, 241 (2013), pp. 212–239. [6] N. Bliznyuk, D. Ruppert, C. Shoemaker, R. Regis, S. Wild, and P. Mugunthan, Bayesian calibration and uncertainty analysis for computationally expensive models using optimization and radial basis function approximation, Journal of Computational and Graphical Statistics, 17 (2008). [7] V. I. Bogachev, Gaussian measures, vol. 62 of Mathematical Surveys and Monographs, American Mathematical Society, 1998. [8] T. Bui-Thanh, K. Willcox, and O. Ghattas, Model reduction for large-scale systems with high-dimensional parametric input space, SIAM Journal on Scientific Computing, 30 (2008), pp. 3270–3288. [9] A. Cohen, R. Devore, and C. Schwab, Analytic regularity and polynomial approximation of parametric and stochastic elliptic pde’s, Analysis and Applications, 9 (2011), pp. 11–47. [10] P. R. Conrad, Y. M. Marzouk, N. S. Pillai, and A. Smith, Asymptotically exact mcmc algorithms via local approximations of computationally intensive models, arXiv preprint arXiv:1402.1694, (2014). [11] P. G. Constantine, Active subspaces: Emerging ideas for dimension reduction in parameter studies, vol. 2, SIAM, 2015. [12] P. G. Constantine, E. Dow, and Q. Wang, Active subspace methods in theory and practice: Applications to kriging surfaces, SIAM Journal on Scientific Computing, 36 (2014), pp. A1500–A1524. [13] S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White, MCMC methods for functions: modifying old algorithms to make them faster, Statistical Science, 28 (2013), pp. 424–446. [14] M. Dashti and A.M.Stuart, The bayesian approach to inverse problems, in Handbook of Uncertainty Quantification, R. Ghanem, D. Higdon, and H. Owhadi, eds., Springer. [15] M. Girolami and B. Calderhead, Riemann manifold Langevin and Hamiltonian Monte Carlo methods, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73 (2011), pp. 123–214. [16] M. Hansen and C. Schwab, Sparse adaptive approximation of high dimensional parametric initial value problems, Vietnam Journal of Mathematics, 41 (2013), pp. 181–215.
28
[17] W. Hastings, Monte-Carlo sampling methods using Markov chains and their applications, Biometrika, 57 (1970), pp. 97–109. [18] J. P. Kaipio and E. Somersalo, Statistical and Computational Inverse Problems, vol. 160, Springer, 2005. [19] M. C. Kennedy and A. O’Hagan, Bayesian calibration of computer models, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63 (2001), pp. 425–464. [20] Y. Marzouk and D. Xiu, A Stochastic Collocation Approach to Bayesian Inference in Inverse Problems, Communications in Computational Physics, 6 (2009), pp. 826–847. [21] Y. M. Marzouk, H. N. Najm, and L. A. Rahn, Stochastic Spectral Methods for Efficient Bayesian Solution of Inverse Problems, Journal of Computational Physics, 224 (2007), pp. 560– 586. [22] B. Mat´ ern, Spatial variation, vol. 36, Springer Science & Business Media, 2013. [23] J. Mercer, Functions of positive and negative type, and their connection with the theory of integral equations, Philosophical transactions of the royal society of London, series A, 209 (1909), pp. 415–446. [24] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller, Equation of state calculations by fast computing machines, The J. of Chemical Physics, 21 (1953), p. 1087. [25] F. Narcowich, J. Ward, and H. Wendland, Sobolev bounds on functions with scattered zeros, with applications to radial basis function surface fitting, Mathematics of Computation, 74 (2005), pp. 743–763. [26] F. J. Narcowich, J. D. Ward, and H. Wendland, Sobolev error estimates and a bernstein inequality for scattered data interpolation via radial basis functions, Constructive Approximation, 24 (2006), pp. 175–186. [27] H. Niederreiter, Random Number Generation and quasi-Monte Carlo methods, SIAM, 1994. [28] A. O’Hagan, Bayesian analysis of computer code outputs: a tutorial, Reliability Engineering & System Safety, 91 (2006), pp. 1290–1300. [29] G. D. Prato and J. Zabczyk., Stochastic equations in infinite dimensions, vol. 44 of Encyclopedia Math. Appl., Cambridge University Press, Cambridge, 1992. [30] C. E. Rasmussen and C. K. Williams, Gaussian processes for machine learning, MIT Press, 2006. [31] P. Rebeschini and R. Van Handel, Can local particle filters beat the curse of dimensionality?, The Annals of Applied Probability, 25 (2015), pp. 2809–2866. [32] C. Robert and G. Casella, Monte Carlo Statistical Methods, Springer, 1999. [33] J. Sacks, W. J. Welch, T. J. Mitchell, and H. P. Wynn, Design and analysis of computer experiments, Statistical science, (1989), pp. 409–423. 29
[34] M. Scheuerer, R. Schaback, and M. Schlather, Interpolation of spatial data–A stochastic or a deterministic problem?, European Journal of Applied Mathematics, 24 (2013), pp. 601– 629. [35] C. Schillings and C. Schwab, Sparsity in bayesian inversion of parametric operator equations, Inverse Problems, 30 (2014), p. 065007. [36] M. L. Stein, Interpolation of Spatial Data: Some Theory for Kriging, Springer, 1999. [37] A. M. Stuart, Inverse problems, vol. 19 of Acta Numerica, Cambridge University Press, 2010, pp. 451–559. [38] W. Walter, Ordinary Differential Equations, vol. 182 of Graduate Texts in Mathematics, Springer, 1998. [39] H. Wendland, Scattered Data Approximation, vol. 17, Cambridge University Press, 2004. [40] D. Xiu and G. E. Karniadakis, Modeling uncertainty in flow simulations via generalized polynomial chaos, Journal of computational physics, 187 (2003), pp. 137–167.
30