CWP-679
A projected Hessian for full waveform inversion Yong Ma & Dave Hale Center for Wave Phenomena, Colorado School of Mines, Golden, CO 80401, USA
(a)
(b)
(c)
Figure 1. Update directions for one iteration of the conjugate gradient method (a), the image-guided conjugate gradient method (b), and a quasi-Newton method with application of the inverse projected Hessian (c).
ABSTRACT
A Hessian matrix in full waveform inversion (FWI) is difficult to compute directly because of high computational cost and an especially large memory requirement. Therefore, Newton-like methods are rarely feasible in realistic largesize FWI problems. We modify the BFGS method to use a projected Hessian matrix that reduces both the computational cost and memory required, thereby making a quasiNewton solution to FWI feasible. Key words: projected Hessian matrix, BFGS method
1
INTRODUCTION
Full waveform inversion (FWI) (Tarantola, 1984; Pratt, 1999) is usually formulated as an optimization problem (Symes, 2008), in which we consider the minimization of a nonlinear objective function E: RN → R, min E (m) ,
m∈RN
(1)
where the function variable m denotes N parameters, such as seismic velocities, for an earth model. In FWI, the objective function often takes a least-squares form: E (m) = 21 kd−F (m) k2 , where k.k denotes an L2 norm, d denotes the recorded data, and F is a forward datasynthesizing operator, a nonlinear function of the model m. Various iterative methods (Nocedal and Wright, 2000; Tarantola, 2005) for solving this least-squares optimization problem include the conjugate gradient
method, Newton’s method, the Gauss-Newton method, and various quasi-Newton methods. Compared to the conjugate gradient method, Newton’s method and other Newton-like methods generally converge faster in fewer iterations. However, Newton-like methods are rarely used to solve realistic seismic full waveform problems, because for such large problems they are costly. Newtonlike methods, especially Newton’s method, are suitable only for solving small- or medium-size optimization problems (Pratt et al., 1998; Sheen et al., 2006; Virieux and Operto, 2009), in which the number N of model parameters ranges from hundreds to thousands. Newton’s method must compute the 1st and 2nd order derivatives of the objective function E (m) with respect to model parameters in m. We refer to these two ∂E and the Hesderivatives as the gradient g (m) ≡ ∂m 2 ∂ E sian H (m) ≡ ∂m2 , respectively. The Hessian matrix, in particular, consists of large numbers O(N 2 ) of 2nd
14
Y. Ma & D. Hale
derivatives. To compute each 2nd derivative in the most straightforward way, we must solve one forward problem F (m). Therefore, the Hessian matrix computed in this way requires the solution of O(N 2 ) forward problems. Pratt et al. (1998) show a more efficient method that reduces the number of required forward solutions from O(N 2 ) to O(N ), which is still too costly in realistic large-size FWI problems. Moreover, the Hessian matrix consumes large amounts of memory. In a model with N parameters, a Hessian matrix stored in single precision requires 4N 2 bytes of memory. Because of the high cost of Newton’s method for large problems, various methodologies have been proposed to approximate the Hessian matrix. The GaussNewton method, for example, ignores 2nd-order terms that account for nonlinearity in the function F (m). This approximation is valid only when the forward operator F is a linear or quasi-linear function of m. If the initial model in FWI is far from the true model and nonlinearity is significant, the Gauss-Newton method may not converge. Quasi-Newton methods do not compute the Hessian matrix explicitly, but instead iteratively update Hessian approximations. The BFGS method (Broyden, 1970; Fletcher, 1970; Goldfarb, 1970; Shanno, 1970) is an efficent way to update the Hessian matrix. However, although the BFGS method reduces the computation time required to approximate a Hessian matrix, it does not reduce the computation time required to use the Hessian matrix to update model parameters, nor does it decrease the amount of memory required to store Hessian approximations. In practical applications of quasiNewton methods for FWI, we must reduce both computation time and memory consumption significantly. In this paper, we first review various Newton-like methods. We then pose FWI in a sparse model space and introduce a projected Hessian matrix in that space. We obtain the projected Hessian by using a projected BFGS method, a version of BFGS in the sparse model space. A projected Hessian matrix built in this way significantly reduces both computation time and memory consumption. Tests of our projected Hessian on the Marmousi II model suggest that quasi-Newton methods like ours may be promising in practical applications of FWI.
2
NEWTON-LIKE METHODS
We consider only iterative methods for solution of the nonlinear FWI optimization problem. In the ith iteration of such methods, we update the model m in the direction of a vector pi : mi+1 = mi + αi pi ,
(2)
for some step length αi . Newton’s method, the GaussNewton method, and quasi-Newton methods differ in
the way in which they compute and use the update vector pi . 2.1
Newton’s method
In Newton’s method, the update vector pi is determined by first ignoring the higher-order (> 2) terms in a Taylor series approximation of the objective function E (m): E (mi + δmi ) ≈ E (mi ) + δmTi gi 1 + δmTi Hi δmi , 2
(3)
where E (mi ) denotes the objective function evaluated at mi , gi = g (mi ), and Hi = H (mi ). For models m near the current model estimate mi , Equation 3 approximates the objective function with a quadratic function. We then minimize this approximated E (m) by solving a set of linear equations: Hi δmi = −gi
(4)
δmi = −H−1 i gi .
(5)
with solution
Therefore, the update direction in Newton’s method is simply pi = −H−1 i gi ,
(6)
and the step length αi = 1. However, for models with large numbers of parameters, high costs for computing the Hessian matrix Hi prevent the application of Newton’s method. To analyze the cost, let us explicitly write each element of Hi as Hjk =
∂2E , ∂mj ∂mk
(7)
where j and k are parameter indices in the model m, ranging from 1 to the total number N of model parameters. If, say, N = 500000 in a 2D FWI problem, the Hessian Hi is a 500000 × 500000 matrix. To compute this Hessian matrix in the most straightforward way, we must solve 2.5 × 1011 forward problems F (m) in each iteration of FWI. Even using the more efficient way in Pratt et al. (1998), we can only reduce this number to 500000 per iteration. For FWI, which normally requires multiple iterations, this cost is prohibitive. Furthermore, the memory required to store this Hessian matrix in single precision is 1 terabyte, or about 1/2 terabyte considering symmetry in the Hessian. 2.2
Gauss-Newton method
Discarding the second term on the right-hand side of equation 7, we get an approximated Hessian Ha , with elements Hjk ≈ JTj Jk ,
(8)
Projected Hessian where Jj and Jk are the j th and kth columns in the Jacobian matrix { ∂F(m) , ∂F(m) , ..., ∂F(m) }, respectively. ∂m1 ∂m2 ∂mN Using this approximation we obtain the Gauss-Newton method with update direction pi = −H−1 a gi .
(9)
Equation 8 shows that the cost of computing Ha is equivalent to that of computing the gradient gi . Pratt et al. (1998) show an efficient approach for calculating gi , and only 2 forward data syntheses F (m) are necessary. However, the approximation Ha in effect assumes that the function F (mi + δmi ) is linear with respect to the difference δmi between the true model m and the current model mi . In practice this assumption is seldom realistic. 2.3
Quasi-Newton method
A quasi-Newton method iteratively approximates the Hessian matrix. The BFGS method (Broyden, 1970; Fletcher, 1970; Goldfarb, 1970; Shanno, 1970) is currently considered the most popular (Nocedal and Wright, 2000) quasi-Newton update formula. The basic idea of BFGS is to update the Hessian using changes in both the model mi and the gradient gi , from one iteration to the next: Hi+1 = Hi +
yi yTi Hi δmi (Hi δmi )T − , yTi δmi δmTi Hi δmi
(10)
where yi = gi+1 − gi , δmi = mi+1 − mi . The BFGS method normally begins with an identity matrix H0 = I in the first iteration, so that the first iteration of BFGS is similar to a gradientdescent method. Pseudo-code for implementing the BFGS method is as follows (Nocedal and Wright, 2000): given m0 , g0 =
∂E , ∂m0
and H0 = I
for i = 0, 1, 2, ..., until convergence do pi = −H−1 i gi search for αi , δmi = αi pi mi+1 = mi + δmi gi+1 = ∇E (mi+1 ) yi = gi+1 − gi Hi+1 = Hi + end for
yi yT i yT i δmi
−
Hi δmi (Hi δmi )T δmT i Hi δmi
The BFGS method is an efficient way to compute Hessian approximations. Like the Gauss-Newton method, BFGS requires computation of only gradients gi , with the cost of only 2 forward modelings F (m). However, unlike the the Gauss-Newton method, BFGS does not simply ignore 2nd derivative terms in the Hessian. Instead, the BFGS method uses differences in models and gradients between iterations to construct a complete Hessian approximation.
15
Unfortunately, the BFGS method does not reduce the O N 3 cost of solving the set of linear equations (equation 4) for the update vector pi , nor does it reduce the O N 2 memory required to store the Hessian approximation.
3
PROJECTED HESSIAN MATRIX
The only way to reduce these costs is to reduce the number N of model parameters. For this purpose, we introduce a projected Hessian matrix in a sparse model space. 3.1
Sparse optimization
In an approach similar to that used in subspace methods (Kennett et al., 1988; Oldenburg et al., 1993) and alternative parameterizations (Pratt et al., 1998, Appendix A), we construct a finely- and uniformly-sampled (dense) model m from a sparse model s that contains a much smaller number n 0 in equation 19. This SPD definite (SPD) if y property is important for an optimization problem, because it guarantees that a quasi-Newton method that uses this projected Hessian converges to (possibly local) minimum. This SPD property can be achieved if the step length αi satisfies the sufficient descent condition ˜ i ) ≡ E (αi ) ≤ E (0) + c1 E 0 (0) αi , E (si + αi p E 0 (αi ) ≥ c2 E 0 (0) ,
Using the simple relationship δmi = Rδsi , we can rewrite equation 17 as RT Hi+1 R = RT Hi R +
for i = 0, 1, 2, ..., until convergence do ˜ −1 ˜ i = −H ˜i p i g ˜i search for αi , δsi = αi p mi+1 = mi + Rδsi ˜ i+1 = RT gi+1 gi+1 = ∇E (mi+1 ), g ˜i = g ˜ i+1 − g ˜i y T T ˜ i+1 = H ˜ i + y˜ i y˜Ti + g˜iTg˜i (equation 20) H ˜i ˜ i pi αi y pi g end for
(21)
and the curvature condition
RT yi yTi R yTi δmi
RT Hi δmi RT Hi δmi δmTi Hi δmi
(20)
T .
(18)
˜ i ≡ RT Hi R, Now simplify equation 18 by defining H ˜ i ≡ RT gi , and y ˜ i ≡ RT yi = g ˜ i+1 − g ˜ i to obtain an g
(22)
˜ i and E 0 (αi ) = g ˜ Ti+1 p ˜ i are direc˜ Ti p where E 0 (0) = g tional derivatives. In equations 21 and 22, c1 and c2 are constants ∈ (0, 1). Suggested values for c1 and c2 are 0.0001 and 0.9 (Nocedal and Wright, 2000), respectively. Together, equations 21 and 22 are referred to as the Wolfe conditions (Nocedal and Wright, 2000). If the curvature condition in equation 22 is replaced by |E 0 (αi ) | ≤ c2 |E 0 (0) | ,
(23)
then the Wolfe conditions become the strong Wolfe conditions used in this study.
Projected Hessian
(a)
Figure 4. Structurally constrained sample selections. A total of 165 samples are chosen for our projected BFGS method.
4.1
(b) Figure 3. The Marmousi II velocity model (a) and an initial velocity model (b).
3.4
Cost of projected BFGS
In each iteration of the projected BFGS method, we perform at least one forward calculation F (m) and compute at least one gradients g. Therefore, in the ideal case, the cost of each iteration of the projected BFGS method is that of three forward calculations F (m). Furthermore, if the number of parameters n in the sparse model space s is much smaller than the number N in the dense model m, the amount of memory required to store the projected Hessian is significantly reduced, as shown in Figure 2.
4
EXAMPLES
We test our projected BFGS method on the Marmousi II model. Figure 3a shows the true model m, and Figure 3b shows the initial model m0 , a highly smoothed version of the true model. We use 11 shots uniformly distributed on the surface, and a 15Hz Ricker wavelet as the source for simulating wavefields. The source and receiver intervals are 0.76 km and 0.024 km, respectively. In this example, the dense model space has N = 391 × 1151 parameters. Therefore, computation or storage of the true Hessian matrix for this example is infeasible.
17
Projected Hessian and its inverse
Our projected BFGS method poses the optimization problem in a sparse model space s. As described by Ma et al. (2011), we construct the sparse model s using a structurally constrained sample selection scheme. This selection method considers structures apparent in migrated images, and places samples along structural features. Figure 4 shows a total of 165 scattered sample locations. This set of scattered sample locations are representative, because (1) they lie between (not on) reflectors; (2) they are distributed along structural features; (3) there are more samples in structurally complex areas and fewer samples in simple areas. The scattered locations together with corresponding values at these locations comprise a sparse model space s that will be used in the projected BFGS method. We employ image-guided interpolation (IGI) (Hale, 2009) as the operator R and the adjoint of image-guided interpolation (Ma et al., 2010) as the operator RT . IGI interpolates values in the sparse model s to a uniformly sampled dense model m, and the interpolant makes good geological sense because it accounts for structures in the model. ˜ 0, Figure 5a shows the initial projected Hessian H an identity matrix. Figure 5b shows the updated pro˜ 1 after one iteration of the projected jected Hessian H ˜1 −H ˜ 0 to the Hessian in BFGS method. The update H this first iteration is shown separately in Figure 5c. As we can see, the projected BFGS method adds significant off-diagonal components to the initial Hessian H0 . Because our line search satisfies the strong Wolfe conditions, the inverse of the projected Hessian matrix exists. ˜ −1 Figure 5d shows the inverse Hessian H 1 .
18
Y. Ma & D. Hale
(a)
(b)
(c)
(d)
˜ 0 (a), the Hessian matrix H ˜ 1 after the 1st iteration (b), the Hessian update H ˜1 −H ˜ 0 in the Figure 5. The initial Hessian H ˜ −1 (d). 1st iteration (c), and the inverse Hessian H 1
4.2
Eigenvalues and eigenvectors
Note that the last two terms of the right-hand side of equation 19 are two rank-one matrices, so each iteration of the projected BFGS method is a rank-two update to the previous Hessian approximation. Figures 6a, 6b, 6c and 6d show the eigenvalues of ˜ i in the 1st, 4th, 7th and 10th the projected Hessian H iterations, respectively. As suggested by the rank-two update, the projected BFGS method changes only two eigenvalues in each iteration: one, the smallest, and the other, the largest. Figures 7a and 7b show two eigenvec˜ 1 , corresponding to the tors of the projected Hessian H largest and the smallest eigenvalues, respectively. The eigenvalues and eigenvectors of the projected Hessian have geometric meaning. Consider the objective function E (m) as a multidimensional parabolic bowl (Thacker, 1989). The eigenvector associated with the largest eigenvalue points in the direction of greatest cur-
vature on the bowl. Updating the model m along the eigenvector direction shown in Figure 7a, we get the largest rate of change in the objective function E (m). This eigenvector indicates the component of the model m that can be best determined in FWI. Likewise, the eigenvector associated with the smallest eigenvalue indicates the direction of least curvature. Updating the model m along the eigenvector direction shown in Figure 7b yields the smallest rate of change of E (m). This eigenvector indicates the component of the model m that can be least well determined in FWI. 4.3
Projected quasi-Newton FWI
The projected BFGS method updates the model in each iteration, and therefore the projected BFGS method can be directly used in quasi-Newton solutions to FWI. The key difference between quasi-Newton FWI and
Projected Hessian 1.5
19
2.5
1
Eigenvalues
Eigenvalues
2
0.5
1.5 1 0.5
0 0
25
50
75 100 Index
125
0 0
150
25
50
(a)
75 100 Index
125
150
125
150
(b)
2.5
3.5 3
2 Eigenvalues
Eigenvalues
2.5 1.5 1
2 1.5 1
0.5 0.5 0 0
25
50
75 100 Index (c)
125
150
0 0
25
50
75 100 Index (d)
Figure 6. Eigenvalues of projected Hessian in 1st (a), 4th (b), 7th (c), and 10th (d) iterations.
(a)
(b)
˜ 1 corresponding to the largest (a) and the smallest (b) eigenvalues. Geometrically, an eigenvector Figure 7. Eigenvectors of H is a direction on the multidimensional surface of an objective function E (m), and the associated eigenvalue determines the curvature on the surface. If the model is updated in the eigenvector direction in (a), the rate of change of E (m) is largest. If the model is updated in the eigenvector direction in (b), the rate of change of E (m) is smallest.
20
Y. Ma & D. Hale
Table 1. Update directions pi for the conjugate gradient (CG) method, the image-guided CG method, and our quasiNewton method. Method Conjugate gradient
Update direction p0 = −g0 gT (g −g ) βi ≡ β gi , gi−1 = i T i i−1 gi−1 gi−1
pi = −gi + βi pi−1 Image-guided CG
Quasi-Newton
data misfit function and the model misfit function (an L2 norm of the difference between the true model and the estimated model). Overall, the quasi-Newton FWI shows faster convergence.
p0 = −RRT g0 βi = β RRT gi , RRT gi−1 T pi = −RR gi + βi pi−1 ˜0 = I H ˜ −1 RT g pi = −RH i i
other solutions, such as the conjugate gradient method and Newton’s method, is the update direction. Table 1 compares the update directions used in the conjugate gradient method, the image-guided conjugate gradient method (Ma et al., 2011), and the quasi-Newton method. As suggested in Ma et al. (2011), the image-guided conjugate gradient method uses a gather-scatter process in applying RRT , and thereby generates low wavenumbers not present in the simple conjugate gradient due to the lack of low frequencies in recorded data. The conjugate gradient in Figure 1a and the image-guided conjugate gradient in Figure 1b illustrate this difference. Figures 8a and 8c show inversion results estimated by the conjugate gradient (CG) method and the imageguided CG method, respectively, after two iterations. The corresponding estimated models in the 10th iteration are shown Figure 8b and 8d, respectively. We can see that both the conventional CG and the imageguided CG methods mainly update the shallow part of the model. This is due to the fact that both the conjugate gradient and the image-guided conjugate gradient have unbalanced amplitudes, high in the shallow part and low in the deeper part. In the case of quasi-Newton FWI, the update direction shown in Figure 1c not only contains significant low wavenumbers, as does the image-guided conjugate gradient in Figure 1b, but the amplitudes are more balanced as well. In Figure 1c, the amplitudes are higher in the deeper part. This improvement is due to the use of ˜ −1 the inverse projected Hessian matrix H (see the last i row in Table 1), which works as a filter applied to the gradient. Figures 8e and 8f show the models estimated in the 2nd and 10th iterations of the quasi-Newton method, respectively. As we can see, quasi-Newton FWI updates the deeper part of the model in early iterations. Figure 9 shows the convergence rates of the quasiNewton method in this Marmousi II example with the
5
CONCLUSIONS
A projected BFGS method iteratively approximates the Hessian matrix in FWI, thereby reducing both computation time and required memory. The projected BFGS method enables FWI to be performed using a quasiNewton method. As demonstrated by the Marmousi II example, quasi-Newton FWI converges in fewer iterations. Compared with the conjugate gradient method or the image-guided conjugate gradient method, the primary disadvantage of our projected BFGS method is the computational cost of a single iteration, which is relatively high because the line search must satisfy the Wolfe conditions. Therefore, a further investigation of the coefficients in the Wolfe conditions is necessary.
6
ACKNOWLEDGMENT
This work is sponsored by a research agreement between ConocoPhillips and Colorado School of Mines (SST20090254-SRA). Yong Ma thanks Diane Witters for help in learning how to polish this manuscript.
REFERENCES Broyden, C. G., 1970, The convergence of a class of double-rank minimization algorithms: IMA Journal of Applied Mathematics, 6, 222–231. Fletcher, R., 1970, A new approach to variable metric algorithms: The Computer Journal, 13, 317–322. Gill, P. E., W. Murray, and M. H. Wright, 1981, Practical optimization: Academic Press, London. Goldfarb, D., 1970, A family of variable-metric methods derived by variational means: Mathematics of Computation, 109, 23–26. Hale, D., 2009, Image-guided blended neighbor interpolation of scattered data: SEG Technical Program Expanded Abstracts, 28, 1127–1131. Kennett, B., M. Sambridge, and P. Williamson, 1988, Subspace methods for large inverse problems with multiple parameter classes: Geophysical Journal International, 94, 237–247. Ma, Y., D. Hale, B. Gong, and Z. Meng, 2011, Imageguided full waveform inversion: CWP Report. Ma, Y., D. Hale, Z. Meng, and B. Gong, 2010, Full waveform inversion with image-guided gradient: SEG Technical Program Expanded Abstracts, 29, 1003– 1007.
Projected Hessian
(a)
(b)
(c)
(d)
(e)
(f)
21
Figure 8. Inverted velocity models for the conjugate gradient method (a,b), the image-guided conjugate gradient method (c,d), and the quasi-Newton method (e,f). Left (a,c,e): the 2nd iteration; right (b,d,f): the 10th iteration.
Nocedal, J., and S. J. Wright, 2000, Numerical Optimization: Springer. Oldenburg, D., P. McGillvray, and R. Ellis, 1993, Generalized subspace methods for large-scale inverse problems: Geophysical Journal International, 114, 12–20. Pratt, R., C. Shin, and G. Hicks, 1998, Gauss-Newton and full newton methods in frequency-space seismic waveform inversion: Geophysical Journal International, 133, 341–362. Pratt, R. G., 1999, Seismic waveform inversion in the frequency domain, part 1: Theory and verification in
a physical scale model: Geophysics, 64, 888. Shanno, D. F., 1970, Conditioning of quasi-Newton methods for function minimization: Mathematics of Computation, 111, 647–656. Sheen, D. H., K. Tuncay, C. E. Baag, and P. J. Ortoleva, 2006, Time domain gauss-newton seismic waveform inversion in elastic media: Geophysical Journal International, 167, 1373–1384. Symes, W. W., 2008, Migration velocity analysis and waveform inversion: Geophysical Prospecting, 56, 765–790. Tarantola, A., 1984, Inversion of seismic-reflection data
22
Y. Ma & D. Hale
(a)
(b)
Figure 9. Convergence of the conjugate gradient method, the image-guided conjugate gradient method, and the projected quasi-Newton method: the data misfit function (a) and the model misfit function (b).
in the acoustic approximation: Geophysics, 49, 1259– 1266. ——–, 2005, Inverse problem theory and methods for model parameter estimation: Society for Industrial and Applied Mathematics. Thacker, W. C., 1989, The role of the Hessian matrix in fitting models to measurements: Journal of Geophysical Research, 94, 6177–6196. Virieux, J., and S. Operto, 2009, An overview of fullwaveform inversion in exploration geophysics: Geophysics, 74, WCC1–WCC26.