Gauss quadrature for matrix inverse forms with applications

Report 8 Downloads 50 Views
Bounds on bilinear inverse forms via Gaussian quadrature with applications Chengtao Li Suvrit Sra Stefanie Jegelka

[email protected] [email protected] [email protected]

arXiv:1512.01904v1 [stat.ML] 7 Dec 2015

Massachusetts Institute of Technology, Cambridge, MA 02139 Abstract We address quadrature-based approximations of the bilinear inverse form u> A−1 u, where A is a real symmetric positive definite matrix, and analyze properties of the Gauss, Gauss-Radau, and Gauss-Lobatto quadrature. In particular, we establish monotonicity of the bounds given by these quadrature rules, compare the tightness of these bounds, and derive associated convergence rates. To our knowledge, this is the first work to establish these properties of Gauss-type quadrature for computing bilinear inverse forms, thus filling a theoretical gap regarding this classical topic. We illustrate the empirical benefits of our theoretical results by applying quadrature to speed up two Markov Chain sampling procedures for (discrete) determinantal point processes.

1

Introduction

Many applications involve computation of bilinear forms like u T f ( A)v, where A ∈ R N × N is symmetric, u, v are given vectors, and f is a smooth function. When A is large it is desirable to approximate u T f ( A)v directly, rather than to first compute f ( A) which could require O( N 3 ) operations. Consequently, for estimating such bilinear forms substantial effort has been invested in developing theory and algorithms. Of these, algorithms based on quadrature stand out—and we refer the reader to the recent work [18] for an engaging account of appurtenant theory, algorithms, and applications. Among bilinear forms, perhaps the most fundamental one is the bilinear inverse form u> A−1 v, where A is (real) symmetric positive definite. Efficient computation of this form is a central step in several scientific applications. For example, in computational physics it is used to estimate specific entries of the inverse of a large sparse matrix; subsequently it serves as central step in estimating the trace of the inverse, a computational substep in lattice Quantum Chromodynamics [9, 13]. In machine learning and statistics, the quantity u> A−1 v arises in numerous places: in the evaluation of the ubiquitous multivariate Gaussian density at a point; as a subroutine when computing with Gaussian Processes [41]; or as a subroutine when performing Markov Chain Monte Carlo sampling with Determinantal Point Processes [7, 23, 24, 26]. Other settings where this computation arises include selective inversion [29–31], preconditioning [3], signal processing [21], network analysis [4, 10, 11], uncertainty quantification [2], rational approximation [46] and Green’s function evaluation [12]. Na¨ıvely, we can compute u> A−1 v by first solving Ax = v and then computing v> x. The linear system could be solved by a direct method at O( N 3 ) cost or by an iterative solver. Clearly, for large N the direct approach is infeasible. Although an iterative approach such as conjugate gradient is still possible, for applications that require precise bounds (see e.g., Sec. 6 where tight bounds enable early termination and hence savings in time) on u> A−1 v we need a more finessed approach. Gauss quadrature is one such approach. Originally proposed in [14] for approximating integrals, since 1996, Gauss quadrature and Gauss-type quadrature (e.g., Gauss-Lobatto [32] and Gauss-Radau [40] quadrature) have also been applied to computing bilinear inverse forms [1]. Bai et al. [1] in fact show that Gauss and right Gauss-Radau quadrature rules yield lower bounds, while Gauss-Lobatto and left Gauss-Radau rules yield upper bounds on bilinear inverse forms. But despite substantial effort (see e.g., the book [18]), our understanding of these methods is still far from complete. For instance, it is not known how the bounds obtained from Gauss, Gauss-Radau

1

and Gauss-Lobatto quadrature compare with each other, nor is it known how fast the iterates of GaussRadau or Gauss-Lobatto quadrature converge. Knowing this information not only closes a gap in the vast literature on quadrature, but also proves to be valuable for Dpp sampling (Sec. 6). Contributions.

In light of the above, our paper makes the following main contributions:

– It shows that both upper and lower bounds generated by the Gauss-type rules monotonically approach the true value of the inverse bilinear form being approximated (Thm. 10, Thm. 12, Corr. 13). – It proves that for the same number of iterations, the upper and lower bounds given by Gauss-Radau are superior to those given by Gauss and Gauss-Lobatto quadrature (Thm. 10, Thm. 12). – It proves that the lower bounds given by Gauss and right Gauss-Radau quadrature share the same convergence rate, while the upper bounds given by left Gauss-Radau and Gauss-Lobatto share the same convergence rate. Moreover, we obtain these rates explicitly (Thm. 11, Thm. 18, Corr. 19). – It presents an application of our theory to accelerating Monte Carlo sampling for Dpp and kDpp. Summary. The remainder of this paper is organized as follows: In Sec. 2 we mention basics of Gauss quadrature; we also recall the famous Gauss Quadrature Lanczos (GQL) algorithm. Sec. 3 presents our main theoretical results, which include an analysis of bounds attained by different Gauss-type quadrature rules, as well as convergence, monotonicity, and interrelations between them. In Sec. 4 we discuss and analyze a more general setting where the matrix A is only required to be symmetric, but the vector u lies in the column space spanned by the eigenvectors of A corresponding to strictly positive eigenvalues. We discuss practical issues such as numerical instability and preconditioning n Sec. 5 and concludes with potential applications in Sec. 6.

2 Background on Gauss Quadrature The material of this section is standard and has been derived from various sources: [1, 15, 18]. We include a summary here for the reader’s convenience. Experts can skim this section for collecting our notation before moving onto Sec. 3, which contains our new results. We begin with Gauss quadrature for computing the bilinear form u> f ( A)v, where A ∈ R N × N is a symmetric positive definite matrix with simple eigenvalues, u, v are arbitrary vectors, and f is a matrix function. It suffices to consider u> f ( A)u; the general case follows from the polarization identity u> f ( A)v = 41 (u + v)> f ( A)(u + v) − 41 (u − v)> f ( A)(u − v). Let A = Q> ΛQ be the eigendecomposition of A where Q is orthonormal. Then, u> f ( A)u = u> Q> f (Λ) Qu = u˜ > f (Λ)u˜ =

N

∑i=1 f (λi )u˜ 2i ,

where u˜ := Qu.

A key conceptual step is to write the above summation as the Riemann-Stieltjes integral: I [ f ] := u> f ( A)u =

Z λmax

f (λ)dα(λ),

(2.1)

λmin

where λmin ∈ (0, λ1 ), λmax > λ N , and α(λ) is piecewise constant measure defined by  λ < λ1 ,   0, ∑kj=1 u˜ 2j , λk ≤ λ < λk+1 , k < N, α(λ) :=   ∑ N u˜ 2 , λ ≤ λ. N j =1 j

2

(2.2)

Thus our task reduces to approximating the integral (2.1), and this is where we invoke the powerful idea of Gauss-type quadrature [14, 15, 32, 40]. We rewrite the integral (2.1) as I [ f ] := Qn + Rn =

n

m

∑i=1 ωi f (θi ) + ∑i=1 νi f (τi ) + Rn [ f ],

(2.3)

where Qn denotes the nth degree approximation and Rn denotes a remainder term. In representation (2.3) the weights {ωi }in=1 , {νi }im=1 , and quadrature nodes {θi }in=1 are unknown, while the values {τi }im=1 are prescribed and lie outside the interval of integration (λmin , λmax ). Different choices of these parameters lead to different quadrature rules: m = 0 corresponds to Gauss quadrature [14]; m = 1 with τ1 = λmin (τ1 = λmax ) corresponds to the left (right) Gauss-Radau quadrature [40]; m = 2 with τ1 = λmin and τ2 = λmax corresponds to Gauss-Lobatto quadrature [32]; while for general m we obtain Gauss-Christoffel quadrature—see [15] for an excellent survey. A more detailed introduction to Gauss-type quadrature for problems similar to ours may be found in [18].

2.1

Selecting weights and nodes

The weights {ωi }in=1 , {νi }im=1 and nodes {θi }in=1 are chosen such that for all polynomials of degree less than 2n + m − 1, denoted f ∈ P2n+m−1 , we have exact interpolation I [ f ] = Qn . One way to compute weights and nodes is to set f ( x ) = xi for i ≤ 2n + m − 1 and then use this exact nonlinear system. But there is an easier way to obtain weights and nodes, namely by using polynomials orthogonal with respect to the measure α. Specifically, we construct a sequence of orthogonal polynomials p0 (λ), p1 (λ), . . . such that pi (λ) is a polynomial in λ of degree exactly k, and pi , p j are orthogonal, i.e., they satisfy Z λmax

 pi (λ) p j (λ)dα(λ) =

λmin

1, 0,

i=j otherwise.

(2.4)

The roots of pn are distinct, real and lie in the interval of [λmin , λmax ], and form the nodes {θi }in=1 for Gauss quadrature (see, e.g., [18, Ch. 6]). Consider the two monic polynomials whose roots serve as quadrature nodes: πn (λ) =

n

∏ i =1 ( λ − θ i ),

ρm (λ) =

m

∏i=1 (λ − τi ),

(2.5)

+ where ρ0 = 1 for consistency. We further denote ρ+ m = ± ρm , where the sign is taken to ensure ρm ≥ 0 on [λmin , λmax ]. Then, for m > 0, we calculate the quadrature weights as     ρ+ ρ+ m (λ)πn (λ) m (λ)πn (λ) , νj = I , (2.6) ωi = I + 0 ρm (θi )πn0 (θi )(λ − θi ) (ρ+ m ) ( τj ) πn ( τj )( λ − τj )

where f 0 (λ) denotes the derivative of f with respect to λ. When m = 0 the quadrature degenerates to Gauss quadrature and we have   πn (λ) ωi = I . (2.7) πn0 (θi )(λ − θi ) Although we have specified how to select nodes and weights for quadrature, these ideas cannot be applied to our problem because the measure α is unknown. Indeed, calculating the measure explicitly would require knowing the entire spectrum of A, which is as good as explicitly computing f ( A), hence untenable for us. The next section shows how to circumvent the difficulties due to unknown α.

2.2

Gauss Quadrature Lanczos (GQL)

The key idea to circumvent our lack of knowledge of α is to recursively construct polynomials called Lanczos polynomials. The construction ensures their orthogonality with respect to α. Concretely, we

3

construct Lanczos polynomials via the following three-term recurrence: β i pi (λ) = (λ − αi ) pi−1 (λ) − β i−1 pi−2 (λ), i = 1, 2, . . . , n p−1 (λ) ≡ 0; while ensuring

R λmax λmin

p0 (λ) ≡ 1,

(2.8)

dα(λ) = 1. We can express (2.8) in matrix form by writing λPn (λ) = Jn Pn (λ) + β n pn (λ)en ,

where Pn (λ) := [ p0 (λ), . . . , pn−1 (λ)]> , en is  α1  β1   Jn =    

(2.9)

nth canonical unit vector, and Jn is the tridiagonal matrix  β1  α2 β 2   .. .. . . . β2 (2.10)   .. . α n −1 β n −1  β n −1

αn

This matrix is known as the Jacobi matrix, and is closed related to Gauss quadrature. The following well-known theorem makes this relation precise. Theorem 1 ([20, 43]). The eigenvalues of Jn form the nodes {θi }in=1 of Gauss-type quadrature. The weights {ωi }in=1 are given by the squares of the first elements of the normalized eigenvectors of Jn . Thus, if Jn has the eigendecomposition Jn = Pn> ΓPn , then for Gauss quadrature Thm. 1 yields Qn =

n

∑i=1 ωi f (θi ) = e1> Pn> f (Γ) Pn e1 = e1> f ( Jn )e1 .

(2.11)

Specialization. We now specialize to our main focus, f ( A) = A−1 , for which we prove more precise results. In this case, (2.11) becomes Qn = [ Jn−1 ]1,1 . The task now is to compute Qn , and given A, u to obtain the Jacobi matrix Jn . Fortunately, we can efficiently calculate Jn iteratively using the Lanczos Algorithm [28]. Suppose we have an estimate Ji , in iteration (i + 1) of Lanczos, we compute the tridiagonal coefficients αi+1 and β i+1 and add them to this estimate to form Ji+1 . As to Qn , assuming we have already computed [ Ji−1 ]1,1 , letting ji = Ji−1 ei and invoking the Sherman-Morrison identity [44] we obtain the recursion:

[ Ji−+11 ]1,1 = [ Ji−1 ]1,1 +

β2i ([ ji ]1 )2 , αi+1 − β2i [ ji ]i

(2.12)

where [ ji ]1 and [ ji ]i can be recursively computed using a Cholesky-like factorization of Ji [18, p.31]. For Gauss-Radau quadrature, we need to modify Ji so that it has a prescribed eigenvalue. More precisely, we extend Ji to Jilr for left Gauss-Radau (Jirr for right Gauss-Radau) with β i on the off-diagonal rr lr rr and αlr i (αi ) on the diagonal, so that Ji (Ji ) has a prescribed eigenvalue of λmin (λmax ). lo lo For Gauss-Lobatto quadrature, we extend Ji to Jilo with values βlo i and αi chosen to ensure that Ji has the prescribed eigenvalues λmin and λmax . For more detailed on the construction, see [16]. For all methods, the approximated values are calculated as [( Ji0 )−1 ]1,1 , where Ji0 ∈ { Jilr , Jirr , Jilo } is the modified Jacobi matrix. Here Ji0 is constructed at the i-th iteration of the algorithm. The algorithm for computing Gauss, Gauss-Radau, and Gauss-Lobatto quadrature rules with the help of Lanczos iteration is called Gauss Quadrature Lanczos (GQL) and is shown in [17]. We recall its pseudocode in Alg. 1 to make our presentation self-contained (and for our proofs in Sec. 3). The error of approximating I [ f ] by Gauss-type quadrature can be expressed as Rn [ f ] =

f (2n+m) (ξ ) I [ρm πn2 ], (2n + m)! 4

Algorithm 1: Gauss Quadrature Lanczos (GQL) Require: u and A the corresponding vector and matrix, λmin and λmax lower and upper bounds for the spectrum of A Ensure: gi , girr , gilr and gilo the Gauss, right Gauss-Radau, left Gauss-Radau and Gauss-Lobatto quadrature computed at i-th iteration Initialize: u−1 = 0, u0 = u/kuk, α1 = u0> Au0 , β 1 = k( A − α1 I )u0 k, g1 = kuk/α1 , c1 = 1, δ1 = α1 , δ1lr = α1 − λmin , δ1rr = α1 − λmax , u1 = ( A − α1 I )u0 /β 1 , i = 2 while i ≤ N do αi = ui>−1 Aui−1 . Lanczos Iteration u˜ i = Aui−1 − αi ui−1 − β i−1 ui−2 β i = ku˜ i k ui = u˜ i /β i gi = gi − 1 +

kuk β2i−1 c2i−1 δi−1 (αi δi−1 − β2i−1 )

. Update gi with Sherman-Morrison formula

ci = ci−1 β i−1 /δi−1 δi = αi −

β2i−1 lr δi−1 , δi

αlr i = λmin + αlo i = gilr =

β2i , δilr

= αi − λmin −

αrr i = λmax +

β2i δirr

β2i−1 δilr−1

, δirr = αi − λmax −

β2i−1 δirr−1

. Solve for Jilr and Jirr

δilr δirr λmax δilr δirr lo 2 ( δlr − λδmin rr ), ( β i ) = rr lr ( λmax − λmin ) δirr −δilr δ i i i − δi 2 2 2 2 2 2 ( βlo β c kuk β c kuk i ) ci k u k gi + ilr i 2 , girr = gi + δ (αirr iδ − β2 ) , gilo = gi + lo 2 δi (αi δi − β i ) δi (αi δi −( βlo i i i i i ) )

. Solve for Jilo . Update girr , gilr and gilo with

Sherman-Morrison formula i = i+1 end while

for some ξ ∈ [λmin , λmax ] (see, e.g., [48]). Note that ρm does not change sign in [λmin , λmax ]; but with different values of m and τj we obtain different (but fixed) signs for Rn [ f ] using f (λ) = 1/λ and λmin > 0. Concretely, for Gauss quadrature m = 0 and Rn [ f ] ≥ 0; for left Gauss-Radau m = 1 and τ1 = λmin , so we have Rn [ f ] ≤ 0; for right Gauss-Radau we have m = 1 and τ1 = λmax , thus Rn [ f ] ≥ 0; while for Gauss-Lobatto we have m = 2, τ1 = λmin and τ2 = λmax , so that Rn [ f ] ≤ 0. This behavior of the errors clearly shows the ordering relations between the target values and the approximations made by the different quadrature rules. Lemma 2 (see e.g., [33]) makes this claim precise. Lemma 2. Let gi , gilr , girr , and gilo be the approximations at the i-th iteration of Gauss, left Gauss-Radau, right Gauss-Radau, and Gauss-Lobatto quadrature, respectively. Then, gi and girr provide lower bounds on u> A−1 u, while gilr and gilo provide upper bounds. The final connection we recall as background is the method of conjugate gradients. This helps us analyze the speed at which quadrature converges to the true value (assuming exact arithmetic).

2.3

Relation with Conjugate Gradient

While Gauss-type quadrature relates to the Lanczos algorithm, Lanczos itself is closely related to conjugate gradient (CG) [22], a well-known method for solving Ax = b for positive definite A. We recap this connection below. Let xk be the estimated solution at the k-th CG iteration. If x ∗ denotes the true solution to Ax = b, then the error ε k and residual rk are defined as ε k := x ∗ − xk ,

rk = Aε k = b − Axk ,

(2.13)

At the k-th iteration, xk is chosen such that rk is orthogonal to the k-th Krylov space, i.e., the linear space Kk spanned by {r0 , Ar0 , . . . , Ak−1 r0 }. It can be shown [35] that rk is a scaled Lanczos vector from the 5

k-th iteration of Lanczos started with r0 . Noting the relation between Lanczos and Gauss quadrature applied to appoximate r0> A−1 r0 , one obtains the following theorem that relates CG with GQL. Theorem 3 (CG and GQL; [34]). Let ε k be the error as in (2.13), and let kε k k2A := ε Tk Aε k . Then, it holds that −1 ]1,1 − [ Jk−1 ]1,1 ), kε k k2A = kr0 k2 ([ JN

where Jk is the Jacobi matrix at the k-th Lanczos iteration starting with r0 . Finally, the rate at which kε k k2A shrinks has also been well-studied, as noted below. Theorem 4 (CG rate, see e.g. [45]). Let ε k be the error made by CG at iteration k when started with x0 . Let κ be the condition number of A, i.e., κ = λ1 /λ N . Then, the error norm at iteration k satisfies  √κ − 1  k kε k k A ≤ 2 √ kε 0 k A . κ+1 Due to these explicit relations between CG and Lanczos, as well as between Lanczos and Gauss quadrature, we readily obtain the following convergence rate for relative error of Gauss quadrature. Theorem 5 (Gauss quadrature rate). The i-th iterate of Gauss quadrature satisfies the relative error bound  √κ − 1 i g N − gi . (2.14) ≤2 √ gN κ+1 Proof. This is obtained by exploiting relations among CG, Lanczos and Gauss quadrature. Set x0 = 0 and b = u. Then, ε 0 = x ∗ and r0 = u. An application of Thm. 3 and Thm. 4 thus yields the bound −1 kε i k2A = kuk2 ([ JN ]1,1 − [ Ji−1 ]1,1 ) = g N − gi √  √κ − 1 i  √κ − 1 i  κ − 1 i ≤ 2 √ kε 0 k A = 2 √ u > A −1 u = 2 √ gN κ+1 κ+1 κ+1

where the last equality draws from Lemma 6. In other words, Thm. 5 shows that the iterates of Gauss quadrature converge geometrically.

3

Main Results

Equipped with above background on quadrature we are now ready to describe the our main results. The key questions that we address are: (i) how tight are the bounds on u> A−1 u yielded by Gauss-type quadrature; and (ii) how fast do Gauss-Radau and Gauss-Lobatto iterates converge? Our answers not only fill gaps in the literature on quadrature, but also have empirical consequences: Section 6 uses our sharper bounds to accelerate an application involving MCMC sampling. We begin by proving an exactness property of Gauss and Gauss-Radau quadrature. Lemma 6 (Exactness). With A being symmetric positive definite with simple eigenvalues, the iterates g N , glrN , and grr N are exact. Namely, after N iterations they satisfy > −1 g N = glrN = grr N = u A u.

(3.1)

Proof. Observe that the Jacobi tridiagonal matrix can be computed via Lanczos iteration, and Lanczos is essentially essentially an iterative tridiagonalization of A. At the i-th iteration we have Ji = Vi> AVi , where Vi ∈ R N ×i are the first i Lanczos vectors (i.e., a basis for the i-th Krylov space). Thus, JN =

6

VN> AVN where VN is an N × N orthonormal matrix, showing that JN has the same eigenvalues as A. As a result π N (λ) = ∏iN=1 (λ − λi ), and it follows that the remainder RN [ f ] =

f (2N ) (ξ ) I [π 2N ] = 0, (2N )!

for some scalar ξ ∈ [λmin , λmax ], which shows that g N is exact for u> A−1 u. For left and right GaussRadau quadrature, we have β N = 0, αlrN = λmin , and αrr N = λmax , while all other elements of the 0 are zeros. Thus, the eigenvalues of J 0 are λ , . . . , λ , τ , and π ( λ ) ( N + 1)-th row or column of JN N 1 N 1 N again equals ∏iN=1 (λ − λi ). As a result, the remainder satisfies RN [ f ] =

f (2N ) (ξ ) I [(λ − τ1 )π 2N ] = 0, (2N )!

lr from which it follows that both grr N and g N are exact.

The convergence rate in Thm. 4 and the final exactness of iterations in Lemma 6 does not necessarily indicate that we are making progress at each iterations. However, by exploiting the relations to CG we can indeed conclude that we are making progress in each iteration in Gauss quadrature. Theorem 7. The approximation gi generated by Gauss quadrature is monotonically nondecreasing, i.e., gi ≤ gi + 1 ,

for i < N.

Proof. At each iteration ri is taken to be orthogonal to the i-th Krylov space: Ki = span{u, Au, . . . , Ai−1 u}. Let Πi be the projection onto the complement space of Ki . The residual then satisfies

kε i+1 k2A = ε Ti+1 Aε i+1 = ri>+1 A−1 ri+1 = ( Π i +1 r i ) > A −1 Π i +1 r i = ri> (Πi>+1 A−1 Πi+1 )ri ≤ ri A−1 ri , where the last inequality follows from Πi>+1 A−1 Πi+1  A−1 . Thus kε i k2A is monotonically nonincreasing, whereby g N − gi ≥ 0 is monotonically decreasing and thus gi is monotonically nondecreasing. Before we proceed to Gauss-Radau, let us recall a useful theorem and its corollary. Theorem 8 (Lanczos Polynomial [18]). Let ui be the vector generated by Alg. 1 at the i-th iteration; let pi be the Lanczos polynomial of degree i. Then we have u i = p i ( A ) u0 ,

where pi (λ) = (−1)i

det( Ji − λI ) ∏ij=1 β j

.

From the expression of Lanczos polynomial we have the following corollary specifying the sign of the polynomial at specific points. Corollary 9. Assume i < N. If i is odd, then pi (λmin ) < 0; for even i, pi (λmin ) > 0, while pi (λmax ) > 0 for any i < N. Proof. Since Ji = Vi> AVi is similar to A, its spectrum is bounded by λmin and λmax from left and right. Thus, Ji − λmin is positive semi-definite, and Ji − λmax is negative semi-definite. Taking (−1)i into consideration we will get the desired conclusions. We are ready to state our main result that compares (right) Gauss-Radau with Gauss quadrature.

7

Theorem 10. Let i < N. Then, girr gives better bounds than gi but worse bounds than gi+1 ; more precisely, gi ≤ girr ≤ gi+1 , i < N.

(3.2)

Proof. We prove inequality (3.2) using the recurrences satisfied by gi and girr (see Alg. 1) Upper bound: girr ≤ gi+1 . The iterative quadrature algorithm uses the recursive updates girr = gi +

β2i c2i rr δi (αi δi −

β2i )

,

and

gi + 1 = gi +

β2i c2i . δi (αi+1 δi − β2i )

It suffices to thus compare αrr i and αi +1 . The three-term recursion for Lanczos polynomials shows that β i+1 pi+1 (λmax ) = (λmax − αi+1 ) pi (λmax ) − β i pi−1 (λmax ) > 0, β i+1 pi∗+1 (λmax ) = (λmax − αrr i ) pi ( λmax ) − β i pi −1 ( λmax ) = 0, where pi+1 is the original Lanczos polynomial, and pi∗+1 is the modified polynomial that has λmax as a root. Noting that pi (λmax ) > 0, we see that αi+1 ≤ αrr i . Moreover, from Thm. 7 we know that the gi ’s are monotonically increasing, whereby δi (αi+1 δi − β2i ) > 0. It follows that 2 0 < δi (αi+1 δi − β2i ) ≤ δi (αrr i δi − β i ),

and from this inequality it is clear that girr ≤ gi+1 . 2 2 Lower-bound: gi ≤ girr . Since β2i c2i ≥ 0 and δi (αrr i δi − β i ) ≥ δi ( αi +1 δi − β i ) > 0, we readily obtain gi ≤ gi +

β2i c2i = girr . 2) δi (αrr δ − β i i i

Combining Thm. 10 with the convergence rate of relative error for Gauss quadrature (Thm. 5) immediately yields the following convergence rate for right Gauss-Radau quadrature: Theorem 11 (Relative error of right Gauss-Radau). For each i, the right Gauss-Radau girr iterates satisfy  √κ − 1 i g N − girr ≤2 √ . gN κ+1 This results shows that with the same number of iterations, right Gauss-Radau gives superior approximation over Gauss quadrature, though they share the same relative error convergence rate. Our second main result compares Gauss-Lobatto with (left) Gauss-Radau quadrature. Theorem 12. Let i < N. Then, gilr gives better upper bounds than gilo but worse than gilo+1 ; more precisely, gilo+1 ≤ gilr ≤ gilo ,

i < N.

Proof. We prove these inequalities using the recurrences for gilr and gilo from Alg. 1. 2 ( βlo i ) . δilr

gilr ≤ gilo : From Alg. 1 we observe that αlo i = λmin + gilr = gi + gilo = gi +

β2i c2i 2 δi (αlr i δi − β i )

= gi +

Thus we can write gilr and gilo as β2i c2i

λmin δi2 + β2i (δi2 /δilr − δi )

2 2 2 2 ( βlo ( βlo i ) ci i ) ci = g + i lo 2 lr 2 2 δi (αlo λmin δi2 + ( βlo i δi − ( β i ) ) i ) ( δi /δi − δi )

To compare these quantities, as before it is helpful to begin with the original three-term recursion for the Lanczos polynomial, namely β i +1 p i +1 ( λ ) = ( λ − α i +1 ) p i ( λ ) − β i p i −1 ( λ ). 8

In the construction of Gauss-Lobatto, to make a new polynomial of order i + 1 that has roots λmin and λmax , we add σ1 pi (λ) and σ2 pi−1 (λ) to the original polynomial to ensure  β i+1 pi+1 (λmin ) + σ1 pi (λmin ) + σ2 pi−1 (λmin ) = 0, β i+1 pi+1 (λmax ) + σ1 pi (λmax ) + σ2 pi−1 (λmax ) = 0. Since β i+1 , pi+1 (λmax ), pi (λmax ) and pi−1 (λmax ) are all greater than 0, σ1 pi (λmax ) + σ2 pi−1 (λmax ) < 0. To determine the sign of polynomials at λmin , consider the two cases: 1. Odd i. In this case pi+1 (λmin ) > 0, pi (λmin ) < 0, and pi−1 (λmin ) > 0; 2. Even i. In this case pi+1 (λmin ) < 0, pi (λmin ) > 0, and pi−1 (λmin ) < 0. Thus, if S = (sgn(σ1 ), sgn(σ2 )), where the signs take values in {0, ±1}, then S 6= (1, 1), S 6= (−1, 1) and 2 2 2 2 S 6= (0, 1). Hence, σ2 ≤ 0 must hold, and thus ( βlo i ) = ( β i − σ2 ) ≥ β i given that β i > 0 for i < N. lo 2 2 2 2 Using ( β i ) ≥ β i with λmin ci (δi ) ≥ 0, an application of monotonicity of the univariate function g( x ) = b+axcx for ab ≥ 0 to the recurrences defining gilr and gilo yields the desired inequality gilr ≤ gilo . gilo+1 ≤ gilr : From recursion formulas we have gilr = gi +

β2i c2i 2 δi (αlr i δi − β i )

,

gilo+1 = gi+1 +

2 2 ( βlo i +1 ) c i +1 lo 2 δi+1 (αlo i +1 δi +1 − ( β i +1 ) )

.

Establishing gilr ≥ gilo+1 thus amounts to showing that (noting the relations among gi , gilr and gilo ): β2i c2i 2 δi (αlr i δi − β i )

⇔ ⇔



2 2 ( βlo β2i c2i i +1 ) c i +1 ≥ lo 2 δi (αi+1 δi − β2i ) δi+1 (αlo i +1 δi +1 − ( β i +1 ) )

β2i c2i 2 δi (αlr i δi − β i )



2 2 2 ( βlo β2i c2i i +1 ) c i β i ≥ lo 2 δi (αi+1 δi − β2i ) (δi )2 δi+1 (αlo i +1 δi +1 − ( β i +1 ) )

2 ( βlo 1 1 i +1 ) − ≥ lo 2 2 αi+1 δi − β2i αlr δi δi+1 (αlo i δi − β i i +1 δi +1 − ( β i +1 ) )

1 1 1 − ≥ lo 2 (αi+1 − δilr+1 ) − β2i /δi αi+1 − β2i /δi δi+1 (αlo δ i +1 i +1 / ( β i +1 ) − 1 ) 1 1 1 ⇔ − ≥ δ λ δ lr min i + 1 δ δi+1 − δi+1 i +1 δi+1 ( lo 2 + ilr+1 − 1) δ (β )



i +1

(Lemma 14)

i +1

λ δi+1 δ δ ⇔ min + ilr+1 − 1 ≥ ilr+1 − 1 lo 2 δi+1 δi+1 ( β i +1 )



λmin δi+1 ≥ 0, 2 ( βlo i +1 )

where the last inequality holds, thus proving the statement. In summary, we have the following corollary for all the four quadrature rules: Corollary 13 (Monotonicity of Lower and Upper Bounds). As the iteration proceeds, gi and girr gives increasingly better asymptotic lower bounds and gilr and gilo gives increasingly better upper bounds, namely gi ≤ gi + 1 ;

girr ≤ girr+1

gilr ≥ gilr+1 ;

gilo ≥ gilo+1 .

Proof. Directly drawn from Thm. 7, Thm. 10 and Thm. 12.

9

Before proceeding further to our analysis of convergence rates of left Gauss-Radau and GaussLobatto, we note two technical results that we will need. lr lr Lemma 14. Let αi+1 and αlr i be as in Alg. 1. The difference ∆i +1 = αi +1 − αi satisfies ∆i +1 = δi +1 .

Proof. From the Lanczos polynomials in the definition of left Gauss-Radau quadrature we have  β i+1 pi∗+1 (λmin ) = λmin − αlr i pi ( λmin ) − β i pi −1 ( λmin )  = λmin − (αi+1 − ∆i+1 ) pi (λmin ) − β i pi−1 (λmin )

= β i+1 pi+1 (λmin ) + ∆i+1 pi (λmin ) = 0. Rearrange this equation to write ∆i+1 = − β i+1 ∆ i +1

Thm. 8

=

− β i +1

pi+1 (λmin ) , pi (λmin )

which can be further rewritten as

1 (−1)i+1 det( Ji+1 − λmin I )/ ∏ij+ =1 β j

(−1)i det( Ji

− λmin I )/ ∏ij=1

=

βj

det( Ji+1 − λmin I ) = δilr+1 . det( Ji − λmin I )

Remark 15. Lemma 14 has an implication beyond its utility for the subsequent proofs: it provides a new way of calculating αi+1 given the quantities δilr+1 and αlr i ; this saves calculation in Alg. 1. The following lemma relates δi to δilr , which will prove useful in subsequent analysis. Lemma 16. Let δilr and δi be computed in the i-th iteration of Alg. 1. Then, we have the following: δilr < δi , δilr δi

(3.3)

≤ 1−

λmin . λN

(3.4)

Proof. We prove (3.3) by induction. Since λmin > 0, δ1 = α1 > λmin and δ1lr = α − λmin we know that δ1lr < δ1 . Assume that δilr < δi is true for all i ≤ k and considering the (k + 1)-th iteration: β2k

δklr+1 = αk+1 − λmin −

δklr

< α k +1 −

β2k = δk+1 . δk

To prove (3.4), simply observe the following αi − λmin − β2i−1 /δilr−1 δilr = δi αi − β2i−1 /δi−1

(3.3)



λ αi − λmin ≤ 1 − min . αi λN

With aforementioned lemmas we will be able to show how fast the difference between gilr and gi decays. Note that gilr gives an upper bound on the objective while gi gives a lower bound. Lemma 17. The difference between gilr and gi decreases exponentially. More specifically we have √ lr + √κ − 1 i gi − gi ≤ 2κ ( ) gN κ+1 where κ + = λ N /λmin and κ is the condition number of A, i.e., κ = λ N /λ1 . Proof. We rewrite the difference gilr − gi as follows gilr − gi =

=

β2i c2i 2 δi (αlr i δi − β i )

=

β2i c2i δi (αi+1 δi − β2i ) 2 δi (αi+1 δi − β2i ) δi (αlr i δi − β i )

β2i c2i β2i c2i 1 1   = , 2 2 lr 2 2 1 − ∆ δi (αi+1 δi − β i ) αi − β i /δi δi (αi δi − β i ) αi+1 − β i /δi i +1 /δi +1 10

(3.5)

where ∆i+1 = αi+1 − αlr i . Next, recall that

g N − gi gN

≤2

√

√κ −1 κ +1

i

. Since gi lower bounds g N , we have

 √κ − 1 i  1−2 √ g N ≤ gi ≤ g N , κ+1   √ κ − 1  i +1  1−2 √ g N ≤ gi + 1 ≤ g N . κ+1 

Thus, we can conclude that  √κ − 1 i β2i c2i √ = g − g ≤ 2 gN . i +1 i κ+1 δi (αi δi − β2i ) Now we focus on the term 1 − ∆i+1 /δi+1

 −1

. Using Lemma 14 we know that ∆i+1 = δilr+1 . Hence,

1 − ∆i+1 /δi+1 = 1 − δilr+1 /δi+1 ≥ 1 − (1 − λmin /λ N ) = λmin /λ N , Finally we have gilr

1 . κ+

(3.6)

√ i β2i c2i 1 + √κ − 1 − gi = ≤ 2κ gN . κ+1 δi (αi δi − β2i ) 1 − ∆i+1 /δi+1

Theorem 18 (Relative error of left Gauss-Radau). For left Gauss-Radau quadrature where the preassigned node is λmin , we have the following bound on relative error:  √κ − 1 i gilr − g N ≤ 2κ + √ , gN κ+1

where κ + := λ N /λmin , i < N.

Proof. Write gilr = gi + ( gilr − gi ). Since gi ≤ g N , using Lemma 17 to bound the second term we obtain  √κ − 1 i gilr ≤ g N + 2κ + √ gN , κ+1 from which the claim follows upon rearrangement. Due to the relations between left Gauss-Radau and Gauss-Lobatto, we have the following corollary: Corollary 19 (Relative error of Gauss-Lobatto). For Gauss-Lobatto quadrature, we have the following bound on relative error:  √ κ − 1  i −1 gilo − g N ≤ 2κ + √ , where κ + := λ N /λmin and i < N. (3.7) gN κ+1

4

Generalization

In this section we consider the case where u lies in the column space of several top eigenvectors of A, and discuss how the aforementioned theorems vary. In particular, note that the previous analysis assumes that A is positive definite. With our analysis in this section we relax this assumption to the more general case where A is symmetric with simple eigenvalues, though we require u to lie in the space spanned by eigenvectors of A corresponding to positive eigenvalues. We consider the case where A is symmetric and has the eigendecomposition of A = QΛQ> = N ∑i=1 λi qi qi> where λi ’s are eigenvalues of A increasing with i and qi ’s are corresponding eigenvectors. Assume that u lies in the column space spanned by top k eigenvectors of A where all these k eigenvectors correspond to positive eigenvalues. Namely we have u ∈ Span{{qi }iN= N −k+1 } and 0 < λ N −k+1 . 11

Since we only assume that A is symmetric, it is possible that A is singular and thus we consider the value of u> A† u, where A† is the pseudo-inverse of A. Due to the constraints on u we have > † u> A† u = u> QΛ† Q> u = u> Qk Λ†k Q> k u = u B u,

where B = ∑iN= N −k+1 λi qi qi> . Namely, if u lies in the column space spanned by the top k eigenvectors of A then it is equivalent to substitute A with B, which is the truncated version of A at top k eigenvalues and corresponding eigenvectors. Another key observation is that, given that u lies only in the space spanned by {qi }iN= N −k+1 , the Krylov space starting at u becomes Span{u, Au, A2 u, . . .} = Span{u, Bu, B2 u, . . . , Bk−1 u}

(4.1)

This indicates that Lanczos iteration starting at matrix A and vector u will finish constructing the corresponding Krylov space after the k-th iteration. Thus under this condition, Alg. 1 will run at most k iterations and then stop. At that time, the eigenvalues of Jk are exactly the eigenvalues of B, thus they are exactly {λi }iN= N −k+1 of A. Using similar proof as in Lemma 6, we can obtain the following generalized exactness result. Corollary 20 (Generalized Exactness). gk , gkrr and gklr are exact for u> A† u = u> B† u, namely gk = gkrr = gklr = u> A† u = u> B† u. The monotonicity and the relations between bounds given by various Gauss-type quadrature will still be the same as in the original case in Sec. 3, but the original convergence rate cannot apply in this case because now we probably have λmin ( B) = 0, making κ undefined. This crash of convergence rate results from the crash of the convergence of the corresponding conjugate gradient algorithm for solving Ax = u. However, by looking at the proof of, e.g., [45], and by noting that λ1 ( B) = . . . = λ N −k ( B) = 0, with a slight modification of the proof we actually obtain the bound

kεi k2A ≤ min Pi

max

λ∈{λi }iN= N −k+1

[ Pi (λ)]2 kε0 k2A ,

where Pi is a polynomial of order i. By using properties of Chebyshev polynomials and following the original proof (e.g., [18] or [45]) we obtain the following lemma for conjugate gradient. Lemma 21. Let εk be as before (for conjugate gradient). Then,  √κ 0 − 1 k )k kε 0 k A , where κ 0 := λ N /λ N −k+1 . kε k A ≤ 2 √ κ0 + 1 Following this new convergence rate and connections between conjugate gradient, Lanczos iterations and Gauss quadrature mentioned in Sec. 3, we have the following convergence bounds. Corollary 22 (Convergence Rate for Special Case). Under the above assumptions on A and u, due to the connection Between Gauss quadrature, Lanczos algorithm and Conjugate Gradient, the relative convergence rates of gi , girr , gilr and gilo are given by  √κ 0 − 1 i g k − gi ≤2 √ gk κ0 + 1 √  rr g k − gi κ 0 − 1 i ≤2 √ gk κ0 + 1 √ 0 i gilr − gk 0 √κ − 1 ≤ 2κm gk κ0 + 1 √  i lo 0 gi − g k 0 √κ − 1 ≤ 2κm , gk κ0 + 1 0 = λ /λ0 0 where κm N min and 0 < λmin < λ N −k +1 is a lowerbound for nonzero eigenvalues of B.

12

5

Practical Issues

Instability. As seen in Alg. 1, the quadrature algorithm is built upon Lanczos iterations. Although in theory Lanczos iterations construct a set of orthogonal Lanczos vectors, in practice the constructed vectors usually lose orthogonality after some iterations due to rounding errors. One way to deal with this problem is to reorthogonalize the vectors, either completely at each iteration or selectively [39]. Also, an equivalent Lanczos iteration proposed in [37] which uses a different expression to improve local orthogonality. Further discussion on numerical stability of the method lies beyond the scope of this paper. Preconditioning. For Gauss quadrature on u> A−1 u, the convergence rate of bounds is dependent on the condition number of A. We can use preconditioning techniques to get a well-conditioned submatrix and proceed with that. Concretely, observe that for non-singular C, u> A−1 u = u> C > C −> A−1 C −1 Cu = (Cu)(CAC > )−1 (Cu).

(5.1)

Thus, if CAC > is well-conditioned, we can use it with the vector Cu in Gauss quadrature. There exists various ways to obtain good preconditioners for an SPD matrix. A simple choice is to use C = [diag( A)]−1/2 . There also exists methods for efficiently constructing sparse inverse matrix [5]. If L happens to be an SDD matrix, we can use techniques introduced in [8] to construct an approximate sparse inverse in near linear time.

6 Applications In this section we consider our main application — use of Gauss-style quadrature to accelerate largescale Markov chain samplers for Determinantal Point Processes (DPPs). We show where the slowdowns arise and how to circumvent them via retrospective Markov chain samplers with Gauss quadrature. We also illustrate some other potential applications that benefit from the theory developed above.

6.1 6.1.1

Accelerated MCMC for Determinantal Point Processes DPP Basics and Motivation

A determinantal point process Dpp(L) is a distribution over subsets S ⊆ Y of a ground set Y of size N. It is determined by a positive semidefinite kernel L ∈ R N × N . Let LY be the submatrix of L consisting of the entries Lij with i, j ∈ Y ⊆ Y . Then, the probability PL (Y ) of observing Y ⊆ Y is proportional to det( LY ); consequently, PL (Y ) = det( LY )/ det( L + I ). Conditioning on sampling sets of a fixed cardinality k, one obtains a kDpp [25], whose probabilities are defined as PL,k (Y ) , PL (Y | |Y | = k) = det( LY )ek ( L)−1 J|Y | = kK,

where ek ( L) is the k-th coefficient of the polynomial det(λI − L) = ∑kN=0 (−1)k ek ( L)λ N −k . For more details about Dpps, we refer the reader to the detailed survey [26]. Exact sampling from a Dpp or kDpp requires an eigendecomposition of L [23], which is inefficient when N is large. In such cases, Metropolis Hastings or Gibbs sampling, shown in Alg. 2, can be a useful alternative. The bottleneck in each iteration is the computation of the transition probabilities. This computation can be accelerated by maintaining the inverse of the submatrix corresponding to the current state of the Markov Chain, and updating this inverse efficiently [24]. If the cardinality of the initial set Y is k0 , then T iterations of the chain take time O(k30 + Tk2 ) where k2 is the average value of the squared cardinality k2i of Y in iteration i. The corresponding running time for kDpp is O(k3 + k2 T ). The constants k0 , k2i and k2 can however be linear in N, leading to long running times even if L is sparse. Hence, we next propose an alternative sampler that uses Gauss quadrature and runs in time linear in the number of nonzero entries of LY , the principal minor of the respective state of the chain. 13

Algorithm 2: Markov Chain Dpp Require: L the kernel matrix we want to sample Dpp from, Y = [ N ] the ground set Ensure: Y sampled from exact Dpp (L) Randomly Initialize Y ⊆ Y while not mixed do Pick y ∈ Y uniformly randomly Pick p ∈ (0, 1) uniformly randomly Set for MH sampling: ( ) det( LY ∪{y} ) + py (Y ) ← min 1, det( LY ) ) ( det( LY \{y} ) − py (Y ) ← min 1, det( LY )

(6.1) (6.2)

Set for Gibbs sampling: q+ y (Y ) ← q− y (Y ) ←

det( LY ∪{y} ) LY ∪{y} + det( LY ) det( LY \{y} ) det( LY \{y} ) + det( LY )

(6.3) (6.4)

if y ∈ / Y then + Y ← Y ∪ {y} with probability p+ y (Y ) for MH; and with probability qy (Y ) for Gibbs else − Y ← Y \{y} with probability p− y (Y ) for MH; and with probability qy (Y ) for Gibbs end if end while

6.1.2

Retrospective Markov Chain Dpp via Gauss quadrature

To accelerate the computation of the transition probabilities, we combine our results on quadrature with the idea of retrospective sampling, which has previously been applied to the different setting of samplers for Dirichlet process hierarchical models [38]. The key idea is to use an iterative scheme to approximate the probabilities, and, adaptively, only compute the quantities as accurately as needed. Our results in the previous sections enable such a scheme for Dpps. We here discuss the details of MH sampling, the variant for Gibbs sampling is derived analogously. Given the current set Y, assume we propose to add element u to Y. The probability of transitioning to state Y ∪ {u} is q = min{1, det( LY ∪{u} )/ det( LY )} = min{1, Lu,u − Lu,Y LY−1 LY,u }. To decide whether to accept this transition, we sample p ∼ (0, 1) uniformly at random; if p < q then we accept the transition, otherwise we remain at Y. Hence, we need to compute q just accurately enough to decide whether p < q. To do so, we can use the lower and upper bounds on Lu,Y LY−1 LY,u discussed above. Let si and ti be lower and upper bounds for this bilinear form in the i-th iteration of Gauss quadrature. If p ≤ Lu,u − ti , then we can safely accept the transition, if p ≥ Lu,u − si , then we can safely reject the transition. Only if Lu,u − ti < p < Lu,u − si , we cannot make a decision yet, and therefore “retrospectively” perform another iteration of Gauss quadrature to obtain tighter upper and lower bounds si+1 and ti+1 . We continue until the bounds are sharp enough to safely decide whether to transition. Note that in each iteration we make an exact decision without approximation error, and hence the resulting algorithm is an exact Markov chain for the Dpp. The algorithm is shown in Alg. 3. For kDpp, the cardinality of Y remains constant, and therefore one transition correponds to swap-

14

Algorithm 3: Gauss-Dpp (L) Require: L the kernel matrix we want to sample Dpp from, Y = [ N ] the ground set Ensure: Y sampled from exact Dpp (L) Randomly Initialize Y ⊆ Y while not mixed do Pick y ∈ Y uniformly randomly Pick p ∈ (0, 1) uniformly randomly if y ∈ Y then Y 0 = Y \{y} Get lower and upper bounds λmin , λmax of the spectrum of LY 0 if Dpp-JudgeGauss(Ly,y − 1p , LY 0 ,y , LY 0 , λmin , λmax ) = True then Y = Y0 end if else Y 0 = Y ∪ {y} Get lower and upper bounds λmin , λmax of the spectrum of LY if Dpp-JudgeGauss(Ly,y − p, LY,y , LY , λmin , λmax ) = False then Y = Y0 end if end if end while

ping two elements (adding u and removing v at the same time). Say the current set is Y ∪ {v}, and we propose to delete v and add u to this set. The corresponding transition probability is o n L − L L −1 L o n det( L u,u Y ∪{u} ) u,Y Y Y,u = min 1, . q = min 1, det( LY ∪{v} ) Lv,v − Lv,Y LY−1 LY,v

(6.5)

Again, we sample p ∼ Uniform(0, 1), but now we have two quantities to compute, and hence two sets of lower and upper bounds: siu , tiu for Lu,Y LY−1 LY,u in the i-th Gauss quadrature iteration, and svj , tvj for Lv,Y LY−1 LY,v in the j-th Gauss quadrature iteration. Indeed, we do not necessarily need to run the same number of quadrature iterations for Lu,Y LY−1 LY,u and Lv,Y LY−1 LY,v . If we have p≤

Lu,u − Lu,Y LY−1 LY,u Lu,u − tiu ≤ Lv,v − svj Lv,v − Lv,Y LY−1 LY,v

(6.6)

then we can safely accept the transition, and if p≥

Lu,u − Lu,Y LY−1 LY,u Lu,u − siu ≥ Lv,v − tvj Lv,v − Lv,Y LY−1 LY,v

(6.7)

we can safely reject the transition. If neither is the case, then we tighten the bounds via additional Gauss quadrature iterations. We could perform one iteration for both u and v, but it may be the case that one set of bounds is already sufficiently tight, while the other still remains loose. A straightforward idea would be to judge the tightness of the lower and upper bounds by their difference (gap) ti − si , and decide accordingly which quadrature to iterate further. But the bounds for u and v do not contribute equally to the transition decision. Essentially, we need to judge the relation between p and

−1 Lu,u − Lu,Y LY LY,u , −1 Lv,v − Lv,Y LY LY,v

15

or, equivalently, the relation between pLv,v − Lu,u

Algorithm 4: Dpp-JudgeGauss(t, u, A, λmin , λmax ) Require: t the target value, u and A the corresponding vector and matrix, λmin and λmax lower and upper bounds for the spectrum of A Ensure: Return True if t < u> A−1 u, False if otherwise u−1 = 0, u0 = u/kuk, i = 1, β 0 = 0, while True do αi = ui>−1 ( Aui−1 − β i−1 ui−2 ) if i = 1 then g = 1/α1 , c = 1 δ = α1 , δlr = α1 − λmin , δrr = α1 − λmax else β2 c2 g = g + δ(α δi−−1β2 ) , c = cβ i−1 /δ i

β2i−1 δ ,

i −1

δ = αi − δlr = αi − λmin − end if u˜i = Aui−1 − αi ui−1 − β i−1 ui−2 β i = ku˜i k, ui = u˜ i /β i β2i , αrr i δlr 2 2 βi c lr g = g + lr , δ(αi δ− β2i ) if t < kuk2 grr then

αlr i = λmin +

= λmax + grr = g +

β2i−1 , δlr

δrr = αi − λmax −

β2i−1 δrr

β2i δrr

β2i c2 2 δ(αrr i δ− β i )

Return True else if t ≥ kuk2 glr then Return False end if ui = u˜i /β i , i = i + 1 end while and pLv,Y LY−1 LY,v − Lu,Y LY−1 LY,u . Since the left hand side is “easy”, the essential part is the right and side. Assuming that in practice the impact is larger when the gap is larger, we tighten the bounds for Lv,Y LY−1 LY,v if p(tvj − svj ) > (tiu − siu ), and otherwise tighen the bounds for Lu,Y LY−1 LY,u . The resulting algorithm is shown in Alg. 5. Since, with the same number of iterations, the GaussRadau quadrature yields tighter approximations than Gauss and Gauss-Lobatto quadratures, we consider only Gauss-Radau when making comparisons and deciding whether to transition. Algorithm 5: Gauss-kDpp (L, k) Require: L the kernel matrix we want to sample Dpp from, k the size of subset and Y = [ N ] the ground set Ensure: Y sampled from exact kDpp (L) where |Y | = k Randomly Initialize Y ⊆ Y where |Y | = k while not mixed do Pick v ∈ Y and u ∈ Y \Y uniformly randomly Pick p ∈ (0, 1) uniformly randomly Y 0 = Y \{v} Get lower and upper bounds λmin , λmax of the spectrum of LY 0 if kDpp-JudgeGauss(pLv,v − Lu,u , p, LY 0 ,u , LY 0 ,v , λmin , λmax ) = True then Y = Y 0 ∪ {u} end if end while

16

Algorithm 6: kDpp-JudgeGauss(t, p, u, v, A, λmin , λmax ) Require: t the target value, p the scaling factor, u, v and A the corresponding vectors and matrix, λmin and λmax lower and upper bounds for the spectrum of A Ensure: Return True if t < p(v> A−1 v) − u> A−1 u, False if otherwise u−1 = 0, u0 = u/kuk, iu = 1, βu0 = 0, du = ∞ v−1 = 0, v0 = v/kvk, iv = 1, βv0 = 0, dv = ∞ while True do if du > pdv then αiu = ui>−1 ( Aui−1 − βui−1 ui−2 ) if iu = 1 then gu = 1/α1u , cu = 1 δu = α1u , (δlr )u = α1u − λmin , (δrr )u = α1u − λmax else ( β u )2 ( c u )2 gu = gu + δu (αuiδ−u1−( βu )2 ) , cu = cu βui−1 /δu i

δu

=

αiu



( βui−1 )2 δu ,

i −1

(δlr )u

= αiu − λmin −

( βui−1 )2 , (δlr )u

(δrr )u = αiu − λmax −

( βui−1 )2 (δrr )u

end if u˜i = Aui−1 − αiu ui−1 − βui−1 ui−2 βui = ku˜i k, ui = u˜i /βui

( βui )2 ( βui )2 u , (αrr i ) = λmax + (δrr )u (δlr )u ( β u )2 ( c u )2 ( β u )2 ( c u )2 ( glr )u = gu + δu ((αlr i)u δu −( βu )2 ) , ( grr )u = gu + δu ((αrri)u δu −( βu )2 ) i i i i ui = u˜i /βui , iu = iu + 1, du = ( glr )u − ( grr )u

(αlri )u = λmin +

else αiv = vi>−1 ( Avi−1 − βvi−1 vi−2 ) if iv = 1 then gv = 1/α1v , cv = 1 δv = α1v , (δlr )v = α1v − λmin , (δrr )v = α1v − λmax else ( β v )2 ( c v )2 gv = gv + δv (αviδ−v1−( βv )2 ) , cv = cv βvi−1 /δv i

( βvi−1 )2 δv ,

i −1

(δlr )v = αiv − λmin − = − end if v˜i = Avi−1 − αiv vi−1 − βvi−1 vi−2 βvi = kv˜i k, vi = v˜i /βvi δv

αiv

( βvi−1 )2 , (δlr )v

(δrr )v = αiv − λmax −

( βvi )2 ( βvi )2 v , (αrr i ) = λmax + (δrr )v (δlr )v ( β v )2 ( c v )2 ( β v )2 ( c v )2 ( glr )v = gv + δv ((αlr i)v δv −( βv )2 ) , ( grr )v = gv + δv ((αrri)v δv −( βv )2 ) i i i i vi = v˜i /βvi , iv = iv + 1, dv = ( glr )v − ( grr )v

(αlri )v = λmin +

end if if t < pkvk2 ( grr )v − kuk2 ( glr )u then Return True else if t ≥ pkvk2 ( glr )v − kuk2 ( grr )u then Return False end if end while

17

( βvi−1 )2 (δrr )v

6.2

Other Applications

Many other applications could benefit from the derived asymptotic bounds; we discuss a few examples. 6.2.1

Further Applications to Bayesian Models

Dpps (and kDpps) have recently become a popular probabilistic model of diversity: they assign higher probability to subsets that are diverse. Hence, they have been considered as a (repulsive) prior distribution over a ground set of parameters in Bayesian models [27, 42]. Gibbs sampling is a candidate inference method for such latent variable models. If however the number of possible parameter assignments is large, the associated kernel matrix of parameters is large, and this may render standard Gibbs sampling intractable. If the kernel matrix is sparse, the retrospective Gibbs sampling with Gauss quadrature discussed above can help exploit sparsity and accelerate the computation of transition probablities. 6.2.2

Local Rank and centrailty in Large Graph

Measuring the popularicty, centrality or “importance” of a node in a large graph is an important question in network analysis. Several existing measures can be expressed as the solution to a largescale linear system. For example, PageRank [36] is the solution x satsfying

( I − (1 − α) G > ) x = α1/N,

(6.8)

and Bonacich centrality [6] is the solution to

( I − αG ) x = 1

(6.9)

where G denotes the adjacency matrix of the graph. A substantial body of work addresses how to estimate x globally (estimating the whole x) [19, 47, 50] or locally (estimating some entries) [29, 49]. The actual question in practice is often not the actual values of the nodes but the relation between nodes: is vi is more “central” than v j ? Moreover, we may only care about local rankings – for a search engine, the ranking of the top few entries is much more important than the bottom. For such a local ranking, it is sufficient to focus on computing only a small set of entries in x. Moreover, we only need to compute those entries within an accuracy that suffices to determine which is larger, similar to deciding the transition probabilities in the sampler. As another example, we may have a rough or outdated ranking and want to update parts of this ranking. In such cases too, re-computing all values could well be redundant. In all of these situations, the lower and upper bounds for Gauss quadrature help speed up computations tremendously: analogously to the decisions for retrospective sampling, these bounds indicate when we have reached sufficient accuracy to stop. In particular, for a specific entry i we can directly calculate the asymptotically tighter bounds for xi = ei> A−1 b =

1 > −1 ( e A ei − b > A −1 b ) 4 i

(6.10)

for given A and b. Note that b is common for all xi ’s, thus bounds for b> A−1 b can be reused. Once the interval given by the bounds on xi no longer overlaps with any other intervals in consideration, we can stop estimating xi and know the ranking for xi among this small group of entries.

References [1] Z. Bai, G. Fahey, and G. Golub. Some large-scale matrix computation problems. Journal of Computational and Applied Mathematics, 74(1):71–89, 1996. 18

[2] C. Bekas, A. Curioni, and I. Fedulova. Low cost high performance uncertainty quantification. In Proceedings of the 2nd Workshop on High Performance Computational Finance, 2009. [3] M. Benzi and G. H. Golub. Bounds for the entries of matrix functions with applications to preconditioning. BIT Numerical Mathematics, 39(3):417–438, 1999. [4] M. Benzi and C. Klymko. Total communicability as a centrality measure. Journal of Complex Networks, 1(2):124–149, 2013. [5] M. Benzi, C. D. Meyer, and M. Tuma. A sparse approximate inverse preconditioner for the conjugate gradient method. SIAM Journal on Scientific Computing, 17(5):1135–1149, 1996. [6] P. Bonacich. Power and centrality: A family of measures. American Journal of Sociology, 92(5): 1170–1182, 1987. [7] A. Borodin. Determinantal Point Processes. arXiv preprint arXiv:0911.1153, 2009. [8] D. Cheng, Y. Cheng, Y. Liu, R. Peng, and S.-H. Teng. Scalable parallel factorizations of SDD matrices and efficient sampling for Gaussian graphical models. arXiv preprint arXiv:1410.5392, 2014. [9] S.-J. Dong and K.-F. Liu. Stochastic estimation with z2 noise. Physics Letters B, 328(1):130–136, 1994. [10] E. Estrada and D. J. Higham. Network properties revealed through matrix functions. SIAM review, 52(4):696–714, 2010. [11] C. Fenu, D. Martin, L. Reichel, and G. Rodriguez. Network analysis via partial spectral factorization and gauss quadrature. SIAM Journal on Scientific Computing, pages A2046–A2068, 2013. [12] J. K. Freericks. Transport in multilayered nanostructures. The Dynamical Mean-Field Theory Approach, Imperial College, London, 2006. [13] A. Frommer, T. Lippert, B. Medeke, and K. Schilling, editors. Numerical Challenges in Lattice Quantum Chromodynamics, volume 15 of Lecture notes in Computational Science and Engineering. Springer Science & Business Media, 2000. Joint Interdisciplinary Workshop of John Von Neumann Institute for Computing, Julich, and Institute of Applied Computer Science, Wuppertal University, August ¨ 1999. [14] C. F. Gauss. Methodus nova integralium valores per approximationem inveniendi. apvd Henricvm Dieterich, 1815. [15] W. Gautschi. A survey of Gauss-Christoffel quadrature formulae. In E.B. Christoffel, pages 72–147. Birkh¨auser, 1981. [16] G. H. Golub. Some modified matrix eigenvalue problems. Siam Review, 15(2):318–334, 1973. [17] G. H. Golub and G. Meurant. Matrices, moments and quadrature II; how to compute the norm of the error in iterative methods. BIT Numerical Mathematics, 37(3):687–705, 1997. [18] G. H. Golub and G. Meurant. Matrices, moments and quadrature with applications. Princeton University Press, 2009. [19] G. H. Golub and C. F. Van Loan. Matrix computations, volume 3. JHU Press, 2012. [20] G. H. Golub and J. H. Welsch. Calculation of Gauss quadrature rules. Mathematics of computation, 23(106):221–230, 1969. [21] G. H. Golub, M. Stoll, and A. Wathen. Approximation of the scattering amplitude and linear systems. Electronic Transactions on Numerical Analysis, 31:178–203, 2008. [22] M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards, 49(6):409–436, 1952. [23] J. B. Hough, M. Krishnapur, Y. Peres, and B. Vir´ag. Determinantal processes and independence. Probability Surveys, 2006. [24] B. Kang. Fast determinantal point process sampling with application to clustering. In Advances in Neural Information Processing Systems, pages 2319–2327, 2013. [25] A. Kulesza and B. Taskar. k-dpps: Fixed-size determinantal point processes. In Proceedings of the 19

28th International Conference on Machine Learning (ICML-11), pages 1193–1200, 2011. [26] A. Kulesza and B. Taskar. Determinantal point processes for machine learning. arXiv preprint arXiv:1207.6083, 2012. [27] J. T. Kwok and R. P. Adams. Priors for diversity in generative latent variable models. In Advances in Neural Information Processing Systems, pages 2996–3004, 2012. [28] C. Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. United States Governm. Press Office, 1950. [29] C. E. Lee, A. E. Ozdaglar, and D. Shah. Solving systems of linear equations: Locally and asynchronously. arXiv, abs/1411.2647, 2014. [30] L. Lin, C. Yang, J. Lu, and L. Ying. A fast parallel algorithm for selected inversion of structured sparse matrices with application to 2d electronic structure calculations. SIAM Journal on Scientific Computing, pages 1329–1351, 2011. [31] L. Lin, C. Yang, J. C. Meza, J. Lu, L. Ying, and W. E. Selinv–an algorithm for selected inversion of a sparse symmetric matrix. ACM Transactions on Mathematical Software, 2011. [32] R. Lobatto. Lessen over de differentiaal-en integraal-rekening..., volume 2. De Gebroeders Van Cleef, 1852. [33] G. Meurant. The computation of bounds for the norm of the error in the conjugate gradient algorithm. Numerical Algorithms, 16(1):77–87, 1997. [34] G. Meurant. Numerical experiments in computing bounds for the norm of the error in the preconditioned conjugate gradient algorithm. Numerical Algorithms, 22(3-4):353–365, 1999. [35] G. Meurant. The Lanczos and conjugate gradient algorithms: from theory to finite precision computations, volume 19. SIAM, 2006. [36] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, November 1999. Previous number = SIDL-WP-1999-0120. [37] C. C. Paige. The computation of eigenvalues and eigenvectors of very large sparse matrices. PhD thesis, University of London, 1971. [38] O. Papaspiliopoulos and G. O. Roberts. Retrospective markov chain monte carlo methods for dirichlet process hierarchical models. Biometrika, 95(1):169–186, 2008. [39] B. N. Parlett and D. S. Scott. The Lanczos algorithm with selective orthogonalization. Mathematics of computation, 33(145):217–238, 1979. ´ [40] R. Radau. Etude sur les formules d’approximation qui servent a` calculer la valeur num´erique d’une int´egrale d´efinie. Journal de Math´ematiques Pures et Appliqu´ees, pages 283–336, 1880. [41] C. E. Rasmussen. Gaussian processes for machine learning. MIT Press, 2006. [42] V. Rockov´a and E. I. George. Determinantal priors for variable selection, 2015. URL http://stat. wharton.upenn.edu/~vrockova/determinantal.pdf. [43] L. Schwartz. Mathematics for the physical sciences. New York, 1966. [44] J. Sherman and W. J. Morrison. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics, pages 124–127, 1950. [45] J. R. Shewchuk. An introduction to the conjugate gradient method without the agonizing pain, 1994. [46] R. B. Sidje and Y. Saad. Rational approximation to the fermi–dirac function with applications in density functional theory. Numerical Algorithms, pages 455–479, 2011. [47] D. A. Spielman and S.-H. Teng. Nearly linear time algorithms for preconditioning and solving symmetric, diagonally dominant linear systems. SIAM Journal on Matrix Analysis and Applications, 35(3):835–885, 2014. [48] J. Stoer and R. Bulirsch. Introduction to numerical analysis, volume 12. Springer Science & Business Media, 2013. 20

[49] W. Wasow. A note on the inversion of matrices by random walks. Mathematical Tables and Other Aids to Computation, pages 78–81, 1952. [50] J. R. Westlake. A handbook of numerical matrix inversion and solution of linear equations, volume 767. Wiley New York, 1968.

21