Algebraic two-level preconditioners for the Schur ... - CiteSeerX

Report 5 Downloads 54 Views
Algebraic two-level preconditioners for the Schur complement method L. M. Carvalho, L. Giraud and P. Le Tallec June 1998 TR/PA/98/18

Algebraic two-level preconditioners for the Schur complement method L. M. Carvalho L. Giraudy P. Le Tallecz CERFACS Technical Report TR/PA/98/18 - June 1998 Abstract

The solution of elliptic problems is challenging on parallel distributed memory computers as their Green's functions are global. To address this issue, we present a set of preconditioners for the Schur complement domain decomposition method. They implement a global coupling mechanism, through coarse space components, similar to the one proposed in [3]. The de nition of the coarse space components is algebraic, they are de ned using the mesh partitioning information and simple interpolation operators. These preconditioners are implemented on distributed memory computers without introducing any new global synchronization in the preconditioned conjugate gradient iteration. The numerical and parallel scalability of those preconditioners is illustrated on two-dimensional model examples that have anisotropy and/or discontinuity phenomena.

Keywords : Domain decomposition, two-level preconditioning, Schur complement, parallel distributed computing, elliptic partial di erential equations.

1 Introduction In the recent years, there has been an important development of domain decomposition algorithms for solving numerically partial di erential equations. Elliptic problems are challenging since their Green's functions are global: the solution at each point depends upon the data at all other points. Nowadays some methods possess optimal convergence rates for given classes of elliptic problems. It can be shown that the condition number of the associated preconditioned systems is independent from the number of subdomains and either independent or logarithmically dependent on the size of the subdomains. That optimality and this quasi-optimality properties are often achieved thanks to the solution of a coarse problem de ned on the whole physical domain. Through the use of coarse spaces, this approach captures the global behaviour of the elliptic equations. Various domain decomposition techniques, from the eighties and nineties, have suggested di erent global coupling mechanisms and various combinations between them and the local preconditioners. In the framework of non-overlapping domain decomposition techniques, we refer for instance to BPS (Bramble, Pasciak and Schatz) [3], Vertex Space [7, 17], and to some extent Balancing Neumann-Neumann [13, 14, 15], as well as FETI [10, 16], for the presentation of major two-level preconditioners. We refer to [6] and [18] for a more exhaustive overview of domain decomposition techniques.  CERFACS - France and COPPE-UFRJ Brazil. This work was partially supported by FAPERJ-Brazil under grant 150.177/98. y CERFACS, 42 av. Gaspard Coriolis, 31057 Toulouse Cedex z CEREMADE, Universit e Paris Dauphine, 75 775 Paris Cedex 16

3

4 In this paper, we present a contribution in the area of two-level domain decomposition methods for solving heterogeneous, anisotropic two-dimensional elliptic problems using algebraic constructions of the coarse space. Through an experimental study of the numerical and parallel scalability of the preconditioned Schur complement method, we investigate some new coarse-space preconditioners. They are closely related to BPS [3], although we propose di erent coarse spaces to construct their coarse components. Furthermore, we propose a parallel implementation of the preconditioned Schur complement method that does not require any new global synchronization in the iteration loop of the preconditioned conjugate gradient solver. The paper is organized as follows. In Section 2, we formulate the problem and introduce the notation. In Section 3, we describe the various coarse spaces we have considered when de ning the preconditioners' coarse-space components. Section 4 provides some details on the parallel distributed implementation of the resulting domain decomposition algorithm. Computational results illustrating the numerical and parallel scalability of the preconditioners are given in Section 5.

2 Preliminaries and notations This section is two-fold. First, we formulate a two-dimensional elliptic model problem. Then, we introduce the notation that will allow us to de ne the coarse spaces we have considered in our preconditioners. We consider the following 2nd order self-adjoint elliptic problem on an open polygonal domain

included in IR2 : 8 @ @v ) ? @ (b(x; y) @v ) = F(x; y) in ; > < ? @x (a(x; y) @x @y @y (1) v = 0 on @ Dirichlet 6= ;; > : @v = 0 on @ Neumann ; @n where a(x; y); b(x; y) 2 IR2 are positive functions on . We assume that the domain is partitioned into N non-overlapping subdomains 1; : : : ; N with boundaries @ 1 ; : : : ; @ N . We discretize (1) either by nite di erences or nite elements resulting in a symmetric and positive de nite linear system Au = f:

(2)

Let B be the set of all the indices of the discretized points which belong to the interfaces between the subdomains. Grouping the points corresponding to B in the vector uB and the ones corresponding to the interior I of the subdomains in uI , we get the reordered problem:       fI uI AII AIB (3) uB = fB : ATIB ABB Eliminating uI from the second block row of (3) leads to the following reduced equation for uB : SuB = fB ? ATIB A?II1 fI ;

where

?1 AIB S = ABB ? ATIB AII

(4)

is the Schur complement of the matrix AII in A, and is usually referred to as the Schur complement matrix. To describe the preconditioners, we need to de ne a partition of B. Let Vj be the singleton sets that contain one index related to one cross point and let V = [j Vj be the set with all those indices; each cross point is represented by  in Figure 1.

5

Figure 1: A 4  4 box-decomposition with edge () and vertex () points. If j = 6 l, (j; l) 2 f1; 2; : : : ; N g2 and j and l are such that j and l are neighboring subdomains (i.e. @ j and @ l share at least one edge of the mesh), then we can de ne each edge Ei by Ei = (@ j \ @ l ) ? V: (5) In Figure 1, the points belonging to the m edges (Ei)i=1;m are represented by . We can thus describe the set B as m [ (6) B = ( Ei ) [ V; i=1

that is a partition of the interface B into m edges Ei and the set V . Here, we realize that we mix continuous curves, @ i , with sets of indices. This ambiguity can be disregarded if we consider that, in order to minimize notation, the symbols i and @ i can represent either continuous sets or discrete sets of indices associated with the grid points.

3 The preconditioners The classical BPS preconditioner [3] can be brie y described as follows. We rst de ne a series of projection and interpolation operators. Speci cally, for each Ei we de ne Ri = REi as the standard pointwise restriction of nodal values on Ei . Its transpose extends grid functions in Ei by zero on the rest of the interface B. Similarly we de ne RV the canonical restriction on V . Thus, we set Sij  Ri SRTj and SV  RV SRTV . Additionally, let assume that 1 , ..., N form the elements of a coarse grid mesh,  H , with mesh size H. We then de ne grid transfer operators between the interface and the coarse grid. RT is an interpolation operator which corresponds to using linear interpolation between two adjacent cross points Vj , Vk (i.e. adjacent points in H connected by an edge Ei) to de ne values on the edge Ei. Finally, AH is the Galerkin coarse grid operator AH = RART de ned on H . With these notations the BPS preconditioner is de ned by X MBPS = RTi Sii?1 Ri + RT A?H1 R: (7) Ei

It can be interpreted as a generalized block Jacobi preconditioner for the Schur complement system (4) where the block diagonal preconditioning for SV is omitted and a residual correction is used on a coarse grid. The coarse grid term RT AH?1 R allows to incorporate a global coupling between the interfaces.

6 This global coupling is critical for scalability. In particular, it has been shown in [3] that, when applying the original BPS technique to a uniformly elliptic operator, the preconditioned system has a condition number (MBPS S) = O(1 + log2 (H=h));

(8)

where h is the mesh size. This implies that the condition number depends only weakly on the mesh spacing and on the number of processors. Therefore, such a preconditioner is appropriate for large systems of equations on large processor systems. Similarly to BPS, we consider a class of preconditioners that can be written in a generic way as: M = Mlocal + Mglobal ;

(9)

where Mlocal is a block diagonal preconditioner with one block for each edge Ei. More precisely, we replace Sii in (7) by an approximation S~ii computed using a probing technique [5], [4]. Mglobal is also computed using a Galerkin formula but not involving anymore A but S and using di erent coarse spaces and restriction operators. In the rest of this paper, we will de ne various preconditioners that only di er in the de nition of Mglobal . Our goal is to obtain performance similar to those of the original BPS preconditioner even in presence of anisotropiy or heterogeneity, with a simple algebraic structure and parallel implementation strategy. Let U be the space where the Schur complement matrix is de ned and U0 be a q-dimensional subspace of U. This subspace will be called coarse space. Let R0 : U ! U 0

(10)

be a restriction operator which maps full vectors of U into vectors in U0 , and let RT0 : U0 ! U

(11)

be the transpose of R0 , an extension operator which extends vectors from the coarse space U0 to full vectors in the ne space U. The Galerkin coarse space operator A0 = R0SRT0 ;

(12)

in some way, represents the Schur complement on the coarse space U0 . The global coupling mechanism is introduced by the coarse component of the preconditioner which can thus be de ned as Mglobal = RT0 A?0 1R0 :

(13)

The correctness of the operators we build is ensured by using the following lemma:

Lemma 1 If the operator RT0 is of full rank and if S is non-singular, symmetric and positive

de nite, then the matrix A0 , de ned in Equation (12), is non-singular, symmetric and positive de nite.

7 The coarse-space preconditioners will only di er in the choice of the coarse space U0 and the interpolation operator RT0 . For convergence reasons, and similarly to the Neumann-Neumann and Balancing Neumann-Neumann preconditioner [13, 15], RT0 must be a partition of the unity in U in the sense that q X (14) R0 T (i; k) = 1: 8i k=1

With these notations and de nitions, all the preconditioners presented in the sequel of this paper can be written as follows: X M = RTi S~ii?1 Ri + RT0 A0?1 R0: (15) Ei

In the next section, we de ne various coarse spaces and restrictions operators which can be used in a very general framework. The coarse spaces are de ned by a set of piecewise constant vectors Zk of U that span the subspace U0 . The shape de ned by the constant non-zero entries of Zk has inspired the name of the coarse spaces. Although a large number of di erent preconditioners can then be proposed, we restrict our study to ve combinations of coarse spaces and restriction operators, and we discuss them in the remainder of this paper.

3.1 Vertex based coarse space

The rst coarse space we consider is similar to the BPS one. Each degree of freedom is associated with one vertex Vj , and the basis functions of the coarse space can be de ned as follows. Let Vk  V be a singleton set that contains the index associated with a cross point and (Ej )(j 2Jk ) be the adjacent edges to Vk . Let mc denote the number of vertex points then [ Ej [ Vk Iek = (j 2Jk )

is the set of indices we associate with the cross point Vk . Let Zk be de ned on B and Zk (i) its i-th component. Then, the vertex based coarse space U0 can be de ned as U0 = span[Zk ];

k = 1; 2; : : : ; mc ( e where Zk (i) = 1 if i 2 Ik ; 0 elsewhere:

(16)

The set Iek associated with a cross point Vk is depicted in Figure 2. The set of vectors B = fZ1 ; Z2; : : : ; Zmc g forms an ordered basis for the subspace U0 , as these vectors span U0 by construction and they are linearly independent. For this coarse space, we consider three di erent restriction operators R0.

3.1.1 Flat restriction operator This operator returns for each set Iek the weighted sum of values of all the nodes in Iek . The weights are determined by the inverse of the number of sets Iek ; k 2 f1; 2; : : : ; mc g, that a given

node belongs to. For 2D problem, the weight for the cross points is 1 and for an edge point is 1/2.

8

3.1.2 Linear restriction operator

The transpose of this operator, RT0 , is the linear interpolation. When using block Jacobi as local preconditioner combined with the vertex coarse space using this linear restriction operator, the resulting preconditioner is equivalent to the one proposed in [3]. Let A be the matrix de ned in Equation (3). Let AV the Galerkin coarse grid operator associated with of A de ned by (17) AV = Re0AReT0 ; where Re0 is a restriction operator from to the coarse space U0 . It has been shown [3], [18] that in general for elliptic problems the operator AV is spectrally equivalent to R0SRT0 (18) and for a few cases these coarse operators are even equal. If we have used the approach presented in (17), the construction of the coarse space would have been reduced to some matrix-vector multiplications with A and the factorization of AV . Nevertheless, we deal with problems for which only the spectral equivalence between (17) and (18) is ensured. For this reason, we have preferred to use the Galerkin coarse grid correction with the Schur complement matrix as described in (18).

3.1.3 Operator-dependent restriction operator

The origin of the operator-dependent restriction is the operator-dependent transfer operator proposed in the framework of multigrid methods, see [19] and references therein. In [12], the authors proposed an extension of these operators for two-dimensional non-overlapping domain decomposition methods. The general purpose of these operators is to construct from u de ned on V an @~u and b @~u are continuous even interpolation u~ de ned on B that is piecewise linear so that a @x @y when either a or b in (1) are discontinuous along an edge Ei . We omit the computational details, and only notice that such operators - can be constructed by solving one tridiagonal linear system for each Ei, which size is the number of nodes on Ei ; - reduce to a linear interpolation when a and b are smooth.

Figure 2: Points of B where the coordinates of a basis vector of the \vertex" space are di erent from zero.

9

3.2 Subdomain based coarse space With this coarse space, we associate one degree of freedom with each subdomain. Let B be as de ned in Equation (6). Let k be a subdomain and @ k its boundary. Then

I k = @ k \ B is the set of indices we associate with the domain k . Figure 3 shows the elements of a certain set I k. Let Zk de ned on B and Zk (i) its i-th component. Then, the subdomain-based coarse space U0 can be de ned as U0 = span[Zk ];

k = 1; 2; : : : ; N where Zk (i) = 1; if i 2 I k and 0; otherwise:

(

(19)

Figure 3: Points of B where the coordinates of a basis vector of the \subdomain" space are di erent from zero. Notice that for the example depicted in Figure 3, [Zk ] is rank de cient. Indeed, if we consider P v~ = Ni=1 iZi where the i are , in a checker-board pattern, equal to ?1 and +1, it is easy to see that v~ = 0. Nevertheless, this rank de ciency can be easily removed by discarding one of the vectors of [Zk ]. In this particular situation, the set of vectors B = fZ1 ; Z2; : : : ; ZN ?1 g forms an ordered basis for the subspace U0 . The considered restriction operator R0 returns for each subdomain ( i )i=1;N ?1 the weighted sum of the values at all the nodes on the boundary of that subdomain. The weights are determined by the inverse of the number of subdomains in ( i )i=1;N ?1 each node belongs to. For all the nodes but the ones on @ N (in our particular example) this weight is: 1=2 for the points on an edge and 1=4 for the cross points.

Remark 1 Although used in a completely di erent context, this coarse space is similar to the one

used in the Balancing Neumann-Neumann preconditioner for Poisson-type problems [14]. We use one basis vector for each subdomain, whereas in Balancing Neumann-Neumann the basis functions are only de ned for interior subdomains, because in these subdomains the local Neumann problems are singular.

10

3.3 Edge based coarse space

We re ne the coarse space based on subdomain and introduce one degree of freedom per interface between two neighboring subdomains, that is, when @ i and @ j share at least one edge of the mesh. Let Ek be an edge and Vj and Vl its adjacent cross points then I^k = Ek [ Vj [ Vl

is the set of indices we associate with the edge Ek . Let Zk de ned on B and Zk (i) its i-th component. Let me denote the number of edges Ei  B, then, the edge based coarse space U0 can be de ned as: U0 = span[Zk ];

k = 1; 2; : : : ; me ( ^ where Zk (i) = 1 i 2 Ik ; 0 otherwise:

(20)

Figure 4: Points of B where the coordinates of a basis vector of the \edge" space are di erent from

zero.

The set I^k associated with an element of the coarse space U0 is depicted in Figure 4. The set of vectors B = fZ1 ; Z2; : : : ; Zme g forms an ordered basis for the subspace U0 , as before, these vectors span U0 by construction and they are linearly independent. The considered restriction operator R0 returns for each edge the weighted sum of the values at all the nodes on that edge. The weights are determined by the inverse of the number of edges each node belongs to. For the decomposition depicted in Figure 1, the weights for the restriction operator are 1 for points belonging to Ek and 1=4 for the ones belonging to Vl and Vj .

4 Parallel implementation The independent solution of local PDE problems expressed by the domain decomposition techniques are particularly suitable for parallel distributed computation. In a parallel distributed memory environment each subdomain can be assigned to a di erent processor. With this mapping, all the numerical kernels but two in the preconditioned conjugate gradient can be implemented either without or with only neighbor-to-neighbor communication. The only two steps that require global communication are the dot product computation and the solution of the coarse problem performed

11 at each iteration. In the sequel, we will describe how the construction of the coarse components of our preconditioners can be performed in parallel with only one reduction. We also will show how the coarse problem solution can be implemented without any extra global communication within the iteration loop.

4.1 Construction of the preconditioner

The linear systems associated with any of the coarse spaces described above are much smaller than the linear systems associated with any local Dirichlet problem, which has to be solved when computing the matrix vector product by S. In this respect, we construct the coarse problem in parallel but solve it redundantly on each processor within a preconditioned conjugate gradient iteration. The parallel assembling P of the coarse problem is straightforward thanks to its simple general form A0 = R0SRT0 = Ni=1 R0S (i) RTO . Here S (i) denotes the contribution of the ith subdomain to the complete Schur complement matrix. Each processor computes the contribution of its subdomain to some entries of A0 then performs a global sum reduction that assembles on each processor the complete coarse matrix A0 . This latter step can be implemented using a messagepassing functionality like MPI Allreduce. The most expensive part of the construction is the solution of the Dirichlet problems on each subdomain. For each processor, the number of solutions is equal to the number of basis function supports that intercept the boundary of the subdomain the processor is in charge of. For a boxdecomposition of a uniform nite elements or nite di erences mesh, the number of Dirichlet problem solutions to be performed by an internal subdomain is:

 four for the vertex-based coarse-component,  eight for the subdomain based coarse-component (that can reduce to four for a ve point nite di erence scheme as the row in A associated to the cross points is unchanged in S),

 eight for the edge based coarse-component (that can reduce to four for a ve point nite di erence scheme).

Generally on each subdomain, one has to solve a Dirichlet problem with non-homogeneous conditions for each of its edges Ej and each of its vertex points Vj .

4.2 Application of the preconditioner

Having made the choice of a redundant solution of the coarse component on each processor, we can exploit further this formulation to avoid introducing any new global synchronization in the preconditioned conjugate gradient (PCG) iterations. If we unroll Equation (21) from Algorithm 1 using the general de nition of the preconditioner, we have X zk = ( RTi S~ii?1 Ri + RT0 A?0 1R0 )rk Ei X = (RTi S~ii?1 Rirk ) + RT0 A?0 1 R0rk : (24) Ei

Each term of the summation in Equation (24) is computed by one processor with possible one neighbor-to-neighbor communication. Furthermore, the numerator of Equation (22) can also be

12

x(0) = 0; r(0) = b

repeat

z (k?1) = Mr(k?1) if k = 1 then p(1) = z (0)

else

(k?1) = z (k?1)T r(k?1) =z (k?2)T r(k?2) p(k) = z (k?1) + (k?1)p(k?1)

endif

(21)

(22) (23)

q(k) = Ap(k) T (k) = z (k?1) r(k?1)= p(k)T q(k) x(k) = x(k?1) + (k)p(k) r(k) = r(k?1) ? (k)q(k)

until convergence

Algorithm 1: PCG Algorithm rewritten as (rk ; zk ) = (rk ; Mrk) X = (rk ; ( RTi S~ii?1 Ri + RT0 A?0 1 R0)rk ) X Ei ~?1 = (Rirk ; Sii Rirk ) + (R0 rk ; A?0 1R0rk ): Ei

(25)

The right-hand side of (25) has two parts. The rst is naturally local because it is related to the diagonal block preconditioner. The second, with the presented formulation, is global but does not require any new global reduction. R0 rk is actually composed of entries that are calculated in each subdomain (\interface" coarse space) or group of neighboring subdomains (\vertex" and \domain" coarse spaces). After being locally computed, the R0rk entries are gathered on all the processors thanks to the reduction used to assemble each local partial dot product (Rirk ; S~ii?1Ri rk ). At this stage the solution A?0 1 R0rk can be performed redundantly on each processor as well as in Equation (22) can be computed by each processor. Rewriting these steps in the iteration loop allows us to introduce the coarse component without any extra global synchronization. With this approach, we avoid a well-known bottleneck when using Krylov methods on parallel distributed memory computers.

5 Numerical experiments We consider the solution of Equation (1) discretized by a ve-point central di erence scheme on an uniform mesh using a preconditioned Schur complement method. We illustrate the numerical

13 scalability of the proposed preconditioners on academic two-dimensional model test cases that have both anisotropy and discontinuity. For all the experiments the convergence is attained when the 2-norm of the residual normalized by the 2-norm of the right-hand side is less than 10?5. All the computations are performed using 64 bit arithmetic.

5.1 Model problems

For the numerical experiments, we consider model problems that have both discontinuous and anisotropic phenomena. These diculties arise in the equations involved in semiconductor device simulation that was actually one of the main motivations for this study. Figure 5 represents a unit square divided into six regions with piecewise constant functions gj , j = 1 or 3. We consider the problems as having low intensity if j = 1 and high intensity if j = 3. Let a and b be the functions of the elliptic problem as described in Equation (1). Using this notation, we can de ne di erent problems with di erent degrees of diculty. In the following description, the acronyms in capital letters are used to refer to the problems in Tables 2 and 3.  high discontinuity (HD): a = b = g3,  low discontinuity (LD): a = b = g1,  high anisotropy (HA): a = 103 and b = 1,  low anisotropy (LA): a = 10 and b = 1,  high discontinuity and high anisotropy (HDA): a = g3 and b = 1,  low discontinuity and low anisotropy (LDA): a = g1 and b = 1.

gj = 10j

gj = 1

gj = 10j

gj = 1

gj = 10j

gj = 1

Figure 5: De nition of two discontinuous functions on , the unit square of IR2 . The preconditioners with the coarse components are denoted in the tables: sd: \subdomain" de ned in Section 3.2, vl: \vertex-linear" de ned in Section 3.1 with the linear interpolation, vo: \vertex-operator-dependent" de ned in Section 3.1.3 with the operator-dependent interpolation, vf: \vertex- at" de ned in Section 3.1 with the at interpolation,

14

ed: \edge" de ned in Section 3.3, no: \without coarse space" in this case we only use the local preconditioner (block diagonal).

5.2 Numerical behaviour

We study the numerical scalability of the preconditioners, by investigating the dependence of the convergence on the number and on the size of the subdomains. Firstly, in Table 1, we give the gures for the ve coarse spaces when solving the Poisson's problem. As we could expect, the behaviour of all coarse-space options does not depend on the number of subdomains. There is only a small growth in the number of iterations for \subdomain", \vertex- at" and \edge", when the number of subdomains goes from 16 to 64. # subdomains 16 64 256 1024 no 15 28 48 90 sd 15 19 19 18 vf 15 18 18 18 vl 10 10 10 10 vo 10 10 10 10 ed 15 18 18 18 Table 1: # PCG iterations varying the number of subdomains with 16  16 points per subdomain for Poisson's problem

In Table 2, we report the number of preconditioned conjugate gradient iterations for each model problem. For these tests, we vary the number of subdomains while keeping constant their sizes (i.e. H variable with Hh constant). In this table each subdomain is a 16  16 grid and the number of subdomains goes from 16 up to 1024 using a box decomposition; that is 4  4 decomposition up to 32  32 decomposition. In the rst row, we see the growth of the number of iterations of a block Jacobi preconditioner without using any coarse grid correction. Its numerical behaviour is well known and is governed by the following theoretical bound for its condition number (see for instance [6]): cond(MbJ S)  CH ?2(1 + log2 (H=h));

(26)

where MbJ denotes the block Jacobi preconditioner and C is a constant indepedent from H and h. The number of iterations of the block Jacobi preconditioner, reported in the rst row of the tables, doubles successively which is consistent with the theoretical upper bound in (26). For HD and LD the behaviour of all coarse alternatives are similar to the one observed for Poisson's problem in Table 1. For LA and LDA, this similarity is still true for \vertex-linear" and \vertex-operator-dependent", the three others exhibit a slight insigni cant increase in the number of iterations. All the preconditioners but \vertex-linear" are quite insensitive to discontinuity. Further, the \vertex-operator-dependent" alternative, speci cally designed to handle interfaces that cross a discontinuity, has the best performance. For HA, the coarse spaces do not improve the convergence for a number of subdomains less than 1024. It seems that on this example the condition numbers of the preconditioned linear systems with and without the coarse component are comparable, for instance, 2  102 with and 6  102 without the vertex-linear coarse component using 256 subdomains. More precisely, the smallest eigenvalue is not that a ected by the use of the coarse component as it is for the other

15

# subdomains no sd vf vl vo ed

16 17 18 19 15 15 19

64 33 25 24 17 17 26

# subdomains no sd vf vl vo ed

16 19 30 27 26 26 24

64 42 64 52 45 45 43

LA 256 59 27 29 17 17 27 HA 256 69 75 72 66 66 57

1024 114 28 31 17 17 28

16 25 19 20 13 11 20

64 47 19 21 13 11 20

1024 127 86 85 73 73 69

16 25 18 21 16 11 17

64 50 19 22 18 11 19

LD LDA 256 1024 16 64 256 83 158 29 55 104 19 19 22 30 33 21 21 23 28 31 12 12 14 16 17 11 11 14 16 17 18 18 21 26 27 HD HDA 256 1024 16 64 256 87 172 37 149 302 19 19 30 64 81 22 21 31 76 86 18 16 21 63 81 11 11 20 60 81 19 18 31 62 70

1024 194 34 32 17 17 28 1024 629 83 99 89 88 77

Table 2: # PCG iterations varying the number of subdomains with 16  16 points per subdomain. model problems. In the presence of high anisotropy (HA and HDA), the convergence rate of all the alternatives is comparable and depends on the number of subdomains, while an asymptotic behaviour tends to appear when the number of subdomains increases. To study the sensitivity of the preconditioners to the size of the subdomains (i.e. Hh variable with H constant), we report in Table 3, the experiments observed with 256 subdomains (16  16 box decomposition) when the size of the subdomains varies from 8  8 up to 32  32. We observe that the convergence of all the preconditioners depend slightly on the size of the subdomains. Furthermore, in the anisotropic experiments (HA and HDA), this dependence is surprisingly negligible for the \vertex-linear" and for the \vertex-operator-dependent" alternatives. The number of iterations of those two preconditioners tends to be stable around 64 and 80 for the problems HA and HDA, respectively. On problems that are not highly anisotropic, all the coarse-components give rise to preconditioners that are independent of the number of subdomains and that have a weak dependence on the size of the subdomains. Furthermore, those experiments tend to show that the most dominating component is not the granularity of the coarse-space (the nest is not the best) but the restriction/interpolation operator R0. This operator governs, in most cases, the quality of the coarse representation of the complete equation. The only example that seems to be in contradiction with this observation is \edge" on the HA and HDA examples. However it could be argued that, because we consider the anisotropy aligned with the discretization and because we use regular box decompositions, two opposite edges Ei of a subdomain are strongly coupled. The \edge" coarse space captures this strong coupling while the other alternatives mix, therefore miss, this information. This latter behavior is related to the fact that the supports of the basis functions of all coarse spaces, but \edge", contain at least two weakly coupled edges. So, the transfer operators are not able, in this speci c case, to retrieve and to spread the most appropriate information.

16

subdomain size no sd vf vl vo ed

64 66 30 25 15 15 24

subdomain size no sd vf vl vo ed

64 66 65 66 60 60 51

LA LD 256 1024 4096 64 256 1024 69 78 96 73 83 106 27 36 42 16 19 23 29 32 35 18 21 24 17 19 21 12 14 15 17 19 21 9 11 12 27 32 34 16 18 22 HA HD 256 1024 4096 64 256 1024 69 73 76 78 87 116 75 76 76 17 19 23 72 75 76 20 22 25 64 64 63 16 18 18 66 64 63 10 11 13 57 59 63 16 19 22

LDA 64 256 1024 97 104 119 29 33 36 27 31 34 15 18 19 15 17 19 24 27 31 HDA 4096 64 256 1024 141 280 302 297 31 65 81 87 31 80 86 89 23 79 81 80 16 77 81 79 29 63 70 79 4096 133 30 31 19 16 29

4096 150 45 40 22 22 36 4096 389 96 91 77 80 86

Table 3: # PCG iterations varying the grid size of the subdomains from (8  8) up-to (64  64) using a (1616) decomposition.

5.3 Parallel behaviour

We investigate the parallel scalability of the proposed implementation of the preconditioners. For each experiment, we map one subdomain on each processor of the parallel computer. In the sequel, the number of subdomains and the number of processors will be always the same. The target computer is a 128-node T3D located at CERFACS, using MPI as message passing library. The solution of the local Dirichlet problem is performed using either the sparse direct solver MA27 [9] from the Harwell Subroutine Library [2] or a skyline [11] solver. The factorization of the band probed matrices, used in the local part of the preconditioners (see Equation (15)) is performed using a LAPACK [1] band solver. To study the parallel behaviour of the code, we report the maximum elapsed time (in seconds) spent by one of the processors in each of the main steps of the domain decomposition method when the number of processors is varied for solving the standard Poisson's equation. The rst row, entitled \init.", corresponds mainly to the time for factorizing the matrix associated with the local Dirichlet problems; \setup local" is the time to construct and factorize the probing approximations S~ii ; \setup coarse" is the time required to construct and factorize the matrix associated with the coarse problem; \iter" is the time spent in the iteration loop of the preconditioned conjugate gradient. Finally, the row \total" permits to evaluate the parallel scalability of the complete methods (i.e. numerical behaviour and parallel implementation), while \time per iter." only illustrates the scalability of the parallel implementation of the preconditioned conjugate gradient iterations. The elapsed time corresponds to a maximum and there is some unbalance among the processors in di erent kernels. Therefore the reported total time di ers from the sum of the time for each individual kernel. For each number of processors, we give in the left column the results with the considered twolevel preconditioner and in the right column we show the results with only the local part of the preconditioner. We report experiments with the domain-based coarse space in Table 4. Results with the vertex-based coarse space are displayed in Table 5. For those experiments, we use MA27 to solve

17 # procs init. setup local setup coarse iter. total # iter. time per iter.

2.58 0.99 0.25 1.84 5.33 16 0.12

4

2.58 0.80 0.00 1.98 5.65 20 0.12

2.57 1.00 0.48 2.55 6.23 22 0.12

8

2.57 1.01 0.00 3.93 7.26 33 0.12

2.57 1.40 0.86 3.01 7.14 26 0.12

16

2.57 1.31 0.00 5.14 8.50 41 0.13

2.57 1.40 0.85 4.12 8.25 35 0.12

32

2.57 1.30 0.00 7.16 10.51 58 0.12

2.57 1.40 0.87 3.79 7.93 32 0.12

64

2.57 1.31 0.00 9.80 13.13 80 0.12

128 2.57 2.57 1.40 1.32 0.93 0.00 4.91 13.26 9.14 16.60 40 109 0.12 0.12

Table 4: Poisson-MA27 Elapsed time in each main numerical step varying the number of processors with 100  100 points per subdomain using the domain based coarse space # procs init. setup local setup coarse iter. total # iter. time per iter.

2.58 0.66 0.72 1.85 5.61 16 0.12

4

2.58 0.80 0.00 2.31 5.65 20 0.12

2.57 0.94 0.74 1.97 6.09 17 0.12

8

2.57 1.01 0.00 3.93 7.26 33 0.12

2.57 1.24 0.90 2.10 6.65 18 0.12

16

2.57 1.31 0.00 5.14 8.50 41 0.13

2.57 1.24 0.90 2.36 6.91 20 0.12

32

2.57 1.30 0.00 7.16 10.51 58 0.12

2.57 1.24 0.92 2.03 6.59 17 0.12

64

2.57 1.31 0.00 9.80 13.13 80 0.12

128 2.57 2.57 1.25 1.32 1.07 0.00 2.45 13.26 7.16 16.60 20 109 0.12 0.12

Table 5: Poisson-MA27 Elapsed time in each main numerical step varying the number of processors with 100  100 points per subdomain using the vertex based coarse space the local Dirichlet problems de ned on 100  100 grids. That subdomain size was the largest we could use according to the 128 MB memory available on each node of the target computer. We can rst observe that the numerical behaviour of those preconditioners is again independent from the number of subdomains. It can be seen that the parallel implementation of the Schur complement method with only a local preconditioner scales perfectly as the time per iterations is constant and does not depend on the number of processors (i.e. 0.115 seconds on 4 processors and 0.118 on 128 nodes). The above scalable behavior is also observed when the coarse components, vertex or subdomain, are introduced. For instance, with the vertex-based preconditioner, the time per iteration grows from 0.116 seconds on 4 processors up to 0.122 seconds on 128 processors. There are two main reasons for this scalable behaviour. First, the solution of the coarse problems is negligible compared to the solution of the local Dirichlet problems. Second, the parallel implementation of the coarse components does not introduce any extra global communication. In any case the methods scale fairly well, when the number of processors grows from 8 (to solve a problem with 80 000 unknowns) up to 128 (to solve a problem with 1.28 million unknowns). The ratios between the total elapsed time expended for running on 128 and on 8 processors are 1.18, with the vertex-based coarse preconditioner, and 1.47, with the domain-based one. That latter larger value is only due to an increase of the number of iterations. One of the most expensive kernels of this method is the factorization of the local Dirichlet problems. Therefore, the tremendous reduction in the number of iterations induced by the use of the coarse alternatives - ve times less iterations for the vertex-based preconditioner - is not re ected directly on a reduction of the total time. The total time is an ane function of the number of iterations with an incompressible overhead due to the factorization at the very beginning of the

18 # procs init. setup local setup coarse iter. total # iter. time per iter.

2.05 0.79 0.49 2.06 5.45 15 0.14

4

2.05 0.82 0.00 2.60 5.58 19 0.14

2.05 1.10 0.83 2.26 6.31 16 0.14

8

2.05 1.14 0.00 4.37 7.67 31 0.14

2.05 1.39 0.98 2.43 6.91 17 0.14

16

2.05 1.49 0.00 5.83 9.43 41 0.14

32

64

128 2.05 2.05 2.05 2.05 2.05 2.05 1.39 1.50 1.39 1.49 1.40 1.49 0.98 0.00 1.00 0.00 1.14 0.00 2.66 8.07 2.31 10.41 2.80 14.38 7.15 11.67 6.81 14.02 7.43 17.98 19 58 16 73 19 101 0.14 0.14 0.14 0.14 0.15 0.14

Table 6: Poisson-Skyline Elapsed time in each main numerical step varying the number of processors with 80  80 points per subdomain using the vertex based coarse space domain decomposition method. Nevertheless, if we use a less ecient local solver, that will slow down the iteration time, the gap between the solution time with and without the coarse component would increase. To illustrate this behaviour, we display in Table 6 the parallel performance of the vertex-based coarse preconditioner, where we replace MA27 by a skyline solver. The size of the subdomains is smaller than the size used for testing with MA27 because the skyline solver is more memory-consuming than MA27 and 80  80 is the biggest size that ts in the memory of the computer.

6 Concluding remarks We have presented a set of new two-level preconditioners for Schur complement domain decomposition methods in two dimensions. The de nition of the coarse space components is algebraic. They are de ned using the mesh partitioning information and simple interpolation operators that follow a partition of unity principle. We have illustrated their numerical behaviour on a set of two-dimensional model problems that have both anisotropy and discontinuity. Those experiments tend to show that the most dominating component is not the the granularity of the coarse-space (the nest is not the best) but the restriction/interpolation operator R0 . This operator governs, in most cases, the quality of the coarse representation of the complete equation. The experiments have been performed on regular meshes but there is no limitation for the implementation of the proposed two-level preconditioners on unstructured grids, whereas the possible rank de ency, that appears in the domain-coarse alternative, could be more tricky to discard. Those numerical methods are targeted for parallel distributed memory computers. In this respect, we have proposed a message passing implementation that does not require any new global synchronization in the preconditioned conjugate gradient iterations, which is a well-known bottleneck for Krylov methods in distributed memory environments. Finally, we illustrated that the numerical scalability of the preconditioners combined with the parallel scalability of the implementation result in a set of parallel scalable numerical methods.

References [1] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Oustrouchov, and D. Sorensen. LAPACK - Users' Guide. SIAM, Philadelphia, 2nd edition, 1995.

19 [2] Anon. Harwell Subroutine Library. A Catalogue of Subroutines (Release 11). Theoretical Studies Department, AEA Industrial Technology, 1993. [3] J.H. Bramble, J.E. Pasciak, and A.H. Schatz. The construction of preconditioners for elliptic problems by substructuring I. Math. Comp., 47(175):103{134, 1986. [4] L.M. Carvalho. Preconditioned Schur complement methods in distributed memory environments. PhD thesis, INPT/CERFACS, France, October 1997. TH/PA/97/41, CERFACS. [5] T. F. Chan and T.P. Mathew. The interface probing technique in domain decomposition. SIAM J. Matrix Analysis and Applications, 13, 1992. [6] T.F. Chan and T.P. Mathew. Domain Decomposition Algorithms, volume 3 of Acta Numerica, pages 61{143. Cambridge University Press, Cambridge, 1994. [7] M. Dryja, B.F. Smith, and O.B. Widlund. Schwarz analysis of iterative substructuring algorithms for elliptic problems in three dimensions. SIAM J. Numer. Anal., 31(6):1662{1694, 1993. [8] I.S. Du and J.K. Reid. MA27 { A set of Fortran subroutines for solving sparse symmetric sets of linear equations. Technical Report AERE R10533, Rutherford Appleton Laboratory, London, 1982. [9] I.S. Du and J.K. Reid. The multifrontalsolution of inde nite sparse symmetric linear systems. ACM Trans. on Math. Soft., 9:302{325, 1983. [10] C. Farhat and F.-X. Roux. A method of nite element tearing and interconnecting and its parallel solution algorithm. Int. J. Numer. Meth. Engng., 32:1205{1227, 1991. [11] A. George and J.W. Liu. Computer solution of large sparse positive de nite systems. PrenticeHall series in Computational Mathematics, Englewood Cli s, 1981. [12] L. Giraud and R. Tuminaro. Grid transfer operators for highly variable coecient problems. Technical Report TR/PA/93/37, CERFACS, Toulouse, France, 1993. [13] P. Le Tallec. Domain decomposition methods in computational mechanics, volume 1 of Computational Mechanics Advances, pages 121{220. North-Holland, 1994. [14] J. Mandel. Balancing domain decomposition. Communications in Numerical Methods in Engineering, 9:233{241, 1993. [15] J. Mandel and M. Brezina. Balancing domain decomposition: Theory and computations in two and three dimensions. Technical Report UCD/CCM 2, Center for Computational Mathematics, University of Colorado at Denver, 1993. [16] J. Mandel and R. Tezaur. Convergence of substructuring method with Lagrange multipliers. Numer. Math., 73:473{487, 1996. [17] B.F. Smith. Domain Decomposition Algorithms for the Partial Di erential Equations of Linear Elasticity. PhD thesis, Courant Institute of Mathematical Sciences, September 1990. Tech. Rep. 517, Department of Computer Science, Courant Institute. [18] B.F. Smith, P. Bjrstad, and W. Gropp. Domain Decomposition, Parallel Multilevel Methods for Elliptic Partial Di erential Equations. Cambridge University Press, New York, 1st edition, 1996. [19] P. Wesseling. An Introduction to Multigrid Methods. Wiley, West Sussex, 1992.