Journal of Computational Physics 230 (2011) 4437–4453
Contents lists available at ScienceDirect
Journal of Computational Physics journal homepage: www.elsevier.com/locate/jcp
Parallel re-initialization of level set functions on distributed unstructured tetrahedral grids Oliver Fortmeier a,b,⇑, H. Martin Bücker a,b a b
Institute for Scientific Computing, Seffenter Weg 23, 52074 Aachen, Germany Center for Computational Engineering Science (CCES), RWTH Aachen University, Germany
a r t i c l e
i n f o
Article history: Received 12 January 2010 Received in revised form 25 January 2011 Accepted 3 February 2011 Available online 29 March 2011 Keywords: Signed distance function High-performance computing k-d tree Re-parametrization Two-phase flow
a b s t r a c t Level set functions are employed to track interfaces in various application areas including simulation of two-phase flows and image segmentation. Often, a re-initializing algorithm is incorporated to transform a numerically instable level set function to a signed distance function. In this note, we present a parallel algorithm for re-initializing level set functions on unstructured, three-dimensional tetrahedral grids. The main idea behind this new domain decomposition approach is to combine a parallel brute-force re-initializing algorithm with an efficient way to compute distances between the interface and grid points. Time complexity and error analysis of the algorithm are investigated. Detailed numerical experiments demonstrate the accuracy and scalability on up to 128 processes. Ó 2011 Elsevier Inc. All rights reserved.
1. Introduction Moving boundaries occur in various scientific and engineering problems. Burning flames, gas–liquid mixtures, and oil droplets in a surrounding liquid are illustrating examples of multi-phase flows. The boundaries between phases can be described either implicitly or explicitly. In this paper, we are concerned with an implicit representation of boundaries by the level set approach [1]. Here, the boundary is represented by the zero level of a scalar-valued level set function u whose movement is described by a partial differential equation (PDE). In the context of multi-phase flow problems, the flow is commonly governed by a set of time dependent PDEs involving the level set function to distinguish the different phases from each other. When solving the PDEs for the flow and the level set function, numerical stability requires to preserve the level set function to be smooth in the vicinity of the interface. To this end, the aim of extension velocities [2,3] is to construct a flow field for moving u such that the level set function is kept smooth. An alternative is given by re-initializing the level set function whenever the smoothness is lost. The discussion of pros and cons of these two alternatives is beyond the scope of this paper. Here, the focus is on re-initializing the level set function. The objective of the re-initialization procedure is to re-parametrize the level set function u in the computational domain X Rd , where d = 2, 3. More precisely, given a level set function with zero-level on an interface C X, determine u such that the zero level is preserved, i.e.,
uðxÞ ¼ 0; x 2 C;
⇑ Corresponding author at: Institute for Scientific Computing, Seffenter Weg 23, 52074 Aachen, Germany. E-mail addresses:
[email protected] (O. Fortmeier),
[email protected] (H. Martin Bücker). 0021-9991/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.jcp.2011.02.005
4438
O. Fortmeier, H. Martin Bücker / Journal of Computational Physics 230 (2011) 4437–4453
and, furthermore, the Eikonal equation
kruk2 ¼ hðxÞ;
x 2 X;
ð1Þ
is satisfied. Here, h denotes a given speed function in X. In the context of two-phase flow problems on adaptively refined grids, it is useful to consider the special case of (1) where the property
kruk2 ¼ 1;
x 2 X;
ð2Þ
allows to measure the distances of any x 2 X to the interface. Here, u is called a signed distance function with numerical advantages addressed in [4]. This property is also of interest when evaluating the surface tension force [5]. For real-world two-phase problems arising from different scientific and engineering disciplines, parallel computing is mandatory to cope with the high demand of time and storage requirement when solving these problems numerically. From a serial point of view, the time for the re-initialization is typically only a small fraction of the time to solve the flow problem. Thus, one might wonder that parallelization of the re-initialization procedure is not important. However, Amdahl’s law [6] states that, even using infinitely many processes, the speedup of the overall time to solution is limited by the inverse of the fraction spent in the part of the solution that is not parallelized. For the parallel solution of the two-phase flow problem, this suggests to also take care of the parallelization of the re-initialization procedure. Moreover, today’s dominating parallel programming model is based on a distributed-memory approach in which all parts of a simulation need to be parallelized. Therefore, in practice, it is indispensable to employ a parallel re-initialization algorithm. The new contribution of our work is a parallel algorithm for re-initializing level set functions on unstructured, distributed grids. The main feature of this parallel algorithm is that a single grid of this type is used for both, the flow variables as well as the level set function. To this end, we combine two known algorithmic elements in a novel way. The first element consists of a brute-force re-initialization strategy. As the second element, a suitable multidimensional data structure is employed to reduce the time complexity. Neighborhood relations are computed with this tree data structure rather than with the data structure representing the unstructured grid. The main advantage of our approach is a high degree of parallelism. In summary, the main features of the new algorithm are: 1. applicability to unstructured three-dimensional grids, 2. serial expected runtime of OðN logðNÞÞ, where N is the number of degrees of freedom to represent u, 3. high degree of parallelism. The paper is organized as follows. In Section 2, we give a brief overview on related algorithms for solving the Eikonal equation. Our motivation of re-initializing level set functions stems from two-phase flow problems that are described in the context of the level set approach in Section 3. Then, we describe the brute-force algorithm to re-initialize level set functions in Section 4. The use of a suitable tree data structure to handle neighborhood relations in the re-initialization algorithm is sketched in Section 5. The new parallel re-initialization algorithm is described in Section 6 and analyzed in Section 7. Finally, in Section 8, we present various results with respect to complexity, accuracy, and performance. 2. Algorithms for solving the Eikonal equation There are mainly the following classes of numerical techniques for solving the Eikonal equation (1). The first class consists of PDE-based methods. These methods offer the advantage of exploiting the full range of well-studied numerical techniques for structured as well as unstructured grids. In particular, PDE-based methods are known to be amenable to parallel computing [7] and provide a viable alternative for finite volume and finite difference methods. However, previous research using serial techniques based on finite elements indicated that such approaches tend to be numerically inadequate for two-phase flows [8]. Therefore, this class is not considered further in this paper; see [9] for a recent survey on PDE-based methods for the solution of (1). The archetype of another class is the fast marching method (FMM), originally developed by Sethian [10] for Cartesian grids. It was later generalized to arbitrary triangulated surfaces [11], unstructured meshes [12], and any d-dimensional implicit hyper-surfaces [13]. The algorithmic complexity of the FMM is OðN logðNÞÞ where N is the number of grid points. The reader is referred to [1] for a detailed overview. The FMM is based on a front propagation scheme and uses a heap data structure which makes it difficult to parallelize using a domain decomposition strategy. The first discussion on a parallel implementation of the FMM is given in [14] where four different strategies are compared for Cartesian grids. A recent technique [15] determines the flow field on an unstructured grid whereas the level set function is described on a Cartesian grid. Both grids are adaptively refined and distributed among processes. In [16], the FMM is separately applied on each sub-domain. This process is iteratively refined exchanging data across sub-domain boundaries. In contrast to a domain decomposition approach, the parallel FMM presented in [17] is based on distributing the interface and is discussed for Cartesian grids. To the best of our knowledge, there is no publication of any parallel FMM on unstructured grids. In [18], the authors introduce the fast iterative method (FIM) which is designed for parallel computation on Cartesian grids. This recent method relies on a modification of a label correction scheme coupled with an iterative procedure for the grid point update. In contrast to the FIM, the fast sweeping method (FSM) [19] solves the Eikonal equation on a ddimensional grid by performing 2d directional sweeps in a Gauss–Seidel fashion. This sweeping idea of visiting nodes in a
O. Fortmeier, H. Martin Bücker / Journal of Computational Physics 230 (2011) 4437–4453
4439
predefined order is also present in [20,21]. The FSM has a complexity of OðNÞ and has been extended to triangular meshes [22]. Also, a parallel version of FSM was introduced for structured two-dimensional grids [23]. As far as we know, no parallel version of the FSM or the FIM on unstructured grids is available in the open literature. Although the methods in all of the above classes are designed to solve the Eikonal equation (1), they are often also used for the solution of the special case (2), where the speed function is equal to one. Some of these methods are compared in [24] concerning numerical efficiency for quadrilateral grids using a serial implementation. However, there are classes of methods specifically exploiting the particular structure of (2). For instance, the Euclidian Distance Transform (EDT) is concerned with the following problem on Cartesian grids. Given a subset of grid points, called sites, the EDT computes the distance of all other grid points to the closest site in the Euclidean norm. The algorithms for the EDT are surveyed in [25] and [26] for two- and three-dimensional grids, respectively. Parallel algorithms for the EDT are investigated for parallel random access machine models [27,28] and for graphic processors in [29–34]. Although the literature on parallel EDT algorithms is vast, we do not consider these methods here. The reason is that these methods are tailored toward Cartesian grids and that the points to which the distances are computed are located on the grid. However, we are concerned with unstructured grids and distances to a given manifold which, in general, is not located on grid points. 3. The level set approach for two-phase flows For describing three-dimensional two-phase flow problems, a level set based approach can be used. Here, we give a brief outline of this approach. A detailed description can be found in [35,8]. The level set function u : X ½0; se ! R during a time interval [0, se] is driven by a given vector field u in X [0, se] by the equation
@ u þ u ru ¼ 0 in X ½0; se : @s
ð3Þ
Here, u can be used to divide the computational domain X at time s into two parts, i.e.,
X1 ðsÞ :¼ fx 2 Xjuðx; sÞ < 0g and X2 ðsÞ :¼ fx 2 Xjuðx; sÞ > 0g: For instance, X1(s) represents an oil droplet in a surrounding water phase X2(s). Then, the interface is represented by the manifold described by the zero level of u, in formula
Cu ðsÞ :¼ fx 2 Xjuðx; sÞ ¼ 0g: The shape of the level set function is characterized by the Eikonal equation (1). Rather than using extension velocities, we determine u by solving the incompressible Navier–Stokes equations
.ðuÞ
@u þ ðu rÞu ¼ rp þ .ðuÞg þ divðlðuÞDðuÞÞ þ fCu @s
ð4Þ
div u ¼ 0: Here u denotes the velocity, p the pressure, l the viscosity, . the density, D the viscous stress tensor, and g the external gravity force. The symbol fCu is the so-called continuum surface force [5,36,37] describing the surface tension at the interface Cu(s) by a localized force term. During the course of solving the coupled system (4) and (3) numerically, the property (2) is typically lost. This makes a re-initialization of u necessary after a few timesteps. In the next section, we describe a brute-force re-initialization algorithm to recover the distance property (2). In the remainder, the time s is fixed. To simplify notations, we skip the parameter s while denoting the level set function u(x) and the interface Cu. 4. The brute-force re-initialization algorithm Before presenting the brute-force algorithm to re-initialize the level set function, we introduce some notations. We assume that the computational domain X is discretized by an unstructured tetrahedral grid T . Let uh denote a representation of u and V denote the points where the degrees of freedom of uh are located. We distinguish between two cases: (i) If uh is only described on vertices of T then V is the union of all vertices of T . Here, uh is called linear. (ii) Otherwise, if uh is described on vertices and edges of T then V consists of all vertices and midpoints of all edges of T , and uh is called quadratic. The discrete representation of Cu is given by Chu . For the sake of simplicity, we call u 2 V a vertex whether u is a vertex or a midpoint of an edge of T . Furthermore, we introduce the so-called ‘‘frontier vertex set’’ F V. A vertex w is in F if it is located on a tetrahedron intersected by Chu . The set of the remaining vertices, V n F , is denoted by S and a vertex v 2 S is called ‘‘off-site vertex.’’ Unless otherwise stated, the symbols w and v are used to denote vertices in F and S, respectively. In Fig. 1, a two-dimensional example for these notations is given. Here, all vertices w are depicted by for a piece-wise linear representation of uh. The objective of the re-initialization algorithm is to determine approximations of u(u) for all u 2 V, representing the signed distance between the manifold Cu and u. To this end, let dðChu ; uÞ denote an approximation of the (unsigned) distance between Chu and u 2 V. Then, the re-initialization algorithm consists of three stages: (I) determining dðChu ; wÞ for frontier
4440
O. Fortmeier, H. Martin Bücker / Journal of Computational Physics 230 (2011) 4437–4453
Fig. 1. Triangulation of a unit square, with zero level Chu of a (piece-wise linear) level set function uh. Here, the frontier vertices w 2 F are denoted by and the local projection Pt(w) in a tetrahedron t by a dotted line.
vertices w 2 F ; (II) determining dðChu ; v Þ for off-site vertices v 2 S; (III) determining uh(u) for all u 2 V. This algorithm is outlined in Algorithm 1. Algorithm 1. Base algorithm to re-initialize a level set function 1: F ; d Chu ; F ; L = InitFrontierSet ()
v2S Compute d Chu ; v 4: end for 5: Compute uh(u) for all u 2 V 2: for all
// stage I // stage II
3:
// stage III
In stage I, the frontier vertex set F is initialized by a loop over all tetrahedra. To determine the distance dðChu ; wÞ for a vertex w 2 F , we first introduce the set T(w) of all tetrahedra which have the vertex w in common. That is, a tetrahedron is in T(w) if one of its vertices is w. In Fig. 1, the set T(w) is indicated by gray shading. The distance dðChu ; wÞ is determined by
d Chu ; w :¼ PðwÞ where the projection P in T(w) is defined by
PðwÞ :¼ min Pt ðwÞ: t2TðwÞ
ð5Þ
The local projection Pt(w) in a single tetrahedron t is defined by the distance between w and Chu in t, in formula
Pt ðwÞ :¼ min kq wk2 : q2Chu \t
ð6Þ
That is, (6) describes the distance between w and Chu in a single tetrahedron t, whereas (5) characterizes the distance between w and Chu in the domain represented by T(w). Ongoing work is concerned with a suitable, alternative definition of T(w) in an attempt to increase the accuracy because the closest interface segment does not necessarily have to be located in an adjacent tetrahedron. In Fig. 1, the projection Pt(w) is depicted by a dotted line for two tetrahedra, t1 and t2. The projection Pt may be either non-orthogonal or orthogonal. Let qw denote the solution of (6) satisfying
kqw wk2 ¼ min kq wk: q2Chu \t
The projection Pt(w) is non-orthogonal onto Chu , if the vector (w qw) is not orthogonal to Chu . Here, the distance dðChu ; wÞ in t is determined by the distance between w and the intersection of Chu with the edges of t. This situation is illustrated by vertex
O. Fortmeier, H. Martin Bücker / Journal of Computational Physics 230 (2011) 4437–4453
4441
Fig. 2. Local projections P t1 and P t2 for two frontier vertices w1 and w2.
w1 in Fig. 2. If Pt(w) is an orthogonal projection onto the local representation of Chu , then the vector ðw q? w Þ is orthogonal to the representation Chu where q? is the perpendicular foot of the projection P in a tetrahedron t. This is illustrated by vertex t w h w2 in Fig. 2. The set of all perpendicular feet q? w j w 2 F , used to determine dðCu ; wÞ, is denoted by L. The orthogonal projection is differently computed according to the representation of uh: 1. In case of a linear representation of uh, the local projection Pt(w) in (6) for a given tetrahedron t is obtained as follows. The local representation of Chu in t is either a vertex, an edge, a triangle, or a quadrilateral. In all cases the local orthogonal projection can be directly computed by basic geometric calculations. 2. If uh is quadratic, we reduce the problem to the linear one. Therefore, we introduce a finer triangulation T 0 , where all vertices and midpoints of T are the vertices in T 0 , i.e., each tetrahedron in T is regularly refined into eight sub-tetrahedra. On this finer triangulation, uh is linear, and we can apply the local projection described above. After the initialization stage which is introduced in [8], the algorithm determines dðChu ; v Þ for all remaining vertices v 2 S in stage II. Therefore, first, we define a distance dðChu ; v ; wÞ between Chu and an off-site vertex v via a frontier vertex w by
d Chu ; v ; w :¼ kv wk2 þ d Chu ; w : Note that this distance does not define a metric or distance in a mathematical sense. In Fig. 3, the distances dðChu ; v ; wi Þ for a vertex v via wi, i = 1, 2, are illustrated by a line from vertex v to wi followed by a line from wi to q? wi . Second, the distance between an off-site vertex v and Chu is given by
min w2F
n
o :
Chu ; v ; w
That is, the distance of
ð7Þ
v is obtained by taking the shortest of all distances via frontier vertices.
Fig. 3. Computing dðChu ; v Þ for a vertex
v 2 S.
4442
O. Fortmeier, H. Martin Bücker / Journal of Computational Physics 230 (2011) 4437–4453
To increase the accuracy of determining the distance in (7), the set F is augmented by all perpendicular feet L computed in stage I. Then, the computation of the distance dðChu ; v Þ between a vertex v 2 S and Chu becomes
n o d Chu ; v :¼ min d Chu ; v ; w : w2F [L
ð8Þ
Since perpendicular feet are located at Chu , we define the distance dðChu ; qÞ :¼ 0 for all perpendicular feet q 2 L leading to
d Chu ; v ; q ¼ kv qk2 :
In Fig. 3, the distances of the vertex v via the perpendicular feet q? wi ; i ¼ 1; 2 are depicted by dashed lines. There are four paths from v via different vertices to Chu . Two of them are via the frontier vertices w1 and w2 and the other two are via the perpenh h ? ? dicular feet q? w1 and qw2 . In this figure, the distance dðCu ; v Þ is determined as dðCu ; v ; qw2 Þ because
q?w2 ¼
argmin ? u2fw1 ;w2 ;q? w ;qw g 1
n o d Chu ; v ; u
2
holds. Finally, stage III of the algorithm consists of assigning a value to uh(u) for all u 2 V as follows. Let uhold denote the level set function before the re-initialization algorithm has started. The sign of the new values uh(u) for all u 2 V is determined by
uh ðuÞ
8 > < d Chu ; u ; uhold < 0; > : d Chu ; u ; uhold P 0:
ð9Þ
In summary, the signs are restored whereas the distances are computed by the function dðChu ; Þ. Before outlining the parallel algorithm, we describe an efficient strategy to compute the distance dðChu ; v Þ for each vertex v 2 S. 5. Computing distances to the interface In general two-phase flow simulations, the off-site vertex set S is larger than the frontier vertex set F . Hence, the computational work of determining dðChu ; v Þ for a vertex v 2 S is generally more time consuming than determining dðChu ; wÞ for a vertex w 2 F . This implies that most computational work of the re-initialization algorithm is spent in stage II of Algorithm 1 to determine
d Chu ; v
min d Chu ; v ; w
w2F [L
ð10Þ
for all off-site vertices v 2 S. For a vertex v, a naïve implementation of this formula would lead to a linear search in the set F [ L resulting in the time complexity of €
T naıve ðv Þ ¼ OðjF [ LjÞ: To reduce the linear complexity, we apply the following approach. We restrict the search space F [ L in equation (10) to an ‘‘easy-to-compute’’ set N ðm; v Þ F [ L which depends on the off-site vertex v and a parameter m 2 N. This restriction of (10) leads to
d Chu ; v
min d Chu ; v ; w
w2N ðm;v Þ
ð11Þ
for all off-site vertices v 2 S. Due to the geometric structure of the problem, we choose the search space N ðm; v Þ as the set of m nearest neighbors of v in the set F [ L. So, the parameter m controls the size of the search subspace and influences the computational cost of setting up N ðm; v Þ as well as computing the minimum in (11). An efficient approach to search for m nearest neighbors in a set of k-dimensional data points is given by k-d trees (kdimensional trees) originally introduced by Bentley [38]. Let N denote the number of data points represented by a k-d tree. The expected runtime of setting up this data structure is then given by OðN logðNÞÞ. In [39], an algorithm for searching the m nearest neighbors of a given k-dimensional data point is presented. Here, the nearest neighbors are stored in the k-d tree whereas the data point is not necessarily part of that data structure. The expected runtime of the algorithm is given by Oðm logðNÞÞ. There are alternative data structures and algorithms for finding m nearest neighbors. An extensive overview is presented in [40]. We use k-d trees to solve the minimization problem in (11). Here, k = 3 and the tree represents the set F [ L. Then, the set N ðm; v Þ is determined by the tree as the set of the m nearest neighbors of v 2 S. Building the data structure involves an expected runtime of
T build ¼ OðjF [ Lj logðjF [ LjÞÞ:
ð12Þ
O. Fortmeier, H. Martin Bücker / Journal of Computational Physics 230 (2011) 4437–4453
The expected runtime of searching for m nearest neighbors of each
T
search
ðm; v Þ ¼ Oðm logðjF [ LjÞÞ:
4443
v is given by ð13Þ
In the next section, we combine the brute-force algorithm and k-d trees to obtain a parallel re-initialization algorithm. 6. The parallel re-initialization algorithm The parallel algorithm is based on a domain decomposition approach. Distributing the tetrahedra of the grid T among P processes leads to a decomposition of V, denoted by V p on process p, and thus a decomposition of the values of uh among the processes. Fig. 4 shows the distribution of the triangulation presented in Fig. 1 among two processes p1 and p2. The domain stored by p2 is gray shaded. The decomposition of V is overlapping. That is, if a vertex u is located on a tetrahedron t1 stored by process p1 and also on a tetrahedron t2 which is assigned to process p2 then u is stored on both processes p1 and p2. This implies that the values of uh on process boundaries are stored on each process. For instance, in Fig. 4, the vertices w1 and w2 are stored by both processes. The parallel algorithm presented in Algorithm 2 follows the approach given in Algorithm 1 where distances are computed based on nearest neighbors. Each process p performs the initialization of the frontier set on its tetrahedra leading to three ‘‘local’’ sets that are computed from the information that is available by process p. The first set is the local frontier vertex set, denoted by F p . In general, F p differs from F \ V p . For instance, in Fig. 4, the vertex w1 2 F p2 because there is a tetrahedron stored on process p2 which is intersected by Chu . However, w1 R F p1 because there is no tetrahedron on process p1 intersected by Chu . The second set consists of the corresponding distances dðChu ; F p Þ. The set of the corresponding perpendicular feet is the third set and is denoted by Lp . Recall from (8) and (11) that the ‘‘global’’ sets F , dðChu ; F Þ and L are needed for computing dðChu ; v Þ. To this end, an additional step is necessary to gather the global sets from the local sets. In particular, F is obtained by
F¼
P [
F p:
p¼1
The sets dðChu ; F Þ and L are obtained similarly. However, a slight modification is necessary if a vertex w is located on a process boundary and is contained in several local frontier vertex sets. Here, dðChu ; wÞ is determined as the minimum of the locally computed distances. For instance, in Fig. 4, w2 2 F p1 and w2 2 F p2 , so dðChu ; w2 Þ is computed as the minimum of the two distances determined locally by process p1 and p2. In this case, the minimum is computed by p1. The set L is accumulated by gathering the perpendicular feet corresponding to the frontier vertices where the minimal distance is obtained. The local offsite vertex set S p consists of the vertices on process p whose distance is still to be computed, in formula
S p :¼ V p n F :
Fig. 4. Distributed triangulation of a unit square on processes p1 and p2, with zero level Chu of a (linear) level set function uh. Here, the frontier vertices w 2 F p are denoted by .
4444
O. Fortmeier, H. Martin Bücker / Journal of Computational Physics 230 (2011) 4437–4453
After gathering the local sets, all processes store F ; L, and dðChu ; F Þ and build the k-d tree representing F [ L. Now, each process p computes dðChu ; v Þ for all its local off-site vertices v 2 S p by applying (11). Determining uh(u) for all u 2 V is simultaneously performed on each processes p by applying (9) to local vertices u 2 V p . The parallel algorithm for re-initializing a level set function is given in Algorithm 2. Algorithm 2. Parallel algorithm to re-initialize a level set function 1: for p = 1, . . ., P do in parallel InitFrontierSet () F p ; d Chu ; F p ; Lp h Gather F p ; d Chu ; F p ; Lp 3: F ; d Cu ; F ; L
2:
4: 5: 6: 7:
kd BuildKDTree ðF [ LÞ for all v 2 S p N ðm; v Þ GetNearestNeighbors (kd,v, m) n o d Chu ; v minw2N ðm;v Þ d Chu ; v ; w
8: end for 9: Compute uh(u) for all u 2 V p 10: end for
7. Analysis of the re-initialization algorithm In this section, we present the time complexity of Algorithm 2 before approximating its error. In Table 1, the complexity of each line of Algorithm 2 for a single process p is given. For line 2, we assume that the cardinality of the set T(w) is bounded, i.e., jT(w)j 6 const. for all w 2 F . Hence, using the projections (5) and (6) to determine dðChu ; wÞ for each vertex w 2 F p is constant in time. Since the initialization iterates over all tetrahedra to identify vertices as frontier vertices and compute their distance to the interface, the complexity of line 2 is bounded by jV p j. As stated in [41], the communication time for gathering, line 3, is bounded by the sum of the number of gathered elements and the logarithm of the number of processes P, i.e.,
O max jF p j þ logðPÞ : p¼1;...;P
The complexity of building the k-d tree and searching in the tree is given by (12) and (13), leading to the complexity of lines 4 and 6. Computing the minimum in the nearest neighbor set with m ¼ jN ðm; v Þj for each vertex v 2 S p yields the complexity of line 7. Iterating over all local vertices u 2 V p to determine uh leads to the complexity of line 9. In general, the terms in line 4 and 6 dominate the complexity. Therefore, we only discuss these two terms in the following. The number of processes P influences the cardinalities of the local sets, in particular jS p j. When increasing P then jS p j is reduced and the term in line 6 pales in comparison to the term in line 4. Thus, while using a moderate number of processes line 6 dominates the runtime and the complexity is given by
Oðm jS p j logðjF [ LjÞÞ:
ð14Þ
That is, computing distances accounts for most of the runtime. In contrast, using a large number of processes decreases jS p j and results in m jS p j < jF [ Lj. Thus, building the k-d tree in line 4 dominates the runtime leading to the complexity
OðjF [ Lj logðjF [ LjÞÞ:
ð15Þ
If using only one process to execute Algorithm 2 the runtime is governed by line 6. Here, the complexity is given by
Oðm jSj logðjF [ LjÞÞ:
ð16Þ
Table 1 Time complexity of Algorithm 2 for a process p. Line
Time complexity
2 3 4 6 7 9
jV p j Gather communication jF [ Lj logðjF [ LjÞ m jS p j logðjF [ LjÞ (expected) m jS p j jV p j
O. Fortmeier, H. Martin Bücker / Journal of Computational Physics 230 (2011) 4437–4453
4445
Numerical tests show that m can be chosen relatively small compared to jS p j and jF [ Lj. Thus, m is assumed constant. Additionally, since jLj 6 jF j holds the expected serial runtime given by (16) is bounded by OðjSj log jF jÞ. So, Algorithm 2 has the same serial complexity as the fast marching method which is difficult to parallelize. Next, we focus on the accuracy of the re-initialization algorithm. Using the m nearest neighbors rather than the set F [ L leads to an error which is numerically investigated in the next section. For obtaining a bound for the error eh, we assume m ¼ jF [ Lj meaning that we analyze the error eh of the brute-force algorithm. Here, h denotes the maximal length of an edge of the triangulation. Three sources of approximation errors exist while computing dðChu ; uÞ for a given u 2 V: (i) Since Chu is discretely represented by values on V the algorithm uses an approximation of the interface Cu. This error is bounded by OðhÞ 2 or Oðh Þ if uh is a linear or a quadratic representation of sufficient smooth level set function. (ii) Determining dðChu ; wÞ for a frontier vertex w 2 F is done by applying the projection given in (6). For the linear and the quadratic case this error is bounded by OðhÞ. (iii) Finally, computing dðChu ; v ; wÞ for a vertex v 2 S includes an approximation error which is also bounded by OðhÞ. Hence, the overall error of the method is given by
eh 2 OðhÞ: 8. Numerical results In this section, we present numerical results of the new re-initialization algorithm with respect to accuracy and performance as well as comparisons to the FMM. All experiments are carried out on a cluster at the Center for Computing and Communication of RWTH Aachen University. This cluster consists of Xeon-based quad-core processors (‘‘Harpertown’’) with a clock rate of 3.0 Ghz which are connected by an InfiniBand network. All tests are performed within the parallel C++ finite element solver DROPS [8] which is being developed for two-phase flows. The communication among and synchronization of processes is handled by the message passing interface (MPI). The overall strategy to parallelize this solver without considering the re-initialization is discussed in [42,43]. Applications of DROPS can be found in [44–47]. In this software, the level set function is given by a piece-wise quadratic function. The implementation of k-d trees is based on the library KDTREE 2 [48]. The domain decomposition strategy employs the graph partitioning library ParMETIS [49]. The experiments are performed for different problems defined by a triple (X, u, s) with the following components. There are two different computational domains X: 1. XC = [0, 1]3 denotes a unit cube which is used to easily modify the edge length h in the triangulation. 2. XM describes an hourglass-shaped measurement cell of 10 cm length and a maximal diameter of 0.72 cm which is depicted in Fig. 5. The domain XM is an optimized measurement cell used to investigate levitated droplets, see [50].
Fig. 5. Computational domain XM.
Fig. 6. Domain XC and Cu for usl.
4446
O. Fortmeier, H. Martin Bücker / Journal of Computational Physics 230 (2011) 4437–4453
Fig. 7. Computational domain XC and umsl 3 .
As u we consider the following three different signed distance functions: 1. usl represents a horizontal slice through the cube XC, depicted in Fig. 6. The idea behind this choice of u is to eliminate errors in computing distances to the interface. 2. umsl describes multiple, equidistant horizontal slices through the cube XC, as illustrated in Fig. 7. The index b denotes the b number of slices and is set either to 30 or 60. This function aims at increasing the number of frontier vertices compared to all vertices, i.e., the ratio jF [ Lj=jVj. It is artificially constructed and does not reflect any of the two-phase flow problems we are interested in where typically the number of frontier vertices is significally smaller than the number of off-site vertices. 3. usp describes a sphere, which is used as initial condition for u if a drop is simulated in the measurement cell XM or in the cube XC. The parameter s describes the level of refinement of the triangulation which is either small or large. At first, we compare the results of the re-initialization algorithm with a known solution. We define the following errors to present the accuracy and the rate of convergence of the re-initialization algorithm:
1 X jucomp ðv Þ uex ðv Þj ; jSj v 2S juex ðv Þj vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u u 1 X jucomp ðv Þ uex ðv Þj2 comp ; Þ :¼ t e 2 ðu jSj v 2S juex ðv Þj
e1 ðucomp Þ :¼
e1 ðucomp Þ :¼ max jucomp ðv Þ uex ðv Þj; v 2S
comp
where u denotes a computed, re-initialized level set function, and uex a given signed distance function. To investigate the error, we use XC as the computational domain to control the edge length h. Furthermore, we set uex to usp, i.e.,
uex ðxÞ ¼ kx ð0:5; 0:5; 0:5ÞT k2 0:25; where the interface Cu is a sphere of radius 0.25 located in the middle of XC. The parallel re-initialization algorithm is applied to disturbed level set functions, either a linear disturbance ex udist 1 ðxÞ :¼ 100 u ðxÞ;
ð17Þ
a piece-wise linear disturbance
udist 2 ðxÞ :¼
2:5 uex ðxÞ; uex ðxÞ P 0; 3:0 uex ðxÞ; uex ðxÞ < 0;
or a nonlinear disturbance
(
u
dist 3 ðxÞ
:¼
ðuex ðxÞÞ2 ; uex ðxÞ P 0; ðuex ðxÞÞ3 ; uex ðxÞ < 0:
Five regular triangulations of XC are analyzed. The coarsest triangulation consists of 384 tetrahedra. Refining each tetrahedron regularly, i.e., each edge is bisected, leads to the next finer triangulation. Here, we refine each tetrahedron up to four times recursively, leading to five triangulations with up to 12 582 912 tetrahedra on the finest level. On the coarsest triangulation, uh is discretized by 729 scalar values and on the finest triangulation by 16 974 593 values. To avoid the approximation error introduced by the usage of k-d trees, we chose m ¼ jF [ Lj as previously used for the error analysis in Section 7. In Table 2, the error and the rate of convergence of the re-initialization algorithm are shown if disturbing usp by (17). The rate of convergence (ROC) is given by the quotient of the error on two consecutive triangulations.
4447
O. Fortmeier, H. Martin Bücker / Journal of Computational Physics 230 (2011) 4437–4453 Table 2 Errors e1, e2, and e1 of ucomp in XC when disturbing uex = usp by udist 1 . h
e1
ROC1
e2
ROC2
e1
ROC1
0.2165 0.1082 0.0541 0.0271 0.0135
0.08822 0.02924 0.01111 0.004922 0.002262
— 3.0 2.6 2.2 2.2
0.1308 0.04954 0.02574 0.01701 0.01133
— 2.6 1.9 1.5 1.5
0.06988 0.02374 0.01383 0.006520 0.003354
— 2.9 1.7 2.1 1.9
Table 3 Error eChu of rucomp in XC when disturbing uex = usp. h
T(w)
TðwÞ
eCh
ROC
eCh
ROC
(a) Re-initialization of u 0.2165 0.1082 0.0541 0.0271 0.0135
0.4144 0.2125 0.1755 0.2649 0.3497
— 2.0 1.2 0.7 0.8
0.3796 0.2099 0.1286 0.06815 0.03485
— 1.8 1.6 1.9 2.0
(b) Re-initialization of udist 2 0.2165 0.1082 0.0541 0.0271 0.0135
0.4055 0.2062 0.1924 0.2423 0.3491
— 2.0 1.0 0.8 0.7
0.3791 0.2017 0.1020 0.08900 0.09248
— 1.9 2.0 1.2 1.0
(c) Re-initialization of udist 3 0.2165 0.1082 0.0541 0.0271 0.0135
0.6614 0.8175 0.7312 0.7993 0.7364
— 0.8 1.1 0.9 1.1
0.6733 0.8056 0.7233 0.7771 0.7596
— 0.8 1.1 0.9 1.0
u
u
dist 1
Since h is bisected in each refinement step, the theoretical rate of convergence is two. Although the theoretical rate of convergence is not obtained with respect to the e2- and e1-error, the rate of convergence is always greater than 2.2 with respect to the e1-error. When solving two-phase flow problems, the gradient of uh is needed to determine the forces at Cu, cf. [5]. Therefore, another metric to assess the quality of the algorithm is to consider the error of ru in the vicinity of Cu. We measure this error by
eCh ðrucomp Þ :¼ max krucomp ruex k2 u
Cu
where the gradients are evaluated on tetrahedra that are intersected by Cu. The error is primarily caused by the routine InitFrontierSet () of Algorithm 1. For a frontier vertex w, it is possible that the tetrahedra containing the closest interface segment is not located in T(w). Therefore, rather than considering T(w), we also investigate a superset TðwÞ TðwÞ which additionally contains neighboring tetrahedra. The error eChu for both approaches is given in Table 3. Here, we consider the dist dist same triangulations as in Table 2 and re-initialize the functions udist for uex = usp. With increasing grid 1 ; u2 , and u3 refinement, indicated by h, the error stagnates or even increases in all three Table 3(a), (b), and (c) for T(w). However, using TðwÞ, the rate of convergence is close to the factor of 2 for udist 1 . Although TðwÞ is used, the error only decreases up to dist h = 0.0271 for udist and there is no decrease for u . 2 3 An explanation for the divergence of the error for udist and udist is given by the following observation. Since in DROPS the 2 3 level set function is represented by a quadratic function, Chu changes if disturbing uex by udist or udist 2 3 . That is, the discrete representation Chu moves although the roots of udist and uex are identical. This situation is illustrated for a one-dimensional 2 example in Fig. 8. In this figure, I(g) denotes an interpolation of a function g by a quadratic polynomial and u describes the h signed distance to 0.25. The root of Iðudist 1 Þ is identical to the one of u. Hence, Cu does not change if applying this disturbance. dist dist dist In contrast, the roots of Iðu2 Þ and Iðu3 Þ are not located at the root of u. Overall, the disturbances Iðudist 2 Þ and Iðu3 Þ change the interface in such a way that the algorithm is not capable of correctly determining distances at frontier vertices. To investigate the influence of the parameter m on the error and runtime, we use four different problem settings. For the cube XC, the level set function usl is used. Using the measurement cell XM, the function usp is applied and Cu is given by a
4448
O. Fortmeier, H. Martin Bücker / Journal of Computational Physics 230 (2011) 4437–4453
Fig. 8. Interpolated disturbed 1d signed distance function.
Table 4 Problem characteristics. Problem
jT j
jVj
jF j
jL n F j
jSj
jF [ Lj=jVj [%]
(XC, usl, small) (XC, usl, large) (XM, usp, small) (XM, usp, large)
224 640 14 380 416 420 187 2 733 447
274 625 16 974 593 482 342 3 152 034
4 225 66 049 37 132 148 602
0 0 1 835 7 360
270 400 16 908 544 445 210 3 003 432
1.54 0.39 8.00 4.95
Fig. 9. Error for problem (XM, usp, small) and (XM, usp, large) with respect to m when re-initializing udist 1 .
sphere of radius 0.1 cm located at the narrowing part of the cell. To vary the problem size, we refine the triangulation in the vicinity of Chu leading to two different problem sizes for each domain. A summary of the characteristics of all four used problems is given in Table 4. Consider the column depicting jL n F j in that table. If using XC and usl the perpendicular feet are located on frontier vertices for both problem sizes. Thus, the set L n F is empty and the accuracy cannot be increased by augmenting the set F by L when considering (8) rather than (7). The influence of the parameter m on the e1- and e2-error is depicted in Fig. 9 for problems (XM, usp, small) and (XM, usp,large). Here, we can see that setting m = 100 results in the best approximation. Increasing the value of m does not lead to any improvement for these two problems with respect to the e1- and e2-error. The runtime of the parallel re-initialization Algorithm 2 depends on the choice of m. To investigate this dependency, let Tm(P) denote the runtime of this algorithm on P processes with the user-given parameter m. In Fig. 10, the normalized times
T m ðPÞ ; T 1 ðPÞ
ð18Þ
O. Fortmeier, H. Martin Bücker / Journal of Computational Physics 230 (2011) 4437–4453
4449
Fig. 10. Normalized runtime of the re-initializing algorithm for P = 1, 8, 64 and 128 processes.
are illustrated for P 2 {1, 8, 64, 128} for all four problems given in Table 4. For a given number of processes P, the ratio (18) increases for growing number of nearest neighbors m. The behavior for moderate numbers of processes given in that figure is expected from the analysis of the complexity given in (14). However, the time ratios do not significantly depend on m. For instance, the ratio for m = 500 and P = 1 in problems (XC, usl, small) and (XC, usl, large) is about 45 and not 500 as suggested by (14), cf. Fig. 10(a) and (b). The corresponding ratio is less than 18 for problems (XM, usp, small) and (XM, usp, large), cf. Fig. 10(c) and (d). For fixed m, the ratio (18) tends to decrease with increasing P for all four problems. That is, the influence of the parameter m on the runtime becomes less and less important when increasing the number of processes. An explanation is that the time for building the k-d tree (15), which is independent of m, starts to dominate the runtime. The parallel speedup is limited by three factors: (i) communication costs of gathering in line 3 of Algorithm 2; (ii) load imbalances in V p ; F p and S p ; and (iii) redundant generation of the k-d tree. In this paper, we focus on the aspect (iii). To this end, let am(P) denote the fraction of Tm(P) that is used to build the k-d tree. In Table 5, the runtime Tm(P) of Algorithm 2 as well as am(P) are presented for 1 and 128 processes using m = 25 and m = 100. For a given m, increasing P also increases am(P) which can be explained by the redundant computation involved in building the k-d tree. If P is kept fixed and m is increased, more time is needed to search in the tree so that building the tree becomes less time dominant. Hence, am(P) is decreased. For both m in problems (XC, usl, small) and (XC, usl, large), the fraction am(P) is reduced by increasing the problem size because the ratio jF [ Lj=jVj is decreased, cf. Table 4. This behavior also applies for problems (XM, usp,small) and (XM, usp, large) on one process. For these two problems, the fractions a25(128) and a100(128) increase by a factor of about 1.25 if switching from the small to the large problem size. This can be explained by the following argument. The sizes of all sets are increased. However, from inspection of the values in Table 4, this increase by factors less than 7 is smaller than the increase in the number of processes from 1 to 128. Therefore, (15) starts to dominate the runtime of the parallel algorithm leading to an increase of am(128).
4450
O. Fortmeier, H. Martin Bücker / Journal of Computational Physics 230 (2011) 4437–4453
Table 5 Time for re-initialization a level set function and fraction of building the k-d tree on 1 and 128 processes. Problem
m
Tm(1) [s]
am(1) [%]
Tm(128) [s]
am(128) [%]
(XC, usl, small) (XC, usl, large) (XM, usp, small) (XM, usp, large) (XC, usl, small) (XC, usl, large) (XM, usp, small) (XM, usp, large)
25 25 25 25 100 100 100 100
1.28 99.4 4.68 39.7 3.56 256 10.2 83.8
0.07 0.02 0.16 0.09 0.03 0.007 0.075 0.042
0.02 0.88 0.20 0.75 0.05 2.24 0.35 1.33
3.2 1.6 4.4 5.5 1.3 0.6 2.5 3.1
Fig. 11. Speedup of re-initializating the level set function.
Assuming perfect speedup for all parts except for building the tree, Amdahl’s law [6] implies that, for an infinite number of processes, the speedup is bounded by Sm(1) :¼ 1/am(1). For instance, in problem (XM, usp, small), the ratio a25(1) = 0.16% implies that the speedup is bounded by S25(1) = 625, whereas for problem (XC, usl, large) the ratio a100(1) = 0.007% yields a theoretic limit for the speedup of S100(1) = 14 286. In Fig. 11, the speedup of the re-initialization algorithm with respect to a serial implementation is depicted. Here, we compare m = 25 with m = 100. Since the computational work per process is larger while using m = 100 the speedup is slightly greater than using m = 25. For both choices of m, the runtime of problems using XC and usl scales better for all number of
4451
O. Fortmeier, H. Martin Bücker / Journal of Computational Physics 230 (2011) 4437–4453 Table 6 Ratio of serial runtime of the FMM and Algorithm 2. Problem
TFMM/TAlg.
(XC, usl, small) (XC, usl, large) (XM, usp, small) (XM, usp, large)
1.29 — 1.16 1.24
2
processes than using XM and usp. For problem (XM, usp, large), we observe a speedup of 63 for m = 100 and P = 128. In problem (XC, usl, small) using 128 processes with m = 100, the speedup reaches 76 and 114 if using problem (XC, usl, large). Table 6 compares the serial runtimes of our implementation of the FMM and Algorithm 2 with m = 100. These ratios TFMM/ TAlg. 2 indicate that the runtime of the FMM is larger than the runtime of Algorithm 2 by a factor of about 1.2. For problem (XC, usl, large), it was not possible to solve the re-initialization problem by the FMM because its memory requirement exceeds the maximum available memory of the platform, i.e., 16 GB. Finally, we address some limitations of the current implementation of the algorithm. As described in Section 7, the k-d tree is built and stored by each process. Therefore, this part of the re-initialization algorithm may constitute the performance bottleneck in terms of runtime and memory consumption. However, in the following discussion we show that—even for artificially constructed problems with no practical relevance—a serial implementation of the k-d tree is reasonable to avoid any communication. The characteristics of these two artificial problems are given in Table 7 where the distance function describes multiple horizontal slices. These problems are designed such that the number of frontier vertices and perpendicular feet compared to the total number of vertices is significantly larger than for the previous investigated problems from Table 4. In particular, the ratios jF [ Lj=jVj in Table 4 are below 10%, whereas the corresponding ratios in Table 7 are 106% and 68%. In these problems, the fraction a100(P) of the time spent in serially building the k-d tree is given in Table 8. These fractions are by orders of magnitude larger than those in Table 5 for P = 1 and m = 100. For instance, the value of the serial fraction is given by a100(1) = 5.16% in Table 8 whereas a100(1) = 0.03% for the problem (XC, usl, small) in Table 5. So, the resulting limits Sm(1) for the speedup by Amdahl are lower. The serial fraction a100(1) = 5.16% leads to Sm(1) = 19.4. Clearly, the fraction of the time spent in serially building the k-d tree increases when increasing the number of processes. Consider the problem ðXC ; umsl 60 ; largeÞ as an example. Here, roughly 3% of the runtime is spent in building the tree using one process. However, that part consumes about 45.5% of the time on P = 64 processes. The corresponding speedup of S100(64) = 14 is indeed low. Although the serial runtime of Algorithm 2 is about 3 times smaller than that of the FMM applied to the problem ðXC ; umsl 30 ; smallÞ, the serial use of the k-d tree is a substantial bottleneck for these problems. However, the storage requirement for the k-d tree is only a small fraction of the memory used to solve the corresponding two-phase flow problem by DROPS. Therefore, storing the k-d tree redundantly on all processes offers the advantage to avoid communication between processes while determining the m nearest neighbors. Table 9 gives an overview of the storage requirements for all problems including the previous ones. Here, the symbol MKD denotes the memory used to store the k-d tree on a single process. The Table 7 Problem characteristics of the artificially multi-sliced problems. Problem
jT j
jVj
jF j
jL n F j
jSj
jF [ Lj=jVj [%]
ðXC ; umsl 30 ; smallÞ
224 688
274 625
236 600
54 470
38 025
106.0
ðXC ; umsl 60 ; largeÞ
14 380 464
16 974 593
10 039 448
1 554 798
6 935 145
68.3
Table 8 Parallel performance of artificial multi-sliced problems.
a100(1) [%]
a100(64) [%]
S100(1)
S100(16)
ðX ; u
5.16
37.0
19.38
5.0
5.8
ðXC ; u
2.81
45.5
35.59
8.2
14.0
Problem C
msl 30 ; smallÞ msl 60 ; largeÞ
S100(64)
Table 9 Memory consumption of DROPS. Problem C
msl 30 ; smallÞ msl 60 ; largeÞ sl
ðX ; u
ðXC ; u (XC, u ,small) (XC, usl, large) (XM, usp, small) (XM, usp, large)
MKD [MB] 15.3 610.6 0.2 3.7 11.7 85.0
MDROPS(8) [GB] 3.4
8MKD/MDROPS(8)[%]
MFMM[MB]
3.8
—
—
2.9 — 4.4 32.1
0.053 — 2.1 2.1
201.2 —
MAlg.
2
[MB]
33.1 1 498.6
201.0 — 367.3 2 408.1
9.0 539.9 33.4 233.6
4452
O. Fortmeier, H. Martin Bücker / Journal of Computational Physics 230 (2011) 4437–4453
storage for the solution of the two-phase flow problem accumulated over P processes is given by MDROPS(P). The symbols MAlg. 2 and MFMM are used for the memory requirements of Algorithm 2 and the FMM on one process. As expected, the memory requirement of the k-d tree for problem ðXC ; umsl 60 ; largeÞ is larger than for all other problems. Clearly, the memory requirement MDROPS(8) is much larger than MKD for all problems. Indeed, the memory requirement MDROPS(8) for problems C sl ðXC ; umsl 60 ; largeÞ and (X , u , large) exceeds the memory capacity of the platform. For the other problems, the ratio 8 MKD/ MDROPS(8) represents the fraction of redundantly storing the k-d tree compared to the total memory requirement of solving the two-phase flow problem on 8 processes. This fraction ranges from 0.05% to 3.8%. The largest fration is observed for the artificial problem ðXC ; umsl 30 ; smallÞ where the k-d tree is intentionally constructed large. The largest fraction is actually quite small. Hence, storing the k-d tree redundantly is affordable. For all problems, Algorithm 2 needs less memory than the FMM. Recall from the previous discussion that the implementation of the FMM cannot solve the large problem (XC, usl, large) with the available memory. This also holds for the problem ðXC ; umsl 60 ; largeÞ. 9. Conclusion and future work While most previous approaches to re-initialize level set functions on unstructured grids for two-phase problems are difficult to parallelize, we present a novel algorithm whose key feature consists of its high degree of parallelism. The algorithm is designed by carefully combining a brute-force strategy and nearest neighbor searches in k-d trees. Moreover, the serial complexity of this algorithm is comparable to the complexity of the fast marching method. Although the number of nearest neighbors controlling the accuracy is an additional user-specified parameter, the distances between the interface and arbitrary vertices are computed efficiently. We show the applicability of the new algorithm for a real-life problem arising from investigating a levitated droplet. Numerical experiments demonstrate that only a moderate number of nearest neighbors needs to be chosen to achieve good accuracy. Furthermore, the algorithm exhibits good scalability on using up to 128 processes for different problem settings. There is still room for further improvements. The distance between the interface and the vertices in the vicinity of the interface can be computed more accurately which is currently being analyzed. Also, the sequential use of k-d trees, though adequate for certain problem classes and computer architectures, may become a bottleneck of the re-initialization algorithm when increasing the number of processes further. To cope with this problem, parallel data structures to determine the set of nearest neighbors are being investigated. It is also interesting to extend the proposed method from computing a signed distance function to solving the Eikonal equation with general speed functions. Another topic is to find an efficient re-initialization technique in an attempt to decrease the parameter controlling the number of nearest neighbors used in the search subspace. Acknowledgments The software DROPS is being developed in a collaboration with the Chair for Numerical Mathematics, RWTH Aachen University, Germany, and was partially funded by the Deutsche Forschungsgemeinschaft (DFG) within the SFB 540 ‘‘Modelbased experimental analysis of kinetic phenomena in fluid multi-phase reactive systems.’’ We thank Arnold Reusken, Sven Groß, and Jörg Grande of the Chair for Numerical Mathematics for stimulating discussions. Also, we are grateful to the anonymous referees for their suggestions. References [1] J.A. Sethian, Level Set Methods and Fast Marching Methods—Evolving Interfaces in Computational Geometry, Fluid Mechanics, Computer Vision, and Materials Science, Second ed., Cambridge University Press, 1999. [2] D. Adalsteinsson, J.A. Sethian, Transport and diffusion of material quantities on propagating interfaces via level set methods, J. Comput. Phys. 185 (1) (2003) 271–288. [3] D.L. Chopp, Another look at velocity extensions in the level set method, SIAM J. Sci. Comput. 31 (5) (2009) 3255–3273. [4] D.L. Chopp, Computing minimal surfaces via level set curvature flow, J. Comput. Phys. 106 (1) (1993) 77–91. [5] S. Groß, A. Reusken, Finite element discretization error analysis of a surface tension force in two-phase incompressible flows, SIAM J. Numer. Anal. 45 (2007) 1679–1700. [6] G.M. Amdahl, Validity of the single processor approach to achieving large scale computing capabilities, in: Proc. of AFIPS Spring Joint Comput. Conf., ACM, New York, NY, USA, 1967, pp. 483–485. [7] E. Dejnozkova, P. Dokladal, A parallel algorithm for solving the Eikonal equation, in: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’03), vol. 3, 2003, pp. III.325–III.328. [8] S. Groß, V. Reichelt, A. Reusken, A finite element based level set method for two-phase incompressible flows, Comput. Vis. Sci. 9 (4) (2006) 239–257. [9] C.-W. Shu, High order numerical methods for time dependent Hamilton–Jacobi equations, in: S. Goh, A. Ron, Z. Shen (Eds.), Mathematics and Computation in Imaging Science and Information Processing, Lecture Notes Series, vol. 11, Institute for Mathematical Sciences National University of Singapore, World Scientific Press, Singapore, 2007, pp. 47–91. [10] J.A. Sethian, A fast marching level set method for monotonically advancing fronts, Proc. Natl. Acad. Sci. USA 93 (4) (1996) 1591–1595. [11] R. Kimmel, J.A. Sethian, Computing geodesic paths on manifolds, Proc. Natl. Acad. Sci. USA 95 (15) (1998) 8431–8435. [12] J.A. Sethian, A. Vladimirsky, Fast methods for the Eikonal and related Hamilton–Jacobi equations on unstructured meshes, Proc. Natl. Acad. Sci. USA 97 (11) (2000) 5699–5703. [13] F. Mémoli, G. Sapiro, Fast computation of weighted distance functions and geodesics on implicit hyper-surfaces, J. Comput. Phys. 173 (1) (2001) 730– 764.
O. Fortmeier, H. Martin Bücker / Journal of Computational Physics 230 (2011) 4437–4453
4453
[14] M. Herrmann, A domain decomposition parallelization of the fast marching method, in: Annual Research Briefs 2003, Center for Turbulence Research, Stanford University, Stanford, CA, USA, 2003, pp. 213–225. [15] M. Herrmann, A parallel Eulerian interface tracking/Lagrangian point particle multi-scale coupling procedure, J. Comput. Phys. 229 (3) (2010) 745–759. [16] M.C. Tugurlan, Fast marching methods—parallel implementation and analysis, Ph.d. thesis, Department of Mathematics, Louisiana State University, USA, 2008. [17] M. Breuß, E. Cristiani, P. Gwosdek, O. Vogel, A domain-decomposition-free parallelisation of the fast marching method, Preprint of the Department of Mathematics 250, Saarland University, Saarbrücken, submitted for publication, 2009. [18] W.-K. Jeong, R.T. Whitaker, A fast iterative method for Eikonal equations, SIAM J. Sci. Comput. 30 (5) (2008) 2512–2534. [19] H.-K. Zhao, A fast sweeping method for Eikonal equations, Math. Comp. 74 (250) (2004) 603–627. [20] P.-E. Danielsson, Euclidean distance mapping, Comput. Graphics Image Proc. 14 (1980) 227–248. [21] M. Boué, P. Dupuis, Markov chain approximations for deterministic control problems with affine dynamics and quadratic cost in the control, SIAM J. Numer. Anal. 36 (3) (1999) 667–695. [22] J. Qian, Y.-T. Zhang, H.-K. Zhao, Fast sweeping methods for Eikonal equations on triangular meshes, SIAM J. Numer. Anal. 45 (1) (2007) 83–107. [23] H. Zhao, Parallel implementations of the fast sweeping method, J. Comput. Math. 25 (4) (2007) 421–429. [24] S. Hysing, S. Turek, The Eikonal equation: numerical efficiency vs. algorithmic complexity on quadrilateral grids, in: Proc. of Algoritmy 2005, Conference on Scientific Computing, Slovak University of Technology, Bratislava, 2005, pp. 22–31. [25] R. Fabbri, L.D.F. Costa, J.C. Torelli, O.M. Bruno, 2D Euclidean distance transform algorithms: a comparative survey, ACM Comput. Surv. 40 (1) (2008) 1– 44. [26] M.W. Jones, J.A. Baerentzen, M. Sramek, 3D distance fields: a survey of techniques and applications, IEEE Trans. Vis. Comput. Graphics 12 (4) (2006) 581–599. [27] Y.-H. Lee, S.-J. Horng, J. Seitzer, Parallel computation of the Euclidean distance transform on a three-dimensional image array, IEEE Trans. Parallel Distrib. Syst. 14 (3) (2003) 203–212. [28] Y.-R. Wang, S.-J. Horng, An O(1)time algorithm for the 3D Euclidean distance transform on the CRCW PRAM model, IEEE Trans. Parallel Distrib. Syst. 14 (2003) 973–982. [29] K.E. Hoff III, T. Culver, J. Keyser, M. Lin, D. Manocha, Fast computation of generalized Voronoi diagrams using graphics hardware, in: Proc. of 26th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’99), ACM Press, New York, NY, USA, 1999, pp. 277–286. [30] A. Sud, N. Govindaraju, R. Gayle, D. Manocha, Interactive 3D distance field computation using linear factorization, in: Proc. of 2006 Symposium on Interactive 3D Graphics and Games (I3D ’06), ACM, New York, NY, USA, 2006, pp. 117–124. [31] G. Rong, T.-S. Tan, Jump flooding in GPU with applications to Voronoi diagram and distance transform, in: Proc. of 2006 Symposium on Interactive 3D Graphics and Games (I3D ’06), ACM, New York, NY, USA, 2006, pp. 109–116. [32] G. Rong, T.-S. Tan, Variants of jump flooding algorithm for computing discrete Voronoi diagrams, in: Proc. of Fourth International Symposium on Voronoi Diagrams in Science and Engineering (ISVD ’07), IEEE Computer Society, Washington, DC, USA, 2007, pp. 176–181. [33] T.-T. Cao, K. Tang, A. Mohamed, T.-S. Tan, Parallel banding algorithm to compute exact distance transform with the GPU, in: Proc. of 2010 Symposium on Interactive 3D Graphics and Games (I3D ’10), ACM, New York, NY, USA, 2010, pp. 83–90. [34] J. Schneider, M. Kraus, R. Westermann, GPU-based real-time discrete Euclidean distance transforms with precise error bounds, in: Proc. of International Conference on Computer Vision Theory and Applications (VISAPP), INSTICC Press, Setubal, Portugal, 2009, pp. 435–442. [35] M. Sussman, P. Smereka, S. Osher, A level set approach for computing solutions to incompressible two-phase flow, J. Comput. Phys. 114 (1) (1994) 146– 159. [36] J.U. Brackbill, D.B. Kothe, C. Zemach, A continuum method for modeling surface tension, J. Comput. Phys. 100 (2) (1992) 335–354. [37] Y.C. Chang, T.Y. Hou, B. Merriman, S. Osher, A level set formulation of Eulerian interface capturing methods for incompressible fluid flows, J. Comput. Phys. 124 (2) (1996) 449–464. [38] J.L. Bentley, Multidimensional binary search trees used for associative searching, Commun. ACM 18 (9) (1975) 509–517. [39] J.H. Friedman, J.L. Bentley, R.A. Finkel, An algorithm for finding best matches in logarithmic expected time, ACM Trans. Math. Softw. 3 (3) (1977) 209– 226. [40] H. Samet, Foundations of Multidimensional and Metric Data Structures, Morgan Kaufman Publishers Inc., San Francisco, CA, USA, 2006. [41] R. Thakur, R. Rabenseifner, W. Gropp, Optimization of collective communication operations in MPICH, Int. J. High Perform. Comput. Appl. 19 (1) (2005) 49–66. [42] S. Groß, A. Reusken, Parallel multilevel tetrahedral grid refinement, SIAM J. Sci. Comput. 26 (4) (2005) 1261–1288. [43] O. Fortmeier, H.M. Bücker, A parallel strategy for a level set simulation of droplets moving in a liquid medium, in: J.M.L.M. Palma, M. Daydé, O. Marques, J.C. Lopes (Eds.), High Performance Computing for Computational Science – VECPAR 2010, Lecture Notes in Computer Science, vol. 6449, Springer, Berlin, 2011, pp. 200–209. [44] S. Groß, M. Soemers, A. Mhamdi, F. Al-Sibai, A. Reusken, W. Marquardt, U. Renz, Identification of boundary heat fluxes in a falling film experiment using high resolution temperature measurements, Int. J. Heat Mass Transfer 48 (2005) 5549–5562. [45] M. Karalashvili, S. Groß, A. Mhamdi, A. Reusken, W. Marquardt, Incremental identification of transport coefficients in convection–diffusion systems, SIAM J. Sci. Comput. 3 (2008) 3249–3269. [46] E. Bertakis, S. Groß, J. Grande, O. Fortmeier, A. Reusken, A. Pfennig, Validated simulation of droplet sedimentation with finite-element and level-set methods, Chem. Eng. Sci. 65 (6) (2010) 2037–2051. [47] H.M. Bücker, J. Willkomm, S. Groß, O. Fortmeier, Discrete and continuous adjoint approaches to estimate boundary heat fluxes in falling films, Optim. Methods Softw. 26 (1) (2011) 105–125. [48] M.B. Kennel, Kdtree 2: Fortran 95 and C++ software to efficiently search for near neighbors in a multi-dimensional Euclidean space (2004). . [49] G. Karypis, V. Kumar, A parallel algorithm for multilevel graph partitioning and sparse matrix ordering, J. Parallel Distrib. Comput. 48 (1) (1998) 71–95. [50] E. Gross-Hardt, E. Slusanschi, H.M. Bücker, A. Pfennig, C.H. Bischof, Practical shape optimization of a levitation device for single droplets, Optim. Eng. 9 (2) (2008) 179–199.