Supplemental Material: In situ Visualization and Analysis of Ion ...

Report 1 Downloads 13 Views
IEEE CG&A, SPECIAL ISSUE ON HIGH PERFORMANCE VISUALIZATION AND ANALYSIS, SEPTEMBER 2015

1

Supplemental Material: In situ Visualization and Analysis of Ion Accelerator Simulations using Warp and VisIt Oliver Rubel*, Burlen Loring*, Jean-Luc Vay, David P. Grote, Remi Lehe, Stepan Bulanov, Henri Vincenti, ¨ and E. Wes Bethel

F

C ONTENTS 1

Memory Profiling and Optimization

2

2

Yee Grid Recentering Optimization

2

References

• • •

5

B. Loring*, O. Rubel*, ¨ and E. W. Bethel are with the Data Visualization and Analytics Group of the Computational Research Division, Lawrence Berkeley National Laboratory (LBNL), 1 Cyclotron Road, Berkeley, CA, 94720. E-mail: [email protected] J. L. Vay, D. Grote, R. Lehe, S. Bulanov, and H. Vincenti are with the Accelerator Technology & Applied Physics Division, LBNL. D. Grote is with the Fusion Energy Sciences Program, Lawrence Livermore National Laboratory.

Manuscript received September 1, 2015; revised February 2, 2016 *These authors contributed equally to this work.

IEEE CG&A, SPECIAL ISSUE ON HIGH PERFORMANCE VISUALIZATION AND ANALYSIS, SEPTEMBER 2015

1

2

M EMORY P ROFILING AND O PTIMIZATION

We analyzed the memory use of Warp [1], [2], WarpIV [3], and VisIt [4] during large scale runs on the Cray XC30, Edison. Coarse grained profiling using WarpIV’s internal instrumentation on a two color laser plasma simulation is shown in Fig. 1. Two line plots track the number of particles in time (Fig. 1 magenta, cyan). Three filled area plots (Fig. 1 gray, blue, green), bound by min and max per process memory consumption, show the total memory used in time for three distinct runs. Each of the three filled area plots includes the average memory use (dashed line) and a long term trend line showing the average consumption once the number of particles stabilizes (dashed line with circles). Ideally the long term trend line should be flat, indicating on average no growth in per process memory consumption. The first run, shown in green, details the memory consumption by Warp without any visualization and analysis. Warp’s memory use flattens out once the number of particles stabilizes. The second run, in black, details WarpIV’s memory consumption before our profiling and optimization work. The trend line slope indicates a growth of approximately 4.3 MB per visualization update which resulted in out-of-memory (OOM) conditions after approximately 300k iterations. We profiled the application using the Massif heap analyzer [5, Chapter 9] and were able to identify and address a number of memory leaks and inefficiencies in VisIt. The third run, in blue, shows WarpIV after our profiling and optimization work. There is a slight growth of approximately 0.2 MB per update remaining. We have identified the source of this growth, as due to VisIt pipeline internals, but did not find a way to safely reclaim this memory during the run. Fortunately, the remaining growth is small enough that our runs are not drastically impacted. Note, that all of this work took place after the NumPy zero copy optimization and the figure does not show the improvements gained in that regard.

Fig. 1. Comparing Warp’s memory use (green) with WarpIV before (black) and after (blue) memory profiling and optimization.

2

Y EE G RID R ECENTERING O PTIMIZATION

The following provides an overview of the basic performance of three different implementations for performing interpolation of staggered “Yee” grids to a node-centered mesh. In total there are 11 cases to consider for 2D and 3D Yee cell conversion. Here we focus on the most computationally expensive case to highlight performance issues and compare our implementation written in modern C++ and exposed to Python via bindings automatically generated by SWIG (listing 3), with i) a Python-based version that performs the iteration over the mesh in Python directly (listing 1), as well as ii) an implementation that uses NumPy’s optimized broadcasting implementation (listing 2). The C++ implementation is designed to minimize cache thrashing and enable compiler auto-vectorization [6]. We have employed these and other optimization techniques where applicable in our C++ extensions. Figure 2 provides an overview of the performance of the three codes for varying mesh sizes. Note, the logarithmic scale along the time axis. We observe that our implementation shows an order of magnitude improvement compared to the numpy version and a three orders of magnitude improvement compared to the Python implementation.

IEEE CG&A, SPECIAL ISSUE ON HIGH PERFORMANCE VISUALIZATION AND ANALYSIS, SEPTEMBER 2015

Python  

Numpy  

3

C++  

Time  in  Seconds      (log  scale)  

200  

20  

2  

0.2  

0.02  

0.002   512  x  512  x  4  

1024  x  1024  x  4   Mesh  Resolu5on  

2048  x  2048  x  4  

Fig. 2. Performance for the interpolation of Yee meshes using i) python, ii) numpy broadcasting, and our iii) C++ implementation.

Listing 1. Pure Python

k = 0 while k < nco[2]: k_in = k + ng[2] j = 0 while j < nco[1]: j_in = j + ng[1] i = 0 while i < nco[0]: i_in = i + ng[0] var[i, j, k] = \ ( ai[i_in + ai[i_in + ai[i_in + ai[i_in + ai[i_in + ai[i_in + ai[i_in + ai[i_in i += 1 j += 1 k += 1

, , 1, 1, , 1, 1, ,

j_in j_in j_in j_in j_in j_in j_in j_in

-

, , , , 1, 1, 1, 1,

k_in k_in k_in k_in k_in k_in k_in k_in

- 1] ] ] - 1] ] - 1] ] - 1]

\ \ \ \ \ \ \ )/8.0

Listing 2. NumPy broadcasting

varout = (ai[ng[0]:-ng[0] , + ai[ng[0]-1:-ng[0]-1, + ai[ng[0]:-ng[0] , + ai[ng[0]:-ng[0] , + ai[ng[0]:-ng[0] , + ai[ng[0]-1:-ng[0]-1, + ai[ng[0]:-ng[0] , + ai[ng[0]:-ng[0] ,

ng[1]-1:-ng[1]-1, ng[1]-1:-ng[1]-1, ng[1]-1:-ng[1]-1, ng[1]:-ng[1] , ng[1]-1:-ng[1]-1, ng[1]-1:-ng[1]-1, ng[1]-1:-ng[1]-1, ng[1]:-ng[1] ,

ng[2]-1:-ng[2]-1] \ ng[2]-1:-ng[2]-1] \ ng[2]-1:-ng[2]-1] \ ng[2]-1:-ng[2]-1] \ ng[2]:-ng[2] ] \ ng[2]:-ng[2] ] \ ng[2]:-ng[2] ] \ ng[2]:-ng[2] ])/8.0

IEEE CG&A, SPECIAL ISSUE ON HIGH PERFORMANCE VISUALIZATION AND ANALYSIS, SEPTEMBER 2015

4

Listing 3. C++ implementation

template void cellToNode(T * __restrict__ ain, mesh_id_t nxi, mesh_id_t nyi, mesh_id_t nxo, mesh_id_t nyo, mesh_id_t nzo, mesh_id_t ngx, mesh_id_t ngy, mesh_id_t ngz, T * __restrict__ aout) { mesh_id_t nxyo = nxo*nyo; mesh_id_t nxyi = nxi*nyi; T dxyzi = 0.125; for (mesh_id_t k = 0; k < nzo; ++k) { for (mesh_id_t j = 0; j < nyo; ++j) { T *ao = &aout[k*nxyo + j*nxo]; T * ai = &ain[(k + ngz )*nxyi + (j + ngy for (mesh_id_t i = 0; i < nxo; ++i) ao[i] = dxyzi*ai[i]; ai = &ain[(k + ngz )*nxyi + (j + ngy for (mesh_id_t i = 0; i < nxo; ++i) ao[i] += dxyzi*ai[i];

)*nxi + ngx

];

)*nxi + ngx - 1];

ai = &ain[(k + ngz )*nxyi + (j + ngy - 1)*nxi + ngx - 1]; for (mesh_id_t i = 0; i < nxo; ++i) ao[i] += dxyzi*ai[i]; ai = &ain[(k + ngz )*nxyi + (j + ngy - 1)*nxi + ngx for (mesh_id_t i = 0; i < nxo; ++i) ao[i] += dxyzi*ai[i];

];

ai = &ain[(k + ngz - 1)*nxyi + (j + ngy for (mesh_id_t i = 0; i < nxo; ++i) ao[i] = dxyzi*ai[i];

)*nxi + ngx

];

ai = &ain[(k + ngz - 1)*nxyi + (j + ngy for (mesh_id_t i = 0; i < nxo; ++i) ao[i] += dxyzi*ai[i];

)*nxi + ngx - 1];

ai = &ain[(k + ngz - 1)*nxyi + (j + ngy - 1)*nxi + ngx - 1]; for (mesh_id_t i = 0; i < nxo; ++i) ao[i] += dxyzi*ai[i]; ai = &ain[(k + ngz - 1)*nxyi + (j + ngy - 1)*nxi + ngx for (mesh_id_t i = 0; i < nxo; ++i) ao[i] += dxyzi*ai[i]; } } }

];

IEEE CG&A, SPECIAL ISSUE ON HIGH PERFORMANCE VISUALIZATION AND ANALYSIS, SEPTEMBER 2015

5

R EFERENCES [1] A. Friedman, R. Cohen, D. Grote, S. Lund, W. Sharp, J.-L. Vay, I. Haber, and R. Kishek, “Computational methods in the warp code framework for kinetic simulations of particle beams and plasmas,” Plasma Science, IEEE Transactions on, vol. 42, no. 5, pp. 1321–1334, May 2014. [2] Warp. [ONLINE] http://warp.lbl.gov. [3] WarpIV. [ONLINE] https://bitbucket.org/berkeleylab/warpiv. [4] H. Childs, E. Brugger, B. Whitlock, J. Meredith, S. Ahern, D. Pugmire, K. Biagas, M. Miller, G. H. Weber, H. Krishnan, T. Fogal, A. Sanderson, ¨ C. Garth, E. W. Bethel, D. Camp, O. Rubel, M. Durant, J. Favre, and P. Navratil, “VisIt: An End-User Tool for Visualizing and Analyzing Very Large Data,” in High Performance Visualization—Enabling Extreme-Scale Scientific Insight, ser. Chapman & Hall, CRC Computational Science, E. W. Bethel, H. Childs, and C. Hansen, Eds. Boca Raton, FL, USA: CRC Press/Francis–Taylor Group, Nov. 2012, pp. 357–372. [5] Valgrind Developers, Valgrind User Manual, September 2015, [ONLINE] http://valgrind.org/docs/manual/ms-manual.html. [6] L. Borges and P. Thierry, “3d finite differences on multi-core processors,” INTEL, Technical Report, Tech. Rep., 2011.