School of Computational Science and Engineering, United States of America
The p4est library implements octree-based adaptive mesh refinement (AMR) and has demonstrated parallel scalability beyond 100,000 MPI processes in previous weak scaling studies. This work focuses on the strong scalability of mesh adaptivity in p4est, where the communication pattern of the existing 2:1-balance is a latency bottleneck. The sorting-based algorithm of Malhotra and Biros has balanced communication, but synchronizes all processes. We propose an algorithm that combines sorting and neighbor-to-neighbor exchange to minimize the number of processes with which each process synchronizes.
We measure the performance of these algorithms on several test problems on Stampede2 at TACC. Both the parallel-sorting and minimally-synchronous algorithms significantly outperform the existing algorithm and have nearly identical performance out to 1,024 Xeon Phi KNL nodes, meaning the asymptotic advantage of the minimally-synchronous algorithm does not translate to improved performance at this scale. We conclude by showing that global metadata communication will limit future strong scaling.