Stencil computations are at the core of Computational Fluid Dynamics (CFD). Given its memory-bound nature, numerous temporal tiling algorithms have been proposed to improve its performance. Although efficient, most algorithms aim at a single iteration space on shared-memory machines. In CFD, however, we are confronted with multiple connected iteration spaces distributed across many nodes.
We propose a pipelined stencil algorithm called Pencil for multiple iteration spaces in distributed computing. We identify the optimal combination of MPI and OpenMP for temporal tiling based on an in-depth analysis of single node performance and exploit deep halo to decouple connected iteration spaces. Moreover, Pencil pipelines the computation and communication to achieve overlap. Evaluated on 4 different stencils across 6 numerical schemes, our algorithm demonstrates up to 1.9x speedup over Pluto on a single node and 1.3-3.41x speedup compared to an MPI+OpenMP Funneled implementation with space tiling for a multi-block grid on 32 nodes.