Ohio State University Columbus, United States of America
Message-Passing Interface (MPI) is the de-facto standard for designing and executing applications on massively parallel hardware. MPI collectives provide a convenient abstraction for multiple processes/threads to communicate with one another. Mellanox’s HDR InfiniBand switches provide Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) capabilities to offload collective communication to the network and reduce CPU involvement in the process. In this paper, we propose, design and implement SHARP-based solutions for MPI Reduce and MPI Barrier in MVAPICH2-X. We evaluate the impact of proposed and existing SHARP-based solutions for MPI Allreduce, MPI Reduce and MPI Barrier operations on the performance of the collective operation on the 8th ranked TACC Frontera HPC system. Our experimental evaluation of the SHARP-based designs shows up to 5.4x reduction in latency for Reduce, 5.1x for Allreduce and 7.1x for Barrier at full system scale of 7,861 nodes over a host-based solution.