Ohio State University Columbus, United States of America
The MPI-3.0 standard introduced neighborhood collective to support sparse communication patterns used in many applications. In this paper, we propose a hierarchical and distributed graph topology that considers the physical topology of the system and the virtual communication pattern of processes to improve the performance of large message neighborhood collectives. Moreover, we propose two design alternatives on top of the hierarchical design: 1) LAG-H, which assumes the same communication load for all processes; and 2) LAW-H, which considers the communication load of processes for fair distribution of load among them. We propose a mathematical model to determine the communication capacity of each process, then use the derived capacity to fairly distribute the load among processes. Our experimental results on up to 28,672 processes show up to 9x speedup for various process topologies. We also observe up to 8.2% performance gain and 34x speedup for NAS-DT and SpMM application kernels, respectively.