University of California, Irvine, United States of America
The supercomputing community holds an outdated view: the network is a single device. Modern interconnects, however, feature multiple network hardware contexts that serve as parallel interfaces into the network from a single node. Additionally, as we are approaching the limits of a single network link’s throughput, supercomputers are deploying multiple NICs per node to accommodate for higher bandwidth per node. Hence, the modern reality is that the network features lots of parallelism. The outdated view drastically hurts the communication performance of the MPI+threads model, which is being increasingly adopted over the traditional MPI-everywhere model to better map to modern processors that feature a lesser share of resource per core than previous processors. Domain scientists typically do not expose logical parallelism in their MPI+threads communication, and MPI libraries still use conservative approaches, such as a global critical section, to maintain MPI’s ordering constraints, thus serializing access to the parallel network resources and limiting performance. The goal of this dissertation is to dissolve the communication bottleneck in MPI+threads. Existing solutions either sacrifice correctness for performance or jump to MPI standard extensions without fairly comparing the capabilities of the existing standard. The holistic bottom-up analyses in this dissertation first investigates the limits of multithreaded communication on modern network hardware, then devises a new MPI-3.1 implementation with virtual communication interfaces (VCIs) for fast MPI+threads communication. The domain scientist can use the VCIs either explicitly (MPI Endpoints) or implicitly (MPI-3.1). The dissertation compares the two solutions through both performance and usability lenses.