Pacific Northwest National Laboratory (PNNL), Washington State University, United States of America
The CCSD(T) coupled-cluster model with perturbative triples is considered a gold standard for computational modeling of the correlated behavior of electrons in molecular systems. A fundamental constraint is the relatively small global memory capacity in GPUs compared to the main memory capacity on host nodes, necessitating relatively smaller tile sizes for high-dimensional tensor contractions in NWChem's GPU-accelerated implementation of the CCSD(T) method. A coordinated redesign is described to address this limitation and associated data movement overheads, including a novel fused GPU kernel for a set of tensor contractions, along with inter-node communication optimization and data caching. The new implementation of GPU-accelerated CCSD(T) improves overall performance by 3.4x. We discuss the trade-offs in using this fused algorithm on current and future supercomputing platforms.