Scalable Machines Research, United States of America
ARM is an attractive CPU architecture for exascale systems because of its energy efficiency. As a recent entry into the HPC paradigm, ARM lags in its software stack, especially in the performance tooling aspect. Notably, there is a lack of fine-grained measurement tools to analyze fully-optimized HPC binary executables on ARM processors. In this paper, we introduce DRCCTPROF; a fine-grained call path profiling framework for binaries running on ARM architectures. The unique ability of DRCCTPROF is that it obtains full calling context at any and every machine instruction that executes, which provides detailed diagnostic feedback for performance optimization and correctness tools. Furthermore, DRCCTPROF not only associates any instruction with source code along the call path, it also associates memory access instructions back to the constituent data object. Finally, DRCCTPROF incurs moderate overhead and provides a compact view to visualize the profiles collected from parallel executions.