Data parallel training is a powerful family of methods for the efficient training of deep neural networks on big data. Unfortunately, however, recent studies have shown that the merit of increased batch-size in terms of both speed and model-performance diminishes rapidly beyond some point. This seems to apply even to LARS, the state-of-the-art large batch stochastic optimization method.
In this paper, we combine LARS with online-codistillation, a recently developed, efficient deep learning algorithm built on a whole different philosophy of stabilizing the training procedure using a collaborative ensemble of models. We show that the combination of large-batch training and online-codistillation is much more efficient than either one alone. We also present a novel way of implementing the online-codistillation that can further speed up the computation. We will demonstrate the efficacy of our approach on various benchmark datasets.