Enhancing Collective Communication in MCM Accelerators for Deep Learning Training | IEEE Conference Publication | IEEE Xplore

Enhancing Collective Communication in MCM Accelerators for Deep Learning Training


Abstract:

With the widespread adoption of Deep Learning (DL) models, the demand for DL accelerator hardware has risen. On top of that, DL models are becoming massive in size. To ac...Show More

Abstract:

With the widespread adoption of Deep Learning (DL) models, the demand for DL accelerator hardware has risen. On top of that, DL models are becoming massive in size. To accommodate those models, multi-chip-module (MCM) emerges as an effective approach for implementing large-scale DL accelerators. While MCMs have shown promising results for DL inference, its potential for Deep Learning Training remains largely unexplored. Current approaches fail to fully utilize available links in a mesh interconnection network of an MCM accelerator. To address this issue, we propose two novel AllReduce algorithms for mesh-based MCM accelerators - RingBiOdd and Three Tree Overlap (TTO). RingBiOdd is a ring-based algorithm that enhances the bandwidth of AllReduce by creating two unidirectional rings using bidirectional interconnects. On the other hand, TTO is a tree-based algorithm that improves AllReduce performance by overlapping data chunks. TTO constructs three topology-aware disjoint trees and runs different steps of the AllReduce operation in parallel. We present a detailed design and implementation of the proposed approaches. Our experimental results over seven DL models indicate that RingBiOdd achieves 50% and 8% training time reduction over unidirectional Ring AllReduce and MultiTree. Furthermore, TTO demonstrates 33% and 29% training time reduction over state-ofthe-art MultiTree and Bidirectional Ring AllReduce, respectively.
Date of Conference: 02-06 March 2024
Date Added to IEEE Xplore: 02 April 2024
ISBN Information:

ISSN Information:

Conference Location: Edinburgh, United Kingdom

Contact IEEE to Subscribe

References

References is not available for this document.