Loading [a11y]/accessibility-menu.js
A Fine-Grained Packet Loss Tolerance Transmission Algorithm for Communication Optimization in Distributed Deep Learning | IEEE Journals & Magazine | IEEE Xplore

A Fine-Grained Packet Loss Tolerance Transmission Algorithm for Communication Optimization in Distributed Deep Learning


Abstract:

Communication overhead is a significant challenge in distributed deep learning (DDL) training, often hindering efficiency. While existing solutions like gradient compress...Show More

Abstract:

Communication overhead is a significant challenge in distributed deep learning (DDL) training, often hindering efficiency. While existing solutions like gradient compression, compute/communication overlap, and layer-wise flow scheduling have been proposed, they are often coarse-grained and insufficient, especially under network congestion. These congestion-unaware methods can lead to long flow completion times, known as the tail latency, resulting in extended training time. In this paper, we argue that packet loss tolerance methods can mitigate the tail latency issue without sacrificing training accuracy, with the tolerance bound varying across different DDL model layers. We introduce PLOT, a fine-grained packet loss tolerance algorithm, which optimizes communication overhead by leveraging the layer-specific loss tolerance of the DNN model. PLOT employs a UDP-based transmission mechanism for gradient transfer, addressing the tail latency issue and maintaining training accuracy through packet loss tolerance. Our evaluations on both small-scale testbeds and large-scale simulations show that PLOT outperforms other congestion algorithms, effectively reducing tail latency and DDL training time.
Published in: IEEE Transactions on Network and Service Management ( Volume: 21, Issue: 6, December 2024)
Page(s): 6112 - 6125
Date of Publication: 16 September 2024

ISSN Information:

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.