Abstract:
Communication overhead is a significant challenge in distributed deep learning (DDL) training, often hindering efficiency. While existing solutions like gradient compress...Show MoreMetadata
Abstract:
Communication overhead is a significant challenge in distributed deep learning (DDL) training, often hindering efficiency. While existing solutions like gradient compression, compute/communication overlap, and layer-wise flow scheduling have been proposed, they are often coarse-grained and insufficient, especially under network congestion. These congestion-unaware methods can lead to long flow completion times, known as the tail latency, resulting in extended training time. In this paper, we argue that packet loss tolerance methods can mitigate the tail latency issue without sacrificing training accuracy, with the tolerance bound varying across different DDL model layers. We introduce PLOT, a fine-grained packet loss tolerance algorithm, which optimizes communication overhead by leveraging the layer-specific loss tolerance of the DNN model. PLOT employs a UDP-based transmission mechanism for gradient transfer, addressing the tail latency issue and maintaining training accuracy through packet loss tolerance. Our evaluations on both small-scale testbeds and large-scale simulations show that PLOT outperforms other congestion algorithms, effectively reducing tail latency and DDL training time.
Published in: IEEE Transactions on Network and Service Management ( Volume: 21, Issue: 6, December 2024)