Abstract:
By offloading partial of the aggregation computation from the logical central parameter servers to network devices like programmable switches, In-Network Aggregation (INA...Show MoreMetadata
Abstract:
By offloading partial of the aggregation computation from the logical central parameter servers to network devices like programmable switches, In-Network Aggregation (INA) is a general, effective, and widely used approach to reduce network load thus alleviating the communication bottlenecks suffered by large-scale distributed training. Given the fact that INA would take effects if and only if associated traffic goes through the same in-network aggregator, the key to taking advantage of INA lies in routing control. However, existing proposals fall short in doing so and thus are far from optimal, since they select routes for INA-supported traffic without comprehensively considering the characteristics, limitations, and requirements of the network environment, aggregator hardware, and distributed training jobs. To fill the gap, in this paper, we systematically establish a mathematical model to formulate i) the up-down routing constraints of Clos datacenter networks, ii) the limitations raised by modern programmable switches’ pipeline hardware structure, and iii) the various aggregator-aware routing optimization goals required by distributed training tasks under different parallelism strategies. Based on the model, we develop ARO, an Aggregator-aware Routing Optimization solution for INA-accelerated distributed training applications. To be efficient, ARO involves a suite of search space pruning designs, by using the model’s characteristics, yielding tens of times improvement in the solving time with trivial performance loss. Extensive experiments show that ARO is able to find near-optimal results for large-scale routing optimization in tens of seconds, achieving 1.8\sim 4.0\times higher throughput than the state-of-the-art solution.
Published in: IEEE/ACM Transactions on Networking ( Volume: 32, Issue: 5, October 2024)
Funding Agency:
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Training ,
- Routing ,
- Pipelines ,
- Optimization ,
- Hardware ,
- Throughput ,
- Synchronization
- Index Terms
- Pathfinding ,
- In-network Aggregation ,
- Performance Loss ,
- Job Training ,
- Tens Of Seconds ,
- Parameter Server ,
- Number Of Workers ,
- Multiple Tasks ,
- Heuristic Algorithm ,
- Mixed Integer Linear Programming ,
- Network Throughput ,
- Route Planning ,
- Traffic Load ,
- Training Scenarios ,
- Multiple Flow ,
- Multicast ,
- Routing Scheme ,
- Quantization Levels ,
- Total Throughput ,
- Access Link ,
- Multiple Aggregates ,
- Worker Nodes ,
- Heuristic Design ,
- Updated Model Parameters ,
- AA Group ,
- Distributed Machine Learning ,
- Modern Networks ,
- OpenFlow ,
- Model Grid ,
- Reasonable Time
- Author Keywords
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Training ,
- Routing ,
- Pipelines ,
- Optimization ,
- Hardware ,
- Throughput ,
- Synchronization
- Index Terms
- Pathfinding ,
- In-network Aggregation ,
- Performance Loss ,
- Job Training ,
- Tens Of Seconds ,
- Parameter Server ,
- Number Of Workers ,
- Multiple Tasks ,
- Heuristic Algorithm ,
- Mixed Integer Linear Programming ,
- Network Throughput ,
- Route Planning ,
- Traffic Load ,
- Training Scenarios ,
- Multiple Flow ,
- Multicast ,
- Routing Scheme ,
- Quantization Levels ,
- Total Throughput ,
- Access Link ,
- Multiple Aggregates ,
- Worker Nodes ,
- Heuristic Design ,
- Updated Model Parameters ,
- AA Group ,
- Distributed Machine Learning ,
- Modern Networks ,
- OpenFlow ,
- Model Grid ,
- Reasonable Time
- Author Keywords