Loading web-font TeX/Main/Regular
Releasing the Power of In-Network Aggregation With Aggregator-Aware Routing Optimization | IEEE Journals & Magazine | IEEE Xplore

Releasing the Power of In-Network Aggregation With Aggregator-Aware Routing Optimization


Abstract:

By offloading partial of the aggregation computation from the logical central parameter servers to network devices like programmable switches, In-Network Aggregation (INA...Show More

Abstract:

By offloading partial of the aggregation computation from the logical central parameter servers to network devices like programmable switches, In-Network Aggregation (INA) is a general, effective, and widely used approach to reduce network load thus alleviating the communication bottlenecks suffered by large-scale distributed training. Given the fact that INA would take effects if and only if associated traffic goes through the same in-network aggregator, the key to taking advantage of INA lies in routing control. However, existing proposals fall short in doing so and thus are far from optimal, since they select routes for INA-supported traffic without comprehensively considering the characteristics, limitations, and requirements of the network environment, aggregator hardware, and distributed training jobs. To fill the gap, in this paper, we systematically establish a mathematical model to formulate i) the up-down routing constraints of Clos datacenter networks, ii) the limitations raised by modern programmable switches’ pipeline hardware structure, and iii) the various aggregator-aware routing optimization goals required by distributed training tasks under different parallelism strategies. Based on the model, we develop ARO, an Aggregator-aware Routing Optimization solution for INA-accelerated distributed training applications. To be efficient, ARO involves a suite of search space pruning designs, by using the model’s characteristics, yielding tens of times improvement in the solving time with trivial performance loss. Extensive experiments show that ARO is able to find near-optimal results for large-scale routing optimization in tens of seconds, achieving 1.8\sim 4.0\times higher throughput than the state-of-the-art solution.
Published in: IEEE/ACM Transactions on Networking ( Volume: 32, Issue: 5, October 2024)
Page(s): 4488 - 4502
Date of Publication: 08 July 2024

ISSN Information:

Funding Agency:


References

References is not available for this document.