Experimental Assessment of Reversibility-Aware Deep Reinforcement Learning for Optical Data Center Network Reconfiguration | IEEE Conference Publication | IEEE Xplore

Experimental Assessment of Reversibility-Aware Deep Reinforcement Learning for Optical Data Center Network Reconfiguration


Abstract:

The performance of communication-intensive distributed machine learning (DML) workloads and other emerging applications can suffer from a traffic-topology mismatch in tra...Show More

Abstract:

The performance of communication-intensive distributed machine learning (DML) workloads and other emerging applications can suffer from a traffic-topology mismatch in traditional data-center networks. This degradation can be alleviated by performing a logical network topology reconfiguration. However, how to dynamically reconfigure the logical topology and steer the bandwidth efficiently with a control plane capable of efficiently adapting to the current data center traffic patterns without considerable overhead is still an open question. This paper presents a reversibility-aware deep reinforcement learning algorithm (RA-DRL) for optical switch reconfiguration in data center networks and validates it in an experimental testbed. Using our testbed, we show that appropriate optical-switch reconfiguration, driven both by a baseline DRL and an RA-DRL method, can improve the training performance of DML workloads under network congestion. More importantly, by incorporating the concept of reversibility in the training of the DRL agent, we demonstrate a 5x training-time decrease for a distributed computer-vision application and an improvement in convergence time by up to 64%.
Date of Conference: 08-11 May 2023
Date Added to IEEE Xplore: 12 June 2023
ISBN Information:
Conference Location: Coimbra, Portugal

Contact IEEE to Subscribe

References

References is not available for this document.