JointPS: Joint Parameter Server Placement and Flow Scheduling for Machine Learning Clusters | IEEE Journals & Magazine | IEEE Xplore

JointPS: Joint Parameter Server Placement and Flow Scheduling for Machine Learning Clusters


Abstract:

To distill more information from training data, more parameters are introduced into machine learning models. As a result, communication becomes the bottleneck of Distribu...Show More

Abstract:

To distill more information from training data, more parameters are introduced into machine learning models. As a result, communication becomes the bottleneck of Distributed Machine Learning (DML) systems. To alleviate the communication resource contention among DML jobs, which prolongs the time to train machine learning models, in machine learning clusters, JointPS is proposed in this paper. JointPS first minimizes the completion time of a single training epoch for each DML job via jointly optimizing the parameter server placement and flow scheduling, and predicts the number of remaining training epochs for each DML job by leveraging a dynamic model fitting method. Then, JointPS can estimate the remaining time to complete each DML job. According to such estimation, JointPS schedules DML jobs following the Minimum Remaining Time First (MRTF) principle to minimize the average job completion time. To the best of our knowledge, JointPS should be the first work that minimizes the average completion time of network-intensive DML training jobs by jointly optimizing the parameter server placement and flow scheduling without modifying the DML models and training procedures. Through both testbed experiments and extensive simulations, we demonstrate that JointPS can reduce the average completion time of DML jobs by up to 88% compared with state-of-the-art technology.
Published in: IEEE Transactions on Computers ( Volume: 72, Issue: 12, December 2023)
Page(s): 3503 - 3518
Date of Publication: 28 August 2023

ISSN Information:

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.