Abstract:
To distill more information from training data, more parameters are introduced into machine learning models. As a result, communication becomes the bottleneck of Distribu...Show MoreMetadata
Abstract:
To distill more information from training data, more parameters are introduced into machine learning models. As a result, communication becomes the bottleneck of Distributed Machine Learning (DML) systems. To alleviate the communication resource contention among DML jobs, which prolongs the time to train machine learning models, in machine learning clusters, JointPS is proposed in this paper. JointPS first minimizes the completion time of a single training epoch for each DML job via jointly optimizing the parameter server placement and flow scheduling, and predicts the number of remaining training epochs for each DML job by leveraging a dynamic model fitting method. Then, JointPS can estimate the remaining time to complete each DML job. According to such estimation, JointPS schedules DML jobs following the Minimum Remaining Time First (MRTF) principle to minimize the average job completion time. To the best of our knowledge, JointPS should be the first work that minimizes the average completion time of network-intensive DML training jobs by jointly optimizing the parameter server placement and flow scheduling without modifying the DML models and training procedures. Through both testbed experiments and extensive simulations, we demonstrate that JointPS can reduce the average completion time of DML jobs by up to 88% compared with state-of-the-art technology.
Published in: IEEE Transactions on Computers ( Volume: 72, Issue: 12, December 2023)