Journals & Magazines >IEEE Transactions on Computers >Volume: 72 Issue: 12

JointPS: Joint Parameter Server Placement and Flow Scheduling for Machine Learning Clusters

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

To distill more information from training data, more parameters are introduced into machine learning models. As a result, communication becomes the bottleneck of Distribu...Show More

Metadata

Abstract:

To distill more information from training data, more parameters are introduced into machine learning models. As a result, communication becomes the bottleneck of Distributed Machine Learning (DML) systems. To alleviate the communication resource contention among DML jobs, which prolongs the time to train machine learning models, in machine learning clusters, JointPS is proposed in this paper. JointPS first minimizes the completion time of a single training epoch for each DML job via jointly optimizing the parameter server placement and flow scheduling, and predicts the number of remaining training epochs for each DML job by leveraging a dynamic model fitting method. Then, JointPS can estimate the remaining time to complete each DML job. According to such estimation, JointPS schedules DML jobs following the Minimum Remaining Time First (MRTF) principle to minimize the average job completion time. To the best of our knowledge, JointPS should be the first work that minimizes the average completion time of network-intensive DML training jobs by jointly optimizing the parameter server placement and flow scheduling without modifying the DML models and training procedures. Through both testbed experiments and extensive simulations, we demonstrate that JointPS can reduce the average completion time of DML jobs by up to 88% compared with state-of-the-art technology.

Published in: IEEE Transactions on Computers ( Volume: 72, Issue: 12, December 2023)

Page(s): 3503 - 3518

Date of Publication: 28 August 2023

ISSN Information:

DOI: 10.1109/TC.2023.3305753

Funding Agency:

Contents

References is not available for this document.

JointPS: Joint Parameter Server Placement and Flow Scheduling for Machine Learning Clusters

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

JointPS: Joint Parameter Server Placement and Flow Scheduling for Machine Learning Clusters

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?