Abstract:
As more distributed deep learning (DDL) jobs run in public clouds, their effective scheduling becomes a major challenge. Current studies prioritize the execution of jobs ...Show MoreMetadata
Abstract:
As more distributed deep learning (DDL) jobs run in public clouds, their effective scheduling becomes a major challenge. Current studies prioritize the execution of jobs with less remaining time, which is known to be the best in reducing average job completion time (JCT). However, we observe that this approach does not work when the preemption for pausing and loading jobs weighs in; sometimes, the preemption overheads of DDL jobs take up to hundreds of seconds. This results in very ineffective scheduling, so in some cases, the first-in-first-out policy performs much better. This paper proposes a new scheduling framework called Xion that takes into account the preemption overheads and only preempts DDL jobs when it is beneficial. Our evaluation results demonstrate that Xion effectively reduces the average JCT by 19% and improves the waiting time by 1.64×.
Date of Conference: 02-08 July 2023
Date Added to IEEE Xplore: 25 September 2023
ISBN Information: