Towards Scalable Resource Management for Supercomputers | IEEE Conference Publication | IEEE Xplore

Towards Scalable Resource Management for Supercomputers


Abstract:

Today's supercomputers offer massive computation resources to execute a large number of user jobs. Effectively managing such large-scale hardware parallelism and workload...Show More

Abstract:

Today's supercomputers offer massive computation resources to execute a large number of user jobs. Effectively managing such large-scale hardware parallelism and workloads is essential for supercomputers. However, existing HPC resource management (RM) systems fail to capitalize on the hardware parallelism by following a centralized design used decades ago. They give poor scalability and inefficient performance on today's supercomputers, which will worsen in exascale computing. We present ESlurm, a better RM for supercomputers. As a departure from existing HPC RMs, ESlurm implements a distributed communication structure. It employs a new communication tree strategy and uses job runtime estimation to improve communications and job scheduling efficiency. ESlurm is deployed into production in a real supercomputer. We evaluate ESlurm on up to 20K nodes. Compared to state-of-the-art RM solutions, ESlurm exhibits better scalability, significantly reducing the resource usage of master nodes and improving data transfer and job scheduling efficiency by a large margin.
Date of Conference: 13-18 November 2022
Date Added to IEEE Xplore: 23 February 2023
ISBN Information:

ISSN Information:

Conference Location: Dallas, TX, USA

Funding Agency:


References

References is not available for this document.