Centralized Composable HPC Management with the OpenFabrics Managment Framework | IEEE Conference Publication | IEEE Xplore

Centralized Composable HPC Management with the OpenFabrics Managment Framework


Abstract:

Traditional HPC systems are provisioned with static fixed quantities of memory, storage, accelerators, and CPU resources to execute requested computation. This is not suf...Show More

Abstract:

Traditional HPC systems are provisioned with static fixed quantities of memory, storage, accelerators, and CPU resources to execute requested computation. This is not sufficient for today’s datacenters that are running modern dynamic workloads, resulting in workloads executing on systems that are not optimized for their needs. Workloads may require hardware resources, e.g. GPUs, that are present in the datacenter but not on the server on which the workload is executing. Conversely, compute resources on a given server may be underutilized because they are not required by the workload running on that server. Thus, datacenters often end up overprovisioning hardware resources to attempt enabling any workload to run on any server. Composable Disaggregated Infrastructure (CDI) enables servers to be dynamically composed out of hardware resources physically disaggregated in the datacenter, and as needed by a given workload. Centralized resource management can potentially mitigate, out-of-memory conditions, IO thrashing, stranding of available resources, such as, CPUs, GPUs, and memories, and provide dynamic network fail-over. Resource Management, using a standardized interface, can enable clients to monitor, compose, and intelligently provision resources, in beneficial ways. The OpenFabrics Alliance in collaboration with the DMTF, SNIA, and the CXL Consortium, is developing an OpenFabrics Management Framework (OFMF) and a hardware Composability Manager. The OFMF is an open-source Resource Manager is designed for configuring fabric interconnects and managing composable disaggregated resources in dynamic HPC infrastructures using client-friendly abstractions. The goal of the OFMF is to enable interoperability through common interfaces to enable client Managers to efficiently connect workloads with resources in a complex heterogenous ecosystem, without having to worry about the underlying network technology.
Date of Conference: 15-19 May 2023
Date Added to IEEE Xplore: 04 August 2023
ISBN Information:
Conference Location: St. Petersburg, FL, USA