Abstract:
Traditional HPC systems are provisioned with static fixed quantities of memory, storage, accelerators, and CPU resources to execute requested computation. This is not suf...Show MoreMetadata
Abstract:
Traditional HPC systems are provisioned with static fixed quantities of memory, storage, accelerators, and CPU resources to execute requested computation. This is not sufficient for today’s datacenters that are running modern dynamic workloads, resulting in workloads executing on systems that are not optimized for their needs. Workloads may require hardware resources, e.g. GPUs, that are present in the datacenter but not on the server on which the workload is executing. Conversely, compute resources on a given server may be underutilized because they are not required by the workload running on that server. Thus, datacenters often end up overprovisioning hardware resources to attempt enabling any workload to run on any server. Composable Disaggregated Infrastructure (CDI) enables servers to be dynamically composed out of hardware resources physically disaggregated in the datacenter, and as needed by a given workload. Centralized resource management can potentially mitigate, out-of-memory conditions, IO thrashing, stranding of available resources, such as, CPUs, GPUs, and memories, and provide dynamic network fail-over. Resource Management, using a standardized interface, can enable clients to monitor, compose, and intelligently provision resources, in beneficial ways. The OpenFabrics Alliance in collaboration with the DMTF, SNIA, and the CXL Consortium, is developing an OpenFabrics Management Framework (OFMF) and a hardware Composability Manager. The OFMF is an open-source Resource Manager is designed for configuring fabric interconnects and managing composable disaggregated resources in dynamic HPC infrastructures using client-friendly abstractions. The goal of the OFMF is to enable interoperability through common interfaces to enable client Managers to efficiently connect workloads with resources in a complex heterogenous ecosystem, without having to worry about the underlying network technology.
Published in: 2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
Date of Conference: 15-19 May 2023
Date Added to IEEE Xplore: 04 August 2023
ISBN Information: