Abstract:
Enterprise and high-performance computing (HPC) data centers are dealing with thousands of sensor metrics and associated data. A top-end target for exascale machines is 1...Show MoreMetadata
Abstract:
Enterprise and high-performance computing (HPC) data centers are dealing with thousands of sensor metrics and associated data. A top-end target for exascale machines is 10 million data points per second. The escalating volume and speed of data generation are making management of such systems more difficult, and outages are increasing. The latest Uptime Institute’s Outage Analysis report [1], published in June 2022, states that 30% of all outages in 2021 lasted more than 24 hours, a disturbing increase from 8% in 2017. While equipment is idle during downtime, it often continues to consume power, especially for cooling systems. This leads to wasted energy and higher operational costs. We propose an AIOps solution that uses advanced data analytics, machine learning, and deep learning methods to develop automated and advanced anomaly detection and predictive tools for data centers. They perform at scale and speed, and improve data center resiliency and energy efficiency, thereby promoting the sustainability of data centers.
Published in: SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis
Date of Conference: 17-22 November 2024
Date Added to IEEE Xplore: 08 January 2025
ISBN Information: