Loading [MathJax]/extensions/MathMenu.js
Hard Disk Drive Failure Analysis and Prediction: An Industry View | IEEE Conference Publication | IEEE Xplore

Hard Disk Drive Failure Analysis and Prediction: An Industry View


Abstract:

Storage media devices are fundamental to Meta’s hardware infrastructure, which supports a diverse family of applications such as Facebook, Instagram, and WhatsApp. Unders...Show More

Abstract:

Storage media devices are fundamental to Meta’s hardware infrastructure, which supports a diverse family of applications such as Facebook, Instagram, and WhatsApp. Understanding the factors that impact the reliability of storage devices is important for setting application expectations on specifications such as throughput, latency, and read/write success rate. Improving hardware reliability helps us meet those expectations.In this paper, we examine the impact that age and workload have on the annualized failure rate (AFR) of Hard Disk Drives (HDDs), one of the most used types of storage devices for Meta’s applications. We analyze the correlation based on data collected from our production hardware fleet. In our datacenter environment, we observe that HDD AFR increases as either age or lifetime cumulative workload increases. We discuss the difference between the AFR curves and the projections that manufacturers make using statistical modeling. Additionally, we use a decision tree-based predictive machine learning (ML) model, XGBoost, for analyzing the correlation between the SMART (Self-Monitoring, Analysis, and Reporting Technology) metrics and the health of HDDs. Through this study, we observe that age and workload-related SMART parameters are most correlated to the health of a drive based on the trained ML model. More so, we identify that the difference of SMART metrics over a 30-day time window could improve the prediction performance for HDD failures.
Date of Conference: 27-30 June 2023
Date Added to IEEE Xplore: 10 August 2023
ISBN Information:

ISSN Information:

Conference Location: Porto, Portugal

I. Introduction

Meta deploys large-scale distributed storage services across datacenters. Storage applications are often categorized based on the type and temperature of the data stored: hot, warm, and cold data. At Meta, we have an exabyte-scale distributed file system, known as Tectonic [1]. Tectonic has tenants that include a warm Binary Large Object (BLOB) storage tier and a data warehouse tier. The warm BLOB tier is used for external media storage (photos, videos, documents), and internal application data (traces, heap dumps, logs) [2]. The data warehouse tier is designed to store data analytics for business intelligence, and objects such as massive map-reduce tables, snapshots of the social graph, and AI training data and models. Both tiers run on specialized storage servers containing Hard Disk Drives (HDDs), also known as Just-a-Bunch-of-Drives (JBODs) [3]. In industry, HDDs are widely used as either a boot device or a data device. The HDDs discussed in this paper are used as data devices. In our infrastructure, one compute module facilitates concurrent I/O across all HDDs in a JBOD.

Contact IEEE to Subscribe

References

References is not available for this document.