Abstract:
As a proactive failure tolerant mechanism in large scale cloud storage systems, drive failure prediction can be used to protect data by early warning before real failures...Show MoreMetadata
Abstract:
As a proactive failure tolerant mechanism in large scale cloud storage systems, drive failure prediction can be used to protect data by early warning before real failures of drives, and therefore improve system dependability and cloud storage service quality. At present, solid state drives (SSDs) are generally widely used in cloud storage systems due to their high performance. SSD failures seriously affect the dependability of the system and the quality of service. Existing proactive failure tolerant mechanisms for storage systems are basically aimed at HDD failure detection and use classification technology (Supervised learning), which relies on enough failure data to establish a classification model. However, the low failure rate of SSDs leads to a serious imbalance in the ratio of positive and negative samples, which brings a big challenge for establishing a proactive failure tolerance mechanism for SSDs storage systems by using classification technology.In this paper, we propose a proactive failure tolerance mechanism for SSDs storage systems based on unsupervised technology. It only uses data of normal SSDs to train the failure prediction model, which means that our method is not limited by the imbalance in SSDs data. At the core of our method is the idea to use VAE-LSTM to learn the pattern of normal SSDs, in which case faulty SSDs can be alerted when their patterns are very different from normal ones. Our method can provide early warning of failures, thereby effectively protecting data and improving the quality of cloud storage service. We also propose a drive failure cause location mechanism, which can help operators analyze the modes of failure by providing guiding suggestions. In order to evaluate the effectiveness of our method, we use cross-validation and online testing methods on SSDs data from a technology company. The results show that the FDR and FAR of our method outperform the baselines by 17.25% and 2.39% on average.
Date of Conference: 25-28 June 2021
Date Added to IEEE Xplore: 26 August 2021
ISBN Information:
Print on Demand(PoD) ISSN: 1548-615X