Abstract:
Disk and memory faults are the leading causes of server breakdown. A proactive solution is to predict such hardware failure at the runtime and then isolate the hardware a...Show MoreMetadata
Abstract:
Disk and memory faults are the leading causes of server breakdown. A proactive solution is to predict such hardware failure at the runtime and then isolate the hardware at risk and backup the data. However, the current model-based predictors are incapable of using the discrete time-series data, such as the values of device attributes, which conveys high-level information of the device behavior. In this paper, we propose a novel deep-learning based prediction scheme for system-level hardware failure prediction. We normalize the distribution of samples' attributes from different vendors to make use of diverse training sets. We propose a temporal Convolution Neural Network based model that is insensitive to the noise in the time dimension. Finally, we design a loss function to train the model with extremely imbalanced samples effectively. Experimental results from an open S.M.A.R.T data set and an industrial data set show the effectiveness of the proposed scheme.
Published in: 2019 56th ACM/IEEE Design Automation Conference (DAC)
Date of Conference: 02-06 June 2019
Date Added to IEEE Xplore: 22 August 2019
ISBN Information:
Print on Demand(PoD) ISSN: 0738-100X
Conference Location: Las Vegas, NV, USA