Understanding Permanent Hardware Failures in Deep Learning Training Accelerator Systems | IEEE Conference Publication | IEEE Xplore