Revisiting Reliability in Large-Scale Machine Learning Research Clusters | IEEE Conference Publication | IEEE Xplore