Enhancing Training Efficiency: A Novel Approach to Handling GPU Failures in Large-Scale Distributed System for LLM Training | IEEE Conference Publication | IEEE Xplore