Large-Scale AI Infra Reliability: Challenges, Strategies, and Llama 3 Training Experience | IEEE Conference Publication | IEEE Xplore