Conferences >2022 IEEE Symposium on Comput...

MD-Roofline: A Training Performance Analysis Model for Distributed Deep Learning

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Due to the bulkiness and sophistication of the Distributed Deep Learning (DDL) systems, it leaves an enormous challenge for AI researchers and operation engineers to anal...Show More

Metadata

Abstract:

Due to the bulkiness and sophistication of the Distributed Deep Learning (DDL) systems, it leaves an enormous challenge for AI researchers and operation engineers to analyze, diagnose and locate the performance bottleneck during the training stage. Existing performance models and frameworks gain little insight on the performance reduction that a performance straggler induces. In this paper, we introduce MD-Roofline, a training performance analysis model, which extends the traditional rooftine model with communication dimension. The model considers the layer-wise attributes at application level, and a series of achievable peak performance metrics at hardware level. With the assistance of our MD-Roofline, the AI researchers and DDL operation engineers could locate the system bottleneck, which contains three dimensions: intra-GPU computation capacity, intra-GPU memory access bandwidth and inter-GPU communication bandwidth. We demonstrate that our performance analysis model provides great insights in bottleneck analysis when training 12 classic CNNs.

Published in: 2022 IEEE Symposium on Computers and Communications (ISCC)

Date of Conference: 30 June 2022 - 03 July 2022

Date Added to IEEE Xplore: 19 October 2022

ISBN Information: