Conferences >2019 IEEE International Confe...

Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

The recent surge of Deep Learning (DL) models and applications can be attributed to the rise in computational resources, availability of large-scale datasets, and accessi...Show More

Metadata

Abstract:

The recent surge of Deep Learning (DL) models and applications can be attributed to the rise in computational resources, availability of large-scale datasets, and accessible DL frameworks such as TensorFlow and PyTorch. Because these frameworks have been heavily optimized for NVIDIA GPUs, several performance characterization studies exist for GPU-based Deep Neural Network (DNN) training. However, there exist very few research studies that focus on CPU-based DNN training. In this paper, we provide an in-depth performance characterization of state-of-the-art DNNs such as ResNet(s) and Inception-v3/v4 on multiple CPU architectures including Intel Xeon Broadwell, three variants of the Intel Xeon Skylake, AMD EPYC, and NVIDIA GPUs like K80, P100, and V100. We provide three key insights: 1) Multi-process (MP) training should be used even for a single-node, because the single-process (SP) approach cannot fully exploit all the cores, 2) Performance of both SP and MP depend on various features such as the number of cores, the processes per node (ppn), and DNN architecture, and 3) There is a non-linear and complex relationship between CPU/system characteristics (core-count, ppn, hyper-threading, etc) and DNN specifications such as inherent parallelism between layers. We further provide a comparative analysis for CPU and GPU-based training and profiling analysis for Horovod. The fastest Skylake we had access to is up to 2.35× better than a K80 GPU but up to 3.32× slower than a V100 GPU. For ResNet-152 training, we observed that MP is up to 1.47× faster than SP and achieves 125× speedup on 128 Skylake nodes.

Published in: 2019 IEEE International Conference on Cluster Computing (CLUSTER)

Date of Conference: 23-26 September 2019

Date Added to IEEE Xplore: 07 November 2019

ISBN Information:

ISSN Information:

DOI: 10.1109/CLUSTER.2019.8891042

Conference Location: Albuquerque, NM, USA

Contents

References is not available for this document.

Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters

Abstract:

Metadata

Abstract:

ISSN Information:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Authors

Figures

References

Citations

Keywords

Metrics

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?