Abstract:
Over the recent years, deep learning is widely being used in a variety of different fields and applications. The constant growth of data used to train complex models, has...Show MoreMetadata
Abstract:
Over the recent years, deep learning is widely being used in a variety of different fields and applications. The constant growth of data used to train complex models, has opened research in the distributed learning. In this domain, two main architectures are used to train models in a distribution fashion, all-reduce and parameter server. Both support synchronous learning, while parameter server also supports asynchronous learning. These architectures are adopted by tech companies, which have developed multiple systems for this purpose. Among the most popular and widely used distributed deep learning systems are Google TensorFlow, Facebook PyTorch and Apache MXNet. In this paper, we quantify the performance gap between these systems and present a detailed analysis to discuss the parameters that affect their execution time. Overall, in synchronous learning setups, TensorFlow is slower compared to PyTorch by average 2.65X, while the latter lags MXNet by average 1.38X. Regarding asynchronous learning, MXNet is faster by average 3.22X in respect with TensorFlow.
Date of Conference: 15-18 December 2021
Date Added to IEEE Xplore: 13 January 2022
ISBN Information:
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Deep Learning ,
- Image Classification ,
- CPU Cluster ,
- Distributed Deep Learning ,
- PyTorch ,
- TensorFlow ,
- Performance Gap ,
- Distributed Learning ,
- Deep Learning System ,
- Asynchronous Learning ,
- Parameter Server ,
- Neural Network ,
- Convolutional Neural Network ,
- Series Of Experiments ,
- Experimental Evaluation ,
- Network Size ,
- Programming Model ,
- Work Tasks ,
- Communication Protocol ,
- AlexNet ,
- Forward Pass ,
- Backward Pass ,
- Communication Cost ,
- Training Modalities ,
- Communication Time ,
- Asynchronous Mode ,
- Synchronous Mode ,
- Linear Algebra ,
- Virtual Machines ,
- Total Execution Time
- Author Keywords
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Deep Learning ,
- Image Classification ,
- CPU Cluster ,
- Distributed Deep Learning ,
- PyTorch ,
- TensorFlow ,
- Performance Gap ,
- Distributed Learning ,
- Deep Learning System ,
- Asynchronous Learning ,
- Parameter Server ,
- Neural Network ,
- Convolutional Neural Network ,
- Series Of Experiments ,
- Experimental Evaluation ,
- Network Size ,
- Programming Model ,
- Work Tasks ,
- Communication Protocol ,
- AlexNet ,
- Forward Pass ,
- Backward Pass ,
- Communication Cost ,
- Training Modalities ,
- Communication Time ,
- Asynchronous Mode ,
- Synchronous Mode ,
- Linear Algebra ,
- Virtual Machines ,
- Total Execution Time
- Author Keywords