Abstract:
Training deep learning model is time consuming, so various researches have been conducted on accelerating the training speed through distributed processing. Data parallel...Show MoreMetadata
Abstract:
Training deep learning model is time consuming, so various researches have been conducted on accelerating the training speed through distributed processing. Data parallelism is one of the widely-used distributed training schemes, and various algorithms for the data parallelism have been studied. However, since most of studies assumed homogeneous computing environment, there is a problem that they do not consider a heterogeneous performance graphics processing unit (GPU) cluster environment. The heterogeneous performance environment leads to differences in computation time between GPU workers in the synchronous data parallelism. Due to the difference of the computation time of one iteration, the straggler problem that fast workers wait for the slowest worker makes training speed slow. Therefore, in this paper, we propose a batch-orchestration algorithm (BOA), reducing the training time by improving hardware efficiency in the heterogeneous performance GPU cluster. The proposed algorithm coordinates local mini-batch sizes for all workers to reduce the training iteration time. We confirmed that the proposed algorithm improves the performance by 23% over the synchronous SGD with one back-up worker when training ResNet-194 using 8 GPUs of three different types.
Date of Conference: 15-17 January 2018
Date Added to IEEE Xplore: 28 May 2018
ISBN Information:
Electronic ISSN: 2375-9356