Abstract:
It remains challenging to train billion-scale DNN models on a single modern multi-GPU server due to the GPU memory wall. Unfortunately, existing memory-saving techniques ...Show MoreMetadata
Abstract:
It remains challenging to train billion-scale DNN models on a single modern multi-GPU server due to the GPU memory wall. Unfortunately, existing memory-saving techniques such as GPU-CPU swap, recomputation, and ZeRO-Series come at the price of extra computation, communication overhead, or limited memory reduction.We present MPress, a new single-server multi-GPU system that breaks the GPU memory wall of billion-scale model training while minimizing extra cost. MPress first discusses the trade-offs of various memory-saving techniques and offers a holistic solution, which alternatively chooses the inter-operator parallelism with low cross-GPU communication traffics, and combines with recomputation and swap, to balance training performance and sustained model sizes. Additionally, MPress employs a novel, fast D2D swap technique, which simultaneously utilizes multiple high-bandwidth NVLink to swap tensors to light-load GPUs, based on a key observation that inter-operator parallel training may result in imbalanced GPU memory utilization and spare memory space from least used devices plus the high-end interconnects among them have the opportunity to support low-overhead swapping. Finally, we integrate MPress with PipeDream and DAPPLE, two representative inter-operator parallel training systems. Experimental results with two popular DNN models, Bert, and GPT, on two modern GPU servers from the DGX-1 and DGX-2 generation, equipped with 8 V100 or A100 cards, respectively, demonstrate that MPress significantly improves the training throughput over ZeRO-Series with the identical memory reduction, while being able to train larger models than the recomputation baseline.
Date of Conference: 25 February 2023 - 01 March 2023
Date Added to IEEE Xplore: 24 March 2023
ISBN Information: