DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale | IEEE Conference Publication | IEEE Xplore

DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale


Abstract:

The landscape of transformer model inference is increasingly diverse in model size, model characteristics, latency and throughput requirements, hardware requirements, etc...Show More

Abstract:

The landscape of transformer model inference is increasingly diverse in model size, model characteristics, latency and throughput requirements, hardware requirements, etc. With such diversity, designing a versatile inference system is challenging. DeepSpeed-Inference addresses these challenges by (1) a multi-GPU inference solution to minimize latency while maximizing throughput for both dense and sparse transformers when the model fits in aggregate GPU memory, and (2) a heterogeneous inference solution that leverages CPU/NVMe/GPU memory to enable high-throughput inference for models larger than aggregate GPU memory. DeepSpeed-Inference reduces latency by 6.4× and increases throughput by 1.5 ×over the state-of-the-art. It enables trillion parameter scale inference under real-time latency constraints by leveraging hundreds of GPUs, an unprecedented scale for inference. It can inference 25 ×larger models than with GPU-only solutions, while delivering a high throughput of 84 TFLOPS (over 50% of A6000 peak).
Date of Conference: 13-18 November 2022
Date Added to IEEE Xplore: 23 February 2023
ISBN Information:

ISSN Information:

Conference Location: Dallas, TX, USA

Contact IEEE to Subscribe

References

References is not available for this document.