Journals & Magazines >IEEE Transactions on Parallel... >Volume: 34 Issue: 7

Full-Stack Optimizing Transformer Inference on ARM Many-Core CPU

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

The past several years have witnessed tremendous success of transformer models in natural language processing (NLP), and their current landscape is increasingly diverse. ...Show More

Metadata

Abstract:

The past several years have witnessed tremendous success of transformer models in natural language processing (NLP), and their current landscape is increasingly diverse. Although GPU gradually becomes the dominating workhorse and de facto standard for deep learning, there are still many scenarios where using CPU remains a prevalent choice.Recently, ARM many-core processor starts emigrating to cloud computing and high-performance computing, which is promising to deploy transformer inference. In this paper, we identify several performance bottlenecks of existing inference runtime on many-core CPU including low-core usage, isolated thread configuration, inappropriate implementation of general matrix multiply (GEMM), and redundant computations for variable-length inputs. To tackle these problems, full-stack optimizations are conducted for these challenges from service level to operator level. We explore multi-instance parallelization at the service level to improve CPU core usage. To improve parallel efficiency of the inference runtime, we design NUMA-aware thread scheduling and a look-up table for optimal parallel configurations. The GEMM implementation is tailored for some critical modules to exploit the characteristics of transformer workload. To eliminate redundant computations, a novel storage format is designed and implemented to pack sparse data and a load balancing strategy is proposed for tasks with different sparsity. Experiments show that our implementation can outperform existing solutions by 1.1x to 6x with fixed-length inputs. For variable-length inputs, it achieves 1.9x to 8x speedups on different ARM many-core processors.

Published in: IEEE Transactions on Parallel and Distributed Systems ( Volume: 34, Issue: 7, July 2023)

Page(s): 2221 - 2235

Date of Publication: 29 May 2023

ISSN Information:

DOI: 10.1109/TPDS.2023.3280805

Funding Agency:

Contents

I. Introduction

The tremendous success of transformer-based models, such as BERT [1] and GPT-2 [2], have been witnessed in the past several years. They achieve state-of-the-art performance and become dominated in most NLP tasks. Their application scenarios and scale also continue to grow aggressively. Transformer-based models are mainly composed of the word embedding layer, the self-attention layer, and feed-forward layer, etc.

References is not available for this document.

Full-Stack Optimizing Transformer Inference on ARM Many-Core CPU

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Full-Stack Optimizing Transformer Inference on ARM Many-Core CPU

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

Authors

Figures

References

Citations

Keywords

Metrics

References

IEEE Account

Purchase Details

Profile Information

Need Help?