A 28nm 343.5fps/W Vision Transformer Accelerator with Integer-Only Quantized Attention Block | IEEE Conference Publication | IEEE Xplore

A 28nm 343.5fps/W Vision Transformer Accelerator with Integer-Only Quantized Attention Block


Abstract:

Vision Transformer (ViT) has achieved state-of-the-art performance on various computer vision tasks. For the mobile/edge device, the energy efficiency is the most importa...Show More

Abstract:

Vision Transformer (ViT) has achieved state-of-the-art performance on various computer vision tasks. For the mobile/edge device, the energy efficiency is the most important issue. However, ViT requires huge computation and storage, which makes it difficult to be deployed on mobile/edge device. In this work, we focus on algorithm level and hardware level to improve efficiency of ViT inference. At algorithm level, we proposed energy efficient ViT model by adopting 4bit Quantization and Low-Rank Approximation to convert all the non-linear functions with floating point (FP) values in Multi-Head Attention (MHA) to linear function with integer (INT) values, to decrease the overhead caused by computation and storage. There are less accuracy drop compare with full-precision (<1.5%). At hardware level, we design an energy efficient row-based pipelined ViT accelerator for on-device inference. The proposed accelerator is consisted of integer-only quantizer, integer MACs PE array used for executing quantization and matrix operations, and approximated linear block adopted for executing low-rank approximation. As we know, in the research of ViT, this is the first accelerator uses 4-bits quantization and designs quantizer to operate integer-only quantization for on-device inference. This work can achieve energy efficiency of 343.5 fps/W and improve up to 8x energy efficiency compare to state-of-art works.
Date of Conference: 22-25 April 2024
Date Added to IEEE Xplore: 19 July 2024
ISBN Information:

ISSN Information:

Conference Location: Abu Dhabi, United Arab Emirates

I. Introduction

Vision Transformers (ViTs) have been shown to be highly effective in various computer vision tasks, including image classification, segmentation, and object detection. The ViT [1] was the first work to implement the encoder structure of the transformer model for image classification, achieving improved accuracy. However, as ViT models become massive, they become more difficult to deploy on edge/mobile devices. Therefore, for edge/mobile devices, both accuracy and efficiency are crucial. There are several methods to achieve high-efficiency on-device inference, such as reducing computational cost, memory storage, and memory footprints. DeiT [2] and Swin [3] have improved ViTs to make them more efficient. However, the massive computation and memory access requirements make ViT models difficult to deploy on edge applications. Therefore, lightweight and efficient ViT models have become a recent trend.

Contact IEEE to Subscribe

References

References is not available for this document.