I. Introduction
Vision Transformers (ViTs) have been shown to be highly effective in various computer vision tasks, including image classification, segmentation, and object detection. The ViT [1] was the first work to implement the encoder structure of the transformer model for image classification, achieving improved accuracy. However, as ViT models become massive, they become more difficult to deploy on edge/mobile devices. Therefore, for edge/mobile devices, both accuracy and efficiency are crucial. There are several methods to achieve high-efficiency on-device inference, such as reducing computational cost, memory storage, and memory footprints. DeiT [2] and Swin [3] have improved ViTs to make them more efficient. However, the massive computation and memory access requirements make ViT models difficult to deploy on edge applications. Therefore, lightweight and efficient ViT models have become a recent trend.