1 Introduction
The International Data Corporation predicts that by 2025, there will be 41.6 billion connected Internet of Things (IoT) devices [1]. Additionally, the recently proposed vision transformer (ViT) models, with the support of large datasets, have crushed the convolutional neural network models that have dominated for many years in multifarious vision tasks, such as image classification [2], [3], object detection [4], [5], and semantic segmentation [6], [7]. Deploying high-performance ViT models on ubiquitous IoT devices to provide high-quality vision services has attracted great attention from both industry and academia.