1. INTRODUCTION
Recent years, the Transformer architecture which drives the development of Computer Vision models [1], [2], [3], [4], [5] is dominant. There are many variants of Vision Transformer(VIT) as ASformer [5], MTFormer [1], MulT [2] and STAR [4] as the backbones of different Video Understanding tasks. However, as parameters of the models continue to expand, a problem is arising: Full fine-tuning a Video Transformer on the specific downstream task incured significant overhead, how can the excellent capabilities of the models be transferred for other tasks’ models to benefit them?