1. Introduction
Video- Language pre-trmining (VidL) aims to learn gen-eralizable multi-modal models from large-scale video-text samples so as to better solve various challenging Video-Language understanding tasks, such as text-video retrieval [1], [4], [38], [55] and video question answering [16], [47], [52]. Recent studies [9], [11], [23], [24], [49], [56], [58], [61] have shown that VidL leads to significant performance improvement and achieves state-of-the-art results on various downstream text-video re-trieval and video question answering (VQA) benchmarks.