Loading [MathJax]/extensions/MathMenu.js
DLM-VMTL:A Double LayerMapper For Heterogeneous Data Video Multi-Task Prompt Learning | IEEE Conference Publication | IEEE Xplore

DLM-VMTL:A Double LayerMapper For Heterogeneous Data Video Multi-Task Prompt Learning


Abstract:

In recent years, the parameters of backbones of Video Understanding tasks continue to increase and even reach billion-level. Whether fine-tuning a specific task on the Vi...Show More

Abstract:

In recent years, the parameters of backbones of Video Understanding tasks continue to increase and even reach billion-level. Whether fine-tuning a specific task on the Video Foundation Models(VFMs) or pre-training the model designed for the specific task, incurs significant overhead. How to enable these models to play roles other than those corresponding to their own tasks becomes a worthy issue. Multi-Task Learning(MTL) makes a visual task acquire the rich shareable knowledge from other tasks while joint training. It is fully explored in Image Recognition tasks especially dense predict tasks. Nevertheless, it is rarely used in video domain due to the lack of multi-labels video data. In this paper, a heterogeneous data video multi-task prompt learning (VMTL) method is proposed to address above problem. It’s different from it in image domain, a Double-Layers Mapper(DLM) is proposed to extract the shareable knowledge into visual prompts and align it with representation of primary task to fine-tune the primary task. Extensive experiments prove that our DLM-VMTL performs better than baselines on 6 different video understanding tasks and 11 datasets.
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information:

ISSN Information:

Conference Location: Hyderabad, India

1. INTRODUCTION

Recent years, the Transformer architecture which drives the development of Computer Vision models [1], [2], [3], [4], [5] is dominant. There are many variants of Vision Transformer(VIT) as ASformer [5], MTFormer [1], MulT [2] and STAR [4] as the backbones of different Video Understanding tasks. However, as parameters of the models continue to expand, a problem is arising: Full fine-tuning a Video Transformer on the specific downstream task incured significant overhead, how can the excellent capabilities of the models be transferred for other tasks’ models to benefit them?

Contact IEEE to Subscribe

References

References is not available for this document.