Conferences >ICASSP 2025 - 2025 IEEE Inter...

DLM-VMTL:A Double LayerMapper For Heterogeneous Data Video Multi-Task Prompt Learning

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

In recent years, the parameters of backbones of Video Understanding tasks continue to increase and even reach billion-level. Whether fine-tuning a specific task on the Vi...Show More

Metadata

Abstract:

In recent years, the parameters of backbones of Video Understanding tasks continue to increase and even reach billion-level. Whether fine-tuning a specific task on the Video Foundation Models(VFMs) or pre-training the model designed for the specific task, incurs significant overhead. How to enable these models to play roles other than those corresponding to their own tasks becomes a worthy issue. Multi-Task Learning(MTL) makes a visual task acquire the rich shareable knowledge from other tasks while joint training. It is fully explored in Image Recognition tasks especially dense predict tasks. Nevertheless, it is rarely used in video domain due to the lack of multi-labels video data. In this paper, a heterogeneous data video multi-task prompt learning (VMTL) method is proposed to address above problem. It’s different from it in image domain, a Double-Layers Mapper(DLM) is proposed to extract the shareable knowledge into visual prompts and align it with representation of primary task to fine-tune the primary task. Extensive experiments prove that our DLM-VMTL performs better than baselines on 6 different video understanding tasks and 11 datasets.

Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Date of Conference: 06-11 April 2025

Date Added to IEEE Xplore: 07 March 2025

ISBN Information:

ISSN Information:

DOI: 10.1109/ICASSP49660.2025.10888058

Conference Location: Hyderabad, India

Contents

1. INTRODUCTION

Recent years, the Transformer architecture which drives the development of Computer Vision models [1], [2], [3], [4], [5] is dominant. There are many variants of Vision Transformer(VIT) as ASformer [5], MTFormer [1], MulT [2] and STAR [4] as the backbones of different Video Understanding tasks. However, as parameters of the models continue to expand, a problem is arising: Full fine-tuning a Video Transformer on the specific downstream task incured significant overhead, how can the excellent capabilities of the models be transferred for other tasks’ models to benefit them?

References is not available for this document.

DLM-VMTL:A Double LayerMapper For Heterogeneous Data Video Multi-Task Prompt Learning

Abstract:

Metadata

Abstract:

ISSN Information:

1. INTRODUCTION

References

IEEE Account

Purchase Details

Profile Information

Need Help?

DLM-VMTL:A Double LayerMapper For Heterogeneous Data Video Multi-Task Prompt Learning

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1. INTRODUCTION

Authors

Figures

References

Keywords

Metrics

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?