Journals & Magazines >IEEE Transactions on Circuits... >Volume: 35 Issue: 5

Progressive Semantic-Visual Alignment and Refinement for Vision-Language Tracking

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

In recent years, vision-language tracking has drawn emerging attention in the tracking field. The critical challenge for the task is to fuse semantic representations of l...Show More

Metadata

Abstract:

In recent years, vision-language tracking has drawn emerging attention in the tracking field. The critical challenge for the task is to fuse semantic representations of language information and visual representations of vision information. For this purpose, several vision-language tracking methods perform early or late fusion to fuse visual and semantic features. However, these methods cannot take full advantage of the transformer architecture to excavate useful cross-modal context at various levels. To this end, we propose a new progressive joint vision-language transformer (PJVLT) to progressively align and refine visual embedding with semantic embedding for vision-language tracking. Specifically, to align visual signals with semantic signals, we propose to insert a semantic-aware instance encoder layer (SAIEL) into each intermediate layer of transformer encoder to perform progressive alignment of visual and semantic features. Furthermore, to highlight the multi-modal feature channels and patches corresponding to target objects, we propose a unified channel communication patch interaction layer (CCPIL), which is plugged into each intermediate layer of transformer encoder to progressively activate target-aware channels and patches of aligned multi-modal features for fine-grained tracking. In general, by progressively aligning and refining visual features with semantic features in the transformer encoder, our PJVLT can adaptively excavate well-aligned vision-language context at coarse-to-fine levels, therefore highlighting target objects at various levels for more discriminative tracking. Experiments on several tracking datasets show that the proposed PJVLT can achieve favorable performance in comparison with both conventional trackers and other vision-language trackers.

Published in: IEEE Transactions on Circuits and Systems for Video Technology ( Volume: 35, Issue: 5, May 2025)

Page(s): 4271 - 4286

Date of Publication: 19 December 2024

ISSN Information:

DOI: 10.1109/TCSVT.2024.3520354

Funding Agency:

Contents

References is not available for this document.

Progressive Semantic-Visual Alignment and Refinement for Vision-Language Tracking

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Progressive Semantic-Visual Alignment and Refinement for Vision-Language Tracking

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Authors

Figures

References

Keywords

Metrics

References

IEEE Account

Purchase Details

Profile Information

Need Help?