LAVViT: Latent Audio-Visual Vision Transformers for Speaker Verification | IEEE Conference Publication | IEEE Xplore
Scheduled Maintenance: On Monday, 30 June, IEEE Xplore will undergo scheduled maintenance from 1:00-2:00 PM ET (1800-1900 UTC).
On Tuesday, 1 July, IEEE Xplore will undergo scheduled maintenance from 1:00-5:00 PM ET (1800-2200 UTC).
During these times, there may be intermittent impact on performance. We apologize for any inconvenience.

LAVViT: Latent Audio-Visual Vision Transformers for Speaker Verification

;

Abstract:

Recently, Vision Transformers (ViTs) have shown remarkable success in various computer vision applications. In this work, we have explored the potential of ViTs, pre-trai...Show More

Abstract:

Recently, Vision Transformers (ViTs) have shown remarkable success in various computer vision applications. In this work, we have explored the potential of ViTs, pre-trained on visual data, for audio-visual speaker verification. To cope with the challenges of large-scale training, we introduce the Latent Audio-Visual Vision Transformer (LAVViT) adapters, where we exploit the existing pre-trained models on visual data without fine-tuning their parameters and train only the parameters of LAVViT adapters. The LAVViT adapters are injected into every layer of the ViT architecture to effectively fuse the audio and visual modalities using a small set of latent tokens, forming an attention bottleneck, thereby reducing the quadratic computational cost of cross-attention across the modalities. The proposed approach has been evaluated on the Voxceleb1 dataset and shows promising performance using only a few trainable parameters. Code is available at https://github.com/praveena2j/LAVViT
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information:

ISSN Information:

Conference Location: Hyderabad, India

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.