Abstract:
Recently, Vision Transformers (ViTs) have shown remarkable success in various computer vision applications. In this work, we have explored the potential of ViTs, pre-trai...Show MoreMetadata
Abstract:
Recently, Vision Transformers (ViTs) have shown remarkable success in various computer vision applications. In this work, we have explored the potential of ViTs, pre-trained on visual data, for audio-visual speaker verification. To cope with the challenges of large-scale training, we introduce the Latent Audio-Visual Vision Transformer (LAVViT) adapters, where we exploit the existing pre-trained models on visual data without fine-tuning their parameters and train only the parameters of LAVViT adapters. The LAVViT adapters are injected into every layer of the ViT architecture to effectively fuse the audio and visual modalities using a small set of latent tokens, forming an attention bottleneck, thereby reducing the quadratic computational cost of cross-attention across the modalities. The proposed approach has been evaluated on the Voxceleb1 dataset and shows promising performance using only a few trainable parameters. Code is available at https://github.com/praveena2j/LAVViT
Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information: