Abstract:
Audio-visual generalized zero-shot learning (AV-GZSL) for video classification is a task where the model learns to identify unseen video classes from multimodal audio-vis...Show MoreMetadata
Abstract:
Audio-visual generalized zero-shot learning (AV-GZSL) for video classification is a task where the model learns to identify unseen video classes from multimodal audio-visual inputs. This is a combination of two equally challenging tasks where one tries to perform video classification from two different modalities of inputs, and the other pushes to achieve this in a zero-shot setting. The natural alignment between audio and visual modalities is the key to addressing this relatively unexplored task. The predominant approach in AV-GZSL has been to learn better cross-modal attention between the two input domains and leverage large language pretraining. However, for better attention and pretraining, there exists a semantic gap between the embedding of different modalities that requires a more diverse and less sparse representation of the joint embedding space. To overcome this, we propose a complementary approach to the existing research direction, where we simulate unseen audio-visual features using a generative model, and regulate it by combining contrastive and discriminative loss. To demonstrate the effectiveness of our approach, we benchmark our model on VGGSound-GZSL, ActivityNet-GZSL, and UCF-GZSL and report state-of-the-art performance, and qualitatively show that unseen classes are better clustered together with our generative approach.
Date of Conference: 18-23 June 2023
Date Added to IEEE Xplore: 02 August 2023
ISBN Information: