A Generative Approach to Audio-Visual Generalized Zero-Shot Learning: Combining Contrastive and Discriminative Techniques | IEEE Conference Publication | IEEE Xplore

A Generative Approach to Audio-Visual Generalized Zero-Shot Learning: Combining Contrastive and Discriminative Techniques


Abstract:

Audio-visual generalized zero-shot learning (AV-GZSL) for video classification is a task where the model learns to identify unseen video classes from multimodal audio-vis...Show More

Abstract:

Audio-visual generalized zero-shot learning (AV-GZSL) for video classification is a task where the model learns to identify unseen video classes from multimodal audio-visual inputs. This is a combination of two equally challenging tasks where one tries to perform video classification from two different modalities of inputs, and the other pushes to achieve this in a zero-shot setting. The natural alignment between audio and visual modalities is the key to addressing this relatively unexplored task. The predominant approach in AV-GZSL has been to learn better cross-modal attention between the two input domains and leverage large language pretraining. However, for better attention and pretraining, there exists a semantic gap between the embedding of different modalities that requires a more diverse and less sparse representation of the joint embedding space. To overcome this, we propose a complementary approach to the existing research direction, where we simulate unseen audio-visual features using a generative model, and regulate it by combining contrastive and discriminative loss. To demonstrate the effectiveness of our approach, we benchmark our model on VGGSound-GZSL, ActivityNet-GZSL, and UCF-GZSL and report state-of-the-art performance, and qualitatively show that unseen classes are better clustered together with our generative approach.
Date of Conference: 18-23 June 2023
Date Added to IEEE Xplore: 02 August 2023
ISBN Information:

ISSN Information:

Conference Location: Gold Coast, Australia

I. Introduction

Traditional deep learning framework learns discriminative features from a large training set and is evaluated on a test set with the objective of finding the same classes. In the Open-World humans are always subject to newer concepts that they have never known before. Under this circumstance, humans usually link what they know and infer about the unknown concept based on various sources of input (e.g., text, audio, visual). When drawing a parallel with deep learning frameworks, the holy grail is the ability to mimic humans' ability to infer an unknown concept. The task of classifying these unseen concepts is coined as zero-shot learning (ZSL) and is a very active area of research. In this paper, we aim to address a challenging video classification task with audio-visual inputs in a generalized zeros-shot learning (GZSL) setting. The problem we solve is shown in Figure 1.

Contact IEEE to Subscribe

References

References is not available for this document.