I. Introduction
Traditional deep learning framework learns discriminative features from a large training set and is evaluated on a test set with the objective of finding the same classes. In the Open-World humans are always subject to newer concepts that they have never known before. Under this circumstance, humans usually link what they know and infer about the unknown concept based on various sources of input (e.g., text, audio, visual). When drawing a parallel with deep learning frameworks, the holy grail is the ability to mimic humans' ability to infer an unknown concept. The task of classifying these unseen concepts is coined as zero-shot learning (ZSL) and is a very active area of research. In this paper, we aim to address a challenging video classification task with audio-visual inputs in a generalized zeros-shot learning (GZSL) setting. The problem we solve is shown in Figure 1.