I. Introduction
Many datasets exhibit long-tailed distribution, i.e., a large number of classes have few or even no prior instances [1], [2], [3], [4], [5], [6], [7], [8]. Insufficient data become a bottleneck limiting the universality of deep learning [9], [10], [11], [12], [13], [14], [15]. Comparatively, humans can intuitively identify non-existent concepts (e.g., canvas tree), once humans understand the underlying primitives (e.g., canvas and tree). Inspired by this, recent works [16], [17], [18], [19], [20] propose a new learning paradigm named Compositional Zero-Shot Learning (CZSL). CZSL models images as compositions of primitive state and object concepts [21], [22], [23], [24]. It aims to extract states and objects in seen images, transferring knowledge from seen to unseen, thereby recognizing unseen state-object compositions without training. For example, given images of canvas shoe and brown tree, machines can learn simple primitives of shoe and brown, thus directly recognizing the unseen composition of brown shoe in images.