I. Introduction
Imitation learning is a powerful paradigm for robots to learn how to perform tasks by imitating an expert’s behaviors [1], [2]. Previous studies [3], [4] required experts to provide one or more demonstrations in the form of state-action pairs, and the robot learns task-related policies from the demonstrations. However, acquiring action information usually requires additional sensors [5], [6], action recognition modules [7], and human participation. Even in some cases, robots cannot access expert actions, such as tutorial videos on YouTube [8]. The reliance on expert actions increases the cost of deploying imitation learning methods on real robots. Thus, it would be very beneficial to devise imitation learning algorithms that do not need expert actions. Zero-shot imitation learning aims to endow robots with the ability to imitate behavior from observation sequences. The zero-shot means the robot never has access to expert actions, neither during training nor for task demonstration at inference [9]. Zero-shot imitation is a meaningful setting in robotic applications, as it enables users to easily teach robots to perform tasks without additional technology or long interaction times. These characteristics make zero-shot imitation is expected to change the dilemma that requires cumbersome programming and specialized expertise when deploying robots in small and medium-sized enterprises [10].