I. Introduction
Comprehending a user’s ambiguous verbal instructions, such as “Take that one and conducting specific tasks based on the information obtained from the surroundings, is one of the most significant challenges in the application of human support robots [1]. The term “that” in this context pertains to an object that was not specified in the verbal directive, i.e., a demonstrative. To understand the referent of a demonstrative, it is essential to perform anaphora resolution. In the past, research on anaphora resolution [2], [3] in the domain of natural language processing has predominantly focused on linguistic information as context. Anaphora resolution entails the identification of the object to which a pronoun or a demonstrative refers, and there are two types: “endophora resolution” and “exophora resolution.” Endophora resolution entails determining the appropriate word corresponding to a linguistic expression within a context, considering the surrounding text both preceding and following the target expression. In contrast, exophora resolution entails predicting the object that corresponds to the target expression in the on-site environment. Endophora resolution may be used to solve problems where the answer or clue is present in the text. However, it is essential to perform exophora resolution, which relies on non-verbal information, to comprehend verbal instructions containing a demonstrative such as “Get that one” [4]–[6].