1. INTRODUCTION
In our daily lives, we often rely on both visual and auditory cues for answering questions [1]–[8]. To simulate this human perception capability in question-answering, the AudioVisual Question Answering (AVQA) task has emerged. This task guides machines to comprehend and respond to questions by utilizing combined audio-visual information relevant to the question text. Due to this property, the AVQA task is closely linked to various practical real-world applications, including autonomous navigation [9] and interactive education [10].