I. Introduction
Recently, non-contact facial expression recognition using skin temperature has become vital for assessing digital content interest due to technological advancements and COVID-19. This technology is crucial for measuring viewer engagement and content quality. Previous studies have demonstrated changes in facial skin temperature, particularly in the nose and cheeks, during emotional arousal [1]. Additionally, thermal images are more effective than visible images for emotion estimation [2]. However, the integration of time-series data from thermal and visible images remains underexplored. Existing studies often emphasize static analyses or single-modality approaches, failing to capture dynamic emotional responses.