Loading [MathJax]/extensions/MathMenu.js
Viewpoint-Adaptive Representation Disentanglement Network for Change Captioning | IEEE Journals & Magazine | IEEE Xplore

Viewpoint-Adaptive Representation Disentanglement Network for Change Captioning


Abstract:

Change captioning is to describe the fine-grained change between a pair of images. The pseudo changes caused by viewpoint changes are the most typical distractors in this...Show More

Abstract:

Change captioning is to describe the fine-grained change between a pair of images. The pseudo changes caused by viewpoint changes are the most typical distractors in this task, because they lead to the feature perturbation and shift for the same objects and thus overwhelm the real change representation. In this paper, we propose a viewpoint-adaptive representation disentanglement network to distinguish real and pseudo changes, and explicitly capture the features of change to generate accurate captions. Concretely, a position-embedded representation learning is devised to facilitate the model in adapting to viewpoint changes via mining the intrinsic properties of two image representations and modeling their position information. To learn a reliable change representation for decoding into a natural language sentence, an unchanged representation disentanglement is designed to identify and disentangle the unchanged features between the two position-embedded representations. Extensive experiments show that the proposed method achieves the state-of-the-art performance on the four public datasets. The code is available at https://github.com/tuyunbin/VARD.
Published in: IEEE Transactions on Image Processing ( Volume: 32)
Page(s): 2620 - 2635
Date of Publication: 25 April 2023

ISSN Information:

PubMed ID: 37097800

Funding Agency:


I. Introduction

Change captioning aims to generate a natural language sentence to describe the difference between a pair of similar images. Compared to conventional change detection [1], [2], [3], change captioning not only needs to localize accurate object changes, but also requires a high-level linguistic expression ability to semantically refer to which object has changed. Hence, not only does this task provide a deeper understanding about changes in a scene, but also has many practical applications, such as automatically generating reports about changes for the monitored facilities and areas [4], as well as about the pathological changes between medical images [5].

Contact IEEE to Subscribe

References

References is not available for this document.