From Plane to Hierarchy: Deformable Transformer for Remote Sensing Image Captioning | IEEE Journals & Magazine | IEEE Xplore

From Plane to Hierarchy: Deformable Transformer for Remote Sensing Image Captioning


Abstract:

With the growth of remote sensing images, understanding image content automatically has attracted many researchers' interests in deep learning for remote sensing image. I...Show More

Abstract:

With the growth of remote sensing images, understanding image content automatically has attracted many researchers' interests in deep learning for remote sensing image. Inspired from the natural image captioning, the model with convolutional neural network (CNN)-Recurrent neural network (RNN) as the backbone and supplemented by attention has been widely used in remote sensing image captioning. However, it is inefficient for the current attention layer to simultaneously mine hidden foreground from the background of remote sensing image and perform feature interactive learning. Meanwhile, the new mainstream language model has recently surpassed the traditional long short-term memory (LSTM) in sentence generation. For solving the above problems, in this article, we proposed a novel thought to make the flat remote sensing images stereoscopic by separating the foreground and background. Based on hierarchical image information, we designed a novel Deformable Transformer equipped with deformable scaled dot-product attention to learn multiscale feature from foreground and background through the powerful interactive learning ability. Evaluations are conducted on four classic remote sensing image captioning datasets. Compared with the state-of-the-art methods, our Transformer variant achieves higher captioning accuracy.
Page(s): 7704 - 7717
Date of Publication: 16 August 2023

ISSN Information:

Funding Agency:


References

References is not available for this document.