Abstract:
The fusion classification of hyperspectral image (HSI) and light detection and ranging (LiDAR) data has gained widespread attention because of its ability to obtain more ...Show MoreMetadata
Abstract:
The fusion classification of hyperspectral image (HSI) and light detection and ranging (LiDAR) data has gained widespread attention because of its ability to obtain more comprehensive spatial and spectral information. However, the heterogeneous gap between HSI and LiDAR data also adversely affects the classification performance. Despite the excellent performance of traditional multimodal fusion classification models, language information containing much linguistic priori knowledge to enrich visual representations needs to be addressed. Therefore, we design a spectral–spatial–language fusion network (S2LFNet), which can fuse visual and language features to broaden the semantic space using linguistic priori knowledge commonly shared between spectral features and spatial features. First, we propose a dual-channel cascaded image fusion encoder (DCIFencoder) for visual feature extraction and progressive feature fusion of different levels for HSI and LiDAR data. Then, three aspects of text data are designed to extract linguistic priori knowledge using the text encoder. Finally, contrastive learning is utilized to construct a unified semantic space, and spectral–spatial–language fusion features are obtained for classification tasks. We evaluate the classification performance of the proposed S2LFNet on three datasets through extensive experiments, and the results show that it outperforms the state-of-the-art fusion classification methods.
Published in: IEEE Transactions on Geoscience and Remote Sensing ( Volume: 62)