I. Introduction
Due to the limitations of hardware devices, images from one type of sensor can merely characterize partial information. For instance, the reflected light information captured by visible sensors can describe scene textures while it is susceptible to light and shading. Complementarily, the thermal radiation information captured by infrared sensors is insensitive to light and can reflect the essential attributes of scenes and objects. Multi-modal image fusion aims to synthesize a single image by integrating complementary source information from different types of sensors. As shown in Fig. 1, the single fused image exhibits better scene representation and visual perception, which can benefit various subsequent tasks, such as semantic segmentation [1], object detection, and tracking [2], scene understanding [3], etc. Therefore, image fusion has a wide variety of applications, from security to industrial and civilian fields [4], [5].