Attention-Based Cross-Modality Feature Complementation for Multispectral Pedestrian Detection

Multispectral pedestrian detection based on deep learning can provide a robust and accurate detection under different illumination conditions, which has important research significance in safety. In order to reduce the log-average miss rate of the object under different illumination conditions, a new one-stage detector suitable for multispectral pedestrian detection is proposed. First, in order to realize the complementarity between the information flows of the two modalities in the feature extraction stage to reduce the object loss, a low-cost cross-modality feature complementary module (CFCM) is proposed. Second, in order to suppress the background noise in different environments and enhance the semantic information and location information of the object, so as to reduce the error detection of the object, an attention-based feature enhancement fusion module (AFEFM) is proposed. Thirdly, through the feature complementarity of color-thermal image pair and the multi-scale fusion of depth feature layer, the horizontal and vertical multi-dimensional data mining of parallel deep neural network is realized, which provides effective data support for object detection algorithm. Finally, through the reasonable arrangement of proposed modules, a robust multispectral detection framework is proposed. The experimental results on the Korea Advanced Institute of Science and Technology (KAIST) pedestrian benchmark show that the proposed method has the lowest log-average miss rate compared with other state-of-the-art multispectral pedestrian detectors, and has a good balance in speed and accuracy.


I. INTRODUCTION
Pedestrian detection is a key task in various computer vision applications, such as autonomous driving [1], video monitoring [2], etc. In recent years, deep learning has achieved a great improvement in pedestrian detection. However, most of the detectors are based on RGB images, and the detection performance of this kind of detector is often significantly degraded under insufficient lighting conditions and severe weather conditions. Thus, it is difficult to accurately detect pedestrians only by relying on the visible light image. As a promising alternative for overcoming the limited accuracy of a single-visible image approach, multispectral pedestrian detection has been actively studied and attracted more attention, especially in the field of safety. Different modal information collected by color and thermal sensors provides complementary visual cues for pedestrian detection. RGB images can provide rich color, texture, and other detailed information, while the thermal images can filter out the interference of some environments and provide clearer contour information in case of insufficient lighting [3]. Therefore, multispectral pedestrian detection based on color and thermal images can provide robust detection under different lighting conditions. Integrating color and thermal images in multispectral pedestrian detection is a key task. Most feature fusion methods adopted a simple addition of two modal features [4] or the dimension reduction after feature map concatenation [5]. These methods did not consider the complementarity between the two modal features and its own importance of each part of the feature map under a single modality [6]. Further, many pedestrian detection methods used a two-stage object detector to ensure detection accuracy. However, such a two-stage object detector sacrificed the inference speed, which is also crucial in practical applications, such as autonomous driving. To address the above problems, this paper proposes a new one-stage detector suitable for multispectral pedestrian detection, as shown in Fig.  1. Firstly, we adopt CSPDarknet53 as the backbone network to extract the multi-scale features under RGB and thermal modalities, respectively. Note that we chose to adopt a onestage object network to ensure the real-time detection of objects. Also, we propose a CFCM module to realize the complementarity between the information flows of the two modalities in the feature extraction stage to reduce the object loss. Next, the AFEFM module fuses the features of different layers extracted from the two modalities to enhance the feature information of the detected object. Then, we perform a multiscale fusion of the output features of the AFEFM module to obtain the input of the detection head. Finally, the detection head predicts the object based on the features of the input.
The main contributions of this paper are as follows： • First, we propose the CFCM module to realize crossmodality complementation of features in the feature extraction stages. It promotes the information interaction between the two modalities, so that the lost object in one modality can be partially recovered through the complementary information provided by the other modality, so as to reduce the object loss and transmit more information.
• Second, we propose the AFEFM module to fuse the same level of depth information in the parallel network of visible and thermal modalities, so that the visual saliency of the two modal images can be mapped to the depth features. This module can suppress interference information and enhance object features.
• Thirdly, through the feature complementarity of color and thermal images and the multi-scale fusion of depth feature layer, we realize the horizontal and vertical multidimensional data mining of parallel network, fully enrich the depth semantic information of the object, and provide effective data support for object detection algorithm.
• Finally, a new cross-modal network based on YOLOv5 is proposed through the reasonable arrangement of proposed modules, which is named YOLO_CMN.
The rest of this paper is structured as follows. Section II introduces the related work. Section III describes the proposed method in detail. The experimental results and analysis are given in Section IV. Section V concludes this paper and puts forward the future work.

II. RELATED WORKS
Pedestrian detection has been actively studied due to its significance in many fields. Especially, multispectral pedestrian detection has attracted much attention due to the This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.  [8][9][10][11] has the limitations of relying on the hand-crafted design and low detection accuracy. In recent years, inspired by the rapid advance of deep learning in other computer vision tasks, multispectral pedestrian detection networks have also achieved a great improvement. Since a single spectral object detection is not available for multispectral sensors, a naive approach for multispectral pedestrian detection separately trained and fused different spectral images. In order to improve detection accuracy, four common frameworks [5] were proposed: early fusion, halfway fusion, late fusion, and score fusion. The proposed fusion framework was based on Faster R-CNN [13], a common twostage object detection algorithm. Many other frameworks were proposed based on Faster R-CNN, such as ConvNets, SAF R-CNN [14], and IAF RCNN [15]. Although two-stage detection has a better detection effect, the detection speed is often low to be used in many practical applications. As an alternative to two-stage detection for real-time pedestrian detection, one-stage detection networks were studied. In [16], an effective one-stage object detection network, YOLO_TLV, was proposed to achieve real-time detection with little accuracy reduction. In order to balance the accuracy and speed of the multispectral pedestrian detection network, [17] proposed an MSFFN fusion network based on the YOLOv3 network. In addition to YOLO, SSD is also one of the typical one-stage object detection networks. GFD-SSD [18] realized the balance between detection accuracy and speed. In addition, MSDS RCNN [12] was proposed, which can be learned by jointly optimizing pedestrian detection and semantic segmentation task.
The feature fusion method is a key task to achieve good detection performance in multispectral pedestrian detection. MIN fusion method [5] was proposed to reduce the dimensions of multimodal features with a 1 × 1 convolution layer after concatenation. SUM fusion [4] method adopted element-wise sum, which can be considered as the linear feature fusion with the same weight. According to the experiment, the contribution of different modal features to the detection results is varied under different lights. Thus, the linear feature fusion is not suitable for all cases. Considering this factor, [6,15,20] used the illumination-aware fusion methods. [18] introduced gated fusion units in the middle layer of SSD for feature fusion and pedestrian detection. In recent years, the attention mechanism has received extensive attention due to its promising performance in many networks. In [21,22], an attention-based feature fusion module was adopted to improve multispectral pedestrian detection.
Previous works showed that color image and thermal image information are complementary, and the deep learningbased neural network integrates multispectral features to improve pedestrian detection performance. Although the existing multispectral pedestrian detection methods have achieved high performance, there is still a big gap between detector performance and human vision. In this paper, a new network architecture is proposed to reduce the log-average miss rate and improve detection speed.

A. CROSS-MODALITY FUSION FRAMEWORK
In order to meet the requirements of multispectral pedestrian detection for detection accuracy and speed, we propose a cross-modality fusion framework based on the YOLOv5. YOLOv5 is a one-stage object detection network that consists of four parts: Input, Backbone, Neck, and Output. The backbone network extracts features and comprises three modules: Focus module, CSP module, and CBL module. The neck network is used for multi-scale feature fusion, which enables the network to detect objects with different scales, thus improving detection accuracy. Four variants of YOLOv5 are available according to the depth and width of the network structure. Due to the integration of many excellent algorithms, YOLOv5 can provide high detection accuracy and speed despite being a one-stage detection network.
The proposed network adopts YOLOv5s as the base network to achieve fast and accurate multispectral pedestrian detection. RGB images and thermal images are fed into two sub-networks as the input. The shallow network generates simple geometric features, while the deep network generates rich semantic information. Thus, the feature fusion at different locations has different effects. We first introduce the commonly used multispectral feature fusion framework: input fusion, early fusion, and halfway fusion. Then, we construct a cross-modal network based on YOLOv5s. As depicted in Fig.  2, the fusion framework uses both the AFEFM module and the CFCM module to obtain a better detection effect.
As all modules are integrated into the network and trained end-to-end with the loss function defined as follows: where local L represents localization loss, conf L represents confidence loss, and cls L represents classification loss. Input fusion is the data fusion of the RGB image and the thermal image before feature extraction. Using the AFEFM module, the color and thermal images are fused before being fed into the feature extraction network. Its structural form is the simplest, and the specific structure is shown in Fig. 2 (a).
Early fusion integrates the RGB and thermal information via low-level feature fusion. RGB image and thermal image are respectively fed into two sub-networks for feature extraction. The RGB feature map R F (128 channels) is extracted after the CSP1_1 module operation. The thermal feature map T F (128 channels) is also extracted after the CSP1_1 module operation. The AFEFM module fuses those two features and generates the fused feature map whose channel size is also 128. The fused feature map passes through the remainder of the network. The specific network structure is shown in Fig. 2

(b).
Halfway fusion deploys the AFEFM modules behind CSP1_2, CSP1_3, and CSP2_1, respectively. The fused features are sent to the Neck network for multi-scale feature fusion. The specific network structure is shown in Fig. 2(c). Since this fusion architecture showed good performance in the experiment, it was selected as the benchmark.
YOLO_CMN refers to the multispectral pedestrian detection architecture proposed in this paper. The TOYO_CMN consists of the halfway fusion structure and the added CFCM modules after the CBL modules. The network structure is depicted in Fig. 1(d). CFCM module makes the two modalities complement each other in the feature extraction stage so that the network can learn more information and reduce the object loss. AFEFM module enhances the important features of the two modalities while suppressing less important features before fusion. Then, they are fed into the Neck network for multi-scale feature fusion to obtain final prediction scores. The two modules are described in the following sections.

B. CROSS-MODAL FEATURE COMPLEMENTARY MODULE
RGB images have detailed information and contour information under good illumination conditions. On the other hand, the thermal image is less disturbed by illumination conditions and has clear pedestrian contour information. In order to enable the detection network to realize all-day pedestrian detection, the CFCM module makes the features of two modalities interact, so that the lost object in one modality can be partially recovered through the complementary information provided by the other modality, so as to reduce the object loss and transmit more information. The CFCM module takes the different modalities into another modality to obtain complementary information of the other modality during feature extraction. In this way, the two modalities can learn more complementary features. The CFCM module operates as follows. First, channelwise differential weighting is used to get the differencefeatures of the two modal feature maps. Second, the different features of different modalities are amplified and fused with the features of another modality. Finally, in order to make the network pay attention to important features, channel attention operation is performed on the feature maps of the two modalities, and features are fused.
is RGB convolution feature map, and is thermal convolution feature map.
is obtained by channel-wise differential weighting for the two features.
where GAP refers to global average pooling,  is the sigmoid activation function, and represents elementwise multiplication.
where || denotes the channel-wise concatenation operation,  represents element-wise sum, ( ) is the residual function.
and R F after feature enhancement. After the above operation, the information of the two modalities is complementary and fused, and the obtained feature map containing more information is sent to the network for further feature extraction.

C. ATTENTION-BASED FEATURE ENHANCEMENT FUSION MODULE
Two modal features extracted via two sub-networks are fused with the AFEFM module, where important features are enhanced while suppressing noise interference based on the attention mechanism.
Global average pooling is often used to encode global spatial information in channel attention, which compresses features from 3-dimensions to 1-dimension, losing spatial information. We decompose global average pooling along with horizontal and vertical coordinates to pay attention to spatial information when using the channel attention module. Further, max-pooling can gather another important clue about the features of different objects to infer finer channel attention. Thus, both the average pooling and max pooling are employed in the proposed network.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
are feature maps generated along with the horizontal and vertical directions, respectively.
( ) where [ , ] represents the join operations along spatial dimensions,  is a non-linear activation function， ( )

IV. EXPERIMENTAL RESULTS
This section describes the dataset, evaluation indicators, and implementation details of the experiment. Then, the performance of the proposed method is evaluated in comparison with other state-of-the-art methods. Lastly, the ablation studies are given on the two different modules and model architectures.

A. DATASET
The evaluation was conducted on the commonly used multispectral pedestrian dataset, the KAIST dataset. KAIST dataset collected 95,328 color-thermal image pairs using visible light camera and infrared thermal imaging camera, including 50,172 pairs in the training set and 45,156 pairs in the test set. According to the sampling principle in [7,12], every two frames were sampled from the training video, and heavy occluded and small person instances were removed (< 50 pixels). The resulting training set contains 7,601 colorthermal image pairs. The test set was sampled every 20 frames, resulting in 2,252 color-thermal image pairs. Annotations of the KAIST dataset have been improved for the training set and test set. The dataset contains objects with different light conditions, different scales, and different degrees of occlusion, which is difficult to detect.

B. EVALUATION METRICS
The evaluation standard proposed in [28] is widely used in pedestrian detection--log-average miss rate (MR). Specifically, the generated detection bounding box ( This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
According to the formula, the smaller the MR of the algorithm, the better the network detection performance.

C. Implementation Details
Our method uses the same environment configuration as YOLOv5. Before training the model on the KAIST dataset, Kmeans clustering is applied to obtain anchors. The obtained anchor boxes after clustering are (18,42), (23,55), (34,84), (48,114), (59,141), (84,205), (110,259), (142,367), and (207,498). The proposed model was trained using a random gradient descent optimizer. The learning rate at the beginning of training was 0.001. When the training loss was not reduced, and the validation recall was not improved, the learning rate was reduced by ten times. After reducing the learning rate twice, the training stopped. All models were trained on GeForce RTX 2080Ti GPU with batch size=8. We kept all hyperparameters the same as the original settings of the YOLOv5 model. In order to make a fair comparison, the MR is used to evaluate the performance of different models.

D. COMPARISON WITH THE STATE-OF-THE-ART METHODS
The proposed network is compared with state-of-the-art methods on the KAIST test dataset, including ACF+T+THOG, Halfway Fusion, YOLO_TLV, Fusion RPN+BDT, IAF R-CNN, IATDNN+IASS, MSDS-RCNN, AR-CNN, and MBNet. Among these detection methods, YOLO_TLV, MBNet, and our method are the one-stage methods, and the rest are the two-stage methods. The experimental results are depicted in Fig. 7, which shows that our detection method is the best and has the lowest MR. The MR values are 7.85%, 8.03% and 7.82% ,respectively. Due to the base network YOLOv5 and the proposed fusion method, the proposed network can provide good detection performance despite being a one-stage detection model. Although the proposed network is a one-stage detection model, it can still provide good detection performance due to the effectiveness of our proposed methods.
Compared with YOLO_TLV, the MR values are reduced by 16.97%, 18.12%, and 13.13%, respectively. Compared with MBNet, the MR values are reduced by 0.36%, 0.42%, and 0.2%, respectively. The proposed method shows the outperforming detection performance over the compared networks. In Fig. 5 (b)(c), we can observe that our method achieves good performance under reasonable settings during the day and night, and our method works better at night than during the day. This shows that the proposed detection method is more suitable for pedestrian detection at night.    5 shows the experimental results tested on a reasonable subset, reasonable subset is to select a part of image data that can reflect the performance of the algorithm from the complete dataset based on the demand for the algorithm. This is because public datasets are often designed for applications rather than for an algorithm. Therefore, it is very necessary to filter the image data according to the specific needs of the algorithm. The experimental strategy based on Fig. 5 is the common strategy used in the methods we compared. In the experiment, they eliminate extremely difficult objects such as less than 55 pixels and heavy occlusion. In order to verify the effectiveness of our method for difficult samples, we conducted a global experiment, tested all samples of the test set in the database, and counted the experimental results. The experimental results are shown in Fig. 6, in which figures (a) (b) (c) are the experimental results under different lighting conditions in all, daytime and nighttime respectively. The experimental results show that the proposed method is the best. It should be noted that through horizontal comparison, it is not difficult to find that the performance improvement effect of our method is more obvious at night. There is no doubt that this is the result of our network fusing thermal images. Through vertical comparison, Fig. 6 is significantly better than Fig. 5 in performance improvement. The only difference between the two groups of experiments is that the experimental dataset is different. It can be concluded that compared with other methods, our method has certain advantages in detecting difficult samples, which shows that the deep information fusion strategy in our method has better performance in dealing with difficult samples. For quantitative analysis, we analyze the experimental results from two perspectives. From the perspective of scale, the dataset divides the object into near, medium and far according to the height of pedestrians in pixel units: size< 45, 45≤size≤115 and 115<size. In order to further verify the performance of the proposed method against small objects, we compare different methods under these three subsets. As can be seen from Table I, compared with MBNet, our method reduces the MR by 0%, 2.56% and 3.79%, respectively. In order to verify the ability of our method to detect occluded objects, we verify the performance through three subsets: no occlusion, partial occlusion and heavy occlusion. It can be seen from the statistical results that our method has a strong ability in detecting occluded objects. These further verify the ability of our method to deal with difficult samples. Through the analysis of some samples of the experimental results, we find that our method is also difficult to work for the very small and heavy occluded objects, which shows that for the deep network, the completeness of the object's own information expression is an extremely important factor affecting the object detection results. Therefore, a good visual sensor and the image and video acquisition strategy are the key factors for the final application of artificial intelligence. The performance of the deep learning algorithm is often strongly related to the scale of the deep network and the scale of the training dataset. The detection performance of networks and data scales with different volumes is not comparable. When the dataset is determined, the network scale is often positively correlated with the workload of calculation. The size of the network will determine the demand for computing resources, and the demand for computing resources often determines the application field and scope of the algorithm. Therefore, comparing the computing resource requirements of algorithms is very important to evaluate the performance of algorithms, and the usual strategy is to compare the processing speed of algorithms on the same computing platform. In order to achieve this goal, Fig. 7 shows the comparison of the calculation speed between our method and several other methods mentioned in this paper. It can be seen from the results that our method performs best in terms of speed, and the result is 50 FPS. In order to further verify the effectiveness of our proposed algorithm, the proposed algorithm is qualitatively analyzed with other advanced algorithms, and the detection results of objects under different environmental conditions are visualized. Fig. 8 and 9 show the detection results during the day and night respectively. Fig. 8 (a), (b) and (c) show the pedestrian detection results under different illumination conditions during the day, respectively. Fig. 9 (a), (b) and (c) show the pedestrian detection results under different illumination conditions at night, respectively. From the visualization results, we can see that under different lighting conditions, the number of missed objects in our algorithm is the least and has the best detection performance. In Fig. 8 (c) and Fig. 9 (b), there are objects with different scales. The visualization results verify that our algorithm also has good detection performance for objects with different scales. Fig.  8 (a) (b) and Fig. 9 (a) contain the object occluded by the background, and Fig. 8 (b) and Fig. 9 (c) contain the object occluded by the pedestrian. Under these different occlusions, our algorithm has better visualization results. The visual experimental results show a good detection effect for multiscale and dense pedestrians under different background environments and different illumination conditions. Compared with other advanced multispectral pedestrian detection frameworks, YOLO_CMN has less missed detection for different forms of pedestrians, and the prediction bounding box is closer to the ground truth bounding box, which further verifies the effectiveness of the proposed algorithm. It can be seen from the detection results that after the enhancement and fusion of the two modal image features, the position and category feature information of the object are integrated to enhance the feature expression ability of the object, so that the algorithm can better detect the object and make the detection effect remarkable.
In conclusion, we believe that the reason why our proposed algorithm has the better performance under different lighting conditions is inseparable from the improvement of the detection performance of difficult samples. Through comparative experiments, it is verified that the proposed algorithm has achieved a good balance in speed and accuracy. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

E. Ablation Studies
CFCM module can help one modality obtain complimentary feature information from another modality during feature extraction. In order to verify the effectiveness of the proposed module, we conducted ablation studies on the module. CFCM module is applied to the feature extraction stage of the two sub-networks. The specific deployment is shown in Fig. 2 (d). This part of the experiment takes the architecture in Fig. 2 (c) as the baseline, which does not use the CFCM module. On the basis of this architecture, we use different numbers of CFCM modules to fuse and complement the feature maps of different modalities. As summarized in Table II  The feature map is obtained after the operation of the CSP module of the backbone network. The visualization results of the feature map are shown in Fig. 10. By comparing the feature maps of baseline and YOLO_CMN, it can be seen that the addition of the CFCM module enhances pedestrian features and suppresses background features. With the deepening of the network, the CFCM module is conducive to refining and integrating the background features of the pictures, reducing background noise interference, and enhancing the features of pedestrian areas. Especially in the case of insufficient illumination conditions, it is difficult to extract pedestrian features on RGB images, while the pedestrian features contained in thermal images are relatively prominent. CFCM module enables the feature information in thermal images to be learned simultaneously when extracting pedestrian features in the RGB modality. Due to the fusion and complementarity of the two modal features, the network can learn richer feature information and improve the detection performance. In general, the CFCM module promotes modal interaction in the network, reduces target loss, highlights pedestrian features, reduces redundant learning, transmits more information, and improves the detection effect under different illumination conditions. AFEFM module first enhances and suppresses the features of the two modalities and then fuses them to obtain more abundant information. In order to evaluate the effectiveness of the AFEFM module, we conducted the following experiments. The feature fusion strategy of the AFEFM module is shown in Fig. 2 Fig. 11, the AFEFM module can integrate features of different modalities to further highlight pedestrian features. In general, the proposed AFEFM module fully integrates color flow and thermal flow, so that the information of the two modal feature maps can be further complementary. In order to evaluate our dual-modality feature fusion architecture, we compared YOLO_CMN with some classical architectures, such as input fusion architecture, early fusion architecture, and halfway fusion architecture. The experimental results are shown in Table Ⅳ . Further, the proposed network is compared with methods using only color images and using only thermal images. Color Only and Thermal Only in Table Ⅳ  The MR of single-modality is significantly higher than that of dual-modalities, which proves that the detection performance of dual-modality is better than that of single-modality. Among the dual-modalities algorithm, YOLO_CMN has the lowest MR, while Input Fusion has the highest MR. Compared with Input Fusion, the MR values are reduced by 7.06%, 6.81%, and 7.3%, respectively. The results show that the proposed network can effectively fuse features and improve detection performance.
These ablation studies show that the proposed architecture has good detection performance and fast detection speed. Overall, the network achieves a good balance between detection accuracy and speed, which can be applied in practical engineering.

V. Conclusion and Future Works
This paper proposes a cross-modal detection network for allday pedestrian detection. A low-cost CFCM module is added to the feature extraction stage of the lightweight feature extraction network (CSPDarknet53). It promotes the information interaction between different modalities during the feature extraction stage. Accordingly, the network can realize the complementarity between the information flows of the two modalities in the feature extraction stage to reduce the object loss. And we propose the AFEFM module to fuse the color and thermal streams, further enhancing features and reducing the error detection of the object. In addition, essential features of the two modalities are learned through the enhancement and suppression processes. And through the feature complementarity of color and thermal images and the multi-scale fusion of depth feature layer, we realize the horizontal and vertical multi-dimensional data mining of parallel depth network, fully enrich the depth semantic information of the object, to improve the detection performance of the detector. The experimental results show that the proposed model can effectively integrate the visible and infrared features, and can effectively detect pedestrians of different scales in various illumination variants and occlusions. Further, the proposed model is applicable in realtime applications. Future works will include exploring a more reasonable attention mechanism for a more effective fusion of dualmodality features to achieve better detection performance and a lighter module to improve the detection speed of the network. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.