Using Deep Learning in Infrared Images to Enable Human Gesture Recognition for Autonomous Vehicles

,


I. INTRODUCTION
Human detection, as a key technology for autonomous vehicles, has attracted considerable research attention. In addition, human gesture recognition is necessary to predict the trend of human behavior, which is significantly important information to formulate effective collision avoidance strategies for autonomous vehicles.
At present, most methods for human gesture recognition involve the following processes: data collection, data preprocessing, feature quantity extraction and classifier learning, among which, feature extraction is the most important component. Researchers are constantly improving and developing this link to improve the accuracy of gesture recognition models. To enable the feature extraction of human gestures, The associate editor coordinating the review of this manuscript and approving it for publication was Chao Yang . several researchers have proposed various methods. Hemati and Mirzakuchaki [1] focused on recognizing human gestures by considering the appearance (Harris features) and motion information (oriented optical flow) by constructing spatiotemporal features. Luvizon et al. [2] presented a framework for human gesture recognition, taking into account the depth maps of skeleton sequences by using spatial and temporal local features from the subgroups of joints aggregated using a robust method based on the VLAD algorithm and a pool of clusters. Murtaza et al. [3] used the features pertaining to the histograms of oriented gradients (HOG) to describe the motion history image (MHI) low dimensional representation to enable silhouette based view independent human gesture recognition. Ghamdi et al. [4] reported on the use of a space-time extension of the scale invariant feature transform (SIFT), which was originally applied to 2 dimensional (2D) volumetric images, for human gesture application. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Murray et al. [5] proposed a multiview human gesture recognition model that relies entirely on the active acoustic sonar data to infer human action. Palhang et al. [6] addressed the problem of categorizing human gestures by devising bag of words models based on the covariance matrices of spatiotemporal features, with the features obtained using the histograms of optical flow. Dawn and Shaikh et al. [7] presented a comprehensive review regarding STIP based methods for human gesture recognition and concluded that STIP based detectors could robustly detect the interest points from video in the spatiotemporal domain. Li et al. [8] proposed a framework combining the fast HOG3D features and self-organization feature map (SOM) network to enable gesture recognition from unconstrained videos, thereby bypassing the demanding preprocessing required for human detection, tracking or contour extraction. However, these methods usually involve handcrafted feature extraction, which requires researchers to have a deep understanding of the acquired data features and feature extraction algorithms. In addition, in general, the performance of handcrafted feature extraction approaches is not sufficiently stable owing to the complexity of environmental factors (weather changes, illumination changes, background changes, etc.), and thus, human gesture recognition under such conditions is challenging when using traditional methods.
In recent years, the use of convolutional neural networks [9], [10] in visual recognition has become increasingly popular, and their excellent performance in such tasks has been demonstrated. To enable the feature extraction of human gestures, Kim et al. [11] proposed a modified convolutional neural network (CNN) having a three dimensional receptive field, to generate a set of feature maps from the human gesture descriptors derived from a spatiotemporal volume. Le et al. [12] presented a framework for human gesture recognition by using the temporal and spatial features extracted simultaneously by utilizing a fine to coarse (F2C) CNN architecture optimized for human skeleton sequences. Wang et al. [13] proposed a visual attribute augmented 3D CNN framework that integrated the visual attributes (including detection, encoding and classification) to enable gesture recognition in trimmed videos. Meng et al. [14] presented a hierarchical dropped CNN architecture with a dropped CNN (d-CNN) to extract deep human gesture features from a probabilistic speed insensitive color image; furthermore, the authors extended the d-CNN to a hierarchical structure (h-CNN), in which multiple scales of temporal information are encoded, to enhance the temporal discriminative power. Meng et al. [15] proposed a deep learning network for gesture recognition, which integrated a quaternion spatiotemporal convolutional neural network (QST-CNN) and long short term memory network (LSTM); in this approach, a quaternion expression for an RGB image was employed, and the values of the red, green, and blue channels were considered simultaneously as a whole in a spatial convolutional layer, thereby avoiding the loss of spatial features. Yang et al. [16] proposed a sequential convolutional neural network to extract the effective spatiotemporal features of human gesture from videos, thereby incorporating the strengths of both convolutional and recurrent operations. Li et al. [17] proposed an end to end deep convolutional neural network in which the skeleton sequences were transformed into images, and the spatial temporal information was learned to enable 3D human gesture recognition. Ji et al. [18] developed a novel 3D CNN model to enable gesture recognition by extracting features from both the spatial and temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames. Sun et al. [19] proposed a human gesture recognition approach by using factorized spatiotemporal convolutional networks that factorize the original 3D convolution kernel learning as a sequential process of learning the 2D spatial kernels in the spatial convolutional layers, followed by the learning of the 1D temporal kernels in the temporal convolutional layers. Bhattacharjee and Das [20] reported upon the exploration of a two stream convolutional neural network (2S-CNN) architecture involving the fusion of the dense optical flow features of the RGB frames and the salient object regions detected using a fast space-time saliency method to categorize the human gestures in videos.
In general, most of these methods are based on RGB images. However, visible light cameras require proper illumination to function effectively. In contrast, infrared cameras produce images based on the heat radiated by the human body; consequently, these cameras can be used regardless of the external illumination conditions, and they can overcome the influence of illumination changes, while still achieving satisfactory results under partial occlusion and overlap conditions. Figure 1 shows a comparison of the RGB images and infrared images with different human gestures in some specific scenes. Figure 1 indicates that under the conditions of rain and foggy weather, the colors of the background and human clothes being similar, occlusion and overlap, evening, and night time, the features of the humans (contour, brightness, orientation, etc.) in infrared images have a higher degree of differentiation with respect to the background, compared to in RGB images. However, research regarding human motion recognition methods based on infrared images is still insufficient. Akula et al. [21] demonstrated the use of IR cameras in the field of ambient assisted living and discussed its performance in human gesture recognition. Li et al. [22] proposed a human gesture silhouette energy histogram algorithm by using the statistical background model and background subtraction method to extract the human gesture silhouettes to address the problem of night time human gesture recognition. Osada et al. [23] proposed a human gesture pattern monitor (HPM) constructed using a film infrared sensor (MFI), without employing a monitor camera to ensure the clients' privacy, as a tableware system.
The contribution of this work is threefold: • Considering the low resolution of convolutional feature maps, three DenseNet blocks were added before the residual components in the YOLO-V3 network to enhance the convolutional feature propagation, thereby developing a novel human gesture detection network architecture; • Infrared images lack information regarding the sharp edges and boundaries of variable human gestures. In addition, the temperature changes considerably influence the imaging results. To address this problem, the saliency maps of infrared images were detected as an additional input channel to the gesture detection network to improve the robustness of the proposed model; • An infrared image dataset of human gestures in four main categories (''squatting'', ''lying'', ''standing'', and ''walking'') in an outdoor environment was generated for training and testing the human gesture detection network. The remaining article is organized as follows. Section 2 describes the proposed human gesture detection algorithm, along with the improved YOLO-V3 algorithm and infrared image saliency detection algorithm. Section 3 describes the creation of the human gesture infrared image dataset, including the process of obtaining the original infrared image and image dataset enhancement methods. The experimental content and results are described in detail in Section 4. Section 5 summarizes the employed methods and conclusions of this work.

II. METHODOLOGY
The architecture of the proposed human gesture detection model for infrared images is illustrated in Figure 2.
The input of this human gesture detection model includes infrared images with a resolution of 640 × 480, as obtained using an infrared camera in outdoor environments. These images are processed and resized into grayscale images and saliency images, each with a resolution of 416 × 416, as the inputs of the proposed network. The feature maps from the two modalities are concatenated, and the 1 × 1 convolution operation is performed on the concatenated feature maps to reduce the dimensions and linearly merge the features. Furthermore, the parameters of the 1 × 1 convolution kernel are trained to reduce the dimensions of the concatenated feature maps to 52 × 52 × 128.
Considering the lower resolution of the convolutional layers in traditional YOLO-V3, three DenseNet blocks are added before the residual networks to improve the network performance from the perspective of feature reuse. The transfer function H i (i = 1, 2, 3, 4) employs the following network architecture: BN-ReLU-Conv, where BN denotes the batch normalization. The transfer functions H 1 , H 2 , H 3 and H 4 enable the nonlinear transformation of 4 ] continues to propagate forward as the input of a transition layer. Finally, the feature layers are spliced into feature maps with resolutions of 52 × 52 × 256, 26 × 26 × 512, and 13 × 13 × 1024, and these feature maps are propagated forward.
The specific structure and parameters of the proposed human gesture detection network are shown in Figure 3.

A. HUMAN GESTURE DETECTION MODEL BASED ON THE IMPROVED YOLO-V3 NETWORK
In contrast to the faster R-CNN, which is a state of the art target detection and recognition network, the YOLO network generates both the coordinates and recurrence of each category directly through regression, which makes the YOLO network considerably faster than the faster R-CNN network. In the YOLO network series, the YOLO-V3 [26] network exhibits the highest detection accuracy, compared with those of the target detection network YOLO, YOLO-V2 [27] and SSD [28] networks.
The YOLO network simply divides the input images into an S × S grid. Each grid predicts the conditional probability C and bounding box B, each of which corresponds to five predicted values, including the center coordinates of bounding box (x and y), size of the image (height and width) and confidence score. The confidence can be obtained as , p r ∈ {0, 1} where p r = 1 if the object is in the grid, and p r = 0 otherwise; IoU t p is used to denote the accuracy of the predicted bounding box relative to the ground truth. If the same target is detected by multiple bounding boxes, the bounding box with the highest score is selected by using nonmaximum suppression.  Although the YOLO network has a significant advantage in terms of the computational speed compared to the faster R-CNN, the accuracy of predicting the bounding box and classification is lower. To improve the object positioning accuracy and recall rate, the YOLO-V2 network implements an anchor box as in the faster R-CNN to improve the design of the network structure. Compared with the YOLO-V2 network, the most notable aspect regarding the YOLO-V3 is that a multiscale prediction method is employed, which leads to a qualitative improvement in the detection accuracy and average detection time.
Compared with the target objects in visible light images, the target objects in infrared images possess a considerably smaller amount of feature information. In the process of convolution and pooling of the YOLO-V3 network, a large amount of feature information is lost, which is unfavorable for the accurate localization and classification of targets. To address this problem, DenseNet [29] blocks are added before the residual components of Darknet-53, which is the basic network for the YOLO-V3. The use of the DenseNet improves the network performance in the context of feature reuse, which makes the feature information of the targets in infrared images more effective. The DenseNet blocks splice all the convolutional modules, owing to which, the input to each layer of the network includes the output of all the previ-ous layers of the network. The use of such DenseNet blocks improves the transmission efficiency of the information and gradients in the network, as the gradient is obtained from the loss function and input signal. This network structure also enables the realization of regularization.

B. DETECTION OF SALIENCY MAP OF INFRARED IMAGES
Because infrared images lack information regarding the colors, textures, sharp edges and boundaries of variable human gestures, it is difficult to accurately distinguish the human body and background considering only the features of the brightness information in infrared images. In addition, changes in the temperature considerably influence the imaging results. To address this problem, the saliency maps of infrared images are detected as an additional input channel for the gesture detection network to improve the robustness of the proposed model. In this work, we propose an algorithm for the multiscale optimization of cellular automata to enable the detection of the saliency maps in infrared images, as shown in Figure 4.
First, the original infrared images are segmented into superpixel maps with five different scales by using the superpixel segmentation algorithm named simple linear iterative clustering (SLIC). Next, the numbers of superpixel blocks K. Geng, G. Yin: Using Deep Learning in Infrared Images to Enable Human Gesture Recognition for Autonomous Vehicles are reduced using the density based spatial clustering of application with noise (DBSCAN) algorithm. The superpixel maps of the saliency features are detected via the improved cellular automaton. Finally, using the framework of the fusion algorithm in Bayesian theory, the final saliency image is obtained.

1) SUPERPIXEL SEGMENTATION
The SLIC algorithm, which is an image segmentation algorithm, was proposed by Ren X. in 2003 [30]. This algorithm uses the similarity of the features between the pixels to group the pixels, and a small number of superpixels are used to describe the characteristics of the images. Consequently, the complexity and redundant information of images can be considerably reduced, which helps improve the speed of the subsequent image processing calculations and real time performance of target detection. The SLIC algorithm used in this paper is a simple linear iterative clustering algorithm, which transforms the original infrared image into a 5 dimensional feature vector.
The process employed by the SLIC superpixel segmentation algorithm can be described as follows: initialize the clustering center of the infrared images and set the number of superpixels and step size; calculate the gradient value of all the pixel points in the 3 × 3 field around the clustering center and reselect the clustering center according to the minimum gradient value; label each pixel in the neighborhood around VOLUME 8, 2020 each clustering center; calculate the distance between the pixel points in the surrounding neighborhood and its clustering center. The distance between two pixel points in the infrared image is obtained as where N s represents the maximum spatial distance within the cluster; N c represents the maximum color distance; d c represents the color distance; d s and represents the spatial distance.
The following scales were selected for the superpixel maps in this work: 50, 100, 200, 500, and 1000, as shown in Figure 4. Replacing the pixel information with the information of the superpixel blocks can effectively reduce the redundant information of the infrared images and increase the speed of the subsequent processing algorithms. Furthermore, by using different scales, human targets with different spatial scales in the infrared images can be segmented more effectively.

2) SUPERPIXEL CLUSTERING
DBSCAN, which is a clustering algorithm proposed by Ester M. [31], has become one of the most widely used clustering algorithms as it can discover arbitrarily shaped clusters and eliminate noisy data. The DBSCAN technique is a spatial data clustering method based on density, using which, a high density region can be divided into a cluster, and arbitrary shapes in the spatial dataset can be found. We present the following definitions to elucidate upon the mechanism of the algorithm: • Border point: a point that is contained in the Eps neighborhood of point p but is not the core point; • Noise point: a point that is neither a core point nor a border point; • Directly density reachable: a point p is directly density reachable from a point q with regard to Eps and MinPts, if p ∈ N Eps (q) and N Eps (q) ≥ MinPts; • Density reachable: a point p is density reachable from a point q with regard to Eps and MinPts, if there exists a chain of points p 1 , p 2 , · · · , p n , p 1 = q, p n = p, such that p i+1 is directly density reachable from p i ; • Density connected: a point p is density connected to a point q with regard to Eps and MinPts, if there exists a point w such that both p and q are density reachable from w. The key concept of this algorithm is to find all the core points and form the clusters by clustering the core points with all the points that are reachable from it. The specific algorithm process can be described as follows: arbitrarily select a point p from the database; retrieve all the points density reachable from p with regard to Eps and MinPts; if p is a core point, a cluster is formed; if p is a border point, no points are density reachable from p, and DBSCAN visits the next point of the database; the process is continued until all the points have been processed.

3) SALIENCY DETECTION BASED ON CELLULAR AUTOMATA
In this paper, the cellular automaton is used to detect the saliency features of the infrared images. The clustered superpixel maps are taken as the input, and the accuracy of the saliency maps are improved by optimizing the update rules.
• Impact Factor Matrix: It is intuitive to accept that neighbors with more similar color features have a greater influence on a cell's next state. The similarity of any pair of superpixels is measured using a defined distance in the CIELAB color space. We construct the impact factor matrix F = f ij N ×N by defining the impact factor f ij of superpixel i to j as where c i , c j denotes the Euclidean distance in the CIELAB color space between the superpixel i and j; σ 3 is a parameter to control the degree of the similarity; NB (i) is the set of neighbors of cell i. To normalize impact factor matrix, a degree matrix Finally, a row normalized impact factor matrix can be clearly determined as follows Because the subsequent state of each cell is determined by its current state as well as the state of its neighbors, the importance of the two decisive factors must be balanced. In particular, if a superpixel is considerably different from all the neighbors in the color space, its next state will be primarily dependent on the cell itself. However, if a cell is similar to the neighbors, it is more likely to be assimilated by the local environment. We build a coherence matrix C = diag {c 1 , c 2 , · · · , c N } to better promote the evolution of all the cells. The coherence of each cell toward its current state can be calculated as To control c i ∈ [b, a + b], we construct the coherence matrix C * = diag c * 1 , c * 2 , · · · , c * N using the following formulation: where j = 1, 2, · · · , N . We set the constants a and b as 0.6 and 0.2, respectively. Using the coherence matrix C * , each cell can automatically evolve into a more accurate and steady state. Furthermore, the salient object can be more easily detected under the influence of the neighbors.
• Synchronous Updating Rule: In single layer cellular automata, all the cells update their states simultaneously according to the update rule. Given an impact factor matrix and coherence matrix, the synchronous updating rule f : S NB → S can be defined as follows: where I is the identity matrix, and C * and F * denote the coherence matrix and impact factor matrix, respectively. By using this update machine to create the original saliency map for each scale space, the respective optimized saliency maps can be obtained.

4) BAYESIAN THEORY FUSION METHOD
Due to the different scales of superpixel segmentation, the optimized saliency maps obtained in each scale space have their own advantages and disadvantages. This paper uses a fusion method based on the Bayesian theory, which can be used to obtain the optimal significance by combining the saliency values of each scale. In particular, the saliency map of any scale S i (i = 1, 2, · · · , 5) is selected as the Bayesian prior probability, and the other four saliency maps S j (j = i, j = 1, 2, · · · , 5) are defined as the likelihood probability. Let the current S i merge with S j separately. The final four posterior probability maps are added, and the average is considered as the final saliency map. The detailed steps can be described as follows: • Use F i and B i to represent the foreground and background regions, respectively. N F i and N B i represent the number of pixels in the foreground and background regions, respectively.
• Calculate the distribution characteristics of S j in the foreground and background regions. In the normalized statistical distribution histogram for the significance value S j , the observation likelihood probability of pixel x can be expressed considering the value of the corresponding bit of S j (x), as follows: N B i where N bF i( S j (x)) and N bB i( S j (x)) respectively represent the number of pixels in the feature S j (x) bits falling in the foreground and background statistical histograms.
• If S i is the Bayesian prior probability, the posterior probability can be calculated as follows: • Add the results of the four time two-two fusion and average these values to obtain the final saliency map.

III. DATASET DESCRIPTION A. IMAGE DATA ACQUISITION
The infrared images for training and testing the proposed deep learning human gesture recognition model were obtained using an infrared camera with a pixel resolution of 640 × 480. The images were acquired at the Southeast University Jiulong Lake Campus in Nanjing, Jiangsu, China. Our data acquisition platform consisted of an infrared camera and two RGB cameras, one of which had a polarizing filter, as shown in Figure 5. To facilitate the data collection process, the platform was fixed on top of an autonomous ground vehicle. The image registration of multiple cameras was achieved by using a calibration matrix. VOLUME 8, 2020 The infrared images were captured under sunny, cloudy, misty, and slightly rainy weather conditions (the light illuminance ranged from less than 50 lux to more than 50,000 lux). The acquisition period was relatively long and lasted from June to October. The acquisition was initiated at 9 a.m., 4 p.m., and 8 p.m. Furthermore, images were gathered in foggy weather and night time conditions.
A total of 2492 infrared images were initially collected, involving 4 main gesture classes, each of which contained several approximated gestures in morphological terms, including ''squatting'' (sitting, kneeling, squatting, etc.), ''lying'' (lying on the back, side, stomach, etc.), ''standing'' (facing the lens, with side to the lens, with back to the lens, etc.), ''walking'' (running, brisk walking, slow walking, etc.). Among all these collected images, the proportion of images with squatting, lying, standing, and walking gestures was 29%, 8%, 39%, and 24%, respectively. In this work, 2000 of these images were randomly chosen to generate the training dataset, which was used to train the human gesture detection model. The remaining 492 images were used as testing data to verify the performance of the proposed detection model.

B. IMAGE DATA ACQUISITION AUGMENTATION
Since different weather conditions, time of the day, seasons and other factors considerably influence the illumination, the generalization ability of the proposed human motion detection model depends on the integrity of the training dataset. To enhance the richness of the experimental dataset, the collected images were preprocessed in terms of the brightness, rotation, horizontal mirror, noise addition and blurring, as shown in Figure 6. After data augmentation, the numbers of images in the training and testing dataset increased to 16000 and 3936, respectively. The training and testing datasets contained 33264 labeled human gestures, with the numbers of ''squatting'', ''lying'', ''standing'' and ''walking'' being 7392, 8928, 9032 and 7912, respectively. The number of labeled ''lying'' and ''standing'' gestures were larger than that for the ''walking'' gesture, and the number of ''squatting'' gestures was the smallest. One image may contain multiple different human gesture targets and the completed dataset is shown in Table 1.

IV. EXPERIMENT AND DISCUSSION
An image processing server with two NVIDIA 1080TI graphic cards was used to train and test the proposed human gesture detection model. The initialization parameters of the proposed network are listed in Table 2.
Considering the memory limit of the server, the input images were resized to a resolution of 416 × 416, and the batch was set as 16. We used 50,400 training steps to better analyze the training process. The initial learning rate was 0.001, and it was reduced to 0.0001 and 0.00001 after 30,000 and 45,000, respectively. The momentum of the network was 0.9, and the weight decay regularization was set as 0.0005. A series of testing experiments were conducted on  the trained human gesture detection model by using testing images with a resolution of 640 × 480. The following indicators were used to evaluate the effectiveness of the human gesture detection model: precision and recall, F1 score, loss function, IoU , detection time, average precision (AP) and mean average precision (mAP), which have been widely used in the existing literature. Herein, we introduce the definition and function of these indicators. The precision (P) and recall (R) can be defined as follows: where TP represents the true positive samples, FP denotes the false positive samples, TN represents the true negative samples, and FN denotes the false negative samples. The AP represents the quality of the model in each category, and it can be obtained as The mAP indicates the quality of the model in all the categories, and it can be obtained as where C is the number of categories. The F 1 score was used to evaluate the performance of the model, and it was determined as In addition, the loss function was used to evaluate the performance of the network model, and it was determined as follows: The coordinate prediction error Error coord can be expressed as where λ coord is the weight of the coordinate error; s 2 is the number of grids in the image; B is the number of bounding boxes generated by each grid; L obj ij = 1, if the object falls into the j th bounding box in grid i and L obj ij = 0 otherwise; denote the predicted and true values of the center coordinates, height, and weight of the predicted bounding box, respectively.
The error Error iou can be defined as follows: where λ noobj is the weight of the IoU error;Ĉ i and C i denote the predicted confidence and true confidence, respectively. The classification error Error cls can be defined as follows: where c is the class that the target belongs to. p i (c) andp i (c) respectively refer to the true and predicted probability that the object belonging to class c lies in grid i. The IoU is another criterion used to evaluate the detection accuracy, and it can be obtained by calculating the overlap ratio between the predicted and true bounding boxes, as follows: where S overlap is the intersection area of the predicted and true bounding boxes, and S union denotes the union area of the predicted and true bounding boxes.
The average detection times were also compared, as reported in this paper.

A. EFFECT OF DATA CATEGORIES
To verify the effect of the different categories in the dataset on the detection results, the infrared images with squatting, lying, standing and walking human gestures, were used to train and test the proposed human gesture detection neural network. The P-R curves of different categories for the proposed model were as shown in Figure 7.    Mathematically, the AP is defined as the area under the P-R curve, reflecting the average performance of the algorithm under different IoU thresholds, and it was set as 0.5 in this work. The AP and F1 scores of different data categories are presented in Table 3.
The mAP is the mean of the AP values in a subclass, and its value was 76.93% for the proposed human gesture detection model.
We considered the confusion matrix of the classification prediction results to evaluate the performance of the proposed method, as shown in Table 4. The values in the main diagonal denote the percentages of the correctly classified categories, and the remaining values correspond to the percentages of the incorrectly classified categories. It was noted that the main errors occurred when ''squatting'' was classified as ''lying'', ''lying'' was classified as ''squatting'', ''standing'' was classified as ''walking'', and ''walking'' was classified as ''standing''. We believe that ''squatting'' and ''lying'', as well as ''standing'' and ''walking'', are considerably similar in terms of the feature information. Furthermore, the sizes of the datasets are relatively small, and the datasets are relatively unbalanced. These aspects need to be further considered to solve the problem of interest.
From the above training and testing results, it can be noted that the categories of the targets affect the detection results of  the proposed network. The number of humans with squatting gestures is relatively small, and thus, the detection results of human targets with squatting gestures are worse than those for human targets with other gestures. The proposed model obtained the best detection results for infrared images with walking human gestures due to the more notable features in the infrared images. This finding occurs because the body temperature rises to different degrees after walking and running motions.

B. COMPARISON WITH DIFFERENT DETECTION MODELS
The detection performances of the proposed model was compared with that of several other detection models, including  the single shot multibox detector (SSD), original YOLO-V3 RetinaNet [32], YOLACT [33] and faster R-CNN, to verify the superiority of the proposed human gesture detection model. The loss function curves of these detection models during training are as shown in Figure 9. The training results of different target detection models indicate that the average loss continuously reduces with an increase in the number of iterations, and the proposed detection model consistently exhibits a faster convergence in the training process. The final loss of the SSD, original YOLO-V3, RetinaNet, YOLACT, faster R-CNN and proposed detection model is approximately 1.21, 0.84, 0.72, 0.69, 0.62 and 0.51, respectively. In addition, compared to other target detection models, the loss curve of our proposed method exhibits a continuously decreasing trend during the training process until after 45000 training steps. These results demonstrates the better training performance of the proposed human gesture detection model.
The P-R curves for the SSD, original YOLO-V3, Reti-naNet, YOLACT, faster R-CNN and the proposed model are as shown in Figure 10. The detection results pertaining to the F1 scores, IoU function mAP and average detection time of the target detection models are summarized in Table 5.
The F1 score and IoU value for the proposed network are approximately 0.862 and 0.873, respectively, which are higher than those for the SSD model, original YOLO-V3, RetinaNet and YOLACT models. Although the F1 score and IoU value of the faster R-CNN model are slightly higher than those of the proposed model (by 0.013 and 0.023, respectively), the average detection time is 0.968 s, which is approximately 8 times larger than that for the proposed model. This analysis indicates that the proposed model exhibits excellent processing speed performance while ensuring a high detection accuracy.

C. COMPARISON OF CLASSIFICATION RESULTS
The time span of the data collection process is from 9 a.m. to 8 p.m., and the light intensity varies considerably (from less than 50 lux to more than 50,000 lux). It is difficult to identify the gestures of the humans from the RGB images under the conditions of weak light intensity, especially in the evening and night time. To further verify the effectiveness of the proposed method, we trained and tested the proposed neural network model using pure RGB images, pure infrared images, and saliency thermal image pairs. Several RGB image samples with less observable human gestures and the corresponding infrared images and saliency infrared image pairs are shown in Figure 11.
As shown in Figure 11, the gestures of humans cannot be easily recognized in many cases when using an RGB image, such as in the presence of a street light under dusk conditions, street light in the evening, absence of street light in the late evening, occlusion and the human and background exhibiting similar colors. However, in these cases, the contour and brightness features of the humans are relatively more notable in the infrared image and the corresponding saliency images. The classification results for the proposed network model, as obtained using different types of images were compared, as presented in Table 6.
The classification results presented in Table 6 confirm that by using the saliency infrared image pairs and data fusion method, we can effectively improve the detection accuracy of the proposed network while maintaining a reasonable detection time.

D. DETECTION RESULTS UNDER THE CONDITION OF OCCLUSION AND OVERLAP
In outdoor scenes, the presence of occlusion due to trees, buildings and other structures, and the overlap between humans can affect the detection accuracy. The results of the F1 scores, IoU and AP for the proposed human gesture  detection model under the conditions of occlusion and overlap are shown in Fig. 12 and Table 7.
These detection results indicate that under occlusion and overlap conditions, the accuracy of the human gesture detection models is reduced. However, in most cases of occlusion and overlap, the proposed model still exhibits a satisfactory performance for human gesture detection in infrared images.

E. DETECTION RESULTS IN SCENES WITHOUT HUMANS
In an outdoor scene, it is possible for an infrared camera to capture images that do not contain human targets. We used 50 infrared images that did not contain human targets to verify the performance of the proposed detection model and to test whether the model would identify some humanoid targets as humans. Specifically, infrared images containing backgrounds of the sky, grass and buildings were collected. The detection results indicated that some humanoid branches were recognized as humans, as shown in Figure 13.
These test results indicate that the proposed detection model can detect most of the human targets in infrared images, even under some severe occlusion and overlap conditions. However, in some scenarios, humanoid targets may still be identified as humans. This problem can likely be solved by increasing the scale of the target dataset under a larger number of scenarios and environmental conditions.

V. CONCLUSION
This work reports upon a deep learning approach for human gesture detection in infrared images, based on the improved YOLO-V3 network. The proposed model uses three DenseNet blocks, added before the residual components in the YOLO-V3 network, to enhance the convolutional feature propagation and improve the human gesture detection performance. The saliency maps of the infrared images were detected as an additional input channel for the network to improve the robustness and performance of the proposed human gesture detection model. To verify the detection performance of the proposed model, several experiments were conducted, and the results indicated that the proposed network has a better detection performance than those of the original detection network YOLO-V3 and SSD. The detection accuracy of the proposed method is comparable to that of the faster R-CNN, which is a state of the art network in terms of the accuracy. However, the proposed network exhibits a notable advantage in terms of the detection time performance. The proposed model is capable of human gesture detection under low visibility images, such as in rainy and foggy weather, night time conditions, and conditions in which the colors of the targets and the background are similar. In addition, the proposed model exhibits a high performance for human gesture detection under conditions involving occlusion and overlap. In the future, we aim to further optimize the human gesture dataset of the infrared images and predict the dynamic behavior of humans.