Object Recognition at Night Scene Based on DCGAN and Faster R-CNN

In the past few years, with the rapid development of computer execution capabilities, target recognition strategies based on convolutional neural networks have become mainstream algorithms in the field of object detection. However, due to the blurred background and dim light, the object detection task in the night environment still faces greater visual challenges. This article is strongly inspired by DCGAN (Deep Convolution Generative Adversarial Networks). We use night images as input, generate virtual target scenes similar to the daytime environment through game training of generators and discriminators; and to obtain high-precision detection results, we combine the currently very advanced Faster R-CNN (Region-based Convolution Neural Networks) target detection system, through deep convolution feature fusion and multi-scale ROI (Region Of Interest) pooling. A series of experimental results show that our method achieves an mAP of 82.6% in the detection of its own night scene dataset, which is significantly higher than the original Faster R-CNN alone of 80.4%. Therefore, our method can meet the actual needs of target detection tasks in night scene. We sincerely hope that our approach will contribute to future research.


I. INTRODUCTION
The improved performance of computer hardware makes it possible to efficiently process large-scale training data. In recent years, with the rise of machine learning and deep learning algorithms, computer vision has performed well in fields such as visual recognition, speech recognition, and natural language processing. In the field of object detection, the task of recognizing night scenes is of great significance to discover potential objects in the environment in time. Some scholars use excellent algorithms to reduce the negative effects of weak light on the judgment of foreground objects, and some of the models have been applied in daily life and military use. Meis et al. [1] proposed an algorithm that mainly includes three parts to improve the accuracy of night traffic obstacle detection and classification. It uses a passive far-infrared sensor to focus on nonluminous objects: a classifier first finds the region of interest that contains hidden targets, and a region-based segmentation algorithm is used to re-segment the ROI obtained by the classifier. The type of object is then determined by a quadratic The associate editor coordinating the review of this manuscript and approving it for publication was Tomasz Trzcinski. polynomial classifier. In 2012, Zhang et al. [2] proposed a night-time moving target detection algorithm based on dualscale Approximate Median Filter back-ground models. The algorithm first performs block-wise down-sampling on the original image to reconstruct the low-resolution image and is used to roughly detect the ROI of the moving object; and then the moving object contour is refined by using coarse detected result ROI and the original image. Govardhan et al. [3] proposed a robust algorithm with a high detection rate and a low false alarm rate for night pedestrian detection systems. It uses different image subsets and different sizes to train a tree-like classifier to solve the problem of large intra-class differences in pedestrian poses. Sonu et al. [4] introduced the hot-spot and the background subtraction algorithms for detecting humans in night vision video. The former uses black body radiation theory and the latter uses difference images obtained from the input image and the generated background image. In 2018, Bazán Caballero and Zamudio Beltrán [5] proposed a computer vision system capable of identifying traffic panels (TP) in night scenes. The size of the original image can be reduced by 74% during the preprocessing stage of the system, cropped to obtain the upper two thirds of the image, and finally converted to grayscale. In the processing stage, in order to identify the region of interest (ROI), cascade object detector (COD) is used to process the output image from the final stage. In the classification stage, the algorithm uses COD to divide each ROI into one of TP and no TP, and finally highlights each ROI classified as TP with a bounding box. Recently, an estimation algorithm of illumination maps like LIME [6] was proposed by X. Guo to construct the illumination maps by finding the maximum intensity of the original image channel. Similarly, some scholars have further proposed Robust-Retinex [7] based on retinex theory. However, the decomposition of observed brightness is an ill-posed problem that has not been solved well so far. For those enhancement algorithms with low illumination, it is very difficult to obtain the ground truth of the corresponding object. LLNet [8] first applied the deep autoencoding method to identifying signals from low-brightness images and adaptively brightening the image without overmagnifying the brighter part. Tao et al. [9] proposed a twostep strategy based on the atmospheric scattering illumination model to enhance low-light images. Wei et al. [10] proposed a Retinex-Net algorithm that includes DecomNet for decomposition and EnhanceNet for brightness adjustment. In the last two years, Xiao et al. [11] used a specially designed feature pyramid network and context fusion network to enhance images under low-illuminance to improve detection results. Li et al. [12] used HSV color space features instead of the traditional RGB space to enhance the robustness of video contrast and color distortion. Zhu et al. [13] proposed a novel multitemporal monitoring image change detection algorithm under low-illuminance by fusing different images to remove noise, and achieved good experimental results. The dimness of light is usually the key factor that affects our judgment of foreground objects. How to accurately classify and learn the distribution of target subjects our concern. Generative Adversarial Networks' game theory has brought inspiration to our nighttime object detection task. Therefore, due to the difficulty of night target recognition, we propose a night object detection method based on Faster R-CNN and Deep Convolution Generative Adversarial Networks. To some extent, the traditional method has the problems of high cost and complicated process. The deep learning method of our model uses only two main modules, which effectively reduces the complexity of the network and at the same time has a better way to deal with the dark objects that are not shining in the previous work.
Convolutional neural network is a common deep learning framework proposed by some scholars inspired by the visual neural system in which the organism perceives external things. It has the characteristics of local connection, weight sharing and automatic feature extraction. Its development can be traced back to the study of the visual system in the cat brain by Hubel and Wiesel [14] in 1962. In 1980, Kunihiko and Sei [15] proposed a neural network structure consisting of convolution and pooling layers. In 1990, Yann Lecun first applied back propagation algorithm (BP) in training of neural network structure, which formed the prototype of modern convolutional neural network, but due to the difficulty of network training and the poor performance in actual tasks, worse than SVM, Boosting algorithm once fell into a low tide.
With the improvement of GPU accelerator performance and the expansion of public data sets, the deepening of the number of CNN layers and the improvement of detection accuracy has revealed the potential development of network, which has once again attracted the attention of researchers. In 2012, AlexNet [16] used a classic CNN structure to produce a great performance breakthrough in image recognition. After the success of AlexNet, Ross Girshick et al. proposed a regional convolutional neural network in 2014. The network uses a selective search algorithm to generate region proposals and train SVM for classification. In order to solve the shortcomings of input images size and large resource consumption, Kaiming He et al. proposed a SPP-net using spatial pyramid pooling algorithm [17]. In 2015, Ross Girshick proposed fast region-based CNN (Fast R-CNN) [18], which uses two different layers that are fully connected to complete the tasks of target classification and bounding box positioning, instead of training SVM separately, which saves a lot of storage space. Unfortunately, Fast R-CNN still applies the selective search algorithm to obtaining fixed-size region proposals. The system cannot achieve end-to-end network training, and the back-propagation algorithm cannot improve the extraction process of region proposals. Based on Fast R-CNN, S. Ren et al. proposed the Faster R-CNN [19], the network uses a novel RPN to obtain region proposals instead of the previous selective search algorithm. It reduces some calculations through parameter sharing and reaches a new height in the field of object detection. Afterwards, a large number of scholars have made improvements, such as [20]- [22]. The continuous improvement of network performance has benefited from the continuous innovation and optimization of the network structure. In order to obtain more comprehensive information of night images, a good strategy is convolution feature fusion, which combines rich detailed in-formation with abstract semantic information [23].
GAN [24] is a model proposed by Goodfellow et al. In 2014, its core idea was two-player game theory. Experimental results show that the network seems to be able to generate sample images that look like genuine things. The DCGAN [25] introduces the ideas in GAN into the convolutional neural network, which uses the latter's powerful feature extraction capabilities in the field of image processing to generate higher quality sample images. Subsequently, the GAN series network was continuously concerned and developed by researchers [26]- [29]. Recently, some scholars have studied the problem of DCGAN and convolutional networks for target detection [30] [31]. We all know that low-light scenes are closely related to our life. At present, the development of the corresponding intelligent vision system still needs further research. In this paper, we use the preprocessed night image as the input signal of the DCGAN network, instead of the original 100-dimensional random noise. The system generates virtual images similar to daytime scenes and uses VOLUME 8, 2020 them with advanced detectors to complete actual tasks. This method achieves good performance in night object detection.

II. RELATED WORK A. DCGAN ALGORITHM PRINCIPLE
DCGAN introduces convolutional networks into the structure of GAN. By optimizing hyper parameters and network topology, it makes up for the gap between CNN in the fields of supervised learning and unsupervised learning. It is mainly composed of two multi-layer perceptrons(generator and discriminator). The former attempts to capture the potential distribution of real samples, establish spatial pixel connections and generate sample data at the same time. The latter is essentially a binary classifier, used to determine whether the input of the model comes from the real sample or the sample generated by the generative model. In other words, D and G play the following two-player mini-max game with value function V (G, D). During the network training phase, both players maximize their profits, and finally reach a Nash equilibrium point, i.e. the generator generates sample data that the discriminator cannot judge the true and false, and the value function finally converges to V * : The definition of the objective function is as follows: P data and P G in the expression represent the true sample probability distribution and model distribution respectively. Fig. 1 illustrates the details of the deep convolutional generative adversarial networks. The generator G takes a 100-dimensional noise vector as input and projects, reshapes as a small-scale convolution space, then uses a 4-layer fractionally-strided convolutions operation to topological space structure, and gets a sample image with a size of 64×64×3. The structure of the discriminator can be regarded as a flip of the generator. The difference is that the final network only outputs a simple discriminant value for judging the source of the data. It is worth noting that the training of the generator and the discriminator is carried out alternately. When updating the parameters of one side, the other side remains unchanged, i.e. the G is fixed first, and the parameters in D are trained to maximize the value of V(G, D), immediately following keeping the weight parameters in D stable, the network G is trained to minimize the value of max D V(G, D) to get the excellent generator as expected.

B. FASTER R-CNN
Faster R-CNN is mainly composed of two modules, the first module is a fully convolutional network for generating regional proposals, and the second module is a Fast R-CNN detector. The entire system can be trained end-to-end by back-propagation and stochastic gradient descent (SGD), and the RPN network uses a popular attention mechanism to tell Fast R-CNN the spatial location of the target. The biggest highlight of the object detection network is to propose a method to effectively locate the target area, and then index the features on the feature map according to the area, which greatly reduces the time consumption of convolution calculation. Fig. 2 shows the structural details of Faster R-CNN.
The deep convolutional network in this target detection system, such as VGGNet [32] or ResNet [33], can extract the main target features of grayscale or color maps. However, with the deepening of the network, while enriching the semantic information, the low-resolution feature map also causes the loss of a lot of detailed information. The anchor box algorithm assigns k (usually k = 9) bounding boxes to each pixel of the shared convolution feature map output by the deep convolutional network, and uses a 3 × 3 sliding window to perform the convolution operation from the upper left corner to the lower right corner of the convolutional layer. Two parallel 1 × 1 convolution layers respectively obtained the category (foreground or background) and quaternion coordinates of k bounding boxes. For the bounding box with positive label (foreground), the predicted bounding box is obtained through the first coordinate fine-tuning, crop the region of interest in the Align ROI Pooling layer and use bilinear interpolation to obtain the ROI and normalize it to a H×W feature map. Finally, two completely connected layers and softmax are used to complete accurate classification and bounding box regression (the second fine-tuning operation, using a non-maximum suppression algorithm, where the predicted bounding box of IOU > 0.7 is assigned to a positive label, and IOU < 0.3 is assigned a negative label) task.  The red box A is the raw predicted bounding box output by our fully connected layer, the green box G is the ground truth (gt), and the blue box G' is our regression expectation after the fine-tuning operation.

1) BOUNDING BOX REGRESSION
Bounding box regression is a meaningful algorithm in Faster R-CNN. The system outputs the category of the foreground target and the predicted four-dimensional coordinate vector of the bounding box through two parallels fully connected layers. However, the coordinate information directly generated is inaccurate. Therefore, we need to perform bounding box correction to optimize the obtained coordinate information. The details of the algorithm are described below. Fig. 3 graphically shows the position of the border and our operating intention. We tried to find a mapping relationship to make the predicted bounding box closer to ground truth and reduce the positioning bias. Previous work also showed that targets detection relies on border regression to achieve accurate positioning.
We have obtained window coordinated vectors for A = (x a , y a , w a , h a ) and G = (x, y, w, h), now we need to find a transformation F to meet the requirements: With: Get the translation and scaling factor from A to G': In the regression task, the translation amounts (t x , t y ) and the zoom scales factor (t w , t h ) completed the fine-tuning of the border. In the Faster R-CNN network, there are two corrections of the frame coordinates. The first time is the finetuning of the anchor box in RPN, and the second time is in the classification part of the network, along with the NMS (nonmaximum suppression) algorithm to retain mature bounding boxes. The formula also applies to the former.

2) LOSS FUNCTION
RPN uses shared convolutional feature maps as input for end-to-end training, and produces high-quality regional proposals while adjusting parameters. As part of the parallel network, Fast R-CNN uses feature maps, region proposals and back-propagates gradients based on the cross-entropy loss function to adjust subject parameters. The loss of Faster R-CNN mainly includes cls loss and bbox regression loss. The following is its calculation formula: where i represents the index of the border in the mini-batch, and p i represents the probability that the predicted bounding box contains the target. If the border corresponding to the special index value is assigned a positive label (IOU > 0.7 for the Ground Truth bounding box), then p * i = 1; otherwise the case is that when the corresponding border is assigned a negative label ( indicates that the bounding box regression is only performed when the sample is positive. t i and t * i are two independent four-dimensional coordinate vectors, where t i = {t x , t y , t w , t h } represents the parameterized coordinates of the predicted bounding box and t * i = {t * x , t * y , t * w , t * h } is related to the details of the GT bounding box. N cls is equal to the network mini-batch size (for example, N cls = 256). N reg is the number of anchor positions (i.e., N reg ∼ 2400). Due to the needs of actual tasks, RPN and Fast R-CNN are usually alternately trained separately. The classification loss of RPN is a two-class cross entropy loss. Unlike it, Fast R-CNN is a multi-class cross entropy loss. The purpose of introducing the adjustable parameter λ is to balance the classification and regression loss. L cls and L reg indicate classification and regression loss respectively, and their definitions are as follows: Among them, R is the Smooth L1 function.

III. DETAILS OF OUR ALGORITHM
Inspired by the idea of generating game theory against the network, we innovatively proposed an algorithm specifically for night scene object detection. We use night-time color images instead of 100-dimensional random noise in DCGAN as the input of the network, and daytime images of the same scene as real data samples, trying to make the network learn the spatial pixel distribution between the two as much as possible. Then we use the virtual data samples generated by the generator as the input of the improved Faster R-CNN network and display the bounding box and target category on the night image. In Part 4, the experiment shows that step training has a good convergence effect. When two steps are trained at the same time, we observe that it is difficult to learn the model parameters, and it is easy to collapse the model. The complete structure of our network is shown in Fig. 4. Without bells and whistles, the first step is to first train the generator and discriminator models unfairly, and keep the weight parameters of the generator unchanged after the model is stable; the second step is to use the generator and Faster R-CNN to complete the final training, and the discriminator model in the first step no longer participates in the activities here. We have covered the details of the training in Section 4.2 and will not elaborate here.

A. DIFFERENT DCGAN
How to establish the potential relationship between nighttime images and daytime images is the core problem we are concerned about. In order to solve this problem, using DCGAN's image generation capabilities seems to be a good strategy. Initially, we directly flatten the original night image into a 1 × 1 × C tensor by pixel to replace the random noise in the traditional method. However, this method presents an experimental phenomenon of unstable oscillation and is accompanied by a large number of noise signals, and the generated image has no regularity and cannot be recognized. We believe that this is the result of forced conversion without considering the characteristics of the spatial dimension. We only use a 3 × 3 size convolution kernel to use the target features in space. Unfortunately, in the training process of this strategy, multiple images tend to collapse into the same scene, and the expected sample diversity is severely lacking. However, what is inspiring is that the output image begins to show virtual features and regular patterns that can be distinguished by human vision, which indicates that the model has begun to learn features and establish mapping relationships between different scenes. However, this method cannot avoid numerical problems caused by non-normalization. Reference [25] mentioned the significant impact of normalization operation on the performance of gradient descent method. Therefore, after exploring the above methods, we have identified a series of model architectures that can perform stable training on our nightly dataset and allow training of more complex and higher-resolution generative models. As shown in Fig. 4, due to the limitations of the device, we first perform a preprocessing operation on the image, and reduce the network parameters by compressing the graphics of size H × W × 3 to the size of H/r × W/r × 3. A 3 × 3 convolution operation with a step size of 1 is used to generate the intermediate relationship image, and the tanh activation function is used to transform the final output to the (−1,1) interval. Finally, it is flattened and transformed into a 4 × 4 × 1024 feature map through the project and reshapes operation similar to the role of the fully connected layer. After the generator output, part of the convolutional layer of the discriminator is used for feature extraction. The feature information of the data sample is transmitted to a specific layer f through convolution operation, and finally the feature matrix is flattened and used as the input of the classifier. Similar to the original DCGAN, we continue to use fractional strided convolutions and strided convolutions in the generator model and discriminator model, respectively, which will allow the model to learn its own spatial upsampling and subsampling. In order to accelerate the convergence and slow down the overfitting of the system, we use a very important Batch Normalization method in the field of deep learning. It is mentioned in [18] that the initialization of the parameters during the initial training of the network is uncertain, and this method can greatly improve the training difficulties caused by improper initialization, and also facilitates gradient flows in deeper models. Previous work has shown that the use of BN operations in all layers will cause the generated samples to oscillate. In this paper, we still only use this algorithm in the input layer of the discriminator and the output layer of the generator. It is worth mentioning that the obvious difference between the training phase and the previous network rules is that we no longer alternate training each time, but adjust the training ratio of the two. In other words, we train the generator network x times to update the discriminator parameters again. This is necessary. Because of the complexity of the image, it is easy for the generator to fail to fully learn the spatial distribution of the data samples in a timely manner. The most intuitive phenomenon is that the discriminator network loss gradually approaches 0, while the generator network loss gradually increases. Subsequent experiments also prove that our practices are correct.

B. MULTI-SCALE CONVOLUTION FEATURE FUSION
The traditional convolutional neural network extracts smaller feature maps related to object features through a series of convolution and pooling operations. This processing method usually causes the loss of some image information, making it difficult to fully express the target details in the night scene image. This is very unfriendly to the situation where we use the data samples generated by DCGAN with information distortion as the network training set.
To assist in expressing the ideas of our algorithm, we illustrate the structure of the previous VGG-16 network in Fig. 5. The network is mainly composed of 6 sub-modules. In general, some advanced object detectors only use the feature map output by Conv5_3 as the input of the subsequent network, such as Faster R-CNN.
The network structure based on the same-scale convolution feature fusion is shown in Fig. 6. As the depth of the network deepens, the detailed information gradually decreases. We use down-sampling and up-sampling methods to combine the features extracted by Conv1_2, Conv3_3, and Conv5_3 layers, respectively. A specific explanation is that Conv1_2 and Conv5_3 have the most comprehensive  detailed information and the richest semantic information in the convolutional layer respectively, and they perfectly fit the balance information in the Conv_3 layer (Section 4.3 shows a series of ablation experiments). In order to solve the problems caused by the different sizes of the feature maps generated by each convolutional layer, we may use the size of Conv3_3 as a benchmark and use the down-sampling and up-sampling operations on the feature maps of Conv1_2 layer and Conv5_3 layer respectively, and make them reach the feature map with the same spatial resolution as the Conv3_3 layer. Therefore, a feature map containing three different levels of information is obtained.
An advanced approach before feature map fusion is to apply local response normalization to processing each extracted feature map so that they have the same activation value. The feature map after fusion has rich detailed information and abstract semantic information, as shown in Fig. 7.

C. MULTI-SCALE ROI POOLING
After the RPN generates a fine-tuned region proposal, the Align ROI Pooling layer first uses the bilinear interpolation operation to obtain the pixel value at the decimal point coordinate according to the obtained four-dimensional coordinate vector (x, y, w, h). Among them, x and y are the horizontal and vertical coordinate parameters of the center point of the region proposal, and w and h are the width and height parameters of the corresponding region, respectively. These four parameters are independent of each other. After cutting, many ROIs of different sizes are generated. Due to the limitation of the fully connected layer, maximum pooling is usually used to transform these efficient ROIs into smaller feature maps with a fixed H × W (i.e., 7 × 7) size.
ROI pooling reduces the calculation parameters and causes the loss of important features of the target object. In order to solve this unoptimistic problem, ROI pooling of different sizes is used. The motivation of our superior design may be to enable the system to capture more feature information. We designed three ROI pools of different sizes (5 × 5, 7 × 7, 9×9). After performing maximum pooling on the feature map in parallel, they are connected for subsequent classification and bounding box regression.

IV. EXPERIMENT AND ANALYSIS
In this paper, we perform experimental analysis on the night dataset and day dataset we have marked. At the same time, in order to prove the effectiveness of our method, we primarily evaluate detection mean Average Precision (mAP), because this is a recognized and strict performance indicator for target detection.

A. PRODUCTION OF DATA SETS
We directly use the computer's camera to capture images of the daytime scene. These images constitute our daytime data set A. That is a multi-target data set with a total of 6,833 images in 10 categories. The detailed category descriptions are as follows: "person," "dog," "car," "cat," "bicycle," "bus," "chair,"" sofa,"" boat," and "TV monitor." The limitations of the data set will greatly affect the ability of the model to generate. In order to avoid DCGAN's generating bad images, we selectively collect some images from the Internet and extend the daytime data set to make the model generate images of higher resolution. An enlarged day-time dataset B containing 62511 pictures was obtained. As mentioned earlier, in order to make the model establish the approximate pixel change relationship from night to daytime, we need sample data of different time periods in the same scene, which is almost impossible to rely on our artificial shooting. Therefore, we manually adjust the brightness and saturation of the daytime images to simulate the nighttime scene as much as possible to generate our nighttime data set C. This is very necessary. The experiment found that the number of samples contained in the training data set will directly affect the quality of the images generated by the DCGAN model, because more data samples mean rich diversity. Benefiting from the label-free requirements of the DCGAN method training, we can use the expanded data set B and the corresponding data set C that change the brightness and saturation to simulate night scenes to specifically train DCGAN to generate better images. It can be intuitively regarded as a pre-training process. However, this strategy is no longer suitable for Faster R-CNN with a special supervision mechanism because it requires labels for training. In order to reduce the time consumption of manually labeling objects, we choose to label data set A instead of large-capacity data set B to complete the overall training. Later experiments show that this method fully achieves the expected detection effect. In addition, in order to meet the needs of actual tasks, a meaningful approach is to finally test the performance of our model on real night images (the data set D that we shot directly and used only for model testing of actual tasks). In order to increase background diversity, different images have different brightness and saturation changes, and the images in the daytime and the nightly data set correspond to each other, as shown in Fig.8. Although the training process of DCGAN does not require the participation of labels, the parameters of the network are extremely huge. To reduce the image size too much to adapt to this situation will also affect the detection accuracy of Faster R-CNN for objects. In order to balance the different requirements of these two parts, we first unified the size of all collected images (day and night) to a base size of 256 × 256, and then completed the annotation work on the daytime data set.
Due to the limitations of the device, for the DCGAN part, we use proportional compression (for example, it becomes 128 × 128 after double compression) to preprocess the image for network input. For the Faster R-CNN part, the data samples (128 × 128) generated by the previous network are reduced to the size of the reference image through the proportional enlargement operation as the input of the detection network. Faster R-CNN training requires a lot of data and labels to support the work. However, data labeling is very time-consuming and tedious, which is also the main reason for limiting the size of our data set A. The lack of data samples will increase the probability of overfitting of the system, resulting in the model simply remembering the training set. In order to enhance the generalization ability of the model, we expand the original data set by flipping. The images in the final data set A were expanded to 13,666. In order to make full use of our data resources, we decided to use 60% of the samples (8,199 images) for training, 30% of the samples (4,100 images) for verification, and the remaining 10% of the samples (1,367 images) for testing. The roles to which the samples are assigned are randomly selected by the computer.

B. MODEL TRAINING
Traditionally, target detection in night scenes is usually based on image enhancement strategies, but the effect is extremely limited. Therefore, we consider a mechanism that can break this limitation and reconstruct the features to solve the problem. The DCGAN algorithm with a generationconfrontation game process is adopted. We use the Python programming language to build our overall detection model under the framework of Tensorflow deep learning. The basic program related to DCGAN used in the paper has been open source and can be obtained from this link: https://github.com/carpedm20/DCGAN-tensorflow. We only need to do a small amount of program modification, i.e. to switch the random noise input port in DCGAN to the night image we need, and then an important convolution operation and normalization strategy is used to ensure the gradient flow to build a new input. In addition, we keep the names of the night images and day images input into DCGAN consistent to ensure the correct relationship mapping during the batch training process. In the experiment, the model training is mainly divided into DCGAN training and Faster R-CNN training. The parameter details are shown below.
First, for the DCGAN model, in order to establish a potential spatial variation relationship, we use the daytime data set B as the real sample input of the discriminator, and the nighttime data set matching it as the generator input. The network goes through mini-batch stochastic gradient descent (SGD) and sets a learning rate of 0.0002 to complete the training, where mini-batch is set to 128 a priori by us. A normal distribution with zero mean and standard deviation of 0.02 is used as the initialization method for all layer weight parameters. It is mentioned in [25] that all layers of the network use the Batch Normalization (BN) algorithm, which will lead to sample oscillation and model instability, so that we still use the BN algorithm only in the output layer of the generator and the input layer of the discriminator. We use the tanh activation function of the input and output layers of the generator model, and the remaining layers are activated using relu. However, all layers of the discriminator model use the leaky relu activation function to enable the generator to generate high-resolution images, where the slope of the leak was set to 0.2. In addition, the system uses Adam optimizer with tuned hyper-parameters and momentum β 1 = 0.5. We train all the pictures for 75 epochs unevenly instead of the traditional alternating training, and we set the training ratio of the generator to the discriminator to 2: 1. In other words, the discriminator updates the parameters only once after the generator is trained twice. This method can train the generator model more fully and enable the generator to generate highresolution images.
After the previous first step training, we keep the generator model weights unchanged and use the DCGAN generator model to match the Faster R-CNN detector to perform the second step training, and the discriminator no longer participates in this process. Faster R-CNN's RPN can be trained end-to-end through back-propagation and stochastic gradient de-scent (SGD). We use VGG-16 pre-trained on ImageNet classification as a deep convolution feature extraction network, and randomly initialize all other new layers by drawing weights from a zero-mean Gaussian distribution with standard deviation 0.01. We use a learning rate of 0.001 for 30k mini-batches, and 0.0001 for the next 10k mini-batches on our benchmark nightly dataset, where mini-batches are set to 128. We use a momentum of 0.9 and a weight decay of 0.0005. A dropout algorithm to prevent network overfitting is used in the fully connected layer, and the correlation factor is set to 0.5. Finally, the model was trained by GPU acceleration. Because the images of the night and day scenes contain the same target information, our previous label in one scene can be applied to another different scene, which is convenient for the recognition task here. The system extracts the feature information of the virtual image generated by the generator, and draws the target's bounding box and category on the night image.

C. EXPERIMENTAL RESULTS AND EVALUATION
In this paper, in order to be able to perfectly combine the DCGAN models with the improved Faster R-CNN model, we intend to study how to fully establish the feature distribution relationship between the night scene graph and the daytime scene graph. The use of pixel encoding of the image to introduce the image into the network and realize the back propagation of the gradient flow is the key to the problem. Through a series of attempts, the size of a 3 × 3 convolution algorithm is used in the initial feature extraction. The tanh activation functions transform the data information into the interval (−1, 1). After flattening, it is a directed vector that is approximately equivalent to the original noise. Our method makes the model tend to be stable, and the sample data generated by the generator is good and has distinctive object characteristics, as is shown in Fig. 9. As can be seen from the randomly presented virtual image, the generator model generates a multi-channel image with clear outline and easy identification. We can use it as an intermediate image and combine the invariance of object labels to train the subsequent detection network, and finally successfully display the bounding box and specific category information on the night scene image, which is very meaningful.
It is worth mentioning that this potential relationship between night and day images can be described as a spatial mapping [25], and the establishment of this relationship is done by DCGAN. In this paper, for convenience, we set the night image sample set input by the generator to ψ = {x 1 , x 2 , . . . , x n }, and the day image sample set input by the discriminator to φ = {y 1 , y 2 , . . . , y n }, where n represents the number of samples contained in the sample set and x N corresponds to y N . The key to our solution is to find an intermediate mapping set f = {f 1 , f 2 , . . . , f n } that satisfies f(ψ) ≈ φ, and each sub-mapping in f is obtained by training. They can essentially be regarded as a series of nonlinear combinations of network weights. Specifically, for different night images x 1 and x 3 , due to their different characteristics, they will get different mapping results f 1 and f 3 after being combined with excellent weight parameters. Therefore, f 1 (x 1 ) ≈ y 1 and f 3 (x 3 ) ≈ y 3 are obtained, for f can be divided into two processes. In the training phase, the iteration of the sample makes the virtual image generated by the generator at night tend to have the characteristics of the day image, and the weighted combination of those network parameters forms a bridge to establish the relationship between the two. When the loss between the virtual image generated by this  algorithm and the day image (standard or final target) is within an acceptable range, the weight parameters will be retained because they contain good feature maps. In the testing phase, we directly use the generator as the front-end network, and use its generated virtual image instead of the night scene image as the original input for the subsequent detection network. The visualization process of DCGAN's gradually establishing the potential relationship (mapping) between night and day images is shown in Fig. 10.
In order to prove the effectiveness of our method, we compared the detection effects of different algorithms on the night dataset. Fig. 11 is the change curves of two loss functions that use Faster R-CNN and our method to train 40,000 iterations on the same nightly data set. It can be observed from the displayed trend graph that both methods can be stably trained and gradually converged on our nightly data set. The difference is that our model has better performance in object classification and bounding box regression tasks. In particular, we show the change of the loss function in the last 5,000 iterations of the system to further more easily observe the details of the oscillation and con-vergence of the curve, as is shown in Fig. 12.
Experiments show the effectiveness of our method on the boundary boxes positioning and target classification tasks. In the last 5000 iterations of the system, our method reduced the loss of both tasks to less than 0.7, while the maximum fluctuation amplitude of the original method is large, which is around 0.9 (see Fig. 12). Next, we evaluated the mean average precision (mAP) of the different methods to make the strategy more convincing. Table 1 shows the detailed test results of the ablation experiment. Our method has an mAP of 82.6%, higher than the original Faster R-CNN (7 × 7) which has an mAP of 80.4%. The line chart in Fig. 13 visually compares the changes in average precision of the different network models in various categories. Fig. 14 presents the comparison results of the algorithm for mAP values. The model has overall competitiveness and the potential to handle more types of actual target detection tasks.

D. RESEARCH ON FUSION OF DIFFERENT CONVOLUTIONAL LAYERS
In this part, we paid special attention to the influence of the fusion of different combinations of convolutional layers in VGG-16 on the detection accuracy. The core of the feature fusion algorithm is to combine the complete detailed information in the low layer and the rich semantic information   in the high layer to achieve a similar compensation effect to deal with the non-robust information loss caused by the pooling operation. Table 2 shows the detailed experimental results of the fusion of any three levels of convolutional layers in VGG-16. The feature map output by Conv1_2 maintain the most complete details because it has not gone through the parameter reduction process in the pooling operation. At the same time, the feature map output by Conv5_3 condenses the adjustment features of multiple previous convolutional layers, thus containing the most abstract semantic information. As an intermediate feature between the two, Conv3_3 is necessary to be considered. The experimental details also prove the advanced nature of this fusion algorithm.

E. COMPARISON OF DIFFERENT ALGORITHMS
In order to verify the detection performance of the proposed DCGAN-Faster R-CNN model, other state-of-the-art detection methods and image enhancement strategies are used for evaluation and comparison-R-CNN, Fast R-CNN, Faster R-CNN, SSD, LIME, Retinex-Net, Robust-Retinex, Xiao et al., Li et al. and Zhu et al.. Without bells and whistles, we use the traditional methods LIME, Retinex-Net, and Robust-Retinex to enhance the low-illumination night data set, and obtain three enhanced data results. For the fairness of the experiment, we still use Faster R-CNN as the basic network to perform detection tasks on those data results to prove the effectiveness of the algorithm. Table 3 shows the experimental results of different methods on the night data set. In Table 3, our method has an mAP of 82.6% compared with Retinex-Net, LIME and Robust-Retinex, which are improved by 1.1%, 0.7% and 1.2% respectively. Compared with other object detection algorithms based on deep learning, the model has more significant recognition advantages. It can be seen from Table 3 that for R-CNN, when we do not use DCGAN for feature space transformation and directly apply the detector, it will cause a 1.2% drop in mAP. For Fast R-CNN, this value is 1.7%, which means that DCGAN has successfully reconstructed the daytime scene by using night features, and has distinctive features that are more easily distinguished compared with night images. In addition, because we use the combination of the two modules to solve the problem of the actual night scene, the network parameters are increased. Interestingly, our method only increases the time consumption of 0.04s compared with using only the Faster R-CNN  detector (0.31s vs. 0.27s). This phenomenon is because in the testing phase we only applied part of the structure of the DCGAN model (i.e., the generator) and directly established the image mapping of the two scenes based on the mature weight parameters instead of the passive participation of the entire network. Compared with the current popular low-light detection algorithms, the algorithm in this paper has a certain level of accuracy improvement compared to the algorithms recently announced by Xiao et al., Li et al. and Zhu et al., and the speed is also in the priority. At the same time, our method has almost the same execution time as the method proposed by Xiao et al., yet with a 1.6% increase in accuracy (82.6% vs 81.0%). Compared with the algorithm proposed by Zhu et al., although with only a 0.1% increase in mAP value (82.6% vs 82.5%), a significant speed advantage is observed. In summary, we believe that the algorithm in this paper is more effective when dealing with recognition tasks in night environments. Different from traditional night image enhancement methods that are essentially based on the image itself (for example, feature, brightness, saturation and noise reduction), DCGAN applies spatial mapping on features to directly establishing the connection between night and daytime scenes. This method of reconstructing virtual features successfully breaks the limitation of previous work that can only enhance the image itself very limitedly. This is also the biggest difference between DCGAN and the existing technology. A series of experiments show that DCGAN has better enhancement advantages in image processing. We hope that the research in this paper can be helpful to future engineering development.

F. RESEARCH ON DIFFERENT BACKBONES
We also presented research results on other frameworks to prove the acceptable flexibility of the core ideas of this paper. As is shown in Fig. 15, ResNet-18, ResNet-50, ResNet-101 and Inception-v3 are additionally used. The algorithmic combination of DCGAN and different backbones will improve the performance of the original network (see Fig.15). For ResNet-50, the DCGAN algorithm will show a 0.9% improvement in the mAP value (85.2% vs 86.1%), while for ResNet-101 this value is 1.1% (86.8% vs 87.9%). Further, the strategy of introducing DCGAN can also observe a 1.2% improvement effect under the Inception-v3 backbone with the highest mAP value. This shows that the model can find a feature map similar to the daytime scene for the input night image so as to improve the quality of the initial image. As part of the description, Table 4 shows the detection details of the algorithm for different types of objects. We hope that we can provide a positive reference for the work of future researchers.

G. MODEL TESTING OF ACTUAL TASKS
The test task is carried out on the night images taken in life. The overall effect of actual detection through the algorithm in this paper and the original Faster R-CNN is shown in Fig. 16. The renderings provided here are easy to observe. It can be seen that the positioning and recognition of targets in the night background is better and the detection accuracy is higher. Our model successfully mines the hidden object information in the dark environment to help us complete the task of night targets recognition in daily life, which is very practical.

V. CONCLUSION
Based on DCGAN and Faster R-CNN, we propose a detection algorithm to efficiently identify targets in night scenes, and evaluate the performance of our model in a real night environment. First, we use the specified nighttime data set and daytime data set as the input of the network, and use the powerful generation capability of the generator model to establish a unidirectional spatial distribution relationship. Next, we indirectly use the generated virtual samples as the training set of Faster R-CNN. In order to make the network capture the feature information of the image as much as possible, a multi-scale feature fusion strategy is used in the feature extraction network, which can be combined with more comprehensive details in the low convolutional layer with the rich semantic information in the higher-level convolutional layer. In addition, a multi-scale pooling strategy reduces the losses caused by traditional ROI Pooling operations. Experimental results show that our method has a significant accuracy improvement in classification and target positioning. At the same time, the algorithm ideas we proposed in this paper can also be flexibly extended to other fields, such as semantic segmentation, human pose evaluation, gesture recognition, etc. This work has a certain contribution to practical engineering applications, and we expect to develop more rigorous theoretical knowledge and contribute to deep learning research in future work.
KUN WANG received the bachelor's degree in automation, the master's degree in control theory and control engineering, and the Ph.D. degree in pattern recognition and intelligent systems from Northeastern University, in 2000, 2005, and 2008, respectively. She was a Visiting Scholar with Purdue University, USA, in 2011. She is currently an Associate Professor and a Master Tutor with the Civil Aviation University of China. Her research interests include pattern recognition and intelligent systems, and fault detection and analysis.
MAO ZHEN LIU was born in Jining, Shandong, China, in 1995. He received the bachelor's degree from Dezhou University. He is currently pursuing the master's degree with the Civil Aviation University of China. His research interests include pattern recognition and object detection.