Global Context Instructive Network for Extreme Crowd Counting

Crowd counting has gained popularity due to wide applications, such as intelligent security, and urban planning. However, scale variation and perspective distortion make it a challenging task. Most existing works focus on multi-scale feature extraction to address the challenge of scale variation and perspective distortion. In this paper, we propose a novel Global Context Instructive Network (GCINet), which devotes to making full use of extracted features and obtaining precise counts. The main contributions are four folds. First, we construct a three-column Feature Processor to generate features with different scales. Second, an Instructive Module is proposed to introduce global context which is the substance for generating adaptive features. Based on global context, the three-column Feature Processor constitutes an adaptive feature generator. Third, a novel loss function which integrates Euclidean distance and spatial correlation is proposed to enhance the spatial correlation and consistency between pixels. We no longer regard a pixel as an independent point in the calculation, but consider the neighborhood space of the pixel to achieve complementary effects. Finally, we conduct experiments on ShanghaiTech dataset, UCF_CC_50 dataset, UCF_QNRF dataset and UCSD dataset which show that our approach achieves state-of-the-art performance.


I. INTRODUCTION
With the development of urbanization, human social activities are increasingly frequent. This increases the probability of congestion. When the crowd density exceeds the safety limit, the chance of a stampede event increases greatly. In addition, it can cause other potential hazards, such as traffic congestion, which gives a great threat to social security. Crowd counting becomes even more important. However, massive challenges exist in crowd analysis, such as occlusion, high aggregation, uneven distribution of people, complex background, scale variation and perspective distortion.
The study of crowd counting and density estimation play critical roles in the maintenance of social security. And relevant advanced methods have been proposed by researchers. Early crowd counting methods are mainly based on The associate editor coordinating the review of this manuscript and approving it for publication was Zhan-Li Sun .
detection [1]- [3] and regression [4]- [6]. However, both kinds of the methods mentioned above failed in dense conditions. Recently, methods based on density estimation [7]- [10] have been a new research hotspot among researchers. By deploying density estimation-based method, we can obtain the location of each people by pixel-wise regression and obtain the crowd counts through integral.
Hand-crafted representations were often used in early density estimation-based methods, whereas these methods usually failed in extremely dense crowds. Attracted by the strong ability of feature extraction, researchers have introduced Convolutional Neural Networks (CNN) into density estimation method for crowd counting [10]- [16]. The effectiveness of density estimation has been proved by CNN representations. Although the effectiveness of CNN-based density map estimation is superior to the effectiveness of traditional methods, it turns out that most methods predict less than real counts in extremely dense scenes with large scale VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ variation. To overcome the problem shown in Fig. 1, various approaches [16]- [18] have been proposed. These methods have made improvements by deploying multi-column network structures. The features learned by multi-column CNN are adapted to human head of different sizes in the same scene. This architecture has achieved better performance compared with single column. This shows multi-scale features play an important role in crowd counting. However, there exist two issues in CNN density estimationbased crowd counting methods. Most works try to address scale variation problem and focus on extracting multi-scale features. The first issue is that they ignore decoding high-quality density maps from multi-scale features adaptively. The other problem is that most methods treat pixels as isolate by deploying Euclidean distance as loss function. Euclidean distance plays an important role in crowd counting by measuring the difference between output density map and ground truth at the pixel-level. However, it assumes pixels are independent from each other [20]. For example, head_is an area consisting of several pixels instead of a single pixel. Thus, the neighborhood of a pixel should be considered when determining the position of the head.
In order to mitigate the above problems existing in the current crowd counting network and obtain better performance, an instructive network named Global Context Instructive Network (GCINet) is proposed in this paper. The overall design architecture is shown in Fig. 2. Different from previous works, we make an effort to design an instructive network for density map decoder, and we not only focus on mining multi-scale features, but also dedicate to finding out how to obtain more accurate density maps from extracted features, that is, to make the features adaptive.
In the initial feature extraction, we deploy the first 13 layers of pre-trained VGG16 [19] as the feature extractor. There are two models commonly used to extract features, VGG16 and ResNet [20]. We verify both by experiments and find that backbone has an important impact on the experimental results. Finally, we select VGG as backbone for initial feature extraction. Density map decoder consists of three parts: three-column Feature Processor, Instructive Module, and Merging Module. three-column Feature Processor is composed of three parallel columns with different filter sizes. Instructive Module is designed to introduce global context guidance information. For the first issue, we deploy threecolumn Feature Processor to generate high-level features with the instruction of global context, then we use Merging Module to get the final density map. For the second issue, we propose a novel distance measurement, which integrates Euclidean distance and spatial correlation loss. Euclidean distance measures the difference between the output and FIGURE 2. The architecture of gcinet. images are sent to the feature extractor (VGG) for extracting initial features. initial features are sent to three-column feature processor and instructive module for calculating multi-scale features and instructive weights. under the guidance of instructive module, three-column feature processor generates adaptive features. the final density map is generated by merging module. the four blocks are jointly learned on the training data.
ground truth pixel-wise, while spatial correlation loss takes advantage of spatial information.
Our model takes the whole image as an input, and generates high-level features under the guidance of global context information, and finally generates crowd density map. The final density map has a resolution of one-sixteenth of the original image. Experiments have shown that our method is superior to existing methods in counting accuracy. This paper mainly makes following contributions. 1) We propose a three-column Features Processor which is composed of three columns convolutional neural network. It allows the network to extract multi-scale features to the same image. 2) We establish an additional column named Instructive Module to implement low-level features for capturing global context. With the instruction of global context information, the network generates adaptive features. 3) We propose a novel distance measurement, which integrates Euclidean distance and spatial correlation loss (SCL). The former guarantees pixel-wise accuracy, while the latter preserves spatial correlation. 4) We conduct experiments to verify the performance of the proposed network. Our model has been proved to be superior to state-of-art methods on ShanghaiTech dataset, UCF_CC_50 dataset, UCF_QNRF dataset and UCSD dataset.

II. RELATED WORK
A variety of methods have been proposed to solve crowd counting problems in images and videos. In this section, we make a brief review of these methods. These methods are roughly classified into traditional approaches and CNN-based method [21].
In early researches, the locations and counts of crowd are mostly determined by detection-based methods which use a sliding window to detect the location of people in the scene and counts the targets together [1]- [3]. The detection-based methods have achieved a certain degree of success in the sparse case. However, due to such problems as occlusion and complex background in the dense case, these methods seem to be powerless. Part-based detection methods [22], [23] which detect part of the body are adopted to work this out. However, the effect is not very satisfactory in extreme dense crowd scene.
Regression-based methods [4]- [6] remove the computationally complex detector. However, these methods cannot obtain spatial location information. The output of these methods is simple predicted counts of crowd.
Lempitsky and Zisserman [7] have introduced a new method. This method learns a linear mapping to get the estimated density map which gives the number of each region by integral. Pham et al. [8] have introduced an approach to learn a non-linear mapping due to the intractability of learning a linear mapping.
With the efforts of the researchers, the task of crowd counting has made great progress. However, hand-crafted representations [24], [25] which are widely used in early approaches are not sufficient for the task.

B. CNN-BASED METHODS
In recent years, more works are based on CNNs, and the features extracted by these methods [19] have been verified to make great improvements in multiple fields. More researchers devote to deploying CNN-based approaches [11]- [15], [27]- [30], and varieties of methods are proposed. First, high-dimensional feature information is extracted by the powerful feature extractor. Then, the crowd density distribution map is predicted by analyzing the high-dimensional feature information. Finally, the crowd density distribution map is used to get the number of crowd.
Wang et al. [27] apply CNNs on crowd counting by adopting AlexNet network [26]. To overcome the issue that performance reduces with switching scenarios, Zhang et al. [28] proposed a network trained by two objective functions, namely crowd counting and density estimation. When applying this network to another scene, the network is fed with similar images.
In [29], Shang et al. proposed an end-to-end network which lead to complexity reduction instead of above mentioned patch-based method. Boominathan et al. [30] proposed a combination of deep and shallow fully convolutional network leading to robustness of crowd scale and perspective variation. MCNN [11] took account of three columns with receptive field at different scales. Compared with MCNN [11], Sam et al. [10] proposed a Switching-CNN which trained a Switch layer to select an optimal column for accurate density mapping.
More recently, CSRNet [12] deployed a convolutional neural network as feature extractor, then used dilated convolutional neural network as decoder to obtain density map. CAN [13] concatenated features with multiple respective field sizes and calculated the importance of each location.
Through researching on crowd counting methods, especially state-of-the-art methods, we find that most works focus on extracting multi-scale features by skip-connection [32] and multi-column structure. As discussed in Section I, multiscale features promote the performance of network. However, the diversity of multi-scale features seems to be insufficient for existing networks in large scale variation scenes. These works are not applicable to the extracted adaptive multiscale information. The adaptability of mentioned network is not fully utilized. Recently, some researchers [33]- [35] have tried to resolve this problem and they have begun to explore the generation of adaptive features and the effect is improved but not obvious. Besides, pixel-wise Euclidean loss is widely used in crowd counting network which regards a VOLUME 8, 2020 pixel as isolated. We argue that the prediction of each pixel should also depend on its neighborhood.
ASD [14] contributed to adaptability by proposing an adaptive scenario discovery framework to value the weights of two different density maps generated by sparse column and dense column. PaDNet [31] committed to taking full use of extracted features by captures the global and local feature.
By researching on the existing methods, we propose Global Context Instructive Network (GCINet) which places emphasis on decoding extracted features in an adaptive way.
In terms of adaptability, we assign weights to high-level features by introducing global context. Finally, the density map is determined based on the adaptive characteristics. ASD [14] directly weights both the sparse and dense density maps and outputs the final density map. In contrast, our approach makes the network more adaptive.
In [31], the global and local features are extracted from multi-scale features. While, we operate on the initial features generated by VGG16 to obtain the global context. PaDNet is a refinement of the density map by decoding the adaptive features. Besides, we also integrate spatial correlation loss (SCL) to take into account neighborhood information.
For extreme scale variations, many methods focus on generating more multi-scale features rather than how to exploit them. There are also methods to refine the density map. To this end, we design GCINet to introduce global context, which generates adaptive features and directly outputs accurate density maps. At the same time, because the head is a region, we also introduced SCL to make use of neighborhood information.
A large number of experiments have shown that our method has achieved good results in the case of extreme scale variations, and also achieved satisfying results in sparse scenes.

III. METHODOLOGY
In this paper, we are committed to making full use of the extracted features to obtain accurate counts. As shown in Fig. 2, our network is mainly composed of two parts: a backbone feature extractor and a density map decoder. Density map decoder consists of three modules: three-column Feature Processor, Instructive Module and Merging Module. In this section, we will introduce the proposed architecture and training details.

A. BACKBONE
VGG and ResNet are the widely-used network structures in feature extraction at present. The common depth of VGG is 16, while the common depth of ResNet is 18, 50 or 101. Varieties of methods based on these backbones adopt new optimization schemes and multi-model fusion, and have made improvements. The two typical models are compared and verified with a pre-trained strategy on ShanghaiTech Part A dataset. And then we determine which one is more effective for feature extraction in our method. The verification results are shown in Fig. 3. CNN extracts rich features. Deepening the network, we can get more abstract information and lose detailed information. In fact, deeper networks do not always lead to improved performance. We need to adopt specific strategies according to different tasks. Fig. 3 shows that, when ResNet is selected as the backbone network, as the depth of the network increases, the effect becomes better and better. At the same time, the number of the parameters increases sharply. However, the best performance of ResNet is inferior to the performance of VGG16. In the task of crowd counting, deeper network lost detail information, and VGG16 with a depth of only 16 performs better than others.
Spatial information and detail information are required in the crowd counting task. However, a deeper network leads to more semantic information and less detail information. At the same time, the residual block in ResNet makes the feature maps shuttle-shaped, that is, the number of feature maps decreases first and then increases. In the process of reducing the number of feature maps, a lot of information is lost. Therefore, VGG16 is selected as the feature extraction module in this task. RGB images as input are converted into a series of feature maps by the first 13 layers of VGG16 pretrained on ImageNet.

B. THREE-COLUMN FEATURE PROCESSOR
The low-level features extracted by the backbone VGG16 need to be decoded according to the characteristics of scenes.
Hence, we design a three-column convolutional neural network named three-column Feature Processor to decode low-level features according to the characteristics of different scenes. The scale variation of heads needs sufficiently various features to be represented. For dense crowds, more detailed information is needed, while for relatively sparse targets, more semantic information is needed. The three-column Feature Processor is designed to capture the both information.
As illustrated in Fig. 2, our three-column Feature Processor consists of three convolution networks. These three column convolution networks are identical in structure, but have different convolution kernel sizes. The first line is designed for large head decoding, and the line with four convolution layers have large receptive fields. The second line is designed for smaller head scenes. To decode small heads, more detail information is needed. So we adopt convolution layers with kernel size of 3 × 3. The third row is designed for the compromise between the size of large heads and small heads. The convolutional kernel size of 1 × 1 is used to reduce the number of channels. And each convolutional layer is followed by activation function ReLU and Batch Normalization.
Overall, we use three columns with different filter sizes to construct feature maps corresponding to the scale variation. Because a larger receptive field provides more global and advanced semantic information, while a smaller receptive field provides more local and detailed information. To get more comprehensive information, we decode features from multi-scales.
The three-column Feature Processor makes the network structure somewhat adaptive, but this adaptability is limited. It simply generates specific multi-scale features based on each image. But it does not make adaptive adjustments and changes based on the characteristics of different images. In order to make the network really adaptable, we designed an Instructive Module.

C. INSTRUCTIVE MODULE
In this subsection, we introduce global context which is utilized to guide low-level features to generate adaptive highlevel features.
We regard acquiring the location of the human heads as a more elaborate semantic segmentation task. The location of each head is a class. Therefore, the density estimation-based methods also face challenge of intra-class consistency. The intra-class consistency problem is mainly caused by lack of context information. Therefore, we utilize Global Average Pooling (GAP) to introduce the global context. However, the global context only has high semantic information, which does not help to recover spatial information [36]. Therefore, we also need a multi-scale receptive field and context to refine the spatial information. This complements our three-column Feature Generator.
The global average pooling is the firstly used for the semantic segmentation task in ParseNet [37]. The work in [38] also employs this strategy and shows that the global context has a significant impact on the performance of network. Attention can help the network structure focus on what we value.
The above works use a method of up-sampling and concatenation. Our method does not adopt this strategy. We abandon up-sampling and concatenation, directly multiply the high-level features by global context features. Overall, we deploy three-column Feature Processor to obtain highlevel features. Three-column Feature Processor considers more local information when decoding low-level features, but lacks of global context. We obtain global context information based on the low-level features and we multiply high-level features with global features generated by Instructive Module.
The final output is a series of scored feature maps with global context, and the calculation process can be expressed as the following equations.
In (1), x represents low-level feature maps processed by Global Average Pooling., k ∈ {1, 2, . . . , K }, where K is the number of output feature channels. C 1 is the operation of convolutional network with 1×1 kernel size. M is the number of input feature map channels. and x i is the i-th feature maps of x.
In (2), δ k is probability corresponding with the k-th channel.
In (3), Y represents the final output. These three formulas represent Global Average Pooling followed by convolution layers, activation function Sigmoid and multiplication respectively.

D. MERGING MODULE
Under the guidance of the Instructive Module, the low-level features generated by VGG16 are calculated by three-column Features Processor. High-level features with adaptability is obtained by the calculation of three-column Features Generator and Instructive Module. To get a density map, a merging module is needed to convert these adaptive feature maps into a density map. Therefore, we design a feature merging module to generate a crowd density map by calculating high-level features.
We construct a merging module consisting of two convolutional layers with a kernel sizes of 1 × 1 and 3 × 3. the kernel size of the first convolutional layer is used to reduce the number of channels. finally, the feature maps are calculated by the final convolutional layer which outputs a crowd density map. in most works [14], [45], the final output is obtained by simple 1 × 1 convolution layer, and this operation is functionally equivalent to a fully connected layer.It only considers the value of a single pixel, while the 3 × 3 convolution kernel operates on the neighborhood and contains more spatial information.

E. LOSS FUNCTION
The number of crowd is calculated by integrating all the regions of the density map. To get accurate counts, Euclidean distance is usually used to limit the distance between the output of density map and the ground truth. This measurement can judge the probability pixel-wise. Euclidean distance guarantees the accuracy of results on pixel level. However, this measurement assumes that the pixels are isolated and ignore neighborhood relationships. For the location of the VOLUME 8, 2020 head in the image, the head occupies an area instead of a pixel. Therefore, we propose a novel distance measurement which integrates spatial correlation loss (SCL) and Euclidean distance to achieve complementary effects. Our loss function is described as follow: where λ is a parameter of balancing the weight of two functions. L(θ ) is the Euclidean distance between predicted density map and ground truth. L SCL proposed in [33] is the spatial correlation loss.
where, D(X i ; θ ) is the predicted density map of GCINet,. θ is the parameters of the model. D(X i ) is the ground truth density map of image X i , and N is the batch size of training.
where O pq , G pq represent pixels of output density map and ground truth density map, respectively. Each density map has P × Q pixels.

F. TRAINING DETAILS 1) GROUND TRUTH GENERATION
Since the head position is an area and the calibration data is absolute coordinate, the direct use of calibration data will cause inaccurate results. Therefore, density maps that can be applied to crowd counting tasks need to be transformed according to the marked coordinates. The density map generation method [18] used in this paper is described in detail below. Density map is defined by convolution of impulse function with Gaussian kernel. Assuming the position of the annotation point is x, the label with N head can be expressed as: where, x i indicates the pixel location of the human head. δ(x-x i ) represents the impulse function of the head position in the image. N is the total number of heads in an image. G σ i (•) is a Gaussian function whose standard deviation is σ i .

2) EVALUATION METRICS
In the task of crowd counting, we use Mean Absolute Error (MAE) and Mean Squared Error (MSE) for evaluation which are defined as follows: (10) where N is the number of testing images. C i and C GT i represent estimated count, ground truth count of the i-th image, respectively. C i is the integral of the output density map.

3) DATA PREPROCESSING
In data preprocessing, we crop the image into 400 × 400 to achieve data augmentation. For some images whose sizes are less than 400 in length and width, such as ShanghaiTech dataset [18], it is firstly resized to 420 × 420, and then cropped. In each epoch, we randomly select a point as the top left corner of the image to be cropped. After cropping, the image is randomly flipped horizontally, and finally random noise is added. For the UCSD dataset, it provides region of interest (ROI). We first cover the image with ROI. Since the resolution of the image is too small, we then resize it to four times of its original resolution. Finally, we randomly crop a 400 × 400 patch from the images.

IV. EXPERIMENT
In this section, we mainly introduce the experimental process and results. Our experiments mainly include following parts: efficiency of Instructive Module, parameter selection of loss function and evaluation on four challenging datasets: Shang-haiTech dataset [11], UCF_CC_50 dataset [39], UCF_QNRF dataset [9] and UCSD dataset [5]. The overall results are shown in Table 1 and Table 2. In addition, the dataset used in the comparative experiments are all ShanghaiTech dataset Part A.
In the training process, Adam optimizer with learning rate of 1e-4 is adopt to optimize parameters of our model. Batch size is set to 16.
A. DATA DESCRIPTION 1) SHANGHAITECH DATASET Kang and Chan [18] introduced a large crowd counting dataset. The dataset contains a total of 1,198 images and 330,165 header annotations. The dataset is divided into two parts: Part A and Part B. The 482 images of Part A are randomly selected from the Internet. Part B is obtained from dense Shanghai street. In contrast, Part A image scenes vary widely and are dense, while images in Part B are sparser.

2) UCF_CC_50 DATASET
This is the first challenging dataset created from a public website with different density levels, scenarios, and distorted perspectives. The authors collected images of different scenes, including concerts, protests, stadiums and marathons to maximize the diversity of scene types. The dataset contains a  total of 50 images with different resolutions and marks a total of 63,075 people. Each image has an average of 1,280 heads. However, the number of people in the image ranges from 94 to 4,543, and the density of the images varies significantly. This dataset has one major drawback: only a limited number of images are available for training and evaluation. Considering the small number of dataset images, we test using a five-fold cross-validation method.

3) UCF_QNRF DATASET
UCF_QNRF is the largest dataset used to train and evaluate crowd counting in terms of the number of annotations. It contains a total of 1,535 images. The training set contains 1201 images, and the test set contains 334 images. Because there are more orders of magnitude annotated people in crowd scenes, it is better suited to training very deep convolutional neural networks (CNN). UCF_QNRF is a new dataset, only a few methods are validated on the dataset.

4) UCSD DATASET
The UCSD dataset contains 2000 frames taken from the video and each frame has a size of 238 × 158. The dataset has a relatively low density population, averaging about 15 people in a frame. In addition, the dataset also provides region of interest (ROI). In the experiment, we divided it into a training set and a test set. The training set contains frames with an index of 600-1399 and the remained 1200 images as a test set. Since the dataset was taken from a fixed location, there is no change in the scene perspective in the image.

B. EFFICIENCY OF INSTRUCTIVE MODULE AND MERGING MODULE
In the Section III, we briefly described the Instructive Module and described the construction principles and meanings in detail. However, no actual verification has been carried out. In this section we will briefly describe the experimental process and conduct experiments to verify. The first group of experiment uses the network with the Institutional Module, while the other group removes Institutional Module. The results of two models with same parameter λ = 1 are shown in Fig. 4 (a). It is shown that three-column Feature Processor with Instructive Module improves the counting accuracy by 2.3(3.97%) in terms of MAE metric comparing with three-column Feature Processor, and enhance the counting accuracy by 8.6 (9.32%) in terms of MSE metric.
It turns out that Instructive Module improves the performance of the entire network. Instructive Module mainly obtains global context information from low-level features. Our network uses global context information as an indicator to guide the generation of high-level features. For Merging Module, the convolution layer with only 1×1 kernel size is widely used to merge features and obtain density map. However, we deploy the convolution layer with 1 × 1 kernel and 3 × 3 kernel in our model. Neighborhood should also be considered, even for the last module of network. Effectiveness of convolution with a 3 × 3 kernel is demonstrated by experiment. As shown in Fig. 4(b). The results in this group show that Merging Module used convolution with 1×1 and 3×3 kernel sizes improves the counting accuracy by 2.8 (4.84%) in terms of MAE metric comparing with Merging Module used convolution with only 1 × 1 kernel size, and enhance the counting accuracy by 7.9 (8.5%) in terms of MSE metric.

C. LOSS FUNCTION ABLATION AND PARAMETER SELECTION
As shown in Fig. 4(d) and Fig. 4(e), when we perform experiment on the ShanghaiTech Part A, the performance of TEDNet [40] and our model have a similar trend according to different values of parameter λ. MAE metrics decreases first and then increase with λ increasing. But there is still a slight difference. MAE metrics decreases first and then increase with λ increasing. When λ = 0.1, our model performs the best in MAE. Factor λ = 1 is the optimal value for MSE. And when λ = 1, the MAE is just 0.4% (0.24) higher than that when λ = 0.1. Hence, λ = 1 is adopted in our model.
It is worth noting that, when we perform experiment on the ShanghaiTech Part A, in terms of the MAE metrics, if we deploy only Euclidean distance as loss function, our model still performs better than other state-of-the-art methods. The evaluation error is shown in Fig. 4(c). It is demonstrated that our model is effective in decoding low-level features. The performance of our model in MSE is shown in Fig. 4(f).
In particular, when λ = 0.1 or λ = 1, the network performs better than that λ = 0, which suggests that the introduction of the SCL loss function has a significant impact on network performance.

D. EVALUATION AND COMPRISION
Proved by the above experiment, we use the network with Instructive Module and choose parameter λ = 1 to verify performance on three datasets: ShanghaiTech dataset, UCF_CC_50 dataset, UCF_QNRF dataset and UCSD dataset. It is demonstrated that our method performs well and exceeds state-of-the-art methods.
The performance of our method on the ShanghaiTech dataset is shown in Table 1. We achieve an improvement of 10.3 (15.1%) compared to CSRNet [12] and 3.8 (6.2%) compared to state-of-the-art SPN [25] in MAE. Fig. 5 shows the density maps generated by our model. The left images belong to Part A and the right one belongs to Part B. It is demonstrated that our model effectively responds to scale changes, and also has a good effect on the complex background scene. Our model adapts to dense scenes and sparse scenes at the same time.
In Fig. 6, we display the predicted value and the ground truth of each test image on two models (CSRNet and our model) for ShanghaiTech dataset Part A. In Fig. 7, we display the predicted value and the ground truth of each test image on two models for ShanghaiTech dataset Part B. The reasons we choose ShanghaiTech dataset are diversity of scene types and different density levels. This dataset is representative. There are two reasons for choosing CSRNet. First, the structures are similar, and the features extracted by VGG are decoded to obtain accurate results. Second, the performance of CSR-Net is relatively good and has a certain representativeness.  The counts increase from left to right according to the ground truth. It can be seen intuitively that the predicted values of our model are usually accurate and the absolute deviation of our model is usually smaller. Our model is adaptable to different density levels and scenarios.
However, due to the small number of high-density image samples in the training set, when the ground truth of image is greater than 1000, our model estimation error becomes larger, and the prediction result is often smaller than ground truth.
In [42], when the ground truth of image is greater than 600, their estimation error becomes larger. In contrast, our model has a good performance improvement in extremely dense situations. In contrast, our model has a good performance improvement in extremely dense situations. Our model has achieved the best results in both accuracy and adaptability.
For UCF_CC_50, the MAE and MSE value of the proposed method comparing with other state-of-the-art methods are shown in Table 1. Our approach has achieved the best performance in MAE. Comparing various methods, they usually cannot retain the same performance through different datasets. This is mainly due to different characteristics of the dataset, which requires the network having better adaptability. We select two typical networks (CSRNet and SPN) in Table 1 for analysis, it can be seen that these two networks perform well on the ShanghaiTech dataset, but in contrast, UCF is inferior compared with them. Focused on decoding similarly, our model is more adaptive under the guidance of global context information.
UCF_QNRF is a new dataset, only a few methods are validated on the dataset. We compare with resent methods on UCF_QNRF. Table 1 shows the evaluation errors for our model and other models. Our network beats the second best approach in MSE, and shows comparative performance in MAE. Table 2 shows the effect of our network on UCSD dataset. As can be seen from the table, our experimental results are superior to other methods except MCNN in MAE and MSE. This shows that our method can not only be used in extremely dense scenes, but also achieves good results in sparse scenes.

V. CONCLUSION
In this paper, we have presented an end-to-end training network called GCINet for crowd counting and density map generation. The architecture of GCINet consists of a low-level extractor and a density map decoder. We use Adaptive Features Processor to decode the low-level features generated by VGG16. Then, we design the Instruction Module to get global context information from low-level features. We verify the effectiveness of Instruction Module, Merging Model, and proposed loss function on ShanghaiTech dataset Part A. Finally, we construct experiments on Shang-haiTech dataset, UCF_CC_50 dataset, UCF_QNRF dataset and UCSD dataset to show that our model obtains state-ofthe-art performance. He has authored more than 100 international journals and conference papers in the fields of mathematical morphology, fuzzy theory, image analysis, pattern recognition, and bioinformatics. He holds 25 national invention patents. He is an Active Reviewer for around 70 international journals and conferences.
TAO ZHAO is currently pursuing the Ph.D. degree in optical engineering. He is also a Senior Engineer of the CETC 3rd Research Institute. His major research interests are 3D display, photoelectric alarm, and image processing.
YONGCE CHENG is currently a Senior Engineer of the CETC 3rd Research Institute. His major research interests are photoelectric detection and tracking, photoelectric alarm, and image processing. VOLUME 8, 2020