SRNet: Scale-Aware Representation Learning Network for Dense Crowd Counting

Huge variations in the scales of people in images create an extremely challenging problem in the task of crowd counting. Currently, many researchers apply multi-column structures to solve the scale variation problem. However, multi-column structures usually have complex structures with large numbers of parameters and are difficult to optimize. To this end, we propose a scale-aware representation learning network (SRNet) that uses a commonly used encoder-decoder framework. An image is converted into deep features by the first ten layers of VGG16 in the encoder. Then, the features are regressed to a crowd density map via the decoder. The decoder mainly consists of two modules: the scale-aware feature learning module (SAM) and the pixel-aware upsampling module (PAM). SAM models the multi-scale features of a crowd at each level with different sizes of receptive fields, and PAM enlarges the spatial resolution and enhances the pixel-level semantic information, thereby improving the overall counting accuracy. We conduct extensive crowd counting experiments on ShanghaiTech Part_A, UCF-QNRF, and UCF_CC_50 datasets. Furthermore, to obtain the locations of each person, we conduct crowd localization experiments on UCF-QNRF and NWPU-Crowd datasets. The qualitative and quantitative results prove the effectiveness of the SRNet in dense crowd counting and crowd localization tasks.


I. INTRODUCTION
Crowd counting is a classic computer vision task that aims to automatically count the number of people from a given image. Crowd counting has a wide range of applications, such as estimating the scale and number of people at political gatherings and sports events and detecting the flow of people in stations or tourist areas. It is greatly significant in the field of public safety [1] and has drawn widespread interest from researchers. Some early methods [2]- [4] formulated crowd counting as a pedestrian detection problem; that is, the number of people in a scene is obtained by counting the number of pedestrian detection boxes. However, in a high-density crowd scene, the crowd counting accuracy of the above methods is not satisfactory due to the occlusion of heads and shoulders. Other methods [5], [6] have been proposed to improve The associate editor coordinating the review of this manuscript and approving it for publication was Chao Tong . the accuracy by manually extracting features and using a regressor to directly regress the number of people. However, these methods ignore the location information of people. In addition, the extraction ability of the handcrafted feature is limited due to the differences in specific scenarios. Different scenarios need to be manually designed, which limits the generalization of the method. In recent years, many convolutional neural network (CNN)-based methods [7]- [15] have been proposed to solve the crowd counting problem, and these methods benefit from the powerful feature learning capabilities of deep learning. Most of the networks, such as MCNN [7], Switching-CNN [8], CAN [10], and CP-CNN [16], aim to model multi-scale variations in crowds. Some of these methods have redundant features [7], require complex structures to train multiple regressors [8], [16] or include a large number of parameters [10].
There are other network structures for solving multi-scale problems in other computer vision tasks. Image pyramids and feature pyramids are often used in object detection tasks to extract multi-scale features. Spatial pyramid pooling methods, such as the SPPNet [17] and ASPPNet [18] or multi-scale feature fusion realized by splicing feature maps are commonly used to solve multi-scale problems in semantic segmentation tasks. However, multi-scale feature extraction usually requires a complicated model, which leads to a sharp increase in the number of parameters. To decrease the amount of calculation required while maintaining the improved accuracy, HS-ResNet [19] was recently proposed; in this model, the Hierarchical-Split Block(HSB) separation feature is divided into two groups: one is used for identity mapping, and the other is used for extracting refined features. By combining different features and enhancing the information interaction of different groups, the network offers more advantages than other complex networks in the trade-off between the parameter amount and performance.
Inspired by HSB, a scale-aware representation learning network(SRNet) for dense crowd counting is proposed. To avoid feature redundancy in the multiple columns of features and to reduce the number of network parameters, we use feature splitting and feature splicing to extract and fuse features. In particular, in the process of extracting multi-scale features, we do not use multi-column convolution as a mutually exclusive structure but continue to refine the extracted features in a multi-layer structure with different receptive fields. The network is designed with the commonly used encoder-decoder framework. In the encoder part, an image is converted into deep features. In the decoder part, the deep features are regressed to the crowd density map. The encoder part is constructed with the first 10 layers of VGG16, and the decoder part contains two modules, a scale-aware feature learning module(SAM) and a pixel-aware upsampling module(PAM). SAM models the multi-scale features of the crowd at each level with different sizes of receptive fields, and PAM improves the resolution and enhances the pixel-level semantic information, thereby improving the overall counting accuracy.
We conduct a large number of crowd counting and crowd localization experiments on mainstream datasets, and our SRNet model achieves competitive results compared with the latest methods.
In summary, our main contributions are as follows: • We propose a scale-aware feature learning module with layer-level correlations that makes full use of the contextual information and multi-scale features extracted by convolutional layers with different receptive fields to perform more accurate counting regression.
• We propose a pixel-aware upsampling module to perform semantic enhancement at a fine-grained level to improve the pixel-level regression performance.
• The above two modules are used to construct the scale-aware representation learning network for dense crowd counting. Crowd counting and crowd localization experiments are carried out on mainstream datasets, and competitive results have been achieved.
The rest of this paper is organized as follows: we introduce the related work in Section II, demonstrate the details of the proposed method in Section III, describe experiments about crowd counting, crowd localization, and ablation studies about the design of SRNet in Section IV, and conclude in Section V.

II. RELATED WORK
In this section, we first briefly review the traditional crowd counting method and the CNN-based crowd counting method. Then, we summarize the common multi-scale learning methods based on deep learning networks in computer vision tasks.

1) DETECTION-BASED METHODS
Detection-based crowd counting methods primarily design a target detector to detect pedestrians in a crowd and obtain the number of pedestrians in a scene by counting the number of detection boxes. Such methods usually extract individual objects [20] or body-part features [21]- [25]to train the target detector. Since the head, shoulders, and other parts of the human body are prone to block each other, detectionbased methods easily miss pedestrians. Therefore, detectionbased methods are only suitable for crowd counting in sparse scenes.

2) REGRESSION-BASED METHODS
A regression-based crowd counting method is proposed to adapt crowd estimation methods to high-density scenes. The main idea is to extract crowd characteristics and train a regressor to establish a nonlinear mapping relationship between a crowd scene image and the number of people in the scene. Commonly used crowd characteristics include foreground characteristics [26] and gradient characteristics [20], and commonly used regressors include linear regression [27] and Bayes Poisson regression [28]. The crowd counting accuracy of these methods is better than that of detection-based methods, but it ignores the location distribution information of the crowd.

3) CROWD DENSITY MAP ESTIMATION METHODS
To obtain location information while calculating the number of people, Lemptisky et al. [5] first proposed a method based on density estimation. The main idea is to learn the linear mapping between local features and the corresponding density map. [29] proposes a non-linear mapping, random forest regression, by introducing a crowdedness prior and using it to train two different forests to reduce the difficulty of learning a linear mapping. However, these methods require manual extraction of features and depend on specific scenarios; thus, the ability to extract features is limited. Different scenarios need to be manually designed, which limits the generalization of the algorithm. To reduce the above problems and eliminate the disadvantages of manual feature extraction and low accuracy caused by traditional density estimation methods, an increasing number of researchers considering the powerful feature extraction function of CNNs have used CNN-based density map methods for crowd counting. Earlier heuristic models typically leverage basic CNNs to predict the density of the crowds [6], [30], [31]. Recently, many CNN-based density map methods [7], [8], [32] have been proposed that employ CNN branches with different receptive fields to extract multiscale information. For the first time, Li et al. [9] applied dilated convolution to a crowd counting network, where the pretraining model of VGG16 [33] was used as an encoder and dilated convolution was applied to a decoder. Sindagi and Patel [34] adopted a multi-task method and designed a cascaded CNN to perform density classification and crowd counting simultaneously. Shen et al. [35] used the idea of generating a confrontation network (a generative adversarial network (GAN)) and generated a density map through the U-Net [36] structure to improve counting accuracy through the counting consistency of the subimage and the total image. Zhou et al. [37] combined a multi-scale convolutional neural network (generator) and an adversarial network (discriminator) to generate a high-quality density map and adapt crowd counting in complex crowd scenes. There are still some other methods based on CNNs: these methods include those based on attention mechanisms [38]- [49] that segment the foreground and the background to increase the expression of foreground features and suppress the expression of background features, methods based on super-resolution [50], [51] that can improve the image quality of high-density areas and thereby improve the counting accuracy, methods based on the perspective relationship of a scene [6], [12], [52]- [56] that can adjust the human scale variation by using the perspective view to eliminate the influence of the human head scale variation, and methods based on transfer learning [57]- [60] that can reduce the dependence of the network model on labeled data and improve the generalization ability of the model. In conclusion, an increasing number of methods based on CNN density estimation [16], [40], [61]- [67] have been proposed by researchers and have become mainstream crowd counting methods.

C. MULTI-SCALE FEATURE LEARNING FOR CROWD COUNTING
Scale variation is the primary problem to be solved in crowd counting; thus, the development of multi-scale feature learning methods for the density map estimation method based on a CNN has attracted the attention of researchers. Zhang et al. [7] proposed a multi-column network structure, called MCNN, that contains three branches and uses different kernel sizes (large, medium and small). Different scale features are extracted through convolution kernels of different sizes in each column to extract and fuse multi-scale features. Zeng et al. [32] proposed a MSCNN, which also uses convolution kernels of different sizes to extract features. MSCNN is different from MCNN in that its feature fusion block uses a fully convolution layer instead of simple feature stacking. Sam et al. [8] added a switching branch on the basis of MCNN, used the three branches as three CNN regressors, and then trained classifiers to select the best matching branch of the three branches to better learn different scale features. Sindagi and Patel [16] used a global context estimator to estimate the image crowd density level, a local context estimator to estimate the patch density level, and a density estimator to estimate the crowd density. The fusion-CNN at the backend of the network fuses these three estimators to predict a highquality multi-scale crowd density map. Liu et al. [10] added a context-aware module between VGG16 and the decoder to extract multi-scale features. Chen et al. [68] proposed a scale pyramid network (SPN) that uses a shared deep single-column convolution to extract high-level multi-scale information through a scale pyramid module containing four parallel dilated convolutions with different expansion rates. In addition, many researchers have proposed various methods based on multi-scale features, such as ADCrowdNet [46], SCNet [69], RSANet [70], efficient and switchable CNN [71], and CASA-Crowd [72].
The above mentioned multi-scale methods ignore feature correlation between multiple columns as well as the problem that increasing the complexity of the network model architecture generates a large number of parameters. Different from the above methods, we propose a scale-aware representation learning network and construct a scale-aware feature learning module, which not only transfers the extracted feature information between layers and refines it but also avoids the large increase in the number of parameters caused by the increasing complexity of the network.

D. MULTI-SCALE FEATURE LEARNING FOR OTHER COMPUTER VISION TASKS
Image pyramids are often used in object detection tasks to perform convolution operations on images of different resolutions to extract multi-scale features from the same image. However, the inference cost of image pyramids is unsatisfactory. Hence, some researchers have proposed using a feature pyramid instead of an image pyramid via feature prediction (such as direct multi-scale feature prediction, or multi-scale feature fusion plus single-scale feature prediction, or multiscale feature fusion plus multi-scale feature prediction) to realize the input of a single image and output feature maps of different resolutions.
Spatial pyramid pooling methods are commonly used to solve multi-scale problems in semantic segmentation tasks. After pooling, multi-scale features corresponding to different resolutions are generated in some methods, such as SPPNet [17] and ASPPNet [18], or multi-scale feature fusion is realized through feature map splicing in some methods, such as U-Net [36]. 2)The decoder part uses SAM to extract and fuse multi-scale information and uses PAM to increase the resolution and pixel-level semantic information. 3)Finally, Multi-scale features are fed into a 1 × 1 convolution layer to predict the crowd density map. ⊕ stands for the addition operation and feature fusion. n indicates n-th convolution processing. represents the concatenation operation. ''split'' means feature separation. ⊗ represents element-wise multiplication.

III. PROPOSED METHOD
Following the current mainstream method [7]- [10], we formulate crowd counting as a pixel-level regression problem. The network uses an encoder and decoder framework. Image I i ∈ R 3×H ×W is input, Eq. (1) is used to generate feature maps F i ∈ R 1×C×H ×W , and then Eq. (2) is used to obtain a predicted density map G i ∈ R 1×1×H ×W : where F vgg represents the encoder and C = 512 and D decoder denotes the decoder designed in this paper. In particular, we explore a new multi-scale feature learning method called SRNet. The proposed SRNet model consists of two parts: an encoder and a decoder. The encoder part uses the first ten layers of VGG16 [33] for feature extraction. Our main work is to design the decoder part, and the details of the decoder are introduced below.

A. SCALE-AWARE FEATURE LEARNING MODULE
Multi-scale features play a crucial role in image tasks. Experiments such as [7], [8], [32] have also been used to verify their effectiveness in crowd counting tasks. The existing multi-scale methods extract information of different scales through multi-column convolution or dilated convolu-tion. These methods do not pay attention to the connection between different scales, make each column learn features independently, and increase the number of additional parameters. In this section, we design an effective SAM. The green block in Fig. 1 shows the feature mapping process of the main part of the proposed SAM.
The main part of the module is composed of four layers, and the beginning of each layer contains a different convolution layer to extract features of different scales. Since the scale of a crowd in congested scenes changes within a relatively small range, we use convolutions with relatively small dilation rates and relatively small kernel sizes. Dilated convolution is used to increase the receptive field without sacrificing the spatial dimension of the feature map and without increasing the number of convolution layers. The input features of the module are fed into each layer after feature separation, and the number of channels becomes half of the original value. During feature extraction, the features of each layer are connected. Taking the second layer of the first SAM component as an example, the input is the feature maps F0 with a size of 1 × C 2 × H 8 × W 8 ; this layer first performs a convolution layer (kernel = 3 and dilation = 2) and outputs feature maps F2. Prior to this, the output feature maps F1 of the first layer have dimensions of 1 × C 2 × H 8 × W 8 . F2 and the first half of F1 are connected in the channel dimension, and the output tensor size is 1 the feature maps go through the convolution layer, and the pixel enhancement(PE) submodule shown in Eq. 5. Feature weighting can enhance the crowd feature, and the convolution layer provides learnable parameters for weighting. The adaptive enhancement of semantic information is achieved through parameter adjustment and feature weighting of the convolution layer. The size of the final output tensor is 1 × Unlike the second layer, the first layer does not need to be connected to the upper-layer feature maps, and the output tensor dimension of the fourth layer is 1× C 4 × H 8 × W 8 . The equation for the entire SAM process is as follows: where i represents the i-th input images and j represents the j-th layer in this module. h i indicates the input features of this module, and the first input feature of SAM is F i . C n i,j indicates n-th convolution processing, in which the size of the convolution kernel and the expansion rate are different, and multi-scale features can be extracted, as shown in Fig. 1.
where f i,j represents the feature map after the first convolution operation in this layer, that is, the extracted features of different scales. f i,j ∈ R 1×C j,k ×H k ×W k denotes the fusion of the upper-layer feature maps and the feature map after the first convolution operation in this layer; C j,k indicates the channel number of f i,j , where C j,k is C k 2 (j = 1, 2, 3) and C j,k is C k 4 (j = 4). g i,j−1 is the output feature map of each layer. [. . . , . . .] indicates that the extracted features are connected in the channel dimension, and C conv i,j denotes convolution processing, which converts the number of channels. 1 2 c indicates that half of the feature map is taken according to the number of channels.
where ⊗ represents elementwise multiplication. S g indicates sigmoid. g i,j is the feature map after PE and represents the output feature map of the j-th layer in SAM. C covn denotes convolution processing performed on the input feature maps, where the size of the convolution kernel is 1 × 1.
where g i represents the multi-scale features fused by SAM.

B. PIXEL-AWARE UPSAMPLING MODULE
We use three pooling layers to extract high-level features while reducing the number of parameters, so upsampling is used to increase the resolution and restore the original image size. To further refine the features of SAM, we propose a feature upsampling module, as shown in the orange box in Fig. 1. This module consists of a pixel attention module and an upsampling layer. The pixel attention module uses a 1 × 1 convolution layer and a sigmoid function to add a learnable weight coefficient to each pixel, which is then weighted into the input features to increase pixel-level semantic information. In addition, in the upsampling layer, we use bilinear interpolation for upsampling to gradually increase the resolution and restore the original size of the image. We take g i as the input, which is the feature map output of SAM, and output the feature map u i . This module is defined as Upsampling → Conv2d → Pixelattention(Pa) → Conv2d, and pixel attention is expressed as: where f in represents the input feature maps and C covn denotes convolution processing on the input feature maps, where the size of the convolution kernel is 1 × 1. ⊗ denotes elementwise multiplication. 1 × C k × H k × W k represents the input feature size of the k-th (k = 1, 2, 3) PAM.

C. SCALE-AWARE REPRESENTATION LEARNING NETWORK
We present a schematic diagram of the model in Fig. 1 and call this model SRNet, which is used for dense crowd counting. Our model offers the following advantages. First, SAM can effectively use the feature connections between levels to refine and fuse the features extracted from each layer. In particular, it does not simply connect the features of all layer levels but splits the features and connects some of them as feature maps. Part of the output in the module is connected to the next layer to continue to refine the features, thereby preventing a sudden increase in the number of parameters caused by the increase in network layers. Second, PAM can restore the image resolution while enhancing pixel-level semantic information. Third, a pixel enhancement module is also added to SAM to enhance pixel-level semantic information and reduce the weights of background pixels.
In summary, our model not only extracts multi-scale features while avoiding a sudden increase in the number of parameters but also adjusts the attention of the crowd and the background at the pixel level. Therefore, our model performs better in the crowd counting task. Although our model can prevent a sudden increase in parameters when the network becomes complicated, the problem of a large number of parameters still exists. Determining how to continue to design a lighter model without affecting the network performance remains to be explored in the future.

D. DENSITY MAP GENERATION AND EVALUATION METRICS
Each head positioning point in an image is processed to generate a density map by a Gaussian kernel function: where G σ i represents the two-dimensional standard Gaussian kernel function, δ(.) denotes the Dirac delta function, σ represents the standard deviation, and M is the total number of people I i , which is the sum of all pixel values in the crowd density map F i . β is a constant, and d represents the diameter of a human head in the image and is the same as in Zhang et al. [7]. In this paper, β is 0.3, andd i is the average distance among k neighborhoods (k = 7). The crowd counting network establishes a nonlinear transformation between the input images I i and the corresponding crowd density map F i . In crowd counting tasks, we use the mean squared error (MSE) and mean absolute error (MAE) as the metrics for comparing the performance of SRNet against other crowd counting methods. For a test sequence with N images, MSE and MAE are defined as follows:

E. LOCALIZATION TASK
Compared with crowd density maps, crowd localization can more accurately reflect the distribution of crowds. Therefore, we extract head positioning points from the crowd density map with 0.5 as the threshold to determine whether a point is a head position. Then, the extracted head positioning points are matched with the ground-truth points one-to-one. If the detection is within a particular distance threshold of the ground-truth location, it is recorded as a true positive (TP); if the distance is greater than the threshold, it is recorded as a false positive (FP); if there is no detection within a groundtruth location, it is treated as a false negative (FN). We use precision (Pre), recall (Rec), and F1-measure (F1) to evaluate the localization performance under the threshold.

F. LOSS FUNCTION
The L2 loss is chosen as the crowd counting task loss function: where λ represents the learning parameter of the crowd counting network andĜ (I i ; λ) is the output of the crowd counting network. G i is the ground truth. N represents the number of images in the training set.
The loss function of the localization task is binary crossentropy: (12) where i denotes the i-th image of the training set. N represents the number of images in the training set.M is the prediction result. M is the ground truth; here, we only set it to 1 or 0, representing the presence and absence of a person, respectively.

IV. EXPERIMENTS
In this section, we mainly perform density estimation tasks and crowd localization tasks. First, we describe the configuration details of our experiments. Then, we conduct the ablation study on ShanghaiTech Part_A. Afterwards, we report the results of the proposed SRNet model on three mainstream datasets. Finally, we show the results of the localization task on NWPU-Crowd and UCF-QNRF.
A. DATASETS 1) ShanghaiTech PART_A ShanghaiTech Part_A includes 482 images from the Internet (300 images for the training and 182 images for the testing). The number of people in the images in this dataset ranges between 33 and 3139, and most of the images contain people.

2) UCF_CC_50
UCF_CC_50 [26] includes 50 images from the Internet, and there are at most 4543 people in an image. The dataset contains 63,075 labeled individuals. This dataset contains images of extremely crowded scenes. These images are mainly acquired from FLICKR. Similar to other widely accepted methods, we also use a standard 5-fold cross-validation protocol to evaluate the algorithms.

3) UCF-QNRF
UCF-QNRF [14] contains 1535 images (1201 images for training and 334 images for testing). Compared with previous crowd statistics datasets, this dataset contains more annotations in congested scenes. The crowd populations in this dataset have large variations from 49 to 12,865, which poses a challenge for crowd counting.

4) NWPU-CROWD
The NWPU-Crowd dataset was proposed in 2020 [73]. The dataset consists of 5109 images (3109 for training, 500 for validation, and 1500 for testing) and 2,133,375 annotated examples. NWPU-Crowd introduces 351 negative samples with similar texture characteristics as crowded crowd scenes, reflecting images without people, with texture characteristics similar to those of crowded scenes. The average resolution of the images in NWPU-Crowd is 2191 × 3209, which is higher than the average resolution of the images in the other datasets.  Specifically, the maximum size of an image is 4028 × 19,044. The crowd population in this dataset varies from 0 to 20,033, which covers a wider range of variations and thus makes the crowd counting task more challenging.

B. IMPLEMENTATION DETAILS
Our method adopts the first ten layers of the feature layer of VGG16 [33]as feature extractors and adopts adaptive moment estimation (Adam) [74] as the optimizer. During the training process, the batch size is set to 1. The learning rate is set to 1e-6. All experiments are performed on an NVIDIA GTX 1080Ti and an Intel(R) Core(TM) i9-7900 CPU. The data preprocessing settings are based on the C-3 framework [75].

1) THE EFFECTIVENESS OF SAM
In this section, we analyze the effectiveness of SAM. The comparison experiments are described in detail as follows: All experiments adopt the first ten layers of VGG16 as the encoder. The structure of the decoder is designed with SAM and without SAM. In the case without SAM, the original SAM part of SRNet is replaced with the corresponding convolution layers. As shown in the results of (1) Conv w. Up and (3) SAM w. Up in Table 1, compared to the convolution layers (kernel size = 3), SAM can truly improve the performance due to the extraction of multi-scale information.
The effect of the PE submodule is also explored. As shown in Table 1, the results of (2) SAM (w/o PE) w. Up versus (3) SAM w. Up, as well as the results of (4) SAM (w/o PE) w. PAM versus (5) SAM w. PAM, indicate that the method with PE in SAM outperforms the method without PE in SAM.

2) THE EFFECTIVENESS OF PAM
In this section, we perform comparison experiments to analyze the effectiveness of PAM. Specifically, the comparison experiment is designed as follows: all experiments adopt the first ten layers of VGG16 as the encoder. The structure of the decoder is designed with PAM and without PAM. PAM includes both the pixel attention submodule and bilinear interpolation upsampling layer, and the other experiment uses only the bilinear interpolation upsampling layer. The other parts are the same as those in SRNet. As seen from Table 1, the experimental result with PAM (MAE = 66.0 and MSE = 96.7) is better than that with bilinear interpolation upsampling (MAE = 70.1 and MSE = 106.5) due to the addition of the pixel attention submodule. The pixel-level semantic information is enhanced in the experiment with PAM, which makes the crowd counting result of the congested area more accurate. The overall performance is thereby improved.
In addition, we explore the effect of pixel attention in PAM. We replace pixel attention with channel attention and spatial attention, and other conditions remain unchanged. As shown in Table 2, the results show that pixel attention is effective at dense crowd counting tasks and surpasses the effects of the channel attention module and spatial attention module.
As shown in Table 1, the results of (1) Conv w. Up, (2)SAM (w/o PE) w. Up, (3) SAM w. Up, and (5) SAM w. PAM indicate that the performance has been improved step by step, which proves the effectiveness of each module in SRNet, and the best performance is achieved when SAM and PAM are both employed. As shown in Table 3, the results indicate that the best performance is achieved when the number of SAM w. PAM components is three. (The experiment with one  set and two sets of SAM w. PAM components replace SAM w. PAM with convolution layers that have the same number of layers and channels. In the experiment with four groups of SAM w. PAM component sets, the PAM upsampling multiple in the fourth group of SAM w. PAM is set to 1.) Therefore, the design of SRNet is promising for dense crowd counting. Table 4 shows the results of thirteen mainstream methods (MCNN [7], Switching-CNN [8], IG-CNN [76], CSRNet [9], TEDNet [15], DUBNet [77], PACNN [12], PSODC [78], GSP [11], MSCANet [79], PACNN+CSRNet [12], CAN [10], ASNet [38]) and SRNet (ours) on the ShanghaiTech Part_A dataset, among which eleven are pretrained methods and ten are multi-scale methods. Compared with the other pretrained methods, SRNet achieves the second-best MSE score of 96.7. Compared with the other multi-scale methods, SRNet achieves the second-best result. SRNet achieves the best MSE result when all methods use L2 loss.  Table 5 indicates the performances of the other fifteen mainstream algorithms and SRNet, among which ten are pretrained methods and thirteen are multi-scale methods. Compared with the other pretrained methods, SRNet achieves the fifth-best MAE score of 184.1 and the best MSE score VOLUME 9, 2021   Table 6 shows the performances of the proposed SRNet model and nine other algorithms. Among the ten methods, five are pretrained methods, and seven are multi-scale methods. SRNet outperforms the other pretrained methods on MSE. Compared with the other multi-scale methods, our model achieves better results (MAE of 108 and MSE  of 177.5). From the above comparison, we can conclude that our SRNet model attains a competitive result (second place with respect to MAE and the best with respect to MSE) on UCF-QNRF.

G. COMPARISON OF THE PARAMETERS AND FLOPs
Current state-of-the-art network models are usually complex in design and include a large number of parameters. Our network model performs similarly to the current state-of-the-art methods with significantly fewer network parameters. We list the parameters, FLOPs and counting performance comparison on the dense crowd dataset UCF_CC_50, as shown in Table 7. The input tensor size is 1 × 3 × 224 × 224 when testing the FLOPs. The counting performance of our method is similar to that of other state-of-the-art approaches and requires the least number of parameters (9255k) and FLOPs (17.98 G), which verifies that the satisfactory performance of our model is not due to an increase in the number of parameters but the reasonable design of the network structure.

H. LOCALIZATION RESULTS ON UCF-QNRF AND NWPU-CROWD
To obtain the locations of each person, we conduct crowd localization experiments on UCF-QNRF and NWPU-Crowd. For the UCF-QNRF dataset, we calculate the values of the evaluation metrics (Pre, Rec and F1) with a radius of 0-100 pixels. Finally, we regard the average of these results as the final value [14]. For the NWPU-Crowd dataset, we follow   The localization performance comparison with state-of-the-art approaches on NWPU-Crowd. '(s)' represents a small threshold, which means that this method applies the internally tangent circle radius of the human head ground-truth box as the threshold. '(l)' represents a large threshold, which means that this method applies the circumcircle radius of the human head ground-truth box as the threshold. the configuration of [73], using the inscribed circle and the circumscribed circle as the radius of the acceptable domain to calculate the value of the evaluation metrics. Fig. 3 illustrates the visualization of the partial localization results on the UCF-QNRF dataset, and Fig. 4 illustrates the visualization of the partial localization results on the NWPU-Crowd dataset. In Table 8, we compare our results with those of the other methods on UCF-QNRF. Our SRNet model achieves the best results in terms of the F1 evaluation metrics. Table 9 shows the localization performance of the proposed SRNet model and some mainstream algorithms on NWPU-Crowd. SRNet also performs best in terms of the F1 evaluation metrics. These localization results achieved satisfactory accuracy and proved that SRNet offers an additional advantage in localization tasks.

V. CONCLUSION
In this paper, we propose a scale-aware representation learning network (SRNet) for dense crowd counting; SRNet provides a new method for extracting and fusing features of different scales. SRNet includes multiple SAM-PAM components. SAM is a scale-aware feature learning module, and PAM is a pixel-aware upsampling module. SAM extracts multi-scale features through convolutions of different scales and dilated convolutions of different scales. SAM employs the 'split' and 'concat' operations to fuse multi-scale features, which can reduce the feature redundancy. Then, the extracted features are enhanced at the pixel level. PAM uses a pixel attention module to enhance semantic information and uses bilinear interpolation upsampling to improve the resolution. A large number of experiments have proven that the proposed SRNet model achieves competitive counting performance for dense crowd scenes and offers more advantages in the trade-off between parameter amount and performance. Since SRNet can decrease the number of parameters to enhance the potential to reduce model calculation, its applications to crowd counting in video sequences can be explored in the future.