U-ASD Net: Supervised Crowd Counting Based on Semantic Segmentation and Adaptive Scenario Discovery

Crowd counting considers one of the most significant and challenging issues in computer vision and deep learning communities, whose applications are being utilized for various tasks. While this issue is well studied, it remains an open challenge to manage perspective distortions and scale variations. How well these problems are resolved has a huge impact on predicting a high-quality crowd density map. In this study, a hybrid and modified deep neural network (U-ASD Net), based on U-Net and adaptive scenario discovery (ASD), is proposed to get precise and effective crowd counting. The U part is produced by replacing the nearest upsampling in the encoder of U-Net with max-unpooling. This modification provides a better crowd counting performance by capturing more spatial information. The max-unpooling layers upsample the feature maps based on the max locations held from the downsampling process. The ASD part is constructed with three light pathways, two of which have been learned to reflect various densities of the crowd and define the appropriate geometric configuration employing various sizes of the receptive field. The third pathway is an adaptation path, which implicitly discovers and models complex scenarios to recalibrate pathway-wise responses adaptively. ASD has no additional branches to avoid increasing the complexity. The designed model is end-to-end trainable. This integration provides an effective model to count crowds in both dense and sparse datasets. It also predicts an elevated quality density map with a high structural similarity index and a high peak signal-to-noise ratio. Several comprehensive experiments on four popular datasets for crowd counting have been carried out to demonstrate the proposed method’s promising performance compared to other state-of-the-art approaches. Furthermore, a new dataset with its manual annotations, called Haramain with three different scenes and different densities, is introduced and used for evaluating the U-ASD Net.


I. INTRODUCTION
In situations involving crowd movements such as religious gatherings, sporting events, and public protests, crowd analysis and management are critical and have supreme significance in avoiding stampedes and saving lives. Crowd analysis can be a powerful tool in these situations for early prediction of crowding and selecting appropriate necessary measures for The associate editor coordinating the review of this manuscript and approving it for publication was Xiaochun Cheng. crowd control and management. Thus, avoiding any disaster that is about to happen. The variety of crowd management applications has prompted and inspired researchers from different disciplines to propose innovative and efficient methods for crowd analysis and relevant tasks, including counting [1], [2], behavior analysis [3], tracking [4], density estimation [5], [6], anomaly detection [3], [7], [8], scene understanding [9], segmentation [10]- [12], and mobile crowd sensing [13], [14]. Among these, density estimation and crowd counting are critical elements that serve as the foundation for various purposes. Crowd surveillance and analysis are not trivial problems and bring along different obstructions, like, occlusion, background noise, changes in lighting, scale, people distribution, and perspective. Researchers in this area have come a long way to tackle some of these issues. The current crowd scene analysis methods range from straightforward crowd counting, predicting the total number of individuals in a scene, to density map estimating, which shows crowd distribution characteristics. The density map assists in obtaining more precise and intensive details, which may be crucial in making appropriate decisions, especially in risky scenarios. Notwithstanding, producing precise distribution models is very challenging. One significant trouble stems from the estimation way. Because the produced values of the density are based on pixel-by-pixel estimation, the generated density maps should have spatial coherence to demonstrate the smooth transition amongst adjacent pixels. This is challenging because of the wide range of crowd density values. As shown in Fig. 1, some of the samples consist of hundreds of pedestrians, while other samples containing only a few. This issue can be difficult for a single CNN to deal with the full range of crowd densities. To tackle this challenge, multicolumn CNN architectures were introduced widely in the literature. Such architectures can have different parallel CNN branches with various sizes of the receptive field. In this kind of architecture, a branch of a network with smaller receptive fields can effectively address the high crowd density images. In contrast, a network branch with larger receptive fields can address low crowd density images well [15]. In addition, the task would be hard to complete due to the variety of views, which include infrequent crowd clusters and multiple camera viewpoints, particularly when using conventional methods without deep neural networks.
The proposed U-ASD model is inspired by U-Net [10] with an additional adaptive scenario discovery (ASD) [16]. The model on an encoder-decoder layout with three light parallel branches is built. The encoder part of the U-Net is supplanted by VGG16-bn [17]. In addition, the output of the U-Net encoder has been used as a backbone to the last branches that represent the adaptive scenario discovery. Adding the ASD as a binary classifier improves the model's crowd counting efficiency. The ground truth attention map is fed into the adaptive scenario discovery branches, and the output is combined with U-Net using a combined loss.
To sum up, the following contributions are made: • A hybrid and modified network structure capable of predicting precise density map half the size or resolution of the input is proposed.
• A modified U-Net is produced by replacing the nearest upsampling with max-unpooling. The upsampling using max-unpooling for U-Net is proposed to extract more spatial information through the complex max-unpooling layers. Thus, a better crowd counting accuracy is achieved. To the best of our knowledge, max-unpooling has not yet been utilized in the literature for U-Net architecture for crowd counting. In this study, a comparison in terms of the counting accuracy, the number of parameters, and training runtime between utilizing the nearest upsampling and max-unpooling are presented in Section V.
• A new dataset, dubbed Haramain, with its manual annotations is presented. The Haramain dataset consists of three various scenes and densities.
• The efficacy of the proposed U-ASD Net is tested on four challenging datasets for crowd counting. Interestingly, it surpasses state-of-the-art approaches, according to our findings.
The other sections of this article are arranged as follows. Section II highlights some critical and timely relevant research. The proposed model architecture is presented in Section III. In Section IV, the evaluation metrics, experimental setup, and qualitative and quantitative findings are presented. Section V presents discussions and analyses of the findings. The proposed work is concluded in Section VI.

II. RELATED WORK
Significant CNN-based crowd counting methods and related density map prediction methods are being demonstrated in this section. Furthermore, since the proposed U-ASD Net uses segmentation and spatial CNN to address the crowd counting task, related research studies about those methods are briefly reviewed.

A. PATCH-BASED AND IMAGE-BASED METHODS
Because of its effectiveness in capturing local features and generating a huge amount of training samples, patch-based methods have been utilized in many methods [4], [18]. Patch-based methods train a model and estimate sliding windows through the testing stage by cropping the images of various sizes. Convolutional Neural Networks (CNN) have been utilized in several methods for crowd counting purposes [19]- [21]. Zhang et al. [19] developed a deepqualified CNN for crowd counting and estimating the level of crowd density. Li et al. [20] suggested using the VGG16 encoder as well as dilated convolutional layers as a decoder to assemble contextual features on a variety of scales. Cao et al. [21] introduced a scale aggregation network to extract multi-scale features utilizing the encoder that uses scale aggregation modules and estimate high-resolution density maps using the decoder that uses a collection of transposed convolutions. Fu et al. [18] suggested categorizing the image into five diverse classes, where each class represents different intensities of the density, rather than estimating density maps. Layered boosting and selective sampling procedures were presented by Walach and Wolf [22]. The layered boosting means that CNN layers are added to the model in an iterative manner so that each new layer is learned to predict the residual error of the previous estimation. Kumagai et al. [23] introduced the Mixture of CNNs (MoCNN) model that comprises a combination of expert CNNs as well as a gating CNN. The gating CNN specifies VOLUME 9, 2021 FIGURE 1. A CNN-based density estimation can be utilized to pose the crowd counting problem, but because of the large fluctuation in densities across pixels in the various samples, this issue can be challenging and difficult for a single CNN. This figure presents two images from the Shanghaitech dataset with substantially varying crowd densities.
a probability to each expert CNN layer, and the expert CNNs estimate the crowd count. The weighted average count of all the expert CNNs is the final output crowd count. According to Sam et al. [24], the better output is determined by training regressors with a specific group of training patches and exploiting variation in crowd density. Moreover, Sam et al. put forward a switching CNN that intelligently determines the best regressor suited for each input patch. Because the patchbased methods seem unable to represent the global contextual data, the whole image-based methods had been concentrated by the proposed works [5], [25], [26]. Zhang et al.. [19] put forward a multi-column CNN that processes the input image with adjustable resolution and uses every column to comply with various scales. Sheng et al. [24] introduced a novel image representation that integrates semantic attributes and spatial cues to enhance the discriminative power of feature representations. Marsden et al. [27] introduced incorporating scale into the models with fewer model parameters and proposed a single column fully convolutional network (FCN) to estimate density map. A cascaded CNN architecture (Cascaded-MTL) was proposed by Sindagi and Patel [26]. The Cascaded-MTL integrates learning of a high level prior to lifting the performance of the density estimation.

B. SEGMENTATION AND SPATIAL CNN
Utilizing pixel-wise regression, density estimation-based techniques can predict a density estimation map and thus localize the crowd. Then, the process of crowd counting is performed by computing the integral image of the density map [28]. To create density maps with keeping the spatial size as the inputs or half the inputs, encoder-decoder architectures are considerably used [10]- [12], [29], [30]. In 2015, U-Net was developed by Ronneberger et al. [10] for biomedical image segmentation and was then extensively employed for image segmentation in many other fields with different encoders such as ResNet, Inception, and DenseNet modules. By applying skip connections between the corresponding encoding and decoding path blocks, U-Net designed a symmetric network structure in which convolutional features are stacked from activations of the encoder to the decoder parts. In [31], Shen et al. made use of a U-Net structure with an adversarial loss to produce high-quality density maps. Huynh et al. [32] put forward an inception U-Net-based multi-task learning for crowd counting, density map-generating, and density level classification. In [30], the authors proposed U-Net-like architecture called W-Net, which applies a reinforcement branch to improve the crowd counting accuracy and converge quicker.

III. PROPOSED METHOD
The workflow of the proposed method is shown in Fig. 2. The estimated maps from the modified U-Net and ASD Net are multiplied to generate the final crowd density map. The overall U-ASD network architecture is illustrated in Fig. 3. The complete design of the U-ASD model is described in detail in subsection A, followed by the specifics of its implementation.

A. U-ASD OVERALL ARCHITECTURE
The proposed U-ASD model is built on U-Net [10] and ASD [16]. The model on an encoder-decoder layout with three light parallel branches is built. The U-Net architecture comprises both an "encoder part" to capture context and a ''symmetric decoder part'' to provide precise localization and estimate the density map. To extract multi-scale visual features from input image sequences, the backbone is utilized. Following [30], a pre-trained VGG16-bn model, which is a variant of VGG16 with batch normalization, is utilized. The first VGG16-bn layers with four-max-pooling are used as the backbone, and it replaces the encoder block of the U-Net. Table 1 presents the configuration of the proposed U-ASD model. The ASD part is used to assist the network in converging faster and providing better performance with low errors. The ASD Net's output density map is 1/16 of the input image. To fuse the U-Net's output with the ASD Net's output, a Nearest-neighbor upsample layer (US) was introduced as shown in Fig. 3 to upsample the output.

B. DENSITY MAP ESTIMATION
Semantic segmentation and density map estimation are classification and pixel-wise regression issues, respectively. Accordingly, numerous studies in crowd counting address the concepts and hypotheses in semantic segmentation. Ronneberger et al. [10] designed U-Net (looks like a U letter) to concentrate on the pixel-wise classification of an image sequence. U-Net can focus on low-level abstract features (extracted from the first convolutional layers of the encoder part) and high-level semantic abstraction features (extracted from the decoder part's final layers). In the proposed U-Net, the max-unpooling operations utilizing the memorized max-pooling indices from the relating encoder layer replace the nearest upsampling in the U-Net structure. Further details about the U-Net are in subsection E.
Kang et al. [4] investigated the generated maps produced by density estimation approaches for crowd analysis applications such as detection, counting, and tracking. They investigated the performance of those applications in great detail when employing full-resolution density maps. Their findings revealed that full resolution density maps enhanced the effectiveness of localization tasks, including tracking and detection. Furthermore, they mentioned that good counting accuracy does not always necessitate full-resolution density maps, and adopting reduced-resolution maps can speed up the predictions while maintaining good counting performance as in [19] and [25]. Because of downsampling strides in the convolution layers and the pooling layers, most existing CNN algorithms normally create density maps with a resolution lower than the source images.

C. CLASSIFICATION VS. REGRESSION FOR COUNTING
As is well known, the network output of the CNN-based classification models is a vector of the same size as the number of classes. The input image's confidence score belongs to the i-th class is expressed by the i-element in the vector. The final classification result has been chosen according to the index that has the highest confidence score during the testing stage. For most classification problems, softmax loss is extensively utilized [33]. Taking the human count number as the class index, on the other hand, is not appropriate for crowd counting problems. The difference between the ground-truth map and the predicted map can be better retained in the proposed U-ASD model while determining the estimation error. Such information is extremely useful for more precisely optimizing the CNN weights during the back-propagation stage. To allow the entire model to implicitly detect all crowd scenarios and respond to varied crowd images in a precise way, two types of architectures in our model are used: U-Net and ADSNet, which will be well explained in the next subsections. For each architecture, a different loss function is defined. For the modified U-Net, the 2-D pixel-wise mean square error (MSE) loss is utilized for the density regression task, which can be defined as in Equation 1 below: where d g and d p refer to the ground-truth map and the predicted density map, respectively, and n is the total number of pixels each.
In the proposed U-ASD model, the main aim of training the ASD is to minimize the binary cross-entropy (BCE) loss, which is used for measuring the error of a reconstruction, which is defined as follows [30], [34], [35]: where g i ∈ {0, 1} is the ground-truth attention map, where there are two classes (background '0') and (foreground '1'), p i is the predicted attention map, and m is the total number of pixels. To put it another way, the predicted foreground mask is compared to the ground truth map using a binary cross-entropy error function, and the low value of the L bce (g, p) means better accuracy.

D. POOLING AND UNPOOLING
To lower the size of the representation and make it more manageable, the pooling layer is employed. It processes the input and downsamples it without affecting the depth [36]. In other words, the process is done spatially. Thus, the input and output depths stay the same. Unpooling is used to achieve upsampling in the network. For density map estimation, precise pixel prediction is required to acquire accurate counting. If max-unpooling is utilized, the feature map will be heterogeneous because of the loss of spatial information produced by max-pooling from the low-resolution image. Nonetheless, after max-pooling, there is no data regarding the locations of the feature vector in the low receptive field. When max-unpooling is performed, the maxima locations inside each pooling zone can be captured in a set of switch variables stored in a continuous array after applying the max-pooling. These switches are utilized in the related maxunpooling to set the signal from the present feature map into the up-sampled feature map's relevant locations. Therefore, more fine detail can be recovered and preserved the spatial information lost during max-pooling.
Only the first layers from the pre-trained VGG16-bn network are used as an encoder for U-Net to generate the multi-scale feature maps, and the fully connected layers for the classification process are excluded. Following the U-Net structure [10], the feature maps (FMs) resulting from the encoder part (FM 1 , FM 2 , FM 3 , and FM 4 shown in Fig. 3) are employed as inputs to the decoder part.

2) DECODER PART
The decoder part is illustrated in Fig. 4. First, the FM 4 output and its index Idx 4 are used to upscale the input using Max-unpooling, and then the output of FM 3 is concatenated with it. After that, this concatenated input is passed to Block 1 demonstrated in Fig. 4. Block 1 includes 2×maxunpool and two convolutional layers with 1 × 1 × 256 and 3 × 3 × 256, respectively, followed by batch normalization (BN) and rectified linear unit (ReLU). The output of this block is upscale using 2×max-unpool and concatenated with the output of FM 2 . Similarly, the process of increasing the size is reiterated prior to feeding Block 2 (with the same architecture as in Block 1 but with a different channel size). At last, another upscaling and concatenation from Block 2 are performed, and Block 3 generates the final feature map. At the training phase, the loss function of U-Net is the 2-D MSE loss, which is defined previously in Equation 1.

F. ASD NET
Recent papers in [30] and [37]- [39] have been utilized VGG16-bn for crowd counting, and the proposed models of these papers achieved high performance. Therefore, following these studies, the VGG16-bn has been used as a backbone for our model instead of VGG16 that was used in the original ASD Net. After the backbone, the ASD part incorporates three light parallel paths as in Fig. 3. The first part, B 1 , is intended to address the sparse crowds. It contains a deconvolutional layer (DC), which upscales the inputs. After the DC, there are five convolutional layers with larger receptive fields followed by max-pooling. Fig. 5 presents the details about the structure of the convolutional layers group 1 (CLs1) in the B 1 pathway. The second pathway, B 2 , is intended for the dense crowd. The structure of the convolutional layers group 2 (CLs2) in B 2 is shown in Fig. 6. Both B 1 and B 2 pathways are relative and can estimate a density map. For fusing the density maps, a dynamic weighting method named adaption discovery is used. Adaptation discovery is a process that permits the network to carry out feature recalibration for the weight of the B 1 and B 2 branches,  through which it can learn to utilize global information to emphasize informative features while suppressing less helpful ones selectively. B 3 in Fig. 3 presents the details of this process. It contains a global average pooling (GAP) and two fully connected layers (FCs) followed by ReLU and Sigmoid-Normalization. The GAP can be used to calculate the global average value of each channel. The multi-layer feature map M that is extracted by the convolutional layers of the U-Net encoder is considered as an input to the GAP. Its dimensions are h × w × c, where h denoting height, w denoting width, and c are the number of channels. M c (i, j) is the element at location (i, j) in the c − th channel (i, j). The output is 1×1×c. To capture the interdependencies between channels, two fully connected layers followed by an activation function of Sigmod, which are not shown in Fig. 3, have been added after the GAP. The first FC layer minimizes the dimension from c to c/16, and the second FC reduces the dimension from c/16 to c/32. An initial response w after the sigmoid function is obtained, the w adaptively recalibrates the weight of the B 1 and B 2 pathways. Thus, w is normalized into the interval of [0, 0.5] [16], [40], the output of the B 1 and B 2 paths can be computed as follows:

IV. EXPERIMENTS
The evaluation metrics and experimental details are initially addressed in this section. The results of the proposed U-ASD Net are then reported and analyzed on five challenging crowd counting datasets.

A. EVALUATION METRICS
The counting accuracy of the CNN-based crowd counting networks can be measured by mean absolute error (MAE), mean squared error (MSE), and the resolution of the density map [33]. Further details are explained in the following paragraphs.
• The most well-known evaluation metrics in the scope of evaluating crowd counting methods are the MAE VOLUME 9, 2021 and MSE, respectively, which can be described as below [41]- [43]: where N represents the total number of patterns in the test set, c i is the count label, andĉ i is the predicted count value for the i-th test pattern. The MAE metric represents the precision of the estimated count, and the MSE metric is a measure of the robustness of counting.
• Peak signal-to-noise ratio (PSNR), the method of computing the mean square error between the predicted density map and its ground truth of all pixels, is preferred for determining the accuracy of the predicted map. Mathieu et al. [44] argued that the PSNR is a better metric for assessing quality. The PSNR is defined as follows [45], [46]: where M refers to the density map image, max I is the highest possible value of image intensities, and N denotes the total number of pixels in the map image. Generally, a higher PSNR value shows a higher image quality.
• Structural Similarity Index (SSIM) is frequently utilized to assess the quality of the estimated density map [47]. The SSIM estimates the image similarity based on contrast, structure, and luminance, which can be calculated by multiplying the three terms described. The SSIM value is in the [−1, 1] range. The higher the SSIM value, the lower the distortion has been. The SSIM formula is defined as follows [48]: where: µ g , µ p , σ g , σ p , σ gp are the local means, standard deviations, and cross-covariance for both the ground-truth density (g) and predicted density (p) maps, respectively. If α = β = γ = 1, and C 3 = C 2 /2 the SSIM can be written as:

B. EXPERIMENTAL SETUP
The U-ASD was tested on two image crowd counting benchmarks and three video crowd counting benchmarks (i.e., ShanghaiTech Part A, ShanghaiTech Part B, and UCF CC 50) and (i.e., UCSD, Mall, Haramain), respectively. Fig. 7 depicts some of their typical scenes. Table 2 lists the basic statistics of each dataset, and shows the total number of people in each dataset. As shown Table 2, the datasets have varying crowd densities, and ShanghaiTech Part A, ShanghaiTech Part B, and UCF CC 50 are highly imbalanced. The training and evaluation were carried out in Python using PyTorch on a Tesla V100 GPU. Following [42], the original training images and frames, which have different resolutions as described in Table 2, are firstly resized to a resolution of 576 × 768, and the ground truths are formed at the same resolution.
1) Ground-truth generation: Since the CNN-based methods utilized for crowd counting require processing continuous data, and the available ground-truth information is discrete [16] thus, a conversion process is required to generate the density map and attention map information utilizing the discrete key points that represent the head annotations as shown in Fig. 8.
• Density map generation: To obtain a density map (D i ) for each image in a dataset utilizing the available ground-truth information (labeled people heads), [25] is followed. The presence of a head at pixel p i is reflected as a delta function δ(p − p i ). This allows the following interpretation of an image with N -labeled heads: This function can be convolved with a Gaussian kernel G to transform it into a continuous density function. Thus, the density can be formulated as: However, if the crowd is supposed to be uniformly distributed over each head, the average distance between the head and its nearest k neighbors estimates a rational approximation of the geometric distortion (resulting from the perspective effect). Consequently, the spread parameter σ for each individual in the picture should be determined based on the size of their head. A kernel with a window size of µ = 15 and a spread parameter of σ = 4 are used in the experiments described in this paper.
• Attention map generation: The attention map (A i ) is generated following the methods in [30], [38] by first generating the density map with a larger spread parameter σ = 6. Then, a threshold to the corresponding density map is applied. The attention map can be obtained using the following formulated Equation: In our conducted experiments, the threshold is set to (T = 0.001). This threshold value was the best experimentally. Different threshold settings will change the performance, as shown in Table 3.  L Total = λ 1 L bce + λ 2 L mse (12) where L mse and L bce are the loss functions for U and ASD, respectively, as mentioned in Section III C. λ 1 and λ 2 are parameters that are used as a balance between the loss values. The optimum values VOLUME 9, 2021 for λ 1 and λ 2 , which have been chosen empirically, are 20 and 1000, respectively. Fig. 10 represents the qualitative results of the U-ASD method on different test scenes. The sub-figures are, respectively, for each scene the: original image, ground-truth density map, and estimated density map. As shown in the sample results from UCSD, Mall, and Haramain H1, Fig. 10 (d, e, and f), the U-ASD model counts well not only under highly dense crowds but also counts well in sparse scenarios. As crowd density rises, people will appear to partially occlude one another, limiting the capacity of classic detection methods and prompting the development of density estimate models. Such situations can be noticed in Fig. 10 (a), (b), (c), (f), (h).
Interestingly, the proposed model can locate these occluded people and count the crowds very well by producing high-quality density maps and thus providing accurate counting accuracy. The scenarios in the Mall dataset have strong perspective distortions, which results in significant variations in the scale and appearance of individual objects. Also, the occlusion resulted from some potted plants raises the complexity. As you can see in Fig. 10 (e), the produced density map from the U-ASD model on the Mall dataset locates the individual correctly and provides well counting. 4) Evaluation details: In the evaluation stage, a patchbased assessment as in [21] and [30] is used. The test images are cropped into patches and generated 9-overlapping units for each image. Then, a sliding window is run over the test image during the prediction process. Predictions are determined for each window before being aggregated to get the total count in the image.

D. UCF CC 50 DATASET EVALUATION
UCF CC 50 dataset is created by [61], and it covers various views with different perspective distortion. UCF CC 50 is constituted of only fifty images but has large head annotations of 63,074, and the images differ in the number of individuals with a range from 96 to 4,633 with an average number of 1,279. Since this dataset has only fifty images, the state-of-the-art approaches utilize the traditional 5-fold cross-validation procedure to assess their methods [19], [20], [42], [61]. Thus, the 5-fold cross-validation is also applied to assess the proposed U-ASD method. Fig. 11 illustrates the estimated errors by applying 5-Fold Cross-Validation. As shown in Fig. 11, the average MAE and MSE are 232.3 and 217.8, respectively. Table 5 shows that the U-ASD method presents the third-best result in terms of the MAE metric and the best result in terms of the MSE metric with reducing the estimation MSE error by about 50.5 compared with the ASANet method.

E. UCSD DATASET EVALUATION
UCSD dataset [64] has several video frames for the same scene snapped by surveillance cameras with a resolution of 238 × 158. It includes 2,000 frames with a total  performance of the U-ASD method, the original settings in [64] are followed. The image sequences 601-1400 are used as the training set, and the remaining 1200 image sequences as the testing set. The results of the UCSD dataset are recorded in Table 6. The results of U-ASD are comparable with the state-of-the-art methods. The U-ASD has obtained the best MSE with 2.1.

G. HARAMAIN DATASET EVALUATION
The Haramain dataset includes various scenes at the holy haram in Mecca and Al-Madinah. People from all over the globe gather at the holy haram places for the sake of worship. Therefore, maintaining people's comfort while praying is considered a major management goal. About more than three million people visited the holy haram in Madinah each year. It covers an area of over 98,000 m 2 and has 42 multidoor entrances [74]. Consequently, maintaining a fine flow at all areas and entrances is a challenging task. Estimating the number of people in the crowd scenes helps to smooth the distribution of up to 167,000 people throughout the holy haram at a time.
To help addressing the crowd management in holy places, the Haramain dataset with its manual annotations is introduced, consisting of three parts for three different scenes. The first and second parts, called H1 and H2, respectively, include 70 and 60 image sequences from two scenes at Madinah mosque. The third part, called H3, comprises 60 image sequences from al-sahn area at al-haram al- sharif  TABLE 9. The detailed information of the U-ASD net and the main state-of-the-art methods on ShanghaiTech Part A dataset, U-ASD net* uses the nearest upsampling in the U-net part. mosque in Mecca, Saudi Arabia, during the pilgrimage season. The resolutions for each part and other details are shown in Table 2. Since the annotation process requires a lot of time, the length of video clips for this dataset has been limited, and 5-fold cross-validation is applied. Fig. 12 shows the estimated errors by applying 5-Fold Cross-Validation. Table 8 shows the results of the proposed U-ASD on the Haramain dataset. As clearly seen, applying the 5-fold cross-validation improves performance metrics by 8.1 and 6.9 on average for MAE and MSE metrics.

V. DISCUSSION AND ANALYSIS
The specifics of the proposed U-ASD model were contrasted with the state-of-the-art methods (Cascaded-MTL [26], Switching-CNN [24], CP-CNN [5], and PCC Net [42]) to demonstrate the superiority of our method. The four main metrics for evaluating density estimation efficiency are mentioned and calculated in Table 9 on the ShanghaiTech Part A: MAE, MSE, PSNR, and SSIM. As can be observed, U ASD Net is the best. The integration of U-Net with ASD-Net is responsible for this performance since it allows the whole model to implicitly identify all crowd scenarios and respond to diverse crowd images in a highly scenario-specific manner. U-ASD Net* is the model, which uses the nearest upsampling in the U-Net part. As clear in Table 9, the U-ASD provides better performance in terms of the counting accuracy (i.e., MAE and MSE), which comes at the expense of the runtime. In the experiments, the computational complexity, in terms of the number of parameters and training runtime, and the quality of the estimated density maps are also measured. Further details are in the next subsections.

A. COMPUTATIONAL COMPLEXITY
To reduce the complexity of the U-ASD Net, the VGG16-bn network (except the fully connected layers) is used for the encoder part of the U-Net and as a backbone for the ASD branches. In addition, for simplicity and to avoid adding complexity to the U-ASD Net, the original ASD Net is used without extra layers, except adding a nearest upsample layer at the output of the net to fuse it with the output from the U-Net. Table 9 includes information on the computation complexity in terms of the number of parameters and execution runtime. Even though Cascaded-MTL [26] presents the lightest model with only 0.12M parameters and 3ms runtime among other models [5], [24], [42], it has the worst estimation performance. During the evaluation phase, U-ASD takes 94ms to process a 512 × 680 frame from ShanghaiTech Part A dataset on one Tesla V100 GPU. Since humans, in general, do not move so fast as well as each frame does not require to be analyzed, this runtime speed is adequate for several realistic applications [75]. Moreover, comparing the U-ASD with Pretrain models on ImageNet, the U-ASD provides a faster execution time than Switching-CNN and CP-CNN models.  Thus, taking into account the performance metrics (MAE, MSE, PSNR, and SSIM), and the number of parameters, the proposed U-ASD Net is very competitive.

B. QUALITY OF THE PREDICTED DENSITY MAP
To test the quality of the estimated density maps produced by U-ASD Net, the PSNR and SSIM were computed on ShanghaiTech Part A and Part B, UCF CC 50 and UCSD datasets for the MCNN [25], CP-CNN [5], CSRNet [20], ADCrowdNet [76], PCC Net [42], and U-ASD methods. Table 10 shows the PSNR and SSIM comparison. Clearly, the U-ASD offers the best structural integrity.
The estimations over time of each frame in ShanghaiTech Part A, ShanghaiTech Part B, UCSD, and Mall datasets concerning their ground truth are illustrated in Figs. 13, 14, and 15.  Interestingly, the prediction counts are almost identical to the ground truth counts.
Comparative experiments on Part A of the ShanghaiTech dataset were conducted in order to determine the best values of λ 1 and λ 2 in Equation 12. Fig. 16 (a) illustrates that as the value of λ 1 increases, the MAE error value decreases, and the lowest error is acquired at λ 1 = 20. The error then increases since the weight of the L mse loss becomes too significant in comparison to the L bce loss. As a result, in our experiments, 20 is identified to λ 1 . Similarly, as shown in Fig. 16 (b), the lowest MAE is acquired when λ 2 is specified by 1000.

D. THE PERFORMANCE OF U-ASD NET COMPONENTS
In the conducted experiments, it is noted that training the U-Net without the ASD Net in ShanghaiTech Part B achieved   the best MAE value at epoch number 7, as shown in Fig. 17. After epoch number 7, the U-Net drastically degrade the counting performance and the loss goes up. Thus, the training of U-Net is stopped at this early stopping. This was the main reason the ASD Net was introduced as a binary classifier. The ASD Net, when independently trained, counts better than U-Net, whereas the quality of the estimated density map is lower than the U-Net. Both networks are combined using a combined loss function as described in Equation 12 ( i.e., BCE loss and MSE loss), and the whole U-ASD Net is trained in an end-to-end fashion. As shown in Table 11, integrating U and ASD networks helps in increasing the counting accuracy and improve the quality of the produced density maps.

VI. CONCLUSION
This paper proposes an end-to-end trainable hybrid modified network architecture, named U-ASD Net, by integrating two novel architectures designed for image segmentation and crowd counting. The proposed U-ASD model has the ability to predict precise and high-quality density maps at half resolution compared to the input. The PSNR and SSIM metrics have proven the superiority of the proposed model in generating high-quality density maps. Moreover, the proposed model contributes in alleviating the drawbacks present in the state-of-the-art methods by addressing both sparse and dense scenes for crowd counting efficiently.
In the modified U-Net, the up-sampling algorithm is changed from nearest to max-unpooling for upsampling using the memorized indices used in U-Net. This accomplishes high counting accuracy. The proposed model achieves the lowest count error in terms of the MAE in ShanghaiTech Part A, Part B, and Mall datasets with 64.6, 7.5, and 1.8, respectively. Moreover, it achieves the lowest count error in terms of the MSE in ShanghaiTech Part B, UCF CC 50, UCSD, and Mall datasets with 12.4, 217.8, 2.1, 2.2, respectively. In addition, the proposed model accomplishes the best quality density maps on all the utilized datasets.
To assist in addressing crowd management and control in the holy places at Mecca and Al-Madinah, a new dataset, named Haramain dataset, is introduced, which consists of three parts for three different scenes. The proposed U-ASD model is applied in this dataset, and all the MAE, MSE, PSNR, and SSIM metrics have shown promising results.
Extensive experiments on four benchmark datasets and comparisons with recent state-of-the-art methods presented the substantial improvements accomplished by the proposed model.