Crowd Counting and Localization Beyond Density Map

Crowd analysis in general and counting in congested scenes, in particular, is an effective and vibrant research domain in computer vision due to its numerous applications. Understanding the risk analysis and safety aspects of crowd dynamics at various vital occasions related to sports cultural and religious activities, specifically, at Hajj and Umrah, is essential. Thousands of people gathered in a small area to carry out their rites. Localizing and counting the annotated head points is quite challenging due to occlusion and large-scale variation in the congested environment. To deal with these problems, a small and effective solution is to generate the density maps. However, the significant flaws of the density map have a blurry Gaussian blob which is less effective for counting and localizing head annotations in the congested scene. To overcome these issues, we propose Congested Scene Crowd Counting and Localization Network (CSCCL-Net) with a Focal inverse Distance Transform (FIDT) map that can count and localize the people simultaneously in the highly congested scene. To evaluate the proposed model’s efficiency, extensive tests were performed on the ShanghaiTech part A, ShanghaiTech part B, and JHU-CROWD++ datasets. The proposed model outperforms existing state-of-the-art techniques regarding high accuracy and low Mean Absolute Error (MAE) and Mean Square Error (MSE) values.


I. INTRODUCTION
Object counting and localization are evolving research nowadays in the area of computer vision. Both crowd counting and localization have drawn interesting attention from the research community. It shows a significant role in maintaining public safety and security at various events like traffic management, urban planning, monitoring of smart cities, political rallies, and religious gatherings of Hajj and Umrah. Since crowd counting and localization are closely related to each other, therefore, it is necessary to describe them The associate editor coordinating the review of this manuscript and approving it for publication was Filbert Juwono .
briefly. The primary aims of counting are to estimate the objects in images or videos. In contrast, localization means predicting the concise location of that object in the images or videos. The crowd monitoring and management in the congested scene are crucial due to the recent Covid-19 pandemic across the globe. The counting and localization of people in congested scenes encounter many key challenges. It can be scale variation, overlapping with severe occlusion and due to blurry background [1], [2], [3], [4], [5]. Mainly, there are three common approaches generally used for crowd counting, that is, detection-based crowd counting [6], [7], [8], [9], regression-based crowd counting, and density-based crowd counting [10], [11]. In detection-based crowd counting, a sliding window is used to detect and count the number of people in an image. The disadvantage of this approach is that it cannot detect and count people in a very crowded scene [12]. Regression-based approaches have been used to regress the low-level feature information to overcome this drawback. This approach can do well in a highly congested scene because of the collection of generalized density information. The regression-based approaches are unable to regress in occlusion and cluttered areas. To some extent, the density-based approaches can address these challenges but possess various challenges, that is, blurry gaussian blobs and severe overlapping with nearest head annotations, as shown in Figure 1. While for crowd localization, the authors have explained three major methods: detection-based crowd localization, heuristic-based crowd Localization, and point-supervision crowd localization [13], [14]. Object detection is a valuable method for crowd counting and localization [15]. As count signifies the number of detected bounding boxes in the image, however, the drawbacks of the detection method are that it can suffer from severe occlusion and overlapping of bounding boxes. Additionally, it is quite difficult to annotate the bounding boxes in the congested scenes due to being expensive and laborious. The concept of density maps was proposed to handle these challenges in [16], [17], [18], [19], and [20], which is formed by the Gaussian kernel. Convolutional Neural Networks (CNNs) are used in several counting approaches to regress density maps with good results. However, these methods [17], [18], [21], [22], [23] have failed to anticipate the precise location of each head in the congested scene. And even these methods cannot count and locate the objects in an image or video simultaneously due to a series of blurry Gaussian blobs and overlapping. But, the traditional deep learning methods are unable to deliver in the highly congested scene to simultaneously count and localize the crowd [16], [17], [21], [24]. The reasons are the tiny and occluded objects in the congested scene and blurry crowded scene. Because of the above mentioned issues, we propose CSCCL-Net, which can simultaneously count and localize the people in the congested scene. FIDT map is used to pinpoint the exact position of people. At the same time, it is evident that the density map depicts blurry Gaussian blobs in a crowded area. It is also unrecognizable owing to the Gaussian kernel filtering on each head annotation. Even then, in an FIDT map, every point is visible and distinguishable even in the congested scene, as mentioned earlier. The Local Maxima Detection Strategy (LMDS) [25] finds the head coordinates to pinpoint the local maxima and exhibits flexibility to negative background data. K-nearest neighbor is used to building bounding boxes for each head. Meanwhile, before training, the bounding boxes were produced using ground truth dotted annotation in [26] and [27]. We have generated the bounding boxes before training through predicted positions. Counting by localization requires comprehensive structural information in the FIDT map at the local level. The SSIM loss can be used to increase the resemblance between the predicted FIDT map and the ground truth map. The conventional SSIM loss may result in high background responses that are unstable and create false local maxima, amplifying the localization and counting error. Then, we utilized an independent SSIM loss to improve the model's ability to increase structural information of local maxima while decreasing false local maxima in the background. VGG-16 has been used as a core network and its first ten layers as a front-end in the proposed framework. A hole dilated convolutional neural network has been employed as a back-end to enlarge the respective fields to get more information from the image and decrease the false local maxima in the background area.
As a result of the issues mentioned above, it is essential to solve them in the suggested model. In summary, the contributions are as follow: 1. Proposed a practical model that can cope with both counting and localization tasks at the same time, based on a congested scene recognition network and the focal inverse distance transform 2. The FIDT map has been enhanced to improve the blurry Gaussian blobs 3. In addition, The proposed model has deduced the findings on ShanghaiTech part A, B, and JHU-CROWD++ datasets, respectively The structure of the article includes: Section II summarizes previous work on crowd counting and localization research. In Section 3, we presented a crowd counting and localization architecture. The experiment findings are evaluated in Section 4, followed by results and discussion in Section 5. Finally, Section 6 summarises the conclusions.

II. PREVIOUS WORK
During the past decade, deep learning made significant progress in computer vision. Several deep learning approaches for image segmentation, classification, and detection have been developed with significant success in the literature. Different CNNs and deep CNNs with multi view models have been published in the literature to count and find individuals in images and videos, motivated by the considerable success of deep learning [2], [28].

1) DETECTION-BASED CROWD COUNTING
Throughout the early phases of deep learning, researchers concentrated on detection-based methods for crowd counting. Sliding windows are employed in these approaches to detect, count, and localize the objects in images or videos [10], [13], [15], [29]. To extract low-level features from the whole human body, many methods, such as Haar wavelets [30] and Histogram of Oriented Gradients(HOG) [15] are reported, which require well-trained classifiers. There are two common styles used in detection-based crowd counting, namely: 1. Monolithic style 2. Part-based detection style Earlier, a classifier was trained employing features taken from the entire body. In the latter case, the classifier is trained to recognize specific points, such as the head, for example to estimate the number of individuals. The authors in [31] have suggested a technique for calculating the number of individuals using MID-based foreground segmentation and heads detection to estimate the number of people. These two sub-methods are unable to detect and count the number of people in the highly congested scenes.

2) REGRESSION-BASED CROWD COUNTING
Considering the above two issues mentioned in detectionbased methods are not applicable in the highly congested scenes. Therefore regression-based crowd counting has been introduced to overcome these issues [2], [32]. This approach contains two steps of low-level feature extraction and modeling of regression. Image features and crowd size are utilized to estimate the crowd count using regression-based algorithms [17], [32], [33], [34], [35], [36]. The features that can be extracted from the image patches are: Low-level data is generated using these attributes. For feature extraction, various regressing techniques are used, including Gaussian process regression, linear regression, and piece wise linear regression [10]. Likewise in [32], a Fourier series has been employed to extract the feature from the image. Similarly, [17] uses a Multi-Column Convolutional Neural Network that has employed three columns with different filter sizes to compensate for it. CNN models with two configurations estimate the crowd density in a single image [16]. While switched CNN uses multiple CNN for crowd counting [21]. In addition, some models are based on attention mechanisms to cope with the crowd counting via regression-based [37], [38], [39]. Specifically, the authors in [37] have developed local and global self-attention to capture short-range and long-range information. ADCrowdNet uses a binary classification network for the regression of the crowd counting region [39]. Similarly, density maps were used as probability maps to get the probability of each pixel in the image [33]. The Distribution Matching (DM) approach compares a normalized projected density map to a normalized ground truth density map using optimum transport [34]. Kernel-Based Density Map Generation (KDMG) is a learnable density map representation that uses an adaptive density map generator. They have focused on counting rather than the process of localization [40]. However, the performance of regression-based still suffers from occlusion and scale variation.

3) DENSITY-BASED CROWD COUNTING
Since the effectiveness of crowd counting has improved dramatically when spatial information is integrated, densitybased crowd counting has become increasingly popular. Dot maps, with each dot representing a person in the image, are typically blurred to make density maps [40]. Most approaches construct density maps in advance by convolving dot maps with Gaussian kernels with fixed or adjustable bandwidths. Then, to solve various barriers, such as size fluctuation, improving density map quality, encoding more contextual information, or adapting to changing conditions, alternative network models are built. To accommodate scale fluctuation, the Multi-Column Neural Network (MCNN) proposes extracting features from many columns with various Kernel sizes [17], [19]. Switch-CNN suggests selecting a right column with an adequate receptive field rather than combining multi-scale features. To manage the diversity of individuals in crowds, a tree-structured CNN has been proposed bu authors in [21] and [41] that presents a hierarchical encoder-decoder architecture for encoding multi-scale characteristics, while in [39] a framework for filtering background, and authors in [42] intend a unique feature fusion strategy. The comparison of different crowd counting approaches have been shown in the Figure 2.

B. CROWD LOCALIZATION
Recently, crowd localization has been a new and active research topic in computer vision. It takes the individual as a basic unit instead of the scenes [13]. The research community has paid less attention to gathering evidence regarding crowd localization in densely populated areas. Finding the person's exact position in a crowd is a challenge in localization. The distribution of individuals in the environment may be determined via localization, which is critical for crowd managers to maintain the safety and security of the public. Furthermore, it may be utilized in a dense crowd to recognize and track a person. And to generate ground truth data, which can be used to correct counting errors [2]. To locate the head position of each individual in the congested crowd scene, some researchers have focused on crowd localization, consisting of three types, namely: 1) detection-based crowd localization, 2) point-based, and 3) heuristic-based crowd localization [14].

1) DETECTION-BASED CROWD LOCALIZATION
There are various ways for locating individuals heads in a crowd that employs bounding boxes. Methods based on detection train object detectors to pinpoint each person's location [29], [43], [44]. In [45] and [46], the authors have proposed head detection method using hand-crafted features for a real-time surveillance system. As hand-crafted features are limited, therefore, several CNN techniques are used to enhance the performance of complex scenes. Similarly, the authors in [47] present context-aware head detection method for crowd localization. A recurrent LSTM technique has been built for sequence creation, and its detector has been used to obtain the detection results [6].

2) POINT-BASED CROWD LOCALIZATION
The most of the crowd datasets contain images which annotated by points instead of bounding box annotations. As a result, it is more practical and useful to employ point annotations for a crowd as bounding box annotation as result productive. After recasting the crowd localization problem as a foreground/background segmentation problem, they used the cross-entropy loss to optimize the network [23], [48], [49], [50].

3) HEURISTIC-BASED CROWD LOCALIZATION
The heuristic-based crowd localization is used to extract the crowd location from density maps proposed in [51] and [52]. Non-maximum suppression has been used to obtain the highest local value. This shows the head annotations of each individual. After that 1-1 matching has been adopted to match the true head location. Finally, the feasible possible solution was obtained by applying the Hungary algorithm for the evolution of crowd localization.
Apart from the above-proposed methods, we have used focal inverse distance transform that can count and localize the annotated heads in the congested scenes. Compared to density maps, Focal Inverse Distance Transform (FIDT) maps appropriately indicate a person's location in a congested scene without overlapping heads. We regress the FIDT maps at the same time to determine the person's position and the count.

C. LIMITATIONS
Existing approaches mainly rely on crowd counts while employing a density map. The primary disadvantage of a density map is that it is difficult to detect the specific position of individuals in dense images due to blurry Gaussian blobs. These methods seldom pay attention to the particular placement of persons in images in crowded settings. This leads our study toward developing a viable combined solution for crowd counts and localization based on the FIDT map.

III. PROPOSED ARCHITECTURE
The proposed architecture is depicted in Figure 3 in this section. Random cropping and horizontal flipping are used to enhance the input training images at first. We used data augmentation techniques to boost efficiency and minimize overfitting because most crowd-counting datasets are small in size. At first, we execute random cropping and resizing the training set with different sizes to overcome the problem of scale variation and retain performance on zoomedin images. Then these cropped images have been used to generate ground truths. The cropped images' size for Shang-haiTech part A dataset and part B dataset are 256 × 256, and the datasets have images' size of 512 × 512. A regressor is used to regress the predicted FIDT maps. We have used a congested scene recognition network with different dilation rates as the core network to train the model.

A. MATHEMATICAL ELABORATION OF FOCAL INVERSE DISTANCE TRANSFORM MAP
Many deep learning-based approaches use FIDT map as an image processing tool [25], [53]. The FIDT map is used to count and locate the crowds simultaneously without overlapping the heads in congested scene. It is mathematically expressed in (1): where B is the total number of annotations for an arbitrary pixel P(x, y). The equation (1) represents the distance between a pixel and its nearest head annotation. Due to the high distance variations, the distance transform map is not easy to regress immediately. An Inverse Distance Transform map (IDT) has been used to prevent variations in distance, as defined by (2): where I represents the IDT map, and C is the additional constant set C = 1. The IDT map is a unique form of iKNN map. Instead of crowd localization, the iKNN map is employed for crowd counts. The IDT (i1NN) map on the other hand, may properly depict specific places that correlate to local maxima when compared to frequently used density maps. The response time should be slower while going away from the head, and the background should go close to 0 as fast as possible. The foregrounds should be emphasized by the model's head regions. As a result, the Focal Inverse Distance Transform (FIDT) is proposed. The mathematical expression of FIDT map is given in (3): where I represents the FIDT map and α, β are the statistical parameters used for contrast and illumination, respectively.

B. THE EFFECTIVENESS OF FIDT MAPS FOR CROWD LOCALIZATION
The HRNET for regression was used to assess the efficiency of the FIDT maps. High-resolution representations are required for position-sensitive vision issues such as human posture estimation, semantic segmentation, and object detection [54]. The counting results of the proposed FIDT map are the same as local maxima. Therefore, high representation is required. We can locate the person's position using a FIDT map with local maxima.

C. ELABORATION OF LOCAL MAXIMA DETECTION STRATEGY
The position of the person in the highly congested scene can be easily found via a predicted FIDT map. In the Local Maxima detection Strategy(LMDS), we first use the 3 × 3 max-pooling methods to get candidate points. However, some false positives may exist in the background of these candidate points. We see those true positives have a significantly higher pixel value than false positives, meaning that a local maximum is prone to exist. As a result, it has been decided to use threshold values to eliminate false positives. Hence, given a set of candidate points M, the last chosen points are those whose value is more than or equal to an adaptive threshold, equal to 100/255.0 times M's maximum value. The most recent dataset [55] contains some negative samples comparable to crowd scenes and some scenes without humans. Based on the anticipated density maps, we can not tell if the original images have people in them. Given a predicted FIDT map, if the maximum of M is less than a small number (set at 0.10), this indicates that the input image is a negative sample, and the LMDS will put the counting result to 0. We can build pseudo individual sizes using the proposed FIDT map, even when actual individual sizes are unavailable. We first extract the coordinates of the head centers from the predicted FIDT map, which can be effectively implemented using the suggested LMDS. The instance size is then calculated by applying the K-nearest neighbor distance, as defined: where S (x,y) show off the size of the instance on the x-axis and y-axis, respectively with P, i.e., a collection of predicted head points. Thed is the average distance calculated between P (x,y) and its k-nearest neighbors. f are used to confine the size of the predicted FIDT map. In particularly sparse locations, thē d maybe larger than the actual size of individuals. Thus we pick and to limit the object size, choose a threshold based on image size.

IV. EXPERIMENTAL EVALUATION
The experiments are conducted on different mainstream datasets, namely, ShanghaiTech parts A, B, and JHU-CROWD++. The latest dataset, i.e., JHU-CROWD++ is the most challenging for crowd counting and localization owing to its several illumination variations and complex background. The key information of the datasets is listed in table 1.

A. IMPLEMENTATION DETAILS
In implementation details, we augment the datasets with random cropping and horizontal flipping with a size of 256 × 256 for ShanghaiTech parts A and B, and 512 × 512 for other datasets, such as JHU-CROWD++, which have a diversified scene. To produce the bounding boxes, we set k to 4 and f to 0.1 further for optimization of the model Adam [56] has been used, with a learning rate of le − 4 and a weight decay of 5 × le − 4. For threshold selection,the distance between the given predicted point P p and ground truth point P g must be less than a distance threshold α(real head size). It indicates that P p and P g have been successfully matched. The MSELoss has been used as a loss function with addition of Adam as a optimizer throughout the training phases.

This section uses the Mean Absolute Error (MAE) and Mean
Squared Error (MSE) as statistical measures to evaluate counting performance. These statistical parameters are being defined mathematically as: We examined the crowd localization performance on multiple statistical criteria, such as Precision, Recall, and F1 measure, and crowd counting metrics. They are mathematically defined as: C. DATASETS We evaluate the approach on three popular datasets, each discussed in detail below.
1) ShanghaiTech DATASETS [17] proposed the ShanghaiTech datasets, which are divided into two parts: part A and part B. Part A has 482 images and 241,677 instances, whereas part B has 716 images and 88,488 labeled heads.

2) JHU-CROWD++
There are 4,372 images with 1.51 million annotations in this dataset. Unlike existing datasets, the proposed dataset was obtained in various scenarios and environments. In addition, the dataset includes a broader range of annotations, such as dots, approximate bounding boxes, and blur levels. It consists of 2722 training, 500 validation, and 1600 test images. It contains diverse crowd scenes, such as fog, rain, snow, and low illumination [57].

D. COMPARISON WITH STATE-OF-THE-ART-METHODS
We compared our proposed framework with other stateof-the-art methods like [21], [58], [59], and [60] on ShanghaiTech part A, and part B datasets. We report the results of these latest existing methods, shown in Table 2. From Table 2, it is demonstrated that our proposed methods compete and outperform the recent existing methods. Reference [61] has proposed a deep fusion network that combines multi-scale features from shallow to deep network.  Peaks of density maps achieved the final count. But the drawback of density maps is blurry gaussian blobs specifically, it is challenging in the highly congested scene with the problem of overlapping.

V. RESULTS AND DISCUSSION
We have tested and assessed our proposed model on various datasets. The training and validation losses has been shown in Figure 4. The experimental results on ShanghaiTech part A and Part B datasets of the proposed model have been compared with recent state-of-the-art methods based on crowd counting and localization. In Table 2 and Table 3, it is demonstrated that the proposed model has achieved remarkable performance in terms of MAE, MSE, average precision, average recall, and F1 score. For ShanghaiTech Part A and Part B datasets as listed in Table 2 the proposed model reports SOTA counting performance compared with recent models [14], [59], [61]. Similarly, the proposed model has been compared with recent existing methods for crowd localization and achieved notable results as shown in Table 3. For crowd localization, the proposed model can identify the exact position.  A. JHU-CROWD++ BENCHMARKING Crowd counting datasets have changed over time in terms of size, crowd density, image resolution, and various other factors. The UCSD dataset has been created early and consists of 2000 low-resolution video frames with 49,885 annotations [70]. Similarly, the UCF_CC_50 with 50 images has been proposed [32] based on high-density crowd images. But the drawback of this dataset is less number of Images, which is not reliable for training the deep learning networks. The ShanghaiTech dataset has been introduced by [17] for better variety in terms of both scene and density levels.

1) EVALUATION OF COUNTING AND LOCALIZATION
First, we compare the proposed approach to existing and cutting-edge crowd counting and localization methods. The qualitative results on JHU-CROWD++ dataset have been shown in Figure 5. It has been stated that the existing models are based on both crowd counting and localization simultaneously. Based on the results, we can see from Table 4, and Table 5 that our model has achieved good performance in terms of both MAE and MSE and produces the correct counting task. The CSCCL-Net plus FIDT model improves the counting and localization performance in highly dense crowd images, including fog, rain, snow, low illumination, and so on. A multi-scale contextual approach has been proposed in [71] that combines the features obtained from the multiple respective fields to predict the crowd density. An encoder and decoder network has also been used to extract multi-scale features using a set of transposed networks. Further, the model's efficiency has been improved by the combination of Euclidean loss and local pattern consistency loss [19].
To leverage the information presented in different layers of the network, a multi-scale fusion has been used to improve the model's effectiveness [42]. The point annotations create a density contribution probability model using Bayesian loss.
The predicted count at each annotated location is then determined by adding the contribution probability and estimated density at each pixel, which is reliably guided by the ground truth count value [33]. A crowd counting network with a rich set of annotations at image and head levels progressively generates the density maps. The proposed model has used VGG-16 as a core network with the addition of generated density map for the final prediction of the people in the crowd [57]. An adaptive density map generator that accepts an annotated dot map as input and learns a density map   representation for a counter has also been presented for crowd counting [40]. As shown in Table 4, and Table 5, we compared and assessed our model's performance for crowd counting with recent existing SOTA methods. The graphical representation of JHU-CROWD++ dataset has shown in Figure 6. On the JHU-CROWD++ dataset, the suggested model has steadily reduced the MAE and MSE, and the findings show that the proposed model has achieved exceptionally good numerical results.

VI. ABLATION STUDY
We evaluate the performance of our proposed model in relation to the backbone network with the existing baseline model. The backbone network in the proposed model is the VGG-16. We conduct the experiments using the JHU-CROWD++ dataset, and the results clearly show that it performs better than the current standard. Table 5 displays the efficacy of the suggested model with higher performance in terms of average accuracy of 5% and F1 measure of 3% accordingly.

VII. CONCLUSION AND FUTURE WORK
This paper has proposed a congested scene crowd counting and localization simultaneously using VGG-16 as the core network with the addition of dilated convolutional neural network. The proposed framework is based on point-level annotations. Notably, using the FIDT map can remove the blurry Gaussian blobs in the density map and promptly enhance the performance. Extensive experiments demonstrate that this framework has achieved notable crowd counting and localization results in the case of highly congested scenarios, including diverse environments. As shown in Table 4, our proposed CSCCL.Net has achieved MAE 61.5 and MSE 226.2 compared to other networks. Equivalently our proposed network has achieved localization performance on JHU-CROWD++ for Average precision 68.2, for average recall 62.6, and for F1 measure 65.4, as shown in Table 5.
In future the proposed model can be used to localize and count the people in complex videos.