Single Convolutional Neural Network with Three Layers Model for Crowd Density Estimation

Crowd density estimation is an important topic in computer vision due to its widespread applications in surveillance, urban planning, and intelligence gathering. Resulting from extensive analysis, crowd density estimation reflects many aspects such as similarity of appearance between people, background components, and inter-blocking in intense crowds. In this paper, we are interested to apply machine learning for crowd management in order to monitor populated area and prevent congestion situations. We propose a Single-Convolutional Neural Network with Three Layers (S-CNN3) model to count the number of people in a scene and conclude about the crowd estimation. Then, a comparative study for density counting establishes the performance of the proposed model against the convolutional neural networks with four layers (single-CNN4) and Switched Convolutional neural networks (SCNN). ShanghaiTech dataset, considered as the largest data base for crowd counting, is used in this work. The proposed model proves high effectiveness and efficiency for crowd density estimation with 99.88% of average test accuracy and 0.02 of average validation loss. These results achieve better performance than the existing state-of-the-art models.


I. INTRODUCTION
More people have chosen to live in the city in recent years, where the advantages of this phenomenon are enriching cultural life and at the same time making good use of accessible urban infrastructure, attracting a wide range of people to coordinate various activities. Global and national events attract large crowds, whether indoors or outdoors. Typically, these events involve at least one activity necessitating simultaneous participation of attendees, such as viewing a display, watching an open-air show, passing through checkpoints, or entering areas that restrict the number of people present. However, such events are prone to overcrowding, where there were no rapid and efficient way to obtain an overview of the whole venue and effectively communicate with those in other areas, often resulting in collisions and overcrowding. In contrast, a centralized monitoring system capable of estimating crowd density in various locations at the same time constitutes a preferable option for effective and reliable decision making to ensure the people safety while allowing them to continue enjoying the event.
In the Kingdom of Saudi Arabia (KSA), there are several seasons for Hajj, Umrah, and Ramadhan (especially the 27th of Ramadan), when people from different countries of the world overpopulate Mecca to perform religious ceremonies in a limited period of the year. In various countries and during the celebration of their founding, national or independence days, the security authorities are mobilized in different places to prevent negative impact of crowding. In this context, the development of an Artificial Intelligence (AI)-based method helps to count the crowds present in different places, provides the ability to control them from accidents and prevent the spread of contagious disease such as Covid-19. Overcrowding problems, wherever they are, can be counted, monitored and controlled using centralized and automated system integrating crowd density estimation and machine learning.
The research on automatic detection, counting and density estimation in a large-scale crowd is playing a significant security and management role for controlling huge crowd in different places. Monitoring large events including huge crowd requires great efforts to ensure the attendees' safety and the delivery of adequate services. Many studies have been conducted to come up with a good monitoring system.
Estimating crowd density will support management authorities in coordinating and scheduling them between different areas during their transitions. With the recent advances in machine learning, various CNN models have been proposed to better resolve some estimation issues and leverage both crowd density estimation and classification accuracy [1]. However, those models used crowd size as a discriminator for crowd density.
Deep Learning (DL), being a specialized form of machine learning, can handle a large amount of data. It is considered as the most technical in machine learning that gives high efficiency in many applications like face recognition and crowd counting. There is a subset of machine learning methods where multilayered neural networks can learn from vast amount of data. Deep learning, such as CNN, LSTM, RNN, GAN, RBFN, MLP, SOM, DBN, RBM, and Autoencoders, supports this mission due to its architecture including more than three layers.
Many studies have discussed ways to automate the monitoring and organizational processes to obtain a macroscale view of the entire event and conduct central decisions. Several technological solutions have been suggested, some of them rely purely on sensors, while others employ image processing and computer vision. However, the most recent solutions focus on neural network approaches. In this work, we are interested in developing a Convolutional Neural Network (CNN) model with three layers for crowd density estimation. We switched the density estimation challenge from a regression problem to a classification problem by defining specific classes estimating various density levels and providing different crowd indicators. So, this model is able to classify a scene into a crowd density class, which allows enabling appropriate and timely actions to mitigate congestion and reduce the risk of hazardous incidents. For credible evaluation, we employed the most known ShanghaiTech dataset for performance measurements. This paper is organized as follows. Section II introduces a background in terms of crowd density estimation and convolutional neural network. Section III reviews the relevant state of the art studies covering convolutional neural networks for crowd counting. Section IV presents a detailed description of our proposed CNN model. Section V describes the simulation environment and the experimentation details. Section VI illustrates a comparison of results between the proposed model and some existing models. Section VII shows the work contributions. Finally, Section VIII concludes the paper and outlines some future works.

II. BACKGROUND
Event organizers monitor crowds with the help of volunteers who are distributed around the venue to direct and guide the attendees. However, while the decisions on when and where to direct the crowd are all individual decisions, without having a global and updated overview of the event. Random and unsupervised crowd movement can cause congestion, overcrowding, and in more serious cases injuries due to stampedes. However, each individual responsible for the event only monitors and organizes the movement of one small group in one area. That's why many studies have investigated crowd density classification and estimation as elements of crowd management solutions. Deep learning techniques have been used for crowd density counting, which is a field of Artificial Intelligence (AI) using algorithms to provide computers with the ability to identify patterns from mass data to make predictions. This learning method allows computers to perform specific tasks autonomously. Some of the crowd density estimation solutions are based on different technologies: sensor technology approaches, computer vision approaches, and neural network approaches.

A. GENERAL PROCEDURE FOR CROWD DENSITY ESTIMATION
Crowd counting consists of crowd density estimation or counting the number of persons in a certain scene (image). Previously, the crowd was manually counted by crowd scientist, responsible of tallying the number of people in certain areas of an image and then extrapolates them for estimation. This method is characterized by waste of time and effort as well as of high error possibility. Because they attract and gather a huge number of people in a confined place, crowds are common in sports, festivals, political, and religious activities [2].
The Jacobs' Method is the most widely used method for counting crowds during religious ceremonial meetings, protests, and rallies. Jacobs' approach is dividing a crowd's area into pieces, calculating the average number of people in each sector, and multiplying by the total number of sections filled [3]. Crowd density estimation helps in the development of management techniques such as the design of safe public spaces and an emergency evacuation plan. The procedure for crowd density estimation is to establish relationships between image parameters from various image processing techniques and actual crowd densities at an investigation site [4]. There are two main different approaches: direct and indirect approaches. The direct approach tracks and counts people simultaneously, as long as people are correctly segmented. The indirect approach relates between a set of measurement features and learning algorithms of the whole crowd to carry out counting and estimating process by pixel-based analysis, texture-based analysis, and corner point-based analysis [5].
The most gigantic and famous crowd in the world occurs in the sacred sites of Mecca. It is considered as the world's largest human gathering as the Muslim holy pilgrimage, attracts millions of humans to Mecca every year for AlHajj. According to statists, the pilgrimage has attracted almost 2.5 million pilgrims in 2019, and over 3 million pilgrims at its height in 2012 [6]. CNN can be one of the best solution for managing Al Hajj crowd, which provides a framework for solving the problem of estimating the level of crowding in order to avoid accidents [7].

B. CONVOLUTIONAL NEURAL NETWORK (CNN)
A Convolutional Neural Network (ConvNets or CNN) is a type of artificial neural network [8] that uses image pixels as input to perform tasks such as image identification, classification, object detection, and face recognition. The process of CNN image classifications is taking image pixels as input and outputting a class such as: "cat", "dog" and "bird", and it can count the number of objects in the image.
The CNN consists of an input layer, multiple hidden layers and an output layer. The hidden layers usually consist of a series of convolutional layers, ReLU layers, pooling layers and a fully connected layer. The convolutional layer has a set of filters to filter parameters which need to be learned. The ReLU layers are used to improve neural networks by speeding up the training process. In the neural network, the pooling layer is utilized to minimize the amount of parameters and processing. The Fully Connected (FC) layer makes the final classification decision [9].
Convolution and pooling are two essential procedures that are always included in CNN. The convolution process with several filters is capable of extracting features (feature map) from the data set while preserving their spatial information. Pooling, also known as subsampling, is a technique for reducing the dimensionality of feature maps created by the convolution procedure. The most frequent pooling techniques in CNN are maximum pooling and average pooling [10]. Figure 1 describes the CNN's structure.

1) SENSOR-BASED SOLUTIONS
Several studies have addressed the problem of crowd estimation using sensor technologies such as Wireless Sensor Networks (WSN) and Radio Frequency Identification (RFID) [11]- [12]. Sensor technology approaches rely on the assumption that crowd members possess network devices that can transmit wireless signals (e.g., RFID wristbands, sensor tags, or smart phones). Certain sensor-based technologies estimate crowd density by counting the Unique Identifiers (UIDs) of the signals sensed or read in the area of interest. Some other solutions relate crowdedness to the Received Signal Strength Indicator (RSSI), Channel State Information (CSI), and Link Quality Indicator (LQI). The latter approaches are less complex in terms of processing, whereas they suffer from serious deployment problem in terms of number of devices and well as their signal communication reliability. Furthermore, their success is strongly determined by the cooperation of crowd members. Finally, these approaches constitute an exhausting and expensive choice for the targeted crowd size.

2) COMPUTER VISION SOLUTIONS1
In contrast, computer vision solutions require fewer devices and do not depend on crowd members' cooperation. It only needs a camera installed at the location to capture images of the crowded scene. Those images are then processed to extract useful information pertaining to the crowd density level [13]- [14], and statistical analysis is performed to find similarities in crowd texture between different images.
However in [15], the authors used mapping based on crowd count and foreground pixels obtained from binary and infrared images. However, statistical analysis and supervised feature extraction are computationally complex and time consuming, as well as not producing universal solutions because of the application-orientated process. Overall, while computer vision solutions require fewer devices for a particular area compared with sensor-based solutions, it is not generalizable to any crowd scene and cannot adapt to new features.

III. LITERATURE REVIEW
Since the estimation of the crowd density is important to manage and control the crowding, crowd management researchers are studying the effectiveness of neural networks for crowd density estimation [17]- [18] leveraging both crowd density estimation and classification accuracy.
Fu et al [19], suggested an optimized convolutional neural network which overcomes the accuracy and speed requirements of engineering applications problems in present methods. This method constitutes the first research work using the convolutional neural network for crowd density estimation. It optimizes the multi-stage convolutional network structure by removing some network connections that have similar feature maps. This contribution increases the speed of estimation and reduces the computation cost for both training and testing phases. In addition, the authors designed a cascade of two convolutional neural network classifiers to improve the accuracy. The first classifier identifies samples that are clearly misclassified, whereas the second one reclassifies those samples. They applied the method on three datasets: PETS_2009, a Subway video and a Chunxi_Road video. They defined five classes; each class in each dataset has different range of people number. Finally, they established a comparative study in terms of results where their method outperforms other related work.
Oñoro & López-Sastre [20], proposed two deep learning approaches to count objects in images. The first approach, named Counting CNN (CCNN), is formulated as a regression model where the network learns how to map the appearance of the image patches to their corresponding object density maps. The second approach, called Hydra CNN, is a scale-aware counting model able to estimate object densities in different very crowded scenarios without the scene's information. Hydra CNN learns a multiscale non-linear regression model, which uses a pyramid of image patches extracted at multiple scales to perform the final density prediction. Each approach is evaluated on three datasets; the UCSD pedestrian, the UCF CC 50, and the TRANCOS datasets. Sam et al [21], used Switch Convolutional Neural Network (Switch-CNN) model for crowd counting. This model is considered as multi-column CNN (MCNN) based on three different architectures CNN regressors. It consists of a classifier (switch) employed to select the best regressor for an input crowd scene patch. Then, it divides the input image into nine non-overlapping patches. This model assumes that the characteristics of the crowd, such as density and appearance, can be consistent in a given patch for a crowd scene. It has a similar architecture for each CNN regressor; four convolutional layers with two pooling layers. The authors used three different datasets including ShanghaiTech. The classification accuracy reaches 64.39% for CNN-small and 73.75% for VGG-16. Their model learns to group crowd places based on latent factors correlated with crowd density.
Kumagai et al [22], proposed an architecture of expert and gating CNNs to select the most compelling feature extractor CNN. Using a filter suitable for specific image perspective, each column was trained only on the images of a particular domain. This work used two challenging crowd counting datasets; the UCF CC 50 dataset and the Mall dataset. Later on, Al-Hadhrami et al [23], provided a Single Convolution Neural Network with four convolution layers (Single-CNN4) for crowd counting considered as classification-based problem solving. They divided the input images into nine non-overlapping patches. Then, they examined the model with eight different phases distinguished in terms of the labeling process, training epochs, and dataset splitting.
Hu et al [24], used a single-column network with multiple filter sizes to capture features at various scales by reducing the filter size after alternate sets of convolution layers. Two supervisory signals, crowd count and crowd density, are employed to learn crowd features and estimate the specific counting. They tested the approach on a dataset containing 107 crowd images with 45.000 annotated humans inside, each with headcounts ranging from 58 to 2201.
Dai et al [25], proposed a DSNet for crowd counting, which is simple and easily trained but effective network. The DSNet is a Dense Scale Network used for Crowd Counting and made up of blocks that are densely linked dilated convolutional layers. As result, it can generate features with various receptive fields and capture crowds at various scales. Then, they evaluated their DSNet on four public datasets for crowd counting (ShanghaiTech, UCF-QNRF, UCF-CC 50 and UCSD). DSNet attained the best performance and made substantial gains (20 % on ShanghaiTech and UCSD, and 30 % on the others).
Sindagi & Patel [26], proposed an end-to-end cascaded network of CNNs with two-columns producing a combination of count estimation and count group classification. The first column classifies the crowd group and shares the extracted features to enhance the accuracy of the crowd count mapping of the second column. Extensive experiments on highly challenging publicly available datasets show that this method achieves lower count error and better density maps. Saqib et al [27], proposed a Motion Guided Filter (MGF) based on Deep Convolution Neural Network (DCNN) and Faster-RCNN for crowd counting. They evaluated the performance of their approach on three publicly available datasets (PETS2009, UCSD dataset and Mall dataset). They used MGF to recover misdetections and improve the mean average accuracy of overall detections. It resulted in improving crowd counting and density estimation, as measured by the Mean Absolute Error (MAE) test for this Method (VGG16 + MGF). The results were compared to other techniques, which achieved 1.27 for the UCSD dataset, 1.89 for the Mall dataset and 1.21 for PETS2009 dataset.
Marsden et al [28], proposed a scale-aware model to reduce the computation complexity of the multiple columns, feeding multiple scales of an image into the network, and the crowd counting estimated for each scale. The final count represents the average of all estimates performed with Shanghaitech and UCF CC 50 datasets. Zeng et al [29], proposed a multi-scale convolutional neural network (MSCNN) for single image crowd counting based on scalerelevant features identification process. They evaluated their model for crowd counting on two separate datasets, including the UCF CC 50 and ShanghaiTech datasets. The results achieved by the MSCNN model outperforms other related works in terms of accuracy and robustness, where the obtained MAE is 83.8 on ShanghaiTech dataset and 363.7 on UCF CC 50 dataset.
Zhou et al [30], proposed a Multiscale Generative Adversarial Network (MS-GAN) for generating high-quality crowd density maps of arbitrary crowd density scenes. The MS-GAN combines a multiscale convolutional neural network (generator) and an adversarial network (discriminator). The multiscale generator utilizes the fusion features from multiple hierarchical layers to detect people with a large-scale variation. The resulting density map produced by the multiscale generator was processed by a discriminator network trained to solve a binary classification task between a poor quality density map and real groundtruth ones. The experiments showed that adversarial training improved the performance of density map prediction.
Pu et al [31], used a deep convolutional neural network (ConvNet), in which each layer only received input from the immediate previous layer. The classification of Subway carriage dataset into five density classes was also based on pre-determined ranges of crowd count. This deep model provided an accuracy of 82% and 86% on five and three density classes, respectively, attributed to the close boundaries between class ranges and the slight variance in crowd count of the samples in the same class.
Kasmani et al [32], proposed an Adaptive Counting Convolutional Neural Network (A-CCNN) that uses an ideally trained CCNN model. This model is used to analyze each component of an input image in order to properly estimate the appropriate density map. The most notable features making the proposed model exceptional for crowd analysis resides on the capacity to manage large-scale differences in people's sizes appearing in the image, as well as the ability to create local density maps within a crowd scene. Therefore, the proposed model can give a complete view about the scattering of a crowd. They evaluated their method on different datasets in terms of MAE metric, achieving 367.3 for the UCSD dataset and an average of 1.35 for the UCF-CC dataset. Tripathy & Srivastava [33], proposed a two-input stream multi-column multi-stage convolutional neural network (TIS-MCMS-CNN) model to classify the crowd of PETS2009 and UCSD datasets into five density classes based on crowd count. This model used a shallower but wider model consisting of two-three column networks trained in parallel. This model tried to enhance the accuracy of previous models by capturing spatial and temporal features.
To  [35], used just the first layer of the CNN pre-trained offline on ImageNet that is utilized in the statistical CNN features and then counted using the SVM (support vector machine). Using UCSD dataset, this approach demonstrated better accuracy and efficiency than other related methods, where MAE and MSE achieved 2.29 and 9.05, respectively.
Although, various network architectures and learning parameters have been proposed to better resolve some estimation issues. However, most existing models in the literature estimate the actual crowd count [36]- [37], while very few studies further classify the overall crowd density.

IV. PROPOSED APPROACH
The main goal of this work consists of using deep learning to develop a model capable of learning and analyzing crowd features and then predicting the crowd density class. This work allows an appropriate decision-making for efficient management and incident prevention in a crowded environment.
The reason for this research is to count the crowds in an automated way without human intervention in order to deal with the crowds in a professional and faster way, where we used a dataset that contains images with a different number of people. The modal counts the heads in a specific spot where people are crowded. As this helps the security authorities to break up the crowd to prevent accidents, and at the present time, distancing and avoiding crowds is essential to prevent the spread of Covid-19 disease

A. PREPROCESSING
The preprocessing phase consists of increasing the size of the dataset by introducing new images using the available data and a segmentation method. The segmentation divides the image into nine non-overlapping sub-images and then calculates the density maps for each part. In one hand, the sum of the density map gives the number of heads in that map. In the other hand, it is necessary to know how the total count is distributed over the image segments. Thus, the dotmap is first created for the entire image, and then it is segmented. Subsequently, the sum and placement of '1's in a dot-map segment represents the headcount and locations of heads in the equivalent image segment. The full dot-map is also needed to perform labeling. Figure 2 shows the segmentation method.

B. FEATURES EXTRACTION AND LABELING
In this phase, the main feature of an image is extracted, which is the number of heads. Then, all images are labeled based on the number of heads in the image. Then, these images are divided into 20 and 33 labels (classes), used in two different experiments respectively.

C. PROPOSED MODEL ARCHITECTURE DETAILS
In this phase, we propose a Single Convolution Neural Network with three convolution layers (Single-CNN3) to count the crowd in a scene. First, to turn the problem solving from counting to density estimation, several classes are created including different ranges of crowd counts. These ranges are selected to define a specific density level. These levels designate various indicators of s where high levels indicate risk of congestion situations.
Second, we built the Single-CNN3 model with a simple structure to solve the crowd density estimation problem as a classification rather than a regression problem. It consists of a single CNN regressor with three convolution layers which has [1 1] filter size. The first and third layers uses 64 filters, whereas the second layer uses 32 filters. In addition, the proposed architecture includes two max-pooling layer, three batch normalization layers, three relu layers, one fully connected layer, one softmax layer beside the input and output layers, and finally, an output layer of n neurons representing the n density classes. The model is initialized by a Gaussian distribution with a learning rate of 0.0001.
The max-pooling layers help in maintaining the dominant features and reduce the size of the input. Earlier, the convolution layers extract low-level features such as edges, while high-level features are extracted in the later convolution layers. Those features are combined in a fully connected layer called softmax before the output layer. The softmax layer must have the same number of nodes (n) in the output layer to be classified into the n density classes in the final output density layer. Since the size of the dataset is small whereas the number of labels (classes) is huge, the images are divided into nine non-overlapping images (patches) to make the dataset nine times more important. This process needs creating a density map for each patch. Figure 3 shows the architecture of our proposed model (Single-CNN3) whereas figure 4 provides more details in terms of layers and mechanisms.

A. TOOLS AND EQUIPMENTS
This proposed model was run on a MATLAB program with specific parameters in training options. Table 1 lists our training parameters. The simulation equipment is an MSI laptop with Core (TM) i7-9750H CPU @ 2.60 GHz, 16 GB RAM, and NVIDIA GeForce RTX 2060.

B. THE SHANGHAITECH DATASET
ShanghaiTech is a new large-scale crowd density estimation dataset including 1198 annotated images, characterized by a total of 330 165 persons with centers of their heads annotated. This dataset is the largest one in terms of number of annotated people [38]. The dataset consists of two parts. The first part contains 482 images, which are randomly crawled from the Internet as well as from Arabian countries and Makkah. The second part contains 716 images taken from the busy streets of metropolitan areas in Shanghai. Both parts are divided into training and testing groups. The crowd density varies significantly between the two subsets, making accurate estimation of the crowd more challenging than most existing datasets [39]. Figure 5 shows samples of the dataset.

C. PERFORMANCE METRICS
To evaluate the performance of our proposed model, multiple metrics are used and calculated over several experimentations. Three performance metrics are used in this work.

1) CLASSIFICATION ACCURACY
The accuracy is defined as the percentage of the number of correct predictions to the total number of input [40]. It is given by equation 1.

2) LOSS RATE
The loss rate metric measures how bad the model's prediction was on a single example. If the model's prediction is perfect, the loss is zero; otherwise, the loss is greater. Thus, the goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples. The loss is calculated on training and validation; it's how well the model is doing for these two sets, unlike accuracy. Thus, the loss is not a percentage. Instead, it is a summation of the errors made for each example in training or validation sets [41].

3) CONFUSION MATRIX
We used confusion matrix to evaluate, visualize and summarize the performance of a classification model [42]. also knowing the actual and predicted classifications done by our model [43].

A. EXPERIMENT AND RESULTS
The Single-CNN3 is applied on the ShanghaiTech dataset [44]. This data set is considered as crowd density estimation dataset including labels used by the neural network to train a model and provide context. It contains a large number of images with various crowd levels, which allows efficient learning on the congestion situation variety. The model is trained using multiple GPU in parallel to speed up the training process. Part A and part B of the dataset are merged and randomly split into train and test subsets given by 80% and 20%, respectively. In this work, a certain number of classes including a specific range of head cunt are created. Each class represents a specific level of crowdness. Several experimentations are performed for a total number of classes of 20 and 33. Obviously, the classification problem is significantly increased as much as the number of classes is high. Each class includes a certain range of head count, which represents a specific crowd level. Therefore, this step turns our problem solving from regression to classification with large number of classes.
The best result of the validation accuracy was 100%, whereas the worst result was 99.36%. In terms of loss, the results ranged from 0.017 to 0.05. Finally, the test accuracy reached significant results given between 99.42%. and 99.88%.
In the second phase, the same simulation scenario is performed, except the dataset labeling procedure including 33 classes. Actually, the results are slightly higher than those obtained in the first phase, where the average validation accuracy is 99.86%, the loss is 0.01748, and the test accuracy is 99.8794%. The best result of the validation accuracy was 100%, whereas the worst was 99.61%. The achieved loss is given between 0.011 and 0.027. Finally, the best-performed test accuracy was 100%, and the worst was 99.78%. Figures  6 and 7 show the evolution of the accuracy and loss metrics during the training for both first and second phases, respectively. Tables 2 and 3 show the results for the two phases.   This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.  Figure 8 shows the confusion matrix result of run #1 in the first phase of Table 2. Figure 9 shows the confusion matrix result of run #1 in the second phase of Table 3.

B. PERFORMANCE EVALUATION
The performance evaluation establishes a comparison on the results and other characteristics between our proposed Single-CNN3 model against both Single-CNN4 and Switch-CNN models using ShanghaiTech dataset. Table 4 shows the results of the three models in terms of number of classes, validation loss, validation accuracy and test accuracy. Single-CNN3 model achieved the highest result for both validation accuracy and test accuracy using 20 labels. With 33 labels, the Single-CNN4 model performed the highest result in terms of validation accuracy, whereas the Single-CNN3 model achieved the highest result in terms of test accuracy.  This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.  Table 4, Switch-CNN achieved the lowest testing accuracy (76.3%), whereas Single-CNN3 reached the highest testing accuracy given by 99.88% obtained within the second phase. Single-CNN3 also performed a high validation accuracy of 99.86 almost similar to 99.88% got by the Single-CNN4 (8 th phase). Moreover, our proposed model succeeds to get the minimum validation loss of 0.02 accomplished as average result during the second phase. As result, the Single-CNN3 model has better performance than Single-CNN4 and Switch-CNN models in terms of accuracy and loss. Although both single-CNN3 and Single CNN4 have simple structures, use single-column CNN and solve the problem as a classification problem.

VII. WORK CONTRIBUTIONS
The most significant contributions of this research are summarized as follows: 1. It contributes to establishing a system capable of estimating the crowds automatically in record time during specific events such as the pilgrimages in religious ceremonies. It also allows saving time, effort, and cost compared to traditional counting and controlling crowded places. 2. The density estimation challenge has been switched from a regression problem to a classification problem by defining specific classes estimating various density levels. 3. Performance evaluation has been achieved using the most known ShanghaiTech dataset that reflects crowded events The targeted applications of this system reside in helping the security authorities to control and organize crowds in touristic and recreational places as well as scientific edifices such as (universities, sea beaches, stadiums, and commercial centers). The system can help for Secure Covid-19 vaccine supply supervision in the available centers.

VIII. CONCLUSION
In this paper, we studied the convolutional neural networkbased approaches, which are designed to accurately estimate the crowd density level in different environments. Recently, deep learning has attracted the interest of the research community and industry in varying applications of image classification and speech recognition. Our proposed approach CNN with three layers is applied in the ShanghaiTech dataset, which is a large dataset in terms of the annotated heads for crowd counting. The results of the approach have proven high accuracy up to 100% and a low loss rate. Then, a comparative study has been established between our proposed model and the switched convolutional neural networks in terms of accuracy and loss metrics. This model is developed on three layers CNN, used very large and recognized crowd counting dataset and evaluated compared to the existing state of the art models.
In conclusion, our approach contributes to establishing a system capable of estimating the crowds automatically in record time during the pilgrimages in religious ceremonies that count crowds faster and efficiently where helps the security disperse and dismantle crowds for the safety of all. Furthermore, the system contributes to helping for Secure Covid-19 Vaccine Supply Supervision in the available centers.
As future work, our approach can be applied in a more general context, including various databases. Furthermore, the model can be used by the concerned authorities in the governmental and non-governmental sectors for crowd density estimation and control.