Wi-CaL: WiFi Sensing and Machine Learning based Device-Free Crowd Counting and Localization

Wireless sensing represented by WiFi channel state information (CSI) is now enabling various fields of applications such as person identification, human activity recognition, occupancy detection, localization, and crowd estimation these days. So far, those fields are mostly considered as separate topics in WiFi CSI-based methods, on the contrary, some camera and vision-based crowd estimation systems intuitively estimate both crowd size and location at the same time. Our work is inspired by the idea that WiFi CSI also may be able to perform the same as the camera does. In this paper, we construct Wi-CaL, a simultaneous crowd counting and localization system by using ESP32 modules for WiFi links. We extract several features that contribute to dynamic state (moving crowd) and static state (location of the crowd) from the CSI bundles, then assess our system by both conventional machine learning (ML) and deep learning (DL). As a result of ML-based evaluation, we achieved 0.35 median absolute error (MAE) of counting and 91.4% of localization accuracy with five people in a small-sized room, and 0.41 MAE of counting and 98.1% of localization accuracy with 10 people in a medium-sized room, by leave-one-session-out cross-validation. We compared our result with percentage of non-zero elements metric (PEM), which is a state-of-the-art metric for crowd counting, and confirmed that our system shows higher performance (0.41 MAE, 81.8% of within-1-person error) than PEM (0.62 MAE, 66.5% of within-1-person error).


I. INTRODUCTION
The importance of technological prediction of how people will behave or make a decision has been growing up more and more in our modern society since the human population has gone beyond the range of manual processing. Before those predictions, naturally, we first need to estimate the current situation of people in an area of interest. The crowd estimation technique is one of the methods that can contribute to various situation understandings. In a retail store or supermarket that has separate sections divided by product types as an example, if we are able to recognize how many people are passing by and gathering at a certain area or passage in a specific time (current situation), it leads to a prediction of sales trends of the particular goods as well as time-specific section congestion (prediction). This enables the shop manager to appropriately arrange the products, as-sign the optimal work schedule for staff, and also especially, it could be very meaningful in terms of crowd dispersal in a situation such as the COVID-19 pandemic spread since 2019. Since this aspect identically applies to the museums, exhibitions, or expositions as well, we can acquire the realtime crowd information of each area in those places as we deploy the crowd estimation system area-by-area.
Today, the most universal method for crowd estimation is a vision-based technique, and wireless sensing-based approaches are rapidly catching it up. The camera and visionbased techniques are intuitively possible in human counting with good accuracy thanks to well-developed head counting and pattern recognition in the images [1]- [3]. Especially, they have an advantage in estimating an extensive crowd over a huge outdoor area. However, vision-based approaches have some critical weaknesses at the same time, such as non-availability under the dim light circumstances, impossibility of widespread installation of cameras, underestimation due to occlusion of objects, and privacy-invasive concerns. Nowadays, many technical approaches for both indoor and outdoor crowd estimation have been attempted using various wireless sensing technologies, e.g., WiFi [4], PIR sensor [5], Bluetooth [6], wireless sensor network [7], and also the combination of multiple wireless sensing technologies such as WiFi, UWB, and light sensor [8]. Among them, WiFi sensing-based methods are now highly spotlighted because of WiFi's pervasiveness and fine-grained source data like channel state information (CSI).
WiFi sensing can be divided into two major approaches: CSI and passive WiFi radar. In [9], Li et al. compared the fundamentals and activity recognition results by leveraging both systems. They evaluated the systems by machine learning, then concluded the CSI-based system performs better in a line of sight (LoS) condition, whereas the radar-based system shows better performance in a non-LoS environment. In this paper, we address a WiFi CSI-based crowd estimation approach, because our target area is the indoor LoS-link environment. Meanwhile, we adopt machine learning to assess our system performance. The recent WiFi sensing techniques are now often being collaborated with IoT and machine learning technologies as spectrum sensing does [10], [11], which is a basis of the wireless channel sensing in the field of cognitive radio.
Although numerous WiFi-based human sensing techniques have been studied so far [12], [13], most of those studies are focused on resolving only a single issue such as person identification [14], respiration detection [15], activity recognition [16], and human detection [17]. Particularly, the crowd counting and localization techniques are treated as separate issues in most cases. On the other hand, one thing we need to note regarding the vision-based methods is that it can count people along with recognizing which part of the area the people are gathered at, from the image or video. Practically, there are several camera-based studies addressing both issues of crowd counting and localization [18], [19]. Knowing the location of a crowd has great advantages in terms of system distribution cost and energy efficiency. If the system can recognize not only the number of people in a crowd but also where the people are gathered, we will be able to sparsely deploy the sensing devices in an area of interest instead of installing them densely to estimate the situation of all small separate sections. Also, we can provide a targeted air-conditioning service toward a more crowded location by graded adjustment of multiple air conditioners in a large room or area.
In our previous work [20], we were inspired by the idea that the same thing a camera can do can be also performed by wireless sensing, and to the best of our knowledge, it was the first attempt of simultaneous crowd estimation by using WiFi CSI. Through this work, we further reveal the potential of WiFi CSI toward a comprehensive crowd estimation system. We propose a method for device-free crowd counting and localization Wi-CaL, and evaluate the system by the experiments with the further enhanced features and more people than the previous work, at two different test areas. To examine the new WiFi CSI platform, we utilize ESP32 1 node which is a compact IoT solution of WiFi/Bluetooth communication and sensing, instead of inaccessible, conventional WiFi CSI tools. We show convincing results obtained by machine learning using practical experiment data from two test fields with up to 10 people. Finally, we provide diverse analytic comparisons in detail, by handling several conditions which are influential in system performance. This paper acquires significance by the following main contributions: • First, we demonstrated the feasibility of real-time simultaneous crowd estimation system that can precisely estimate not only the crowd count but also the location of the crowd in parallel. • Second, we examined the potential of ESP32 nodes and CSI toolkit to become a promising WiFi sensing platform, and confirm that they have sufficient sensing resolution for medium-scale crowd estimation. • Third, practical validation were conducted in two different real environments, which are small-sized meeting room with five people and medium-sized seminar room with 10 people. • Fourth, we evaluated the system performance by leaveone-session-out cross-validation to reflect CSI tendency change depending on time-varying environmental factors, as well as by continuous data series (k-fold crossvalidation). • Finally, diverse analytic results were obtained by machine learning (regression analysis for crowd counting and classification for crowd localization) with comparisons depending on conditions and parameters, additionally, we examined the differences and comparisons with the results by deep learning. The rest of this paper is organized as follows. In Section II, we first briefly review the studies related to crowd estimation. We then address the background of WiFi CSI and its solutions, and our observation in terms of CSI characteristics in Section III. The proposed system Wi-CaL for crowd counting and localization is described in Section IV. We present the evaluation method of our system and the results in Section V. Finally, we give a discussion about the current state and future works in Section VI and then conclude this paper in Section VII.

II. RELATED WORK
In this section, we review the literature related to WiFi CSI-based sensing techniques mainly focused on crowd estimation techniques. Since we can observe the significant variation of CSI only by the change of multipath environment or LoS blockage events of a WiFi link, most WiFi sensingbased human sensing approaches are based on the mobility of the target object. Therefore, all following crowd estimation systems are assuming the situations of when people are walking in or passing through the WiFi channels, same as our work.
Depatla and Mostofi [21] presented a technique for through-wall crowd counting based only on WiFi received signal strength (RSS). In the paper, they emphasized that through-wall counting should be demonstrated in case there is no available WiFi device in an area, pointing out that transceivers are located within the area of interest in all the conventional counting methods. They proposed a motion model for multi-people walking to estimate the number of people walking inside with one pair of WiFi transceivers behind walls. Ibrahim et al. [22] proposed a deep learning system for WiFi-based human counting. They also used WiFi RSS measurements to detect temporal line of sight (LoS) blockage of a single WiFi link. They utilized LoS blockage detector to measure its timing and long short-term memory (LSTM) model to overcome the vanishing gradient problem during long sequences training. They showed that the system is able to count the people with 63% of count accuracy in a small room with up to seven people, and 55% of count accuracy in a medium-sized room with up to 10 people.
Liu et al. [23] proposed an approach of deep learningbased crowd counting using WiFi CSI. Both CSI amplitude and phase are used as source data in the system, and they attempted to use two filters to smooth those measurements. They provided performance comparison depending on impacts of time window size, neural network structure, and preprocessing method. The system showed 82.3% of average recognition accuracy with up to five people. Di Domenico et al. [24] presented a differential CSI approach for counting by trained-once classification model. Normalized Euclidean distance between two CSI vectors is used as a basic metric of the system to reduce the dependence on the background environment. They trained a classifier with the data from a medium-sized room, and tested it with the data from smallsized and large-sized rooms. The system showed 74% of classification accuracy by small room data, and 52% by large room data.
Zou et al. [25] proposed FreeCount, which is a device-free crowd counting scheme using a modified CSI tool running on commercial WiFi devices. They adopted the transfer kernel learning (TKL) model to take account of temporal variation of CSI measurements, and trained the model with 20 features based on de-noised CSI data by wavelet filter, which are categorized in common statistics, transformationbased, and shape-based features. In addition, they extended and further developed their system into WiFree in [26]. They mainly measured the shape similarity between adjacent time series CSI curves to distinguish the number of people. Also, the feature selection method was presented in the paper, to figure out the most informative features for the system. They demonstrated the system in three different-sized rooms with four, seven, and 11 participants, respectively, and achieved 99.1% of occupancy detection accuracy and 92.8% of crowd counting accuracy.
Xi et al. [27] proposed a device-free crowd counting approach by using the percentage of non-zero elements (PEM) and the Grey Theory, where PEM is a metric of dilated CSI matrix for crowd counting proposed in the paper. The values of PEM reflect the fluctuation of CSI signal by a matrix with '0' or '1' elements, based on the idea that the signal is unstable, then the dilated CSI matrix contains the larger number of '1'. This is grounds for monotonic relation between the number of people and PEM. They evaluated their system with Intel 5300 NIC-based CSI tool, and their results showed that the ratio of estimation errors within two people was 98% in the indoor area and 70% in the outdoor area.
Some works use this PEM as a main metric of their system. Li et al. [28] presented a device-free indoor people-counting method based on WiFi CSI and PEM. To calculate PEM, they made dilated matrix by the covariance matrix of both CSI amplitude and phase. Their system achieved robustness and detection performance by combining the amplitude and phase information in CSI data, and validated a monotonic relation between CSI variation and crowd number. It is shown that the system can get 92% of accuracy with up to eight people. Meanwhile, Zhou et al. [29] proposed the crowd counting technique by using WiFi CSI and deep neural networks (DNN), and PEM. They also leveraged PEM to construct the monotonic relationship between the change of CSI amplitude and people count by the DNN regression model. One pair of WiFi links was used in their experiment with Intel 5300 NIC-based CSI tool. They achieved 0.11 of mean counting error in a medium-sized meeting room with up to five people and 0.14 of mean counting error in a hall with up to 34 people.
In [30], Xu et al. described SCPL system which can perform the counting and localization in parallel. The system consists of two phases, first is counting subjects by successive cancellation (iteratively subtracting an impact of one target from the measurements) and the other is localizing each subject by indoor human tracking model. They tested their system in two indoor environments with four people, then achieved up to 86% of counting accuracy and 1.3 m of average localization error. However, they only used WiFi RSS as their system's source data, leading to very extensive distribution of necessary WiFi devices (about 20 nodes for each test area) for high accuracy. Since this work is addressing multi-subject counting and individual tracking, it is essentially different from our work which is estimating the number of people in the crowd and the sectioned location of the human cluster itself.
Mohammadmoradi et al. [8] presented multi-modal people counting by a combined system of multiple wireless sensors such as WiFi, UWB, and light sensors. Their estimation is performed based on the detection of the flow of people getting into a room or going out of the room through the sensor sets installed on both sides of the door. They described that each sensor can independently detect a person's passage by variation of the sensor signal, then the final decision is made by a majority vote between the different sensors. Also, VOLUME 4, 2016 they tested that each sensor can tell the obvious difference of when multiple people move in/out together at the same time. As a result, WiFi and UWB could distinguish the cases of the movement of multiple targets (up to three people), and the system showed 96% of overall performance in passage counting.
Finally, Zheng et al. [31] examined the impact of radio frequency interference (RFI) on WiFi CSI measurements, and proposed the cyclostationary analysis-based RFI detection algorithms. They described that, even though the CSIbased sensing applications have been widely studied in recent years, the RFI problem is overlooked and unexplored in the field of WiFi sensing. Therefore, they conducted realworld experiments with WiFi (main signal source), and ZigBee, Bluetooth and microwave (RFI sources). They provided several comparisons depending on evaluation metric, interference type, RFI-Rx distance, or Tx-Rx Distance, then the system eventually showed over 90% of RFI detection accuracy.
All the above-mentioned studies utilized the conventional WiFi routers and old CSI platforms that require particular WiFi modules such as Intel 5300 NIC or Qualcomm Atheros WiFi chip. In our work, we leverage ESP32 transceivers as the signal source which is the latest WiFi IoT CSI solution. Although the conventional WiFi routers can obtain more finegrained and stable CSI measurements, we will show that our system also could achieve promising and convincing, even better performance. Most of all, we differentiate our work from other related works by a point of revealing the possibility and potential in WiFi IoT sensing-based simultaneous crowd estimation for both counting and localization.

III. WIFI CSI PRELIMINARIES
In this section, we briefly describe the basics of WiFi CSI, currently usable solutions and a new promising CSI IoT platform, and our observations.

A. BACKGROUND
As mentioned earlier, many research works are leveraging a WiFi sensing technique thanks to some solutions for access to WiFi CSI open to the public. CSI represents an estimate of the impulse response of the propagation channel between a transmitter and a receiver in the orthogonal frequencydivision multiplexing (OFDM) transmission system. When we denote the OFDM system in the frequency domain, it is modeled as: where x and y are the transmitted and received complex vectors, and n and H are noise vector and channel information matrix, respectively. Since CSI is an estimate of H, it can be denoted asĤ which is obtained from a transmitter.Ĥ contains the information of amplitude attenuation and phase shift of each subcarrier in the form of complex numbers, therefore, these measurements can be denoted as: where ||Ĥ|| and ∠Ĥ mean the CSI measurements of amplitude attenuation and phase shift, respectively.
Those have been widely utilized as CSI-enabled platforms in various publications so far. However, both Linux 802.11n and Atheros tools require a laptop or WiFi router which is equipped with particular WiFi modules such as Intel 5300 NIC for the former, and specific Qualcomm Atheros WiFi chips for the latter. This fundamentally restricts the accessibility to CSI data, even some of those modules are purchasable only from the used-item market. They also may cause inconvenience in device deployment due to the requirement of a laptop or router. Moreover, the Linux 802.11n CSI tool has a constraint that it can provide CSI readings of only 30 subcarriers out of 64 subcarriers. Therefore, some researchers have modified those CSI tools to fit them into their systems. In early 2020, an ESP32 CSI toolkit has been presented as a new CSI solution, emphasizing its convenience and accessibility [34]. Using this toolkit, the authors of [34] practically performed further research works regarding human occupancy and direction monitoring in [35]. They conducted a hallway experiment to investigate the capability of ESPbased device-free WiFi sensing for single-person detection and walking direction prediction, even if the Tx/Rx ESP nodes are lined up behind the same side of a wall. In addition, they also presented a method of soil sensing by using ESP nodes in [36], demonstrating that ESP-based WiFi sensing is effective not only for human sensing. By [35], [36], they showed the feasibility of this compact ESP32 becoming an alternative solution of WiFi sensing. In this paper, we also adopt the ESP32 CSI toolkit and ESP32 WiFi nodes, which are shown in Figure 1, as the CSI reading devices for WiFi sensing. Since the ESP32 module has a single antenna, it can only exploit signals from fewer channels than other twoby-two or three-by-three MIMO WiFi architectures. Consequently, we could obtain a relatively small amount of CSI data. Nevertheless, this low-cost, low-power, compact WiFi  node has a great advantage in terms of easy and flexible deployment. We suppose that these compact devices have the potential to become a promising WiFi IoT sensing solution.
For this work, we set several ESP32 nodes as transmitters (access point, AP), and the others as receivers (station, STA), to make multiple WiFi links. We assign a dedicated SSID and password to each pair of Tx/Rx for one-to-one communication at a configured packet rate, by the ESP32 CSI toolkit operating in the Linux terminal. Since the ESP32 nodes are powered, the AP continuously sends CSI requests to the STA, Then, the STA returns the observed CSI information to the AP so that we can get the channel state between AP and STA from the AP side. The ESP32 nodes are operated on 802.11n legacy mode WiFi, which uses 2.4 GHz band (bandwidth: 20 MHz) and consists of 52 non-null subcarriers [35], [36].
If there are multiple WiFi links in the system, a measured CSI vector h i,k from the i th packet can be denoted as: where h i,j,k is a complex CSI value of j th subcarrier measured in the k th link, and n s is the total number of available subcarriers. Since the complex CSI values contain information of both amplitude a i,j,k and phase ϕ i,j,k , they can be calculated by the following equations: where Re(·) and Im(·) are the functions of the real and imaginary part of a complex number, respectively, and atan2(y, x) is the function of 2-argument arctangent.
In this paper, we use only the amplitude values a i,j,k for our system. This is because the purpose of this work does not strictly require a contribution of phase shift value. Phase shift value is required for some applications that need angle of arrival (AoA) or time of flight (ToF), but it is excluded in some cases due to its severe offset caused by hardware and software errors that leads to difficulty in clarifying the signal pattern, as described in [12].

C. OBSERVATIONS
WiFi CSI provides measurements of the signal amplitude and phase information at the subcarrier level. To investigate the CSI amplitude data, we look into a subcarrier-amplitude plot that shows the signal magnitudes of each subcarrier within a certain time interval. In our system, for example, the time-series CSI data is segmented into six-second time windows to convert it into overlapped CSI curves (as we will describe in Section IV). In one time window, we call the overlapped CSI curves a CSI bundle. Figure 2 visualizes the CSI bundles in several different situations. CSI bundle shows a specific tendency in terms of the width and shape, therefore, it reveals a couple of characteristics in accordance with the propagation condition between WiFi AP and STA, which is changed by moving objects or channel circumstances. Those characteristics can be represented in dynamic and static state-dependent characteristics, which are described in the following subsections.

1) Dynamic State-dependent Characteristic
For crowd counting, we associate the bundle-width variation with the number of people. If there is no person between a WiFi link, the signal multipath or scattering effect is nearly constant and signal variation only comes from observational error, thermal noise, or signal interference. Therefore, the CSI amplitudes across all the subcarriers are relatively stable. On the other hand, as the number of people in the area increases, the multipath environment becomes more and more complicated due to increased moving objects. As a result, the amplitudes fluctuate widely and the CSI bundle width consequently gets thicker. In Figure 2(a) and (b), the black curves form the CSI bundles of the cases when an area is empty and four people are walking within the area, respectively, and the green lines represent the lower and upper quartile values across all subcarriers, which can reveal the difference of bundle width.

2) Static State-dependent Characteristic
In a CSI bundle, we can also recognize a particular shape depending on the difference of the target space's inner structure and/or distribution of objects including human bodies. The basic shapes of CSI curves are formulated depending on the inner structure of a target area. However, a cluster of people consistently moving around within a limited area constantly affects the multipath environment of the WiFi   signal. Consequently, this continuous influence affects the formation of shape tendency of the CSI bundle as well. Figure 2(c) and (d) show the difference of CSI-bundle-shape formation with yellow average line, between two different situations that three people are freely walking within one section and another section of a target area.

IV. WI-CAL: CROWD COUNTING AND LOCALIZATION
In this section, we propose a WiFi sensing based crowd counting and localization system Wi-CaL that enables both crowd counting and crowd localization.

A. OUTLINE
The final goal of this study is to investigate if the proposed system can estimate not only how many people are in a particular area, but also which specific section of that area people are gathering at. Therefore, we devise effective features for dynamic and static state-dependent characteristics as well as using common statistical features. Since we found that some features extracted from CSI data generally have the monotonic relationship to people count, ML regressor is used for crowd counting. On the other hand, crowd localization should be estimated by ML classifier because we divide the test area into discrete sections. Figure 3 shows the comprehensive flow of our system. We describe the system flow in the following sections, including the scheme and method of data processing and feature extraction in detail.

B. CSI PRE-PROCESSING
In order to leverage CSI readings as informative and effective resources for crowd estimation, it is essential to pre-process the data before the feature extraction. We present the CSI segmentation and smoothing process in this section.

1) Data Segmentation
After receiving the CSI data which is obtained as a form of complex vector, the system first calculates amplitude values across the entire subcarriers as mentioned in Section III-B. After that, the time-series amplitude values are accumulated and segmented into a given-sized time window. Here, we omit the link index k because all the following CSI processing is identically performed regardless of the link number, then we can define a CSI curve vector a i of each packet and a time-series amplitude vector a j of each subcarrier as follows: where i and j are indices of packet and subcarrier, respectively, and n s and n p are the total number of subcarriers and packets in a time window, respectively. Then, a CSI bundle A (w) in a time window can be denoted as: where w is the index of time window. We empirically set each time window to contain six seconds of CSI data with threeseconds overlapping. Since we configure the packet rate of

RSS-based
rss (standard deviation of RSS) the ESP32 nodes as 100 packets/sec, each time window contains 600 packets (n p = 600). Also, we can obtain CSI readings in a total of 52 available subcarriers (n s = 52). This CSI bundle A (w) , which is consisting of CSI curves in a 6 s time window, becomes a base unit for our feature extraction process.

2) CSI Smoothing
Since the CSI readings are considerably noisy, it is necessary to remove the redundant components from the calculated amplitude values. For this smoothing process, we apply two filters, one is Hampel filter for eliminating spike noises, the other is Savitzky-Golay filter for removing overall white noise without distorting the tendency of the signal. These filters are used in several existing studies for WiFi CSI noise reduction because of their low computational cost, as described in [37]. Figure 4 shows the amplitudes of the timeseries CSI before applying filters, after applying Hampel filter, and after applying both Hampel and Savitzky-Golay filters, respectively.

C. FEATURE EXTRACTION
In this section, we describe all the features extracted from the amplitude signal of WiFi CSI for crowd counting and localization. The features are categorized by three extraction sources for each dynamic and static state, as summarized in Table 1.

1) Common Statistical Features
We calculate common statistical features from time-series CSI amplitudes. Several statistical functions are independently applied to each subcarrier signal. First of all, we can simply use the standard deviation of amplitudes of each subcarrier. Intuitively, the more the number of people between WiFi channels, the more complicated multipath fading channel is formed. This subsequently makes the signal amplitude more severely fluctuate across entire subcarriers than when there are no people in the area. We have checked that the number of people shows the monotonic relationship with the degree of signal fluctuation. A standard deviation vector of subcarriers std (w) can be denoted as: 1 ), · · · , σ(a (w) j ), · · · , σ(a (w) ns ) ) where σ(x) denotes a function of the standard deviation of any vector x.
As we can see from the CSI bundles in Figure 2(a) and (b), the uppermost and lowermost CSI curves in a time window gradually rise and go down as the number of people increases. This characteristic is also representing the linearity between crowd size and CSI signals. The CSI minima vector min (w) and maxima vector max (w) can be denoted as: where min(x) and max(x) represent a function of minima and maxima of any vector x, respectively. Similarly, the lower and upper quartile values of entire subcarriers also show linear downward and upward trends along with the increased number of people. We can denote the lower quartile qtl (w) and the upper quartile qtu (w) as: where q 1 (x) and q 3 (x) denote a function of the first quartile and the third quartile of any vector x, respectively. The average line of a CSI bundle shows the general shape of CSI curves in a time window. This mean vector across entire subcarriers mainly contributes to the localization part of the system, because it reflects a particular shape of bundles to the learning model depending on a specific section in the area of interest that the crowd is gathered at. The mean vector avg (w) can be denoted as: where µ(x) is a function of the mean value of any vector x.

2) CSI bundle-based Features
It is necessary to figure out a way to enhance our system's performance with some more effective features as well as statistical ones. Therefore, we now address the features which can be extracted from the CSI bundles. The interquartile range (IQR) is the width between the lower quartile and upper quartile. The values of the lower quartile and upper quartile mutually inversely go down and up as the number of people between a WiFi link increases, consequently, the IQR also increases as we can see in Figure 5. We can obtain an IQR vector that intuitively implies the vertical width of a CSI bundle by the subtraction of upper and lower quartiles as:  The amplitude difference with adjacent subcarriers is the summation of the absolute differences between one subcarrier and adjacent subcarriers on both sides. It reflects the relationship between adjacent subcarriers to the ML model, in terms of lightly-varying or heavily-varying subcarriers depending on the state of measuring space. This difference with adjacent subcarriers adj is denoted as: where N is the number of adjacent subcarriers on both sides which will be included in adj calculation. In this paper, we decide as N = 2 through the empirical test. Euclidean distance between CSI curve vectors from adjacent packets also contains information of how intensely the multipath fading channel is changing. The Euclidean distance maintains relatively low values when a channel is not being interrupted by moving people, but the larger crowd in the channel makes the value gradually increase, as we can see in Figure 6. Let med(x) be a function of the median value of any vector x, then the median of Euclidean distances in a time window euc can be denoted as: where In localization, we use coefficients of the fitted polynomial curve of CSI bundle's average line (cur (w) ) and its 1st derivative function (der (w) ), to leverage a particular shape of the CSI bundle as a feature for localization. cur (w) reflects the shape of the CSI bundle itself, and der (w) clarifies at which points of the fitted curve have peaks, valleys, or sharp slopes. We empirically apply the curve fitting with a 6-term polynomial curve, then we use its polynomial coefficients as the features. Therefore, cur (w) and der (w) feature vectors contain six and five components, respectively.

3) RSS-based Features
Lastly, we use RSS measurements which are measured with CSI readings. WiFi RSS also shows a monotonic relation between its variation and the number of people within the link coverage similar to statistical features of CSI. If we define ρ as an RSS measurement of a packet, the standard deviation of RSS in a time window rss (w) can be denoted as:

D. STANDARDIZATION & LEARNING MODELS
The extracted features are concatenated to form the datasets for training each machine learning model of crowd counting and localization. In this study, we treat counting and localization as regression and classification problems, respectively. Each feature vector or feature value is connected vertically along the order of time windows and horizontally along the order of links, for example, a feature matrix of standard deviation STD can be denoted as: where n k and n w are the total number of WiFi links in the system (n k = 4 in this work) and the total number of time windows for training, respectively. Equally, other feature matrices such as MIN, MAX, · · · , RSS are also produced by the same procedure. Then, all the feature matrices are lined up from side to side becoming the final training dataset. After the formation of training data, all datasets are standardized by standard normal distribution N (0, 1) to fit the scales between different features before training. Then, machine learning regressors and classifiers are trained with the datasets to evaluate the performance of simultaneous

V. PERFORMANCE EVALUATION
In this section, we present experimental setup, data gathering scheme, and several comparisons depending on learning models and adjustable parameters, then evaluate the system performance through the experiments at two difference-size rooms.

A. EXPERIMENTAL SETUP
We collected the CSI data through a multi-scenario experiment with up to five participants in a small-sized meeting room and up to 10 participants in a medium-sized seminar room. Unlike conventional researches that a single pair of WiFi routers were usually installed using Intel or Atheros CSI solutions, we placed the four pairs of ESP32 nodes to make four WiFi links vertically, horizontally, and diagonally crossing over the target area. This enables the system to faithfully observe the change of CSI measurements with regard to the movement of walking people covering the whole target area. For our experiment, all transmitters were set to send the CSI request packets to their pair receivers at 100 Hz of packet rate. We performed our experiments in a small-sized meeting room (5.5 m by 5.5 m) and a medium-sized seminar room (11 m by 5.5 m) which were equally divided into four sections for assessment of crowd localization, as shown in Figure 7(a) and (c). Figure 7(b) and (d) show the actual scenes of our experiment. In both rooms, each WiFi link k consists of AP k (Tx) and STA k (Rx).

B. DATA GATHERING SCHEME
To confirm effectiveness of our insight of simultaneous crowd estimation, we designed and conducted the experi- ments which contain five scenarios with a certain number of people walking in an experiment area. Here, five scenarios mean the situations that the cluster of people is walking at different sections of the area. The number of people are denoted as P npeo (n peo = 0, 1,· · · , 5 in the meeting room, n peo = 0, 1,· · · , 10 in the seminar room), and the scenarios related to the section number correspond to S nsect (n sect = 1,· · · , 4, and oth that indicates other pattern, i.e., full-area walk). To be specific, the scenarios S nsect are corresponding to the situation in which the participants walk freely within a particular section n sect . In the scenario S oth , on the contrary, the participants perform free walking all over the experiment area. The examples of P npeo /S nsect scenarios are depicted in Figure 8. In every section walking (S 1 , S 2 , S 3 , S 4 , and S oth ), all the participants walk randomly within the given space, without any guidance/limitation about how to walk. We collected two minutes of CSI data in each scenario of all combinations of P npeo and S nsect . That is, a total of 60-minute-data (2 mins × 5 sections × 0-5 people) in the meeting room and 110-minute-data (2 mins × 5 sections × 0-10 people) in the seminar room were collected in a single  experiment. Then, we carried out three times of identical experiments in each of the meeting room and the seminar room, in different days. This is to check the difference in system performance originating from circumstance changes, such as temperature, humidity, or signal interference. The experiments in different days are distinguished as Session 1, 2, and 3.

C. COMPARISONS
In this section, we first compare our system performance depending on the learning models including conventional ML models and DNN, then provide further comparisons between LGBM and DNN. We also present the result of performance comparison between our method and conventional metric (PEM) based method, then show how the system performance changes in several different conditions and parameters, such as time window size, the number of used subcarriers, the number of used links, and scenario length. All comparisons are based on the results of leave-one-sessionout cross-validation from the seminar room (up to 10 people). We show the counting performance by median absolute error (MAE) because a few error outliers are included in the results due to an observational error. Here, MAE is the median value of the absolute crowd counting errors calculated by median( | Real Counts − Estimated Counts | ).

1) Impact of Learning Model
As we mentioned in Section IV-D, we test four different ML models and DNN for each of counting and localization, then we finally select LGBMR and LGBMC are finally selected for overall evaluation among them. In the case of counting, LGBMR shows the second-best performance (0.41 MAE) after DNNR (0.35 MAE), but we use LGBMR as a prior learning model because of the reasons that are discussed in Section V-C2. In localization, LGBMC shows the highest accuracy as 98.1%, also it shows the smallest error range of each session testing result. Figure 9(a) and (b) present the result comparison by the learning models.

2) Further Comparison between ML and DL
As we described in Section II, the authors of [29] assessed the crowd counting system by DNN, and used PEM metric as their system's feature. Hence, we set this related work as our comparison target to weigh the pros and cons of ML (LGBMR) and DL (DNNR), and also PEM and our feature.
To that end, we first calculated the PEM values from our datasets in the same way, then constructed our DL model with the same DNN architecture described in [29] as follows: four hidden layers with [1000, 500, 100, 10] neurons, 10 −4 of learning rate, 100 of batch size, Adam optimizer and ReLU activation function. Figure 10 shows the differences in accuracy and training time depending on the used model, used feature, and epochs setting. The descriptions of the trials in Figure 10 are as follows: • LGBM-OF: LGBMR trained with our features. It shows 0.41 MAE and requires 4.6 seconds of training time. epochs. 22000 is the same number of epoch settings in [29]. It shows 0.03 improved MAE compared to DNN-OF-ES case, but the required training time is unrealistic (15,867 seconds). • DNN-PEM-22K: DNNR trained with PEM, 22000 epochs. This is the identical condition with [29]. It also shows 0.18 worse MAE than the case of our features (DNN-OF-22K).
Here, early stopping is a method for avoiding overfitting in DNN models by the halt of model fitting if validation MAE doesn't seem to be enhanced anymore, and patience is the early stopping parameter of how many epochs DNN will be patient even without enhancement of validation MAE. All above results are obtained by the following PC specification: Intel(R) Core(TM) i7-10750H CPU (2.60GHz, 2.59GHz), 16GB RAM, and 64-bit Windows OS. Although DNN shows slightly better performance, we evaluate our system by LGBMR and LGBMC in the rest of this paper. There are several reasons that we use the conventional ML models other than DL. First is, the fact that there is no significant gap between the ML and DL-based results implies evidence of well-designed features. Our work is more focused on effective feature engineering, which is to find out some attributes corresponding to a system's goal from the raw data, rather than using an advanced learning model. Meanwhile, LGBM shows considerably shorter training time than DNN. Generally, DNN requires a large number of epochs and a longer training time to reach to system's best performance. We adopted the ESP32 nodes as our CSI reading devices with the consideration of IoT-based aspect, therefore a low computing power environment is also needed to be considered. In addition, since a retraining process for a new target area is required as of now, it should be considered that the cost of model training of DL would be a high barrier.

3) Comparison with conventional metric PEM by LGBMR
We compared PEM (with 52 subcarriers) and our features (with 13 subcarriers) by LGBMR as well. Under our testing environment, our features show better performance (0.41 MAE, 81.8% of within-1-person error) than PEM-based performance (0.62 MAE, 66.5% of within-1-person error), as shown in Figure 11. To objectively compare the feature importance with PEM, we include the PEM values with our features in LGBMR for crowd counting. As a result, several of our features including adj (w) and euc (w) show higher rank in feature importance than PEM in link 1, 2 and 4, as shown in Table 2. Only in link 3, PEM shows the highest impact in feature importance.

4) Impact of Time Window Size
Since our approach is adopting a method extracting statistical and designed features from a single-time-window CSI bundle, the configuration of time window size influences system performance. In other words, the performance evaluation by each time window length is necessary because it is important to decide how long data will be a base unit of the system for the learning phase and online phase. Since the longer time window contains more information and its statistical values are more stable, the system performance becomes higher as the length of the time window increases as we can see in Figure 12(a) and (b). However, with taking into account the system's real-time estimation capability, we decided to use the time window size of our system in six seconds with three seconds overlapping.

5) Impact of Number of Subcarriers
In terms of the number of subcarriers, the difference in system performance is not very significant. Even so, we decided to use 13 subcarriers data in our system, since it shows a slightly higher performance than the other cases using 4, 26, and 52 subcarriers in both counting and localization, as shown in Figure 12(c) and (d). Here, the used subcarriers are selected with having the identical distance on both sides, from subcarrier 1 to 52 (e.g., 13 subcarriers: 1, 5, 9,· · ·, 49.). The small number of subcarriers would have an advantage in terms of shorter training time. For instance, we practically checked the training time of each case that contains the different number of subcarriers as 1.4s (4 subc), 4.6s (13 subc), 9.5s (26 subc), and 16.7s (52 subc) by LGBMRbased leave-one-session-out cross-validation with 600 mins long dataset (10 mins data × 10 people × 3 days × 2session data for each day). Nevertheless, the reason why we use 13 subcarriers here is that we also need to consider the performance degradation produced by the mutual similarity between the signal tendency of chosen subcarriers that leads to overfitting.

6) Impact of Number of Links
We placed four WiFi links to cover the whole experiment area without any blind spots. Naturally, the number of WiFi links impacts the system performance, therefore we compare the accuracy when we use only a part of the links data in the learning and testing phase. As we can see in Figure 12(e) and (f), the system performance drops when we include only a single link data, and it is gradually improved as the number of links is increased, then it shows the best performance when we use all four links. Also, we can see that the cases  including link 1 show higher MAE than the others. This can be considered that link 1 in the seminar room was too short to cover the entire area compared to the other links.

7) Impact of Scenario Length
As mentioned in Section V-B, two-minute-long CSI readings have collected for each scenario (P npeo S nsect ). To figure out how long scenario data is required for higher accuracy, we compared the performance of when we use only a part of scenario data or the whole two minutes data for the training phase. We adjusted in scenario length by 30, 60, 90, and 120 seconds, and the corresponding results showed 0.47, 0.44, 0.43 and 0.41 MAE in counting, respectively, and 97.5%, 98.0%, 98.0% and 98.1% in localization, respectively. The scenario length seems not to give a drastic impact on our system performance, nonetheless, the numerical accuracy is being slightly improved by the longer scenario data.

D. OVERALL PERFORMANCE
For the final results, we fixed the optimal conditions and parameters that are confirmed in Section V-C. Our overall performances are obtained under the conditions as follows: LGBMR and LGBMC models were used for counting and localization, respectively. Time window size for a single CSI bundle was set in six seconds with three seconds overlapping.
We used 13 subcarriers out of 52, and all four WiFi links. We set the scenario length as two minutes. In training and testing process, counting datasets for each crowd count contains all section data (S 1 -S 4 , and S oth ), and localization datasets for each section contains all crowd count data (P 1 -P 5 in meeting room, P 1 -P 10 in seminar room).
To compare overall differences between the performances of ML (LGBM) and DL (DNN), we present all the numerical results from both learning methods in Table 3. In the table, we gave background shadows to the results that showed better performance between ML and DL. According to the results by leave-one-session-out cross-validation, DL showed worse MAE in the meeting room but achieved better MAE in the seminar room in counting, on the other hand, ML showed better accuracy in both meeting and seminar room in localization. In other words, it is impossible to be clarified that DL always has a clear advantage or always achieves better performance than ML in all the cases, as mentioned in Section V-C1 and V-C2. We have opened the corresponding Python codes and feature datasets 2 of the results in Table 3 to the public through Github.

1) k-fold Cross-validation
The k-fold cross-validation is a machine learning evaluation method to assess a trained model by a single session dataset. The whole dataset is split into k folds of datasets from the first. When one fold is selected as test data, the other k − 1 folds become training data. After repeating this process k times, the system performance is derived by averaging all results from k trials. Specifically, we adopt the stratified kfold method which splits the folds by criteria ensuring that each fold contains the same ratio of target classes data. In this study, we empirically set the number of folds as k = 7.
In the meeting room experiment, we achieved 0.16, 0.18, and 0.13 MAE in crowd counting, and 96.5%, 97.1%, and 95.9% of classification accuracy in crowd localization, by Session 1, 2, and 3, respectively. Meanwhile, in the seminar room experiment, we achieved 0.32, 0.36, and 0.32 MAE in crowd counting, and 95.7%, 96.7%, and 97.3% of classification accuracy in crowd localization, by Session 1, 2, and 3, respectively. These results are summarized in Table 3.

2) Leave-one-session-out Cross-validation
We have separate datasets of three sessions which are collected in the same room, by the same scenarios, but on different days. This is to confirm our assumption that the tendency of CSI data changes as time passes due to different temperature, humidity, signal interference, and so on. In that case, a regressor or classifier trained by only a certain session's data might not be adequate for the others. However, there are only a few existing studies which are addressing the time-variant influence in CSI measurements. Hence, to confirm this variation between different sessions, we conducted leave-one-session-out cross-validation. Here, leaveone-session-out means, one whole session is selected as test data to test a regressor or classifier trained by the other sessions. This process continues until every session becomes a test session at least once. Finally, the system performance is calculated by averaging all the session results.
As summarized in Table 3, we achieved 0.35 MAE and 89.8% of counting predictions occurred within-1-person error in the meeting room experiment, also 0.41 MAE and 81.8% of counting predictions occurred within-1-person er-   ror in the seminar room. In crowd localization, we achieved 91.4% and 98.1% classification accuracy in the meeting room and seminar room, respectively. Figure 13 and 14 are presenting the error CDFs of counting results and the confusion matrices of localization results from both meeting room and seminar room, respectively.

3) Feature Importance
We checked the rank of feature importance for both counting and localization, from the result of leave-one-sessionout cross-validation. As shown in Table 4, the bundle-based features, which are separately designed, dedicated metrics for each counting and localization, mostly hold the highest ranks across all links in both estimations. On the other hand, we use the statistical features as a common input. This is because, each statistical feature shows different feature importance depending on link number, regardless of what kind of estimation (counting or localization) it contributes for. Therefore, it is hard to define which specific statistical features are always effective for counting or localization, as we can see in the middle and lower-ranked features of Table 4.
the lack of enough distinct features. Hence, the algorithmic investigation of selective subcarriers for a certain target area would be needed as one of our future works.

B. LAYOUT-INDEPENDENT LEARNING
It is also necessary to conduct the leave-one-room-out crossvalidation. We implemented our system in a meeting room and seminar room which have a relatively simple inner structure, however, if we want to examine the feasibility of the system in the real world, it should be on trial in the public space such as supermarkets, museums, and even outdoors. We will proceed in stages for our future work on system robustness from diverse indoor layouts, structure, and outdoors.

C. LARGE-SCALE HUMAN DENSITY ESTIMATION
In the same context with, the validation of the system's detection limit in terms of the number of people is essential.
Our system uses the statistical values and features in a given size of time windows as training data for machine learning. Especially, the crowd count estimation is based on CSI variation and regression analysis, but the fluctuation level of CSI signals is expected that it will necessarily converge at a certain point of crowd size. Therefore, we need to examine the possibility of massive crowd estimation, which is currently possible by vision-based approaches, by more large-scale experiments.

D. MULTI-CLUSTER CROWD ESTIMATION
Our system now has a restriction that it can estimate the crowd information only in the cases when a crowd is gathered within a single section (S 1 -S 4 ) or randomly spread across the entire area (S oth ). Undoubtedly, it is a generous precondition that all people are gathered at a single section in an area. However, at least this work has significance in the sense of the very first foundation stone in WiFi sensing-based crowd localization that can contribute to predicting which part of an area is the most crowded spot in the real world such as retail stores, supermarkets, or exhibitions.Indeed, the most ideal case is if we can estimate the number of people in each section like "five people in Section A, three people in Section B.", i.e., when the crowd is split into multiple clusters and exists in multiple sections. This detailed estimation, for instance, will enable to help disperse the people onto a less crowded area in the situation of an emergency evacuation. Even though it requires more time and effort to devise a new metric or design a different algorithmic approach, this multicluster crowd estimation would become our final objective in our future work.

E. COEXISTENCE OF MULTIPLE TYPES OF SIGNALS
As we mentioned in Section II, there are some studies addressing the WiFi sensing with other multiple types of wireless signals such as UWB and visible light [8] or Zigbee, Bluetooth and microwave [31]. A considerable advantage of WiFi CSI-based human sensing is that it is possible to detect people without installing any other devices by utilizing pervasive WiFi signals. Nevertheless, different types of wireless sensing could be helpful in some cases, for example, the visible light sensors can recognize the obvious change of luminance occurred by the passage of person or change of crowd size, as presented in [8]. Meanwhile, since it is necessary to consider the impact of coexisting radio frequency (RF) signals on WiFi if we use multiple types of wireless signals, the signal interference should be detected. In this case, the RFI detection algorithm introduced in [31] can be a base of the solution for eliminating the redundant components in CSI measurement.

VII. CONCLUSIONS
In this paper, we examined the potential and feasibility of the simultaneous crowd estimation system that can predict both the number and location of a crowd, by WiFi IoT CSI solution and machine learning. We also comparatively confirmed the pros and cons between conventional machine learning and deep learning in crowd estimation by empirical comparisons. We utilized for the first time, ESP32 nodes and its CSI toolkit as the WiFi sensing source for mediumscale crowd counting and localization instead of conventional WiFi, therefore we provided the initial foundation of this new CSI platform by various comparisons. We conducted the empirical experiments with up to 10 people (for crowd counting) in two four-sectioned real environments (for crowd localization) for three different days. By leave-one-sessionout cross-validation, our system achieved 0.35 MAE of counting error (89.8% of within-1-person error) and 91.4% of localization accuracy with five people in a small-sized meeting room, and 0.41 MAE of counting error (81.8% of within-1-person error) and 98.1% of localization accuracy with 10 people in a medium-sized seminar room, through machine learning. From April 2015 to September 2021, he was an assistant professor at the Graduate School of Science and Technology, Nara Institute of Science and Technology (NAIST). Since October 2021, he has been an associate professor at the Graduate School of Engineering, Osaka City University. He is also an adjunct associate professor with the NAIST. His research interests include ubiquitous computing, wireless networks, sensing technology, and elderly monitoring. He is a member of the IEEE, IEICE, and IPSJ.