Deep Learning-Based Packet Detection and Carrier Frequency Offset Estimation in IEEE 802.11ah

Wi-Fi systems based on the IEEE 802.11 standards are the most popular wireless interfaces that use Listen Before Talk (LBT) method for channel access. The distinctive feature of a majority of LBT-based systems is that the transmitters use preambles that precede the data to allow the receivers to perform packet detection and carrier frequency offset (CFO) estimation. Preambles usually contain repetitions of training symbols with good correlation properties, while conventional digital receivers apply correlation-based methods for both packet detection and CFO estimation. However, in recent years, data-based machine learning methods are disrupting physical layer research. Promising results have been presented, in particular, in the domain of deep learning (DL)-based channel estimation. In this paper, we present a performance and complexity analysis of packet detection and CFO estimation using both the conventional and the DL-based approaches. The goal of the study is to investigate under which conditions the performance of the DL-based methods approach or even surpass the conventional methods, but also, under which conditions their performance is inferior. Focusing on the emerging IEEE 802.11ah standard, our investigation uses both the standard-based simulated environment, and a real-world testbed based on Software Defined Radios.


I. INTRODUCTION
W IRELESS communication systems based on the or- thogonal frequency division multiplexing (OFDM) dominate current wireless research and development.In order to ensure fairness, wireless systems operating in unlicencsed bands share a common channel using Listen Before Talk (LBT) methodology.Common approach in a majority of LBT systems is that the transmitters send preambles prepended to data packets in order to ensure that the receivers detect signal and acquire initial synchronization.Preambles usually contain a sequence of symbols with good correlation properties, allowing the receiving end to identify packet start samples and establish initial timing and frequency offset synchronization.Conventional model-based signal processing methods at OFDM receivers are well understood and are currently used as a basis for the receiver design [1]- [9].
Conventional methods are recently challenged by the databased approaches relying on deep learning (DL) [10]- [12].DL-based methods have been evaluated across the physical layer (PHY), ranging across signal detection [13], channel estimation [14], [15] and error correction coding [16], demonstrating promising performance as compared to the conventional methods.Moreover, the DL-based positioning services that exploit channel state information as fingerprints have been explored recently [17].However, in most of the DL-based PHY studies, signal detection at the receiver that includes procedures that precede channel estimation, such as packet detection and carrier frequency offset (CFO) estimation, are assumed to be perfectly known.In addition, studies on DLbased PHY methods focusing specifically on preamble-based LBT OFDM systems are also missing, with an exception in the domain of channel estimation [18].
In this paper, we fill this gap by focusing on the DLbased methods for packet detection and CFO estimation in IEEE 802.11 systems.In order to provide a detailed, standardspecific investigation, we consider an emerging IEEE 802.11ah standard for low-power Internet of Things (IoT) applications [20].We use both the standard-based simulated environment, and a real-world testbed based on Software Defined Radios (SDRs) to evaluate our results.
The paper is topically divided in two parts.In the first part of the paper, we focus on the packet detection problem and provide a detailed complexity vs performance evaluation and comparison between the conventional and the DL-based packet detection.Our results demonstrate that the DL methods based on the one-dimensional Convolutional Neural Networks (1D-CNN) may outperform conventional methods under reduced computational effort, while being inferior in miss detection and false alarm rates.
In the second part of the paper, we investigate the DLbased CFO estimation methods and compare them to the conventional methods.Our results show that, for the CFO estimation at the IEEE 802.11ah receiver, long short-term memory (LSTM)-based recurrent neural network (RNN) are able to match the performance of the conventional methods, and even surpass them at low-to-medium signal-to-noise ratios (SNR).However, despite their excellent accuracy, DL-based methods suffer from higher complexity as compared to the conventional methods.
Our goal in this paper is to discuss both the benefits and drawbacks of DL-based methods in the context of a specific wireless standard (IEEE 802.11ah) and provide fair comparison with the conventional methods.In other words, the main message of the paper is not in advocating the usage of DL-based solutions, but in pointing out, in a given scenario, ©2021 IEEE.DOI: 10.1109/ACCESS.2021.3096853when it is advantageous to use such methods and when it is not.

A. Related Work and Paper Contributions
Using DL for PHY processing is a very active research area.However, most of the recent work is focused on the channel estimation, assuming that the signal detection and synchronization is ideal.Nevertheless, several recent papers address the DL-based signal detection in several scenarios.
Authors in [21] address the problem of CFO in the uplink of the OFDM access (OFDMA) system, where DL is used to suboptimally estimate CFOs corresponding to different users.The DL-based CFO for the received signals after a low resolution analog-to-digital conversion in emerging mmWave multipleinput multiple-output (MIMO) systems is investigated in [22], demonstrating improved performance as compared to the conventional methods.For OFDM-based unmanned aerial vehicle communications, DL methods for CFO are proposed in [23].Our work on the CFO estimation part is influenced by [13], an early study on DL-based CFO estimation in single-carrier systems.Finally, a comprehensive overview of DL methods for the IEEE 802.11ax receiver design is presented in [24].
The contributions of this work are summarized as follows: • We introduce a DL-based packet detection in preamblebased IEEE 802.11 systems and provide systematic performance and complexity comparison with the conventional packet detectors.The initial results, presented in [19], are here expanded with additional numerical results and SDR-based real-world demonstrations; • We present a systematic performance and complexity comparison of the DL-based and the conventional CFO methods in preamble-based IEEE 802.11 systems; • Our results are demonstrated using standard-based IEEE 802.11ah simulated environment and verified in a realworld setup using SDRs; • The study provides clear insights under which conditions the performance of the DL-based methods may approach or even surpass the conventional methods for packet detection and CFO estimation, but also, under which conditions their performance is inferior.
To summarize, compared to [19], this paper extends our work to a more challenging problem of CFO estimation, provides extensive simulation and SDR-based real-world performance results, and presents a detailed discussion on implementation complexity for both packet detection and CFO estimation.
The paper is organized as follows.In Sec.II, we present a system model and review IEEE 802.11ah frame structure.Sec.III deals with the packet detection problem, where the conventional and the DL-based methods are first described, and then evaluated using numerical simulations and the realworld SDR experiments.In a similar manner, Sec.IV describes and compares the conventional and the DL-based CFO estimation methods, including simulated and real-world SDR-based results.The paper is concluded in Sec.V.

II. BACKGROUND AND SYSTEM MODEL A. OFDM Communication System Model
We consider a conventional OFDM system with N subcarriers separated by ∆f in the frequency domain.At the transmitter side, the binary information sequence is mapped onto the sequence of complex modulation symbols X X X allocated to different subcarriers and converted into the time-domain signal x x x via Inverse Discrete Fourier Transform (IDFT) [1].The resulting discrete-time complex baseband signal is obtained as: where X k are the complex samples in the frequency domain.
Cyclic prefix (CP) of length greater than the expected channel delay spread is inserted in order to mitigate Inter-Symbol Interference (ISI) and preserve the orthogonality of the subcarriers [2].After oversampling and filtering, the oversampled signal x x x os will propagate through the indoor multipath channel.Focusing on the discrete-time complexbaseband model, the channel is represented via an equivalent discrete-time impulse response h h h.After the complex additive white Gaussian noise (AWGN) w w w samples are added, the discrete-time complex-baseband signal at the receiver side can be obtained as: where represents the circular convolution.The receiver side, which is in the focus of this paper, is illustrated in Fig. 1.After the signal passes through a reverse pulse shaping filter, it is downsampled, time and frequency offsets are corrected, and the cyclic prefix is removed.In order to demodulate the received signal, DFT is performed, and the frequency-domain signal is written as: Next, the signal correction using channel estimation techniques (usually based on inserted pilot symbols) is executed and the data is passed to the signal demapper block for the demodulation and channel decoding.Lastly, the binary information data is obtained back.Note that, besides the channel impairment and the noise, the received signal (y y y os ) is affected by the time sampling offset (ε = τ of f T , where T represents duration of one OFDM symbol) and the carrier frequency offset (∆ = f of f ∆f ), which needs to be estimated and corrected.A carrier frequency offset (CFO) of f of f causes a phase rotation of 2πtf of f .If uncorrected, this causes both a rotation of the constellation and a spread of the constellation points similar to the AWGN.A timing error will have a little effect as long as all the taken samples are within the length of the cyclically-extended OFDM symbol [3].

B. IEEE 802.11ah Frame Structure
In this paper, we focus on listen-before-talk (LBT)-based IEEE 802.11OFDM technologies, whose frame structure is shown in Fig. 2. In LBT systems, the sequence of data symbols is preceded by a preamble of known data needed to establish the initial synchronization and/or channel estimation [4].The initial synchronization includes the frame detection (estimation of the initial time sample of the incoming frame) and frequency offset estimation.Preamble structure is usually based on a certain repeated pattern, representing sequences with good correlation properties that provide for good time and frequency synchronization [5].
For the purpose of detailed implementation and evaluation, in this paper, we restrict our attention to the IEEE 802.11ah (Wi-Fi HaLow) standard.The 802.11ah 1 MHz packet preamble is a pilot sequence with a fixed length of 14 OFDM symbols (for single-antenna transmission) where each OFDM symbol has N = 32 subcarriers of subcarrier spacing ∆f = 31.25 kHz.Normal cyclic prefix of 8 µs duration is applied, resulting in 40 µs OFDM symbol [6].Note that the composition of the preamble remains the same as in conventional 802.11 systems, further adapted to specific 802.11ahrequirements [4], [6], [20]: Signal Field (SIG) -The signal field, which is made of 6 OFDM symbols, contains packet information to configure the receiver: rate (modulation and coding), length (amount of data being transmitted in octets), etc.
Long Training Field 2 (LTF2) -The second long training field is used for MIMO channel estimation, and in our case, because only SISO transmission is applied, this part does not exist.In this paper, we focus on the problem of initial synchronization, which depends only on the packet preamble.To reduce the complexity of both the simulations and real-world experiments, 802.11ahNull Data Packet (NDP) [25] is used, containing only the preamble (without data field).The transmit waveform of the NDP packet is shown in Fig. 3.

III. PREAMBLE-BASED PACKET DETECTION A. Conventional Packet Detection Methods
Conventional algorithms for packet detection, which are nowadays widely used, use repetitive preamble structure through complex correlation between two subsequently received training symbols.If we suppose that the number of complex samples in one training symbol is L, such complex correlation can be expressed as: In [3] and [8], the authors proposed a packet detection algorithm which relies on the assumption that the channel effects will be annulled if the conjugated sample from one training symbol is multiplied by the corresponding sample from the adjacent training symbol.Consequently, products of these sample pairs at the start of the frame will have approximately the same phase, thus the magnitude of their sum will be a large value.In order to reduce the complexity of the algorithm, they introduced a window of 2L samples which slides along the time τ as the receiver searches for the first training symbol, i.e., the packet start sample τ S .Timing metric used for the packet detection is: where P τ is the sum of the powers of L subsequent samples: From the timing metric M (τ ), one may find the initial packet sample by finding the sample that maximizes M (τ ).In addition, except finding the maximum sample-point, observing the points to the left and right in the time domain which are at the 90% of the maximum, and averaging these two 90%time samples, may result in more accurate timing estimation.A threshold which triggers the above algorithm should be chosen in a way that the algorithm minimizes the probability of miss detection while controlling for the probability of false alarm.
Packet detection in IEEE 802.11 is usually separated into two steps: coarse and fine synchronization, where the main principles from conventional algorithms are reused and adapted to the specific system requirements.The coarse packet detection, denoted as τS , may follow [3] (Eq. 5), setting L = 80 samples (one half of the STF duration): where l S is the STS sample-length and L S represents the sample-lengths of the STF field.After calculating τS , we can extract the whole preamble because the peaks from the correlation between a single long training symbol and the entire preamble are used to derive more accurate time estimation [4].

B. Deep-Learning Based Packet Detection
The packet detection problem can be formulated as a regression problem, where DNN needs to learn a mapping between the input signal and the output value representing the packet start instant while distinguishing from the noise.We suppose that DNN-based packet detection operates over the consecutive fixed-length blocks |y y y| of the received signal amplitude samples: after the received signal is downsampled and filtered.Next, we will describe the DNN architecture used for packet detection task, as well as the training procedure.1) Convolutional Neural Networks for Packet Detection: Motivated by recent investigation in [13] and the initial results obtained in [19], we consider Wi-Fi packet detection using one-dimensional convolutional neural networks (1D-CNN).
CNNs are DL architectures that achieved outstanding results in computer vision and image classification problems, due to their ability to extract features from local input patches through the application of relevant filters.CNNs can effectively learn the hierarchical features to construct a final feature set of a high level abstraction, which are then used to form more complex patterns within higher layers [26].The same ideas can be applied to 1D-sequences of data, where 1D-CNNs are proven to be effective in deriving features from fixedlength segments of the data set.This characteristic of the 1D-CNN, together with the fact that the 1D convolution layers are translation invariant, which means that a pattern learned at a certain position in the signal can be latter recognized at a different position (e.g., the start instant of the packet), makes this architecture suitable for packet detection task.
Two types of layers are applied in compact 1D-CNNs: i) 1D-CNN layer, where 1D convolution occurs, and ii) Fully Connected (FC) layer.Each hidden CNN layer performs a sequence of convolutions, whose sum is passed through the activation function [27].The main advantage of 1D-CNN represents fusing feature extraction and classification operations into a single process that can be optimized to maximize the network performance, because CNN layers process the raw 1D data and extract features used by FC layers for prediction tasks (Fig. 4).As a consequence, low computational complexity is provided and if compared to 2D-CNNs, 1D-CNN can use larger filter and convolution window sizes since the only expensive operation is a sequence of 1D convolutions.
2) Training procedure: To train DNN models, the meansquared error (MSE) loss: L M SE (τ S , τS ) = i (τ Si − τSi ) 2 is minimized, which achieves better performances as compared to the mean-absolute error (MAE) and Huber loss functions.The training set is separated into mini-batches of size 80, and 400 epochs are sufficient for the loss function convergence.In order to optimize network parameters, stochastic gradient descent (SGD) with Adam at the learning rate α = 0.001, β 1 = 0.9 and β 2 = 0.999 is used [28].The same 1D-CNN architecture (Table I) is used for all experiments.Filter size of the first convolution layer is chosen as a half of the STS sample-length (8 samples), and stride of 1 sample is applied (Fig. 4).Note that we do not exploit full flexibility of 1D-CNN architecture since we apply a fixed number of input channels as well as the fixed-length filters.We apply such fixed architecture to make the analysis of the proposed algorithm in terms of its performance and complexity easier.We note that the further optimization of the number of input channels and the input filter lengths may further improve performance vs complexity trade-off.

C. Data Set Generation
1) Simulated environment: The data set consists of (|y y y|, τ S ) pairs, where τ S indicates a packet start sample inside the block.Within the data set, we included about 50% of the blocks that do not contain a packet start instance, tagged with the value of τ S = −1.Among such blocks, roughly half contain only noise samples, while the other half contain intermediate or tail-parts of NDP packets.For data set blocks containing packet start instants τ S , its value is set uniformly at random among the input block samples.Data sets are created for input blocks |y y y| of lengths: 40, 80, 160, 320, 800, 1600 samples, where the number of received blocks in each data set is 50000.Note that the larger the length of the input block, the complexity of the first layer increases, however, the number of blocks to be processed per unit time decreases.Careful complexity analysis is presented in Sec.III-D.From the data set, 70% records are used for training, 15% for validation and 15% for testing.
Regardless of the input block size, all packets are simulated under the same conditions using the standard-compliant IEEE 802.11ah physical layer simulator.In order to examine estimator robustness to varying signal-to-noise-ratio (SNR), SNR values are uniformly and randomly selected from range [1 dB, 25 dB].During the simulations, indoor multipath fading channel -model B [29] is applied.
2) Real-World Environment: In order to evaluate the proposed method in a real-world environment, we collect data sets using Software-Defined Radio (SDR) implementation.We deploy our real-world setup in an indoor environment, placing the transmitter along a sequence of predefined grid points, while the receiver is stationary, as shown in Fig. 5.Note that 12 out of 20 transmitter positions are in the same room as the receiver, while the remaining 8 are in the neighboring room, thus providing us with the data set of a wider range of received SNRs.
Both the transmitter and the receiver include standardcompliant MATLAB-based 802.11ahPHY implementation and USRP B210 SDR platforms, as shown in Fig. 6.From each point, the transmitter sends 1000 1 MHz NDP packets with the measured SNR range ∈ [−6 dB, 31 dB]).At the receiver side, the complex baseband data samples obtained after filtering and downsampling are collected and separated into the input blocks |y y y| of lengths: 40, 80, 160, 320, 800 and 1600 samples.Roughly 50% of blocks that do not contain packet start instance are included, resulting in a data set that consists of 40,000 (|y y y|, τ S ) pairs (70% for training, 15% for validation and 15% for testing).Other system assumptions and parameters are the same as in the simulated environment.

D. Numerical Results
In this subsection, we discuss the packet detection performance of both CNN-based and conventional methods in terms of the mean absolute error (MAE) under different SNRs in the simulated environment.In the real-world environment, we present MAE averaged across the entire SNR range.Also, miss detection and false alarm rates are investigated and taken into account.Furthermore, we investigate the computational complexity of the proposed CNN-based algorithm for packet detection for different input block lengths, and compare them to the conventional method in terms of the approximate number of floating point operations per second (FLOPS).
The complexity of the DL-based algorithms considered in this paper are evaluated for an inference phase only.In other words, we assume that the training process is done offline.Note that the offline training can be made more efficient by first pretraining the model on a realistic system simulator, and then extending the training with an additional, usually smaller, set of training samples collected from a real-world environment [14].This process can be further improved by techniques of deep transfer learning, which can speed up the model design, as suggested in [30].Also, authors in [31] propose effective combining of the trained models using the concept of federated learning in order to arrive at more robust and efficient models.1) Simulated Environment: In the simulated experiments, the number of 1D-CNN input channels is set to 4 for all input block lengths.Fig. 7 presents MAE packet detection performance of 1D-CNN architectures as a function of the received SNR evaluated over the test set.The figure also includes the results obtained by using conventional method after both coarse and fine packet start sample estimation is applied.We note that the 1D-CNN approach demonstrates better robustness to the variations of SNR as compared to the conventional method that deteriorates at lower SNRs.In addition, as the input block lengths decrease, the 1D-CNN packet detector outperforms the conventional method.Although this can be attributed to the fact that the estimated packet start sample value τ S is bounded by the input block size (thus the estimation error naturally reduces by decreasing the input block length), we still note that 1D-CNN processing input blocks as large as 320 samples performs comparably with the conventional detector that slides across input blocks of 80 samples (Sec.III-A), while outperforming the conventional detector for SNRs below 7 dB.Finally, it is interesting to compare the performance of different algorithms at SNR equal to 10 dB, since the authors in [32] emphasize this SNR value as critical for different IEEE 802.11ah use cases.From Fig. 7 we note that the conventional algorithm has comparable performances with the CNN-based algorithm for an input block of 320 samples, while for the smaller input blocks, the CNN-based algorithm outperforms the conventional one.
For the same setup, Fig. 8 presents the miss detection and false alarm rates for different input block sizes.The results are expressed as a percentage of miss or false detected packets averaged across the entire test set (i.e., across all SNRs).For comparison, for the same testing conditions, the conventional method exhibits superb performance of miss detection rate equal 0.0012% and false alarm rate equal 0.0016%.For 1D-CNN-based packet detectors, although the results vary across the range of input block lengths showing particularly high false alarm rates for small input block sizes, the performance gradually improves for larger input block lengths, achieving sub-0.1% miss detection and false alarm rates.2) Real-World Environment: Next, we explore the performance of the 1D-CNN-based packet detector in the real-world environment.The number of 1D-CNN input channels is kept at 4 for all input block lengths.Note that, in the simulated environment, the test data set contains approximately the same number of packets at each SNR value, thus we present MAE performance as a function of the received SNR (Fig. 7).However, in a real-world environment, we do not have such control over received SNRs, and our data set is highly irregular in terms of recorded received SNR values.For this reason, average MAE across the whole range of SNR values is presented for each input block size, along with the performance of the conventional algorithm included as a benchmark.
Fig. 9 shows that the proposed CNN-based algorithm outperforms the conventional method in terms of the averaged MAE.Moreover, such performance is achieved for input block lengths up to 800 samples, while for the input block length of 1600 samples, the performances of the two methods are similar.Fig. 10 shows probability of miss detection for different input block sizes averaged across all SNRs.The obtained rates are promising as, even in the worst-case input block size of 40 samples, the obtained rates are below 0.5%.For larger block lengths, the miss detection rates drop significantly, reaching as low as 0.01% for input block size of 1600 samples.The conventional method is an order of magnitude better achieving the miss detection rate of 0.0026%.
In terms of the false alarm rate, the proposed CNN-based algorithm for packet detection shows deteriorated performances.From Fig. 11, one can note that for small input block sizes, the false alarm rate can be as high as 5% while with the increase in the input block length, the false alarm performance improves.The best achieved false alarm rate for the CNN-based estimator of 0.015% block length of 1600 samples) still falls short of the conventional algorithm whose false alarm rate is 0.0027%.Finally, we note that the performance trends observed in the simulated environment are preserved in the real-world environment.
3) Computational Complexity Analysis: Assuming the sampling rate of 1 MHz for IEEE 802.11ah scenario used in our experiments, Fig. 12 shows the approximate number of FLOPS of the 1D-CNN architecture as a function of the input block lengths, with the conventional method included for reference.The complexity of each layer of 1D-CNN may be computed by calculating the number of additions and multiplications within each layer.The total number of FLOPS for a CNN   depends on the input block size, however, note that although larger input blocks lead to more complex network, they also reduce the number of blocks processed per second.According to [13], the complexity of a single convolution layer depends on filter length F , number of input (ch i ) and output (ch o ) channels, and output width K, while the complexity of FC layer is determined by the input (N i ) and the output (N o ) size.Mathematical expressions used for calculating an approximate number of FLOPS (multiplications and additions) in a single layer are given in Table II [13].
Regarding the conventional method, it consists of two parts: coarse and fine estimation.During the coarse estimation, it   For smaller input block lengths, Fig. 12 shows that the complexity of 1D-CNNs is lower or comparable with the conventional algorithm.Taking the overall results into account, 1D-CNN offers a relatively wide operational range for balancing between MAE, computational complexity in MFLOPs, miss detection and false alarm rates.We summarize our findings as follows: 1D-CNNs are able to outperform conventional methods under reduced computational effort, while being inferior in miss detection and false alarm rates.

IV. PREAMBLE-BASED CFO ESTIMATION
In the second part of the paper, we consider implementation of deep-learning based CFO estimation in IEEE 802.11ah and compare its performance with the conventional method.

A. Conventional CFO Estimation
A common approach to CFO estimation uses the fact that the samples of two consecutive identical short training symbols differ by the phase shift proportional to the CFO f of f : where T s represents the sample period [7].Maximum likelihood CFO estimate uses the phase of complex correlation Λ τ (Eq.4) between the repeated training symbols, denoted as φ = ∠(Λ τ ), to estimate CFO [3], [8]: where f s = N Ts is the sample frequency.In the IEEE 802.11ah scenario, the CFO estimation can be separated into two steps.The coarse CFO, denoted as f (1) of f , is carried out using auto-correlation of two adjacent STS within STF, taken at the estimated packet start sample time τ S [4]: where P is equal to or is a multiple of l S .Using ( 10) and ( 11), and φ(1) = ∠(Λ τ S ), we get: After correcting f (1) of f over the signal y y y, the coarse CFOcompensated signal ŷ y y is obtained.Using LTF field of ŷ y y, the fine CFO estimation f (2) of f can be expressed as [4]: where τ L = τ S +L S is the initial LTF sample, L L is a samplelength of LTF field, and l L is a sample-length of a long training symbol.Using φ(2) = ∠(Λ τ L ) the fine CFO is estimated as: Finally, the CFO of the received signal is estimated as the sum of the coarse and fine CFOs:

B. Deep-Learning Based CFO Estimation
In this paper, we test the ability of selected DNN architectures to estimate the CFO from the phase of received STF samples: In other words, a DNN architecture learns the mapping between the received ∠(y y y ST F ) and f of f .Note that we test the DNN-based CFO estimation only on the STF field, unlike the conventional methods that use both STF and LTF fields.Finally, we note that in both simulation and real-world experiments in this paper, the CFO estimation is applied sequentially after the conventional packet detection is applied.Thus the effects of imperfect packet detection are included in the CFO estimation results in Sec.IV-D.Next, we detail the DNN architectures considered for CFO estimation, and describe the data set and training procedure.
1) Fully Connected Feed-Forward Neural Networks: This neural network architecture consists of an input, an output and the set of hidden layers, and is a simple and well-understood DNN model.The relation between the input x and the output y is a layer-wise composition of computational units: (16) where Θ denotes the set of network parameters: weights W i and biases b i , f i (x) = W i x + b i and g i (•) are the linear pre-activation and activation function of the i th hidden layer, respectively, f o (•) represents the linear function of the output layer, and M is the number of layers.Among the non-linear activation functions, we focus on rectified linear units (ReLU), as ReLU DNNs are known universal piece-wise linear function approximators for a large class of functions [33].
2) Recurrent Neural Networks: RNNs represent sequencebased models able to establish temporal correlations between the previous and the current circumstances.As such, RNN represent a suitable solution for the CFO estimation problem, given that the estimated CFO values between the samples of the subsequent symbols in the past have influence on the current CFO estimate.
A simple example of a single-layer RNN is given in Fig. 13, where the output of the previous time step t − 1 becomes a part of the input of the current time step t, thus capturing past information.Computation result performed by one RNN cell can be expressed as a following function [34]: where tanh represents the hyperbolic tangent function, h t and h t−1 are the hidden states at time steps t and t−1, respectively, W ih , W hh and b ih , b hh are the weights and the biases which need to be learned, and an input at time t is denoted as x t .Basic RNN cells fail to learn long-range dependencies due to the vanishing or exploding gradients.To solve this, Long Short-Time Memory (LSTM) [35] cells are put forward that contain special units called memory blocks in recurrent hidden layer, which enhance its capability to model long-term dependencies.This block is a recurrently connected subnet that contains functional modules called memory cells and gates.The former remembers the network temporal state while the latter controls the information flow from the previous cell state.
Besides standard LSTM cells, we also consider Gated Recurrent Unit (GRU) [36].The main ideas from LSTMs are preserved, but GRU introduces only two gates, update gate and reset gate, to control the information flow.GRUs perform similarly to LSTM, but with reduced execution time [37].
3) Training Procedure: To train DNN models, we minimize the MSE loss: L M SE (f of f , foff ) = i (f of fi − foffi ) 2 .The training set is divided into mini-batches of size 100, and 500 epochs are sufficient for the loss function convergence.Network parameters are optimized the same way as in Sec.III-B, i.e., by using SGD with Adam at the learning rate α = 0.001, β 1 = 0.9 and β 2 = 0.999 [28].
The parameters of the proposed ReLU DNN and RNN architectures are described in Table III   C. Data Set Generation 1) Simulated Environment: Using the simulated environment, we generate the data set of pairs (∠(y y y ST F ), f of f ), where f of f represents a CFO introduced during transmission.After downsampling and filtering, y y y ST F consists of 160 samples (10 repetitions of 16-sample STS).We simulated transmission of 50, 000 NDP packets and extracted STF phase vectors, while the corresponding true CFO values are generated within the simulation uniformly at random from [− ∆f 2 , ∆f 2 ] = [−15.625kHz, 15.625 kHz].From the data set, 70% of the records are used for training, 15% for validation and 15% for testing purposes.In order to examine estimator robustness, NDP packets are received with different SNRs ranging between 1 dB and 25 dB.Depending on the simulated channel model, two data sets are created: i) AWGN channel, and ii) indoor multipath fading channel -model B [29].
2) Real World Environment: The setup used for data set generation in the real-world environment is the same as in Sec.III-C.From each grid point, the transmitter sends 1000 1 MHz NDP packets with the measured SNR range ∈ [−6 dB, 31 dB].At the receiver side, after the packet detection, the STF phase vectors (∠(y y y ST F )) are extracted.The collected data set consists of 20,000 (∠(y y y ST F ), foff ) pairs (70% for training, 15% for validation, and 15% for testing), where as a label foff we use a CFO estimated using the conventional algorithm.This is due to the fact that, in the real-world conditions, we do not have a priori knowledge on CFO introduced during the transmission.Thus, in this case, we train the DL-based CFO estimator to replicate the conventional method performance.Note also that, in contrast to the simulated environment where the CFO values are generated uniformly at random from a given interval, in real-world experiments, estimated CFO values between two SDR devices are nearly stationary.

D. Numerical Results
In this subsection, the performance of the DL-based method is compared with the conventional one in both simulated and real-world environments.In addition, we compare the two methods in terms of the computation complexity evaluated using the approximate number of FLOPs per packet.As it is described in Sec.III-D, for the CFO estimation training is again done offline, so complexity analysis for DL-based algorithms is conducted only for the inference phase.DNN-based methods use only STF samples as an input, while conventional methods use both STF+LTF samples through two-step coarse and fine CFO.We note that certain DNN approaches are more robust to varying SNR values than the conventional algorithm, which however outperforms all DNN architectures at the higher SNRs (above 8 dB).We also note that the more challenging indoor fading channel (model B) increases the MAE of all methods by approximately 15 Hz.As for the packet detection task, we observe that, for the SNR value of 10 dB, the conventional algorithm slightly outperforms the RNN-based method for both channel models.We identify the existence of outliers as the main reason why RNN is not able to follow the MAE performance of the conventional method at high SNRs.Indeed, taking a closer look at Fig. 16, the majority of test samples are predicted with high accuracy, except a few that deviate and strongly affect MAE.In order to solve this problem, two different approaches are pursued: i) we extend the data set with additional 20000 samples, ii) we increase the RNN architecture complexity (using a single GRU layer with 50 units followed by a two ReLU FC layer with 30 and 20 neurons, respectively, and an output single-neuron layer).Our preliminary results demonstrate slight improvement only in the second approach, however, at high complexity costs (complexity will be discussed in Sec.IV-D3).
The problem of outliers can be addressed by designing additional outlier detection methods.For example, one can include unsupervised methods such as deep autoencoders for outlier detection [38].We are currently investigating such methods, however, we note they will additionally contribute to the complexity of the proposed RNN-based method.
2) CFO Estimation Performance in the Real World Environment: We explore the ability of the proposed algorithm to replicate the results obtained using the conventional algorithm.Based on the MAE obtained in a simulated environment, we use RNNs with LSTM as a DNN-based method.Fig. 17 shows that, except for a few outliers, the proposed RNN-based estimator is able to replicate the performance achievable with the conventional one.
3) Computational Complexity Analysis: For the proposed RNN-based algorithm, an approximate number of FLOPs for processing a single packet is presented and compared with the complexity of the conventional algorithm in Table VI.The reason why we calculate the number of FLOPs per packet is due to the fact that the CFO estimation occurs only upon the packet detection event.In order to calculate the number of FLOPs for the conventional algorithm, we take into account the number of multiplications and additions per packet for both coarse and fine CFO estimation.This number is favourable as, given the STF and LTF fields, the only task is to calculate the phase of the complex correlation as described in Sec.IV-A.On the other hand, the number of FLOPs for DNN-based algorithms is calculated as the number of multiplications and additions within each network layer.Since ReLU DNN comprises only a FC layers, mathematical expressions for calculating the approximate number of FLOPs are described in Table II.In Table V [39] we note that the number of multiplications and additions in one recurrent cell depends on the number of recurrent units in a layer (U ) and on the number of features in one time stamp (N F , in our case we have 10 time stamps, each with 16 features).Except a recurrent layer, the proposed RNN also has a single FC layer whose complexity needs to be taken into account (Table II) in order to obtain the total number of FLOPs.
In Table V we provide the expressions used to evaluate the computational complexity of a simple recurrent cell.In addition, LSTM or GRU units introduce additional memory cells and gates, having higher complexity than a simple recurrent cell.For example, the total number of FLOPs for a single LSTM cell is approximately 4 times higher than for a simple recurrent cell, while for a GRU cell, it is approximately 3 times higher than for a simple RNN cell.Finally, Table VI shows that, despite their excellent accuracy in terms of MAE, DNN-based methods suffer from high complexity in terms of the number of FLOPs per packet.
The complexity of the RNN architecture is the main reason why, instead of a single architecture, we used different neural network architectures for packet detection and CFO estimation tasks.As our preliminary results show, when RNN is applied for packet detection task in the real-world environment (using the same parameters described in Table IV), MAE performances are slightly increased compared to 1D-CNN, i.e., they are comparable to the conventional algorithm, however, for the price of significant increase in the computation complexity.

V. CONCLUSION
We performed an in-depth performance and complexity study of the DL-based packet detection and CFO estimation in preamble-based IEEE 802.11 systems.For both packet detection and CFO estimation, the conditions under which the performance of the DL-based methods approach or even surpass the conventional methods, but also, the conditions under which their performance is inferior, are clearly presented.
For the case of packet detection, 1D-CNNs are identified as the best-performing architecture able to achieve excellent accuracy that matches or even surpasses the conventional method (at low-to-medium SNRs), under favourable computation complexity.In contrast, the conventional method is always superior in terms of the false alarm and miss detection rate.For the case of CFO estimation, RNNs are identified as the best-performing architecture that are able to match the accuracy of the conventional method (at low-to-medium SNRs), however, their complexity is always inferior to conventional methods.Our findings are supported by numerical simulation results, and the real-world testbed using SDRs.According to our preliminary results for both packet detection and CFO estimation tasks, the proposed methods could be extended to other preamble-based IEEE 802.11 standards operating in 2.4/5 GHz bands.
Finally, for our future work, we plan to extend our investigation to multiple-input multiple-output (MIMO) modes of operation of IEEE 802.11ah standard, investigate effects of imperfect DL-based packet detection and CFO estimation on the DL-based channel estimation, and real-world implementation of the proposed methods in field-programmable gate array (FPGA) hardware in order to estimate realistic latency and resource requirements.
Training Field (STF) -The short training field, which lasts 160 µs, consists of 4 OFDM symbols in the frequency domain which, after IDFT, represent 10 repetitions of the same short training symbol (16 µs each) in the time domain.Short training symbol is a sequence with good correlation properties and a low peak-to-average power whose features are preserved even after clipping or compression by an overloaded analog front end.Because of that, a short training field is suitable for coarse timing synchronization (packet detection) and (coarse) frequency offset estimation.Long Training Field 1 (LTF1) -The first long training field also contains 4 OFDM symbols of 160 µs duration.Two repetitions of the same long training symbol enables fine timing synchronization, fine frequency offset estimation and channel estimation.

Fig. 7 .
Fig. 7. MAE performance of 1D-CNN vs conventional packet detection for different received SNR.

Fig. 8 .
Fig. 8. 1D-CNN packet detection miss detection and false alarm rates for different input block sizes.
Input block size [samples]Mean absolute error[samples]

Fig. 9 .
Fig. 9. 1D-CNN MAE performance for different input block sizes in the real-world environment.

Fig. 10 .5
Fig. 10.1D-CNN miss detection rate for different input block lengths in the real-world environment.

Fig. 11 .
Fig. 11.1D-CNN false alarm rates for different input block lengths in the real-world environment.

Fig. 14 .
Fig. 14.MAE performance of different CFO algorithms for different received SNRs under AWGN channel.

Fig. 15 .
Fig. 15.MAE performance of different CFO algorithms for different received SNRs under indoor channel model.

Fig.
Fig. 17.CFO predicted by RNN vs estimated by conventional algorithm.

17
Fig. 17.CFO predicted by RNN vs estimated by conventional algorithm.
Fig. 17.CFO predicted by RNN vs estimated by conventional algorithm.

TABLE I 1D
-CNN NETWORK PARAMETERS FOR PACKET DETECTION.
and TableIV, respectively.Unlike ReLU DNN, where the input is the whole sequence ∠(y y y ST F ), at RNN, this sequence is split into STSs (16 samples), and one STS is input into one LSTM/GRU unit.

TABLE III RELU
DNN NETWORK PARAMETERS FOR CFO ESTIMATION.
1) CFO Estimation Performance in Simulated Environment: MAE of CFO estimation as a function of channel SNR presented in Figs. 14 and 15 for both simulated channel models (see Sec. III-C), respectively.

TABLE VI APPROXIMATE
NUMBER OF FLOPS FOR CFO ESTIMATION