Neural Network Approaches for Data Estimation in Unique Word OFDM Systems

Data estimation is conducted with model-based estimation methods since the beginning of digital communications. However, motivated by the growing success of machine learning, current research focuses on replacing model-based data estimation methods by data-driven approaches, mainly neural networks (NNs). In this work, we particularly investigate the incorporation of existing model knowledge into data-driven approaches, which is expected to lead to complexity reduction and / or performance enhancement. We describe three different options, namely “model-inspired” pre-processing, choosing an NN architecture motivated by the properties of the underlying communication system, and inferring the layer structure of an NN with the help of model knowledge. Most of the current publications on NN-based data estimation deal with general multiple-input multiple-output (MIMO) communication systems. In this work, we investigate NN-based data estimation for so-called unique word orthogonal frequency division multiplexing (UW-OFDM) systems. We highlight differences between UW-OFDM systems and general MIMO systems one has to be aware of when using NNs for data estimation, and we introduce measures for a successful utilization of NN-based data estimators in UW-OFDM systems. Further, we investigate the use of NNs for data estimation when channel coded data transmission is conducted, and we present adaptions to be made, such that NN-based data estimators provide satisfying performance for this case. We compare the presented NNs concerning achieved bit error ratio performance and computational complexity, we show the peculiar distributions of their data estimates, and we also point out their downsides compared to model-based equalizers.


I. INTRODUCTION
On the receiver side of wireless digital communication systems, data estimation, also referred to as equalization, is conducted to reconstruct the transmitted data that have been disturbed during transmission.Traditionally, this task is accomplished with model-based estimation methods.That is, the data transmission is described by physical and mathematical models, such that on basis of these models statistical estimation methods can be developed to estimate the transmitted data.This established approach has many advantages, e.g., the derived estimation methods are well-interpretable, and often performance bounds can be derived.However, there are also some downsides.Model-based estimation methods yielding optimal performance are generally computationally infeasible, which requires resorting to less complex, suboptimal methods in practice.Furthermore, modeling inaccuracies may lead to severe performance degradation, and the empirical statistical behavior of available data cannot be utilized for improving the estimation results.With data-driven machine learning methods, some of the aforementioned issues of model-based approaches can be resolved.Hence, employing data-driven methods, particularly neural networks (NNs), for equalization majority of available publications on NN-based data estimators assume a MIMO system model -often with data transmission over an uncorrelated Rayleigh fading channel.Due to the different system properties of UW-OFDM systems in comparison to MIMO systems over uncorrelated Rayleigh fading channels, pre-processing steps are required to obtain well-performing NN-based equalizers, which we detail in this paper.We compare the NN-based approaches with model-based methods in terms of performance and complexity.We conduct our investigations for both channel coded and uncoded data transmission, where the former case has rarely been covered in publications on NN-based data estimators yet.For channel coded transmission, the equalizers have to provide reliability information about their estimates.It turns out, that NN-based data estimators tend to be overconfident in their decisions, which impairs the overall system performance.We suggest a measure that can be conducted to counteract the overconfidence of the NNs, which allows achieving approximately the same BER performance as with an optimal equalizer.Furthermore, we plot the empirical distributions of the estimates of model-based and NN-based equalizers in the in-phase/quadrature-phase (I/Q)-diagram, which highlights peculiarities of some of the considered equalizers.
The remainder of this paper is structured as follows: we start by reviewing the UW-OFDM signaling scheme in Sec.II.In Sec.III, we present optimal and suboptimal model-based data estimation methods, and we visualize their decision boundaries in a toy example.We address the NN-based equalizers, as well as the utilized data normalization scheme, in Sec.IV.In Sec.V, we provide BER performance results for both channel coded and uncoded data transmission, we conduct a complexity analysis, and we compare the distributions of the estimates provided by model-based and NN-based equalizers.Finally, we conclude with our findings in Sec.VI.

Notation
Throughout this paper, the ith element of a vector x, the element in the ith row and the jth column of a matrix X, and the ith row of a matrix X are denoted as x i , [X] ij , and [X] i, * , respectively.The operators Re{.} and Im{.} deliver the real and the imaginary part of a complex-valued quantity, (.) T and (.) H indicate the transposition and the conjugate transposition of a vector/matrix, respectively.Furthermore, p(.), p[.], Pr(.), p[a|b], p[a = ã], and E a [.] describe the probability density function (PDF) of a continuous random variable, the probability mass function (PMF) of a discrete random variable, the probability operator, a conditional PMF of the random variable a given b, a PMF evaluated at the value ã, and the expectation operator averaging over the PDF/PMF of a, respectively.The subscript of the expectation operator is omitted, when the averaging PDF/PMF is clear from context.

II. PRELIMINARIES
In this section, we describe the basics of UW-OFDM.For more detailed information on UW-OFDM, we refer to [15], [17]- [19].The UW-OFDM signaling scheme mainly exhibits two differences from CP-OFDM.Firstly, a deterministic sequence, the so-called UW, is employed as a guard interval.Secondly, the guard interval is part of a UW-OFDM time domain symbol resulting from an inverse discrete Fourier transform (IDFT) operation.That is, a guard interval is not removed on receiver side, but is transformed to frequency domain together with the preceding payload.With this approach, redundancy in frequency domain is introduced, which can be exploited beneficially for spectral shaping [20], and for achieving a better BER performance [15] than with CP-OFDM, however, at the cost of receiver complexity.In the following, we elucidate the data transmission in a UW-OFDM system and its associated system model.
As in CP-OFDM, the data symbols, drawn from a phase-shift keying (PSK) or quadrature amplitude modulation (QAM) alphabet 2 S , are defined in frequency domain.In contrast to a CP-OFDM symbol, a UW-OFDM symbol x ∈ C N , containing N d data symbols d ∈ S , has to fulfill some conditions.To reveal the conditions on a UW-OFDM symbol, we consider the structure and the generation of a UW-OFDM time domain symbol x t ∈ C N of length N .In a first step, a time domain symbol x is generated that consists of payload data x pl , and a succeeding sequence of zeros with length N u , i.e., x = [x T pl 0 T ] T .The requested structure of x imposes the condition F −1 N x = [x T pl 0 T ] T on the corresponding UW-OFDM symbol x in frequency domain, where F −1 N is the N -point IDFT matrix.To fulfill this constraint, the number of data symbols N d per UW-OFDM symbol has to be at least by N u smaller than the length N of a UW-OFDM symbol, reduced by the number of zero subcarriers N z , i.e., N d ≤ N − N z − N u .Throughout this paper, we consider the case 3   ) , in turn, can be any non-singular matrix, which can be chosen according to the so-called systematic or non-systematic UW-OFDM signaling scheme.In this work, the non-systematic approach is used, where A is optimized for the BER performance of the linear minimum mean square error (LMMSE) data estimator as in [18].In case A is chosen to be a permutation matrix placing the data symbols and the redundant values on their intended subcarrier position, the signaling scheme is termed systematic UW-OFDM.For further details on systematic and non-systematic UW-OFDM, we refer to [15], [17]- [19].The last step on transmitter side is generating a transmit symbol x t by inserting the deterministic UW x u ∈ C Nu at the position of the zero sequence of the UW-OFDM time domain symbol, i.e., x t = x + [0 T x T u ] T .After transmission of x t over a multipath channel and additional disturbance by additive white Gaussian noise (AWGN), the corresponding received vector is transformed to frequency domain, and the zero subcarriers are removed.The resulting downsized vector y d follows to where the diagonal matrix H ∈ C (N d +Nu)×(N d +Nu) contains the sampled channel frequency response excluding the positions of the zero subcarriers, F N is the N -point discrete Fourier transform (DFT) matrix, xu = F N [0 T x T u ] T denotes the UW in frequency domain, and n ∼ CN (0, σ 2 n I) is circularly symmetric complex white Gaussian noise, where σ 2 n is the variance of the AWGN in time domain.Removing the influence of the known UW on y d yields the equivalent complex baseband system model with H = HG and w ∼ CN (0, N σ 2 n I).In case of channel coded data transmission, reliability information of the estimates, also referred to as soft information or soft decision estimates, has to be provided, e.g., in form of log-likelihood ratios (LLRs) with L ji being the LLR of the jth bit b ji of the ith data symbol, j ∈ {0, ..., log 2 (|S | − 1)}, i ∈ {0, ..., N d − 1}.The LLRs serve as input for the channel decoder.For uncoded data transmission, the data symbol estimates are sliced to the nearest symbol in the symbol alphabet, which is also termed hard decision estimation.
III. MODEL-BASED DATA ESTIMATION In this section, we review some traditional, model-based approaches for equalization.The aim is to estimate the data vector d based on the received vector y , the channel state information in form of the matrix H , and the system model (2).We start by elaborating on optimal estimators, which are, however, in general computationally infeasible.Consequently, one usually has to resort to suboptimal estimation methods in practice.We describe two state-of-the-art suboptimal estimators, where one is a linear and the other one is a non-linear estimator.Further, we present the decision boundaries of the aforementioned equalizers in a toy example to visualize their differences.

A. Bit-Wise Maximum A-Posteriori Estimator
The optimal estimator in terms of the BER performance is the bit-wise maximum a-posteriori (MAP) estimator [23], yielding the bit value featuring the highest probability for a given received vector y ji ⊂ S N d denotes the set of data vectors with the bit b ji fixed to the value b ∈ {0, 1}.For the second step in (4), the data symbols in the data vector are assumed to be independent and identically distributed (i.i.d.) with a uniform prior probability.

B. Vector Maximum Likelihood Estimator
The estimated data vector d produced by the vector maximum likelihood (ML) estimator maximizes the likelihood function p(y |d ), whereby all possible data vectors d ∈ S N d are considered.Since we assume to have i.i.d.data symbols in the data vector, the vector ML estimator coincides with the vector MAP estimator.The vector ML estimator is given by In literature, this estimator is often considered to be the optimal equalizer.In fact, it is optimal with respect to the error probability of the data vector estimate [23], but not with respect to the BER, which is the usual figure of merit in communications.By examining (5), a noteworthy peculiarity of the vector ML estimator can be observed, namely, this estimator does not depend on the noise variance σ 2 n , which is in contrast to the bit-wise MAP estimator (4).

C. Minimum Mean Square Error Estimator
The nonlinear MMSE estimator is, in contrast to the ML and the MAP estimators, very rarely regarded in communications literature.Especially when it comes to NN-based data estimators, which try to approximate the nonlinear MMSE estimator, we believe that a detailed consideration of the nonlinear MMSE estimator is quite meaningful.
When employing the Bayesian mean square error E y ,d [|| d − d || 2  2 ] as a performance measure, the minimum mean square error (MMSE) estimator is the optimal estimator.The MMSE estimator is obtained by computing the mean of the posterior PMF [24], i.e., where again a uniform prior probability distribution of the data vectors is assumed.Interestingly, as shown in Appendix A, for a QPSK modulation alphabet (which is employed as modulation alphabet in this paper) the hard decision estimates of the MMSE estimator coincide with those of the bit-wise MAP estimator.Hence, the MMSE estimator also serves as a benchmark for the best BER performance achievable.For higher-order modulation alphabets, e.g., 16-QAM or 64-QAM, the MMSE has to be formulated for the transmitted bit vector (instead of the complex-valued data symbol vector) for obtaining optimal BER performance.
Reliability Information for MMSE Estimates: As obvious from (3), the posterior probabilities Pr(b ji = 1|y ) and Pr(b ji = 0|y ) have to be determined to obtain the desired LLRs L ji .For the employed QPSK modulation alphabet, the LLRs L 0i and L 1i , corresponding to the zeroth and the first bit of the ith data symbol d i , respectively, can be computed on basis of the MMSE estimates d i with low complexity, which is presented in the following.To this end, let us consider the QPSK bit-to-symbol mapping (b 1i b 0i ) → d i , where the bits b 0i and b 1i are mapped to the real part and the imaginary part of d i , respectively.The bit values 0 and 1 are mapped to the symbol values −ρ and ρ, respectively, with the energy normalization factor ρ = 1/ √ 2. Hence, as given in (38), the real part of the ith MMSE estimate follows to Since Pr(b 0i = 0|y ) + Pr(b 0i = 1|y ) = 1, (7) can be expressed as or as By rearranging ( 8) and ( 9) with respect to the posterior probabilities, and plugging the results into the LLR definition (3) yields where for obtaining L 1i the same steps as above have to be conducted for the imaginary part of d i .

D. Linear Minimum Mean Square Error Estimator
The aforementioned optimal equalizers all suffer from a complexity that is exponential in the length of the data vector.To obtain low-complex equalizers, one can constrain the estimator to be linear.The best linear estimator in terms of the Bayesian mean square error is the LMMSE estimator.By applying the Bayesian Gauss-Markov theorem [24] to (2), the LMMSE estimator follows to where σ 2 d is the variance of the data symbols, and E LMMSE is the LMMSE estimator matrix.
Algorithm 1 Decision Feedback Equalization H 0 ← H , y 0 ← y 3: for k = 0, ..., N d − 1 do 5: C ee,k ← N σ i ← r idx,j data symb.index to be estim. 13: By assuming a Gaussian conditional distribution p( d i |d i ) for the LMMSE estimates (which is valid for large N d following central limit theorem arguments), it can be shown [25], that ( 12) is equivalent to the LLR definition (3).Following the derivation described in [22], the LLRs for the zeroth and the first bit are given by respectively, where and H i is H without the ith column.

E. Decision-Feedback Equalizer
A performance-complexity trade-off is provided by the decision-feedback equalizer (DFE), which is summarized in Alg. 1.In this iterative method, LMMSE estimation of a single data symbol is conducted in every iteration.As a decision criterion which data symbol is estimated in the kth iteration, we use the diagonal of the LMMSE error covariance matrix C ee,k , containing the error variances of the LMMSE estimates.That is, in a single iteration the data symbol corresponding to the smallest error variance is estimated (cf.Alg. 1, line 7), followed by updating the system model in form of removing the influence of the hard decision estimate di from the received vector (Alg.1, line 14), and by deleting the appropriate column from the system matrix H k of the kth iteration (Alg. 1, line 13).
Reliability Information for the DFE: Due to the non-linear iterative equalization process, the best BER results for channel coded data transmission are obtained by incorporating channel decoding into the iterations of the DFE.However, in this work, we do not consider information feedback from the channel decoder to any of the regarded equalizers.As a circumvention, we utilize the LLRs of the LMMSE data symbol estimation in every iteration as reliability information of the DFE.Hence, the LLRs L 0i and L 1i corresponding to the data symbol estimate d i estimated in the kth iteration are computed as given in ( 13), whereby H is replaced by H k .

F. Decision Boundaries of Model-Based Equalizers
To illustrate the differences between the model-based data estimators elaborated above, we plot the decision boundaries of their hard decision estimates for a small toy example where , where the bit values 0 and 1 are mapped to the symbols −1 and 1, respectively.The decision boundaries are plotted for different noise power levels, i.e., for σ 2 = 0.5 and σ 2 = 0.05 in Fig. 1 and Fig. 2, respectively.As already mentioned in Sec.III-B, the vector ML estimator, which is optimal regarding the estimation error probability of the whole data symbol vector, does not depend on the noise variance.This is visible in the identical decision boundaries of the vector ML estimator in Figs.1b and 2b.The decision boundaries of the MMSE estimator (Figs.1a and 2a) change with the noise variance, whereby the decision boundaries (and thus also the performance) of the MMSE estimator converge towards those of the vector ML estimator for σ 2 → 0. That is, only for higher values of the noise variance a BER performance difference between these two equalizers might be observable.As shown in Sec.V-C, the BER performance difference between the vector ML estimator and the MMSE estimator is negligible for the considered UW-OFDM system.Clearly, the decision boundaries of the LMMSE estimator (Figs.1c and 2c), can only be straight lines.The decision boundaries of the LMMSE estimator distinctly deviate from those of the MMSE estimator, indicating a considerable performance degradation for hard decision estimation due to the linearity constraint.With the DFE, in turn, in each iteration a symbol is estimated using a linear estimation step, leading to a smaller deviation of the decision boundaries to the optimal ones, which is visible in Figs.1d and 2d.
IV. NEURAL NETWORK BASED DATA ESTIMATION In this section, we introduce the utilized NN-based data estimators.We describe the actions that have to be taken specifically for UW-OFDM systems to achieve well-performing NNs, and we detail how to counteract overconfidence of NN-based data estimators to obtain reliable soft information required for channel coded data transmission.
To use existing knowledge of NN architectures and NN training methods, real-valued input data of the NNs are generated.Hence, we map the complex-valued system model ( 2) to an equivalent real-valued model of double dimension where Assuming a symmetric alphabet S , d ∈ S 2N d contains data symbols d i , i ∈ {0, ..., 2N d − 1}, drawn from the real-valued symbol alphabet S = Re{S } = Im{S }.The NN-based data estimators are, however, not trained to directly estimate the data symbols d i , but to estimate the corresponding so-called one-hot vectors d oh,i ∈ {0, 1} |S| .Let s j ∈ S, j ∈ {0, ..., |S| − 1} be the uniquely numbered symbols of the symbol alphabet S.Then, a one-hot vector d oh,i corresponding to a data symbol d i that exhibits the value s j contains all zeros but a one at the jth position.The one-hot vectors d oh,i are stacked to a vector d oh , serving as ground truth for training the NN-based data estimators.Further, a quadratic loss function (d oh , doh ) is employed to quantify the error between the output doh ∈ R 2N d |S| of an NN and d oh .It can be shown (cf., e.g., [3]), that with this approach the estimates doh,i of a properly trained NN approximately contain the posterior probabilities Pr( With the approximate posterior probabilities, LLRs can be computed using (3).Hence, soft information of the data symbol estimates required for coded data transmission is available.A hard decision estimate, in turn, is the symbol corresponding to the maximum entry in doh,i .

A. Data Normalization
Proper normalization of the input data of an NN is generally very important for well-behaved training, and thus the performance of trained NNs [26], [27].Interestingly, in the majority of currently available publications on NNs for data estimation in MIMO systems (e.g., [3]- [5], [10]), the input data of the NNs are not normalized.As we show in Sec.V-C, applying DetNet [3] as a data estimator in a UW-OFDM system without any data normalization (which is done in [3] for general MIMO systems) leads to poor BER performance.A major reason for this issue can be found by investigating the relation between the noise variance σ 2 n and the signal-to-noise ratio (SNR) on receiver side.The performance of an equalizer is typically determined by evaluating the achieved BER at a specified4 E b /N 0 , which is a measure for the SNR, where E b is the mean energy per bit, and N 0 is the noise power spectral density.For the following considerations, we define an SNR measure which is proportional to E b /N 0 .For a specified SNR γ at the input of the equalizer, the noise variance in time domain σ 2 n can therefore be expressed as In case of a general MIMO system over an uncorrelated Rayleigh fading channel, which is mainly used in, e.g., [3]- [5], [10] for the performance comparison of different NN-based data estimators, the elements of H ∈ R 2N ×2N d are drawn independently from a standard normal distribution, i.e., [H] Plugging this approximation into (17) results in That is, for a general MIMO system over an uncorrelated Rayleigh fading channel, the noise variances var(w l ) = N σ 2 n of the elements w l of the noise vector w are independent of the current channel realization, and for a fixed SNR, they are constant.Hence, the data is implicitly normalized for this communication system.This is not the case for UW-OFDM systems, and thus we normalize the data such that the variances of the elements of the noise vector become independent of the channel realization.The data normalization is conducted by multiplying the real-valued system model ( 15) by the normalization factor √ N /||H|| F , with ||H|| F = tr(H T H) denoting the Frobenius norm of H. Consequently, every element of the noise vector after normalization has a variance var ( , which is independent of the channel realization.This data normalization is implemented by multiplying both y and H by the above-given normalization factor, which is conducted as a pre-processing step for all the NNbased data estimators presented subsequently.In the remainder of this paper, we omit the normalization factor for the sake of better readability.

B. DetNet
DetNet is proposed in [3] for data estimation in MIMO systems.Its network architecture is deduced by deep unfolding [13] a projected gradient descent method applied to the optimization problem of the vector ML estimator for the model (15).The kth step of the iterative optimization method can be expressed as where Π(.) denotes a non-linear projection to a convex subspace D containing all possible data vectors d, i.e., S 2N d ⊂ D ⊂ R 2N d , and δ k is the step width in the kth iteration.The structure of the kth layer of the L DetNet layers is inspired by a projected gradient descent iteration step (19).Firstly, the affine mapping is applied to the layer input dk−1 to obtain the temporal variable q k , where δ k1 and δ k2 are learned parameters.Secondly, the temporal variable is forwarded to a fully-connected neural network (FCNN) with a single hidden layer consisting of d h hidden neurons and ReLU activation, which replaces the (unknown) nonlinear projection Π(.).To ease the training of DetNet, weighted residual connections [28] with weighting factor α, as well as an auxiliary loss inspired by the loss function employed for the training of GoogLeNet [29] are utilized.Further, d v -dimensional auxiliary variables v k passing unconstrained information from layer to layer are used to improve the performance of DetNet.We refer to [3] for more detailed information.
Preconditioning: Due to the deduction of the layer structure of DetNet by deep unfolding, the number of layers corresponds to the number of required iterations of the underlying projected gradient descent method.It is well known, that the condition number of the Hessian matrix in an optimization problem influences the number of iterations required for an iterative optimization method to converge.Hence, preconditioning the system model ( 15) may reduce the number of required DetNet layers and thus the number of trainable parameters, which, in turn, enhances both the training behavior and the inference complexity.As also stated in [7], we have observed [1] that for ill-conditioned channel matrices NNbased equalizers suffer from severe performance degradation.We showed in [1] that preconditioning distinctly narrows the eigenvalue spectrum of the Hessian matrix S ∈ R P ×P , [S] rs = ∂ (d oh , doh ) ∂pr ∂ps of the NN learning problem, where p r and p s are two of the P trainable parameters of the NN.This, in turn, allows using higher learning rates, which leads to a faster and probably better optimization of the NN parameters.We show the influence of preconditioning on the DetNet performance in Sec.V-C.
In the following, we show that preconditioning does only add a further processing step of the layer input data, while the layer structure of DetNet remains unchanged.To this end, let us rewrite the optimization problem of the vector ML estimator in form of min where L ∈ R 2N d ×2N d is an invertible matrix.Neglecting temporarily the projection operator, a gradient descent step for the linearly transformed vector d pr = Ld is given by with dpr,k/k−1 = L dk/k−1 , and L −T = L −1 T = L T −1 .Hence, the kth iteration of the projected gradient descent for d follows to where P = L T L is the so-called preconditioning matrix.In this paper, we utilize a Jacobi preconditioning matrix, which is a diagonal matrix containing diag(H T H) on its main diagonal.Hence, the computation of P −1 , P −1 H T y, and P −1 H T H can be carried out with low complexity.A comparison of a projected gradient descent step (19) and its preconditioned version (23) reveals that preconditioning does not change the structure of the equation.Hence, the layer architecture of DetNet remains unchanged, while H T y and H T H have to be replaced by P −1 H T y and P −1 H T H, respectively.

C. Fully-Connected Neural Network
According to the universal approximation theorem [12], an FCNN with a single hidden layer and sufficiently many hidden neurons can approximate any function, and thus should also be able to accomplish the task of data estimation.However, as stated in [3], it is challenging to employ an FCNN for equalization under changing channel realizations when using the columns of H concatenated with y as input data.That is, training an FCNN for different channels might be a hard task.One reason for this issue might be that no model knowledge is included in the structure of an FCNN.We therefore suggest to include model knowledge in data pre-processing.
To motivate the choice of the proposed data pre-processing with the purpose of reducing redundant information, we elucidate three observations.Firstly, the FCNN should approximate the estimator function of the optimal MMSE estimator ( 6).An inspection of (6) reveals, that an MMSE estimate is a sum of exponential terms, where the exponents contain 5||y − Hd|| 2 2 = y T y − 2d T H T y + d T H T Hd, ∀d ∈ S 2N d .That is, the MMSE estimator does not use the isolated data y and H, but the terms y T y, H T y, and H T H. Secondly, it can be shown with the help of the Fisher-Neyman factorization theorem [30] that H T y provides a sufficient statistic for the data estimation problem.Consequently, multiplying y by H T , which modifies the system model to preserves all the relevant information contained in y for the estimation of d, while reducing the dimension of the available data.Thirdly, the matched filter equalizer for the system model ( 15) is given by dMF = H T y, which is the linear filter designed for maximizing the output SNR [31].
With the above-given arguments, we conclude that multiplying both H and y by H T before using them as inputs of an FCNN compresses the input data while preserving all the information required for data estimation.Interestingly, also for DetNet the quantities H T y and H T H are utilized instead of y and H, however, due to a different motivation, and in a different manner.Since H T H is a symmetric matrix, the dimension of the input data is further reduced by utilizing only the upper triangular matrix of H T H including its main diagonal.That is, the input vector of the FCNN data estimator is , where [H T H] 0:l,l denotes the vector containing the first l + 1 entries of the lth column of H T H.
The utilized FCNNs for equalization are comprised of L layers, d h neurons per hidden layer, and weighted residual connections with weighting factor α. The employed activation functions ϕ(.) are stated in Tab.I.

D. Attention Detector
Due to the arguments given in Sec.IV-C, we use the compressed system model (24) for defining the inputs of the so-called Attention Detector.Investigations on the compressed system model (24) revealed that the entries in H T y are correlated.This observation motivates the use of the self-attention mechanism [16] to exploit these correlations for enhancing the estimation performance and/or reducing the required computational complexity.For further elaborations on the network architecture of the Attention Detector, let us start by defining its inputs, which are the rows m T k , k ∈ {0, ..., 2N d − 1}, of the matrix where P is the Jacobi preconditioning matrix as described in Sec.IV-B.Although the layer architecture of the Attention Detector is not deduced by deep unfolding, we apply preconditioning for obtaining a narrower eigenvalue spectrum of the Hessian matrix of the NN learning problem, cf.Sec.IV-B.The vectors m T k serve as an input sequence of an encoder.Since the rows of the equation system (24) are interchangeable, no positional encoding is applied to the vectors.The encoder is very similar to that of the Transformer [16].It is comprised of L enc stacked encoder layers, whereby the lth encoder layer, l ∈ {0, ..., L enc − 1}, is schematically shown in Fig. 3.An encoder layer with inputs6 m (l) T k consists of a self-attention layer [16], followed by a batch norm layer, a single hidden layer FCNN with d h,enc hidden neurons and ReLU activation function, and another batch norm layer.Around both the self-attention layer and the single hidden layer FCNN residual connections are employed.Further, dropout [32] with a dropout rate D is applied to the outputs of the self-attention layer and the single hidden layer FCNN, as well as to the input layer outputs of the latter.The outputs of the last encoder layer m ] of a shallow FCNN with L fcnn hidden layers, d h,fcnn neurons per hidden layer, and an activation function ϕ(.) specified in Tab.I.The outputs of this shallow FCNN are the final estimation results doh .

V. RESULTS
In this section, we compare the presented NN-based data estimators with state-of-the-art model-based equalizers in terms of the achieved BER performance over a specified SNR range by means of simulations.We study the equalizers for channel coded and uncoded data transmission, and we regard their computational complexity.Further, we highlight the peculiar distribution of the estimates provided by NNbased data estimators.Due to the multitude of possible combinations of system settings, only selected simulation cases are presented, while those setups that do not provide further insights are omitted.Since with non-systematic UW-OFDM signaling a better BER performance is achievable [18], we focus on this signaling scheme in our investigations.

A. Simulation Setup
The evaluation is conducted for two different system dimensions.The parameter setup for system I is N = 12, N d = 8, N u = 4, N z = 0, and N p = 0, and for system II N = 64, N d = 32, N u = 16, N z = 12, and N p = 4. System II should represent a real-world communication system, where N p pilot subcarriers can be utilized for synchronization purposes.However, in our simulations, the pilot subcarriers are unused and do not influence the presented results.For system I, in turn, the BER performance of the optimal model-based data estimators can be simulated in a reasonable time, which allows providing insights concerning the gap between the performance achieved with an NN-based equalizer and the lower BER bound.
We assume data transmission over a multipath channel in form of data bursts comprised of a sequence of 1000 UW-OFDM symbols.The channel is assumed to be stationary for a single data burst, but to be changing independently of all other channel realizations from burst to burst.We utilize the statistical = m T k .channel model [33] of an indoor frequency selective environment, where the channel impulse responses are modeled in form of tapped delay lines.The complex tap values exhibit a uniformly distributed phase and a Rayleigh distributed magnitude with an exponentially decaying power profile.As in the referenced works on UW-OFDM [15], [18], [22], we use for system II a sampling time T s = 50 ns, and we choose a channel delay spread of τ RMS = 100 ns.For system I, we specify the sampling time to be 200 ns while keeping the same channel delay spread as for system II.We assume perfect channel knowledge on receiver side.The presented BER curves are obtained by averaging over 8000 channels.
For channel coded data transmission, a convolutional code with generator polynomials (133, 171) 8 , constraint length 7, and rate R = 1/2 is used, whereby a Viterbi channel decoder is employed.As already mentioned, the data symbols are drawn from a QPSK modulation alphabet.

B. Neural Network Training
The dataset for training the NNs is obtained by simulating sample data transmissions with known payload data over randomly generated multipath channels following the employed channel model described in Sec.V-A.Since data estimation is most challenging for transmissions over deep fading channels, we emphasize those cases by adding a set of sample data transmissions to the training set that solely contains transmissions over deep fading channels.The channels for this subset of the training set are found by creating 5000 times more channels than needed and picking the channels with the most severe fading holes.Including particularly bad channels in the training set turns out to be beneficial for the BER performance of the NN-based data estimators (a similar observation has also been mentioned in [34]).Empirical investigations show that the proportion of the subset of specifically generated bad channels in the training set of 10 % and 50 % is a good choice for system I and system II, respectively.Overall, the training set consists of 30000 channels and 40000 channels for system I and system II, respectively.The selection of the E b /N 0 values for the sample data transmissions, which turns out to have a major impact on the performance of the NNs, differs for the simulated system setups, and thus is given with the results for the chosen system setup.Furthermore, we pre-trained the NNs with noiseless data transmissions, i.e., the sent data is only disturbed by a multipath channel, over 2000 different channels, which leads to a faster training convergence.
For the training, we employ an Adam optimizer [35] with default settings.The learning rate is decreased exponentially, such that the learning rate in the final optimization step is 5 % of the initial learning rate η.All NNs are trained with a batch size of 1024 and for 60 epochs.Further, early stopping is utilized as a regularization technique.The hyperparameters of the NN-based equalizers are found with an extensive grid search by evaluating the trained NNs on a validation set; the best settings found are summarized in Tab.I.

C. Bit Error Ratio Performance -Uncoded Transmission
We start the performance comparison of the NN-based equalizers by highlighting the importance of data pre-processing.Without data normalization, the NNs exhibit even worse performance than the LMMSE estimator, which is exemplarily shown for DetNet in Fig. 4 (dotted line).Utilizing the normalized data  leads to a major performance improvement (dashed line in Fig. 4).For DetNet, the BER performance can be further boosted by employing preconditioning, such that with this NN close to optimal MMSE performance can be achieved.The FCNN performs approximately equivalently to the DetNet without preconditioning, while the Attention Detector can outperform the FCNN, which confirms the idea of exploiting correlations for enhancing the estimation performance by utilizing the self-attention mechanism.It turns out, that the SNR utilized for the sample transmission contained in the training set has a large influence on the performance of the NN-based data estimators.Training at too low SNRs leads to flattening out BER curves of the NN-based data estimators at higher SNRs.Training solely at higher SNRs, in turn, impairs the overall performance of the NNs, which probably comes from too few data samples located around the optimal decision boundaries (these samples are very important for the NNs to learn good decision boundaries).Hence, the E b /N 0 training range is another hyperparameter for the NN-based data estimators, whereby the E b /N 0 values for the data burst transmissions contained in the training set are chosen randomly, with uniform distribution on a linear scale within the specified range.For system I, all NNs are trained in the E b /N 0 range [9 dB, 18 dB].
Regarding the model-based equalizers, we observe a large performance gap between the LMMSE estimator and other estimators.With the DFE, a performance close to the optimal MMSE performance can be achieved, while the BER performance difference between the vector ML estimator and the MMSE estimator is negligible for the considered system.As illustrated in Fig. 5, for system II, DetNet can slightly outperform the DFE.Similar as for system I, the Attention Detector exhibits a small performance gap compared to the DetNet, while it outperforms the FCNN.All NNs considered clearly outperform the LMMSE baseline performance.While for DetNet the E b /N 0 training range is chosen to be [18 dB, 27.5 dB], the Attention Detector and the FCNN exhibit better performance for an E b /N 0 training range of [15 dB, 27.5 dB].

D. Bit Error Ratio Performance -Coded Transmission
As already described in Sec.IV, the NNs are trained to provide estimates for the posterior probabilities for every data symbol estimate for both coded and uncoded data transmission.That is, we expect a trained NN-based data estimator to be applicable for coded and uncoded transmission without requiring retraining.As detailed in Sec.V-C, for uncoded transmission it is beneficial to train the NNs for different SNRs, where the SNR training range limits can be viewed as hyperparameters -with this approach a good, or even close to optimal BER performance can be achieved.However, employing these trained NNs for coded transmission, their performance is unsatisfactory.As shown in Fig. 8b   DetNet, the NN-based equalizer trained in an E b /N 0 range of [1 dB, 9 dB] performs distinctly worse than the DFE and the LMMSE estimator, while the same NN outperforms both model-based equalizers for uncoded transmission (Fig. 8a).The reason for this result can be explained by investigating the empirical distribution of the LLRs provided by DetNet.Comparing the LLRs of DetNet trained in an E b /N 0 range of [1 dB, 9 dB] (Fig. 6a) with the true LLRs at E b /N 0 = 4 dB (Fig. 6c) reveals that a vast number of LLRs provided by DetNet has a high absolute value7 , while this is not the case for the true LLRs and also not for the LLRs of the LMMSE estimator (Fig. 6b).That is, the NN is overconfident in many of its decisions, which harms the performance of the Viterbi channel decoder.
To tackle this problem, we investigated treating the data estimation problem as a classification task, i.e., we utilized Softmax as an output activation function of the NNs, combined with using crossentropy loss for training.Then, so-called label smoothing can be applied, which is a common approach for combating overconfidence of classification NNs [36].Unfortunately, this approach did not lead to significant performance improvements in our experiments.However, we observed that the training E b /N 0 range has a large impact on the distribution of the LLRs provided by DetNet.More specifically, the overconfidence of an NN-based equalizer can be highly reduced by training at low SNRs.This highlights the importance of the training SNR as a hyperparameter, which has to be chosen differently for coded and uncoded data transmission.
Investigating solely the distribution of the LLRs, however, is only an indicator of their reliability.We utilize an approach described in [37] for an assessment of the LLR quality of turbo equalizers.To this end, we apply the trained NNs on the validation set, to obtain the estimated LLRs L est,i for all bits b i contained in the validation set.The estimated LLRs L est,i are grouped according to their value into K bins with the value L k , k ∈ {0, ..., K − 1} (L k is the mean of the estimated LLRs in bin k).The signs of L est,i are used for a hard decision estimate of the corresponding bits b i .With these hard decision estimates at hand, the empirical bit error probability P emp,k = # wrong hard decisions in bin k # bits in bin k can be computed for all K bins.These empirical bit error probabilities, in turn, can be utilized to determine the empirical LLRs L emp,k for all K bins with Assuming a sufficiently large number of LLR values per bin, the empirical LLRs L emp,k provide an approximation of the true LLRs.The quality of the estimated LLRs L est,i can be ascertained by plotting L emp,k against L k .Since the estimated LLRs should match the empirical ones, the plotted graph is ideally a linear function with slope one.However, also slopes not equal to one allow optimal channel decoding performance of the Viterbi decoder, since all LLRs are under-or overrated in the same fashion.Nonlinear graphs, in turn, indicate a loss in BER performance, since some estimated LLRs are overrated while others are underrated at the same time.This may lead to wrong decisions of the Viterbi channel decoder when searching the optimum path in the trellis diagram of the convolutional code.As shown in Fig. 7 for the LLRs provided by DetNet when being trained at 1.5 dB, the number of LLRs with too high value could be drastically lowered.Further, the empirical LLRs and the estimated LLRs are related nearly linearly for the majority of the estimated LLRs, i.e., in the regions where the relation is nonlinear, the counts per LLR bin are comparatively small.As the BER curves in Fig. 8b show, DetNet trained at 1.5 dB achieves close to optimal BER performance.However, for uncoded transmission, the DetNet trained at E b /N 0 = 1.5 dB performs distinctly worse than the DetNet trained in the E b /N 0 range of [1 dB, 9 dB], which is depicted in Fig. 8a.For the Attention Detector and the FCNN, E b /N 0 = 0.8 dB is utilized as an SNR for training, all other hyperparameters are chosen as for uncoded data transmission.Both achieve close to optimal BER performance, too.
For system II, we compare the LMMSE estimator, the DFE, and the DetNet, which is trained at E b /N 0 = 4 dB.As shown in Fig. 9, all three investigated equalizers exhibit approximately the same BER performance for coded data transmission.Although simulating the optimal BER performance is computationally infeasible, it can be stated that the achieved performance of the three equalizers is very close to the optimal performance.This statement can be verified by considering the LLRs provided by the LMMSE estimator.They are equivalent to the true LLRs when the conditional distribution p( d i |d i ) is Gaussian (cf.Sec.III-D).Since this condition is well fulfilled for the system dimensions of system II, the LLRs of the LMMSE are close to the true LLRs, leading to close to optimal BER performance for coded data transmission.

E. Distributions of the Data Estimates
We also want to highlight the differences in the distributions of the estimates of the MMSE estimator, the NN-based estimators (exemplarily shown for DetNet), and the LMMSE estimator.To this end, we visualize the conditional distributions of their estimates, given a transmitted symbol (1 + j)/ √ 2, for system I at E b /N 0 = 4 dB in I/Q diagrams.The empirical distributions of the data symbol estimates are plotted in histograms along the I-axis and the Q-axis.As shown in Fig. 10a, the conditional LMMSE estimates follow, as expected, (approximately) a Gaussian distribution.However, the MMSE estimates are distributed in a completely different manner.As indicated by the histograms in Fig. 10b, the vast majority of the estimates are located very close to the constellation point.Since the MMSE estimator yields the posterior expectation of a data symbol as an estimate, no estimate can lie outside the square connecting the four constellation points (marked by red crosses).The estimates of DetNet, plotted in Fig. 10c, exhibit a distribution similar to that of the MMSE estimates.This is in fact expected, since, due to training the NNs with a quadratic loss function, the NNs try to minimize the cost metric that the MMSE estimator minimizes, namely the Bayesian mean square error.Hence, the trained NNs approximate the MMSE estimator function.

F. Complexity Analysis
In this section, we provide a brief analysis of the inference complexity of the presented NN-based data estimators as well as of the LMMSE estimator and the DFE in terms of the number of required scalar, real-valued multiplications needed for equalization of one UW-OFDM data symbol.In this paper, we account four real-valued multiplications for one complex-valued multiplication.Data normalization, as well as the complexity required for training the NNs is not regarded in this analysis.
For DetNet, we first determine the complexity of a single layer.Given H T H, the number of multiplications carried out in a layer according to (20) including the projection by an FCNN with a single hidden layer, one-hot demapping, and the weighted residual connections is Overall, DetNet has an inference complexity of real-valued multiplications, where we consider with the subtracted term that no one-hot decoding is conducted in the last layer, while with the three added terms the computations of H T H and of H T y, and the preconditioning are taken into account.
Determining the number of multiplications required by the FCNN is straightforward and can be expressed as where with the last two terms the computations of the upper triangular matrix (including the main diagonal) of H T H and of H T y are considered.
For the Attention Detector, we start by evaluating the complexity of a single encoder layer, which consists of a self-attention layer, a single hidden layer FCNN, layer normalization, and residual connections.The inputs of the self-attention layer are mapped to so-called queries q i , keys k i , and values v i , i ∈ {0, ..., 2N d − 1}, by multiplying with learned matrices [16].Then, self-attention scores between each query and each key are computed, followed by a weighting of the values v i by these scores.The number of multiplications conducted in one encoder layer thus is  (31) multiplications have to be performed.Therefore, the complexity of the Attention Detector is given by where we consider the computations of H T H and H T y, as well as the preconditioning with the last three terms.
For the LMMSE estimator, we first regard the complexity for obtaining the estimator matrix E LMMSE .Since the channel is assumed to be stationary for a whole data burst, the estimator matrix has to be computed only once per burst.Assuming that the inversion in (11) is computed by a Cholesky decomposition as of [38], the computation of E LMMSE entails a complexity of Then, given E LMMSE , the number of required multiplications for the equalization of every received UW-OFDM vector is We determine the DFE complexity in Alg. 1 by first considering those computations that have to be done once for every data burst.Namely, this refers to the computations of the estimator vectors e H k and the error covariance matrices C ee,k for every iteration step.We note that H H H needs to be computed only once, and then the matrices H H k H k can be retrieved by deleting the appropriate rows and columns.The size of H k decrements in every iteration, and thus we elaborate the complexity of computing A k given H H k H k ∈ C C×C , with C ∈ {2, ..., N d }.Furthermore, the scaling of A k by N σ 2 n , as described in Alg. 1, line 6, can be omitted, since only the minimum value on the diagonal of C ee,k is needed for finding the index j.In summary, M DFE,burst = 4N  multiplications have to be carried out once for every data burst to obtain the N d estimator vectors e H k .For both the estimation of a single data symbol and the removal of the influence of this estimate on the received vector, (N d + N u ) complex-valued multiplications have to be accounted for.Hence, given the estimator vectors, equalization of every received UW-OFDM vector with the DFE has a complexity of The particular complexity numbers of the considered equalizers are stated in Tab.II for both system I and system II.Obviously, the NN-based equalizers exhibit a distinctly higher complexity than the considered model-based ones.However, a comparison of the complexities of the DetNet and the DFE reveals that the complexity of the DFE grows significantly faster with the dimension of the UW-OFDM system model than that of the DetNet.Among the considered NNs, the DetNet is the lowest complex equalizer.That is, incorporating model knowledge directly into the layers structure of an NN seems to be most promising for obtaining well-performing and comparably low complex NN-based data estimators.

VI. CONCLUSION
In this paper, we investigated three NN-based approaches for data estimation in UW-OFDM systems, whereby model knowledge was utilized in different ways.Moreover, we described state-of-the-art modelbased equalizers, and we discussed the equivalence of the MMSE estimator and the bit-wise MAP estimator for the considered system setup.We pointed out the importance of proper data normalization for NNbased data estimators and proposed a data normalization scheme specifically for UW-OFDM signaling.With preconditioning, we introduced adaptions for DetNet to boost its BER performance and decrease its computational complexity.Further, we showed a model-inspired data pre-processing approach, and we proposed an NN-based data estimator inspired by the Transformer network.We highlighted the difficulties when employing NNs for data estimation in channel coded data transmission, and we introduced a measure for obtaining reliable LLRs by NN-based equalizers.Finally, we provided BER performance results, we conducted a complexity analysis, and we visualized the distribution of the estimates of selected modelbased and NN-based equalizers. where where in the last step the QPSK bit-to-symbol mapping described in Sec.III-C is applied.In case of hard decision, d i is sliced to the closest constellation symbol, i.e., Re{ di } is sliced to ρ for Re{ di } > 0, and to −ρ otherwise (accordingly for Im{ di }).Hence, the real and the imaginary part of an MMSE hard decision estimate d respectively, which coincides with a hard decision estimate of the bit-wise MAP estimator.
The generation of a UW-ODFM symbol is described by x = BGd , where d ∈ S N d is the data vector, B ∈ {0, 1} N ×(N d +Nu) models the optional insertion of zero subcarriers, and G ∈ C (N d +Nu)×N d is the so-called generator matrix.The generator matrix G can be decomposed into G = A I T , with the N d × N d identity matrix I, and an appropriately chosen matrix T ∈ C Nu×N d , ensuring N u trailing zeros in the UW-OFDM time domain symbol.The matrix

Fig. 3 .
Fig. 3. Structure of one encoder layer of the Attention Detector.

Fig. 6 . 6 LFig. 7 .
Fig. 6.Empirical distribution of the LLRs provided by DetNet trained in an E b /N0 range of [1 dB, 9 dB] (a) and by the LMMSE estimator (b), compared with the empirical distribution of the true LLRs (c) at E b /N0 = 4 dB.

Fig. 8 .
Fig. 8.Comparison uncoded and coded BER performance for system I, non-systematic UW-OFDM.DetNet is once trained in an E b /N0 range of [1 dB, 9 dB], and once at E b /N0 = 1.5 dB.The FCNN and the Attention Detector are trained at E b /N0 = 0.8 dB.

Fig. 10 .
Fig. 10.Distribution of the conditional data symbol estimates for system I at E b /N0 = 4 dB.

TABLE II NUMBER
OF REQUIRED MULTIPLICATIONS OF CONSIDERED EQUALIZERS ROUNDED TO HUNDREDS.
deriving the hard decision estimate of the MMSE estimator, we consider the MMSE estimate of the ith data symbol, i ∈ {0, ..., N d − 1}:d i = d ∈S N d d i p[d |y ] = d i,Re ∈S Re d i,Im ∈S Im (d i,Re + jd i,Im )p[(d i,Re + jd i,Im )|y ] = d i,Re ∈S Re d i,Im ∈S Im d i,Re p[(d i,Re + jd i,Im )|y ] + j d i,Re ∈S Re d i,Im ∈S Im d i,Im p[(d i,Re + jd i,Im )|y ] = d i,Re ∈S Re d i,Re p[d i,Re |y ] + j d i,Im ∈S Im d i,Im p[jd i,Im |y ] ,