Direct Data Detection of OFDM Signals Over Wireless Channels

This paper presents a novel efficient receiver design for wireless communication systems that incorporate orthogonal frequency division multiplexing (OFDM) transmission. The proposed receiver does not require channel estimation or equalization to perform coherent data detection. Instead, channel estimation, equalization, and data detection are combined into a single operation, and hence, the detector performs a direct data detector (<inline-formula><tex-math notation="LaTeX">$D^{3}$</tex-math></inline-formula>). The <inline-formula><tex-math notation="LaTeX">$D^{3}$</tex-math></inline-formula> is applied to key practical wireless systems such as the Long Term Evolution (LTE) and the New Radio (NR) of the fifth-generation (5G) system. The performance of the proposed <inline-formula><tex-math notation="LaTeX">$D^{3}$</tex-math></inline-formula> is thoroughly analyzed theoretically in terms of bit error rate (BER), where closed-form accurate approximations are derived for several cases of interest, and validated by Monte Carlo simulations. Moreover, extensive complexity analysis are performed to evaluate the system suitability for implementation. The obtained theoretical and simulation results demonstrate that the BER of the proposed <inline-formula><tex-math notation="LaTeX">$D^{3}$</tex-math></inline-formula> is only 3 dB away from coherent detectors with perfect knowledge of the channel state information (CSI) in flat and frequency-selective fading channels for a wide range of signal-to-noise ratios (SNRs). If CSI is not known perfectly, then <inline-formula><tex-math notation="LaTeX">$D^{3}$</tex-math></inline-formula> outperforms the coherent detector substantially, particularly at high SNRs with linear interpolation. The computational complexity of <inline-formula><tex-math notation="LaTeX">$D^{3}$</tex-math></inline-formula> depends on the length of the sequence to be detected, nevertheless, a significant complexity reduction can be achieved using the Viterbi algorithm.


I. INTRODUCTION
O RTHOGONAL frequency division multiplexing (OFDM) is widely adopted in several wired and wireless communication standards fourth-generation (4G) wireless networks [1], [2], Digital Video Broadcasting (DVB), Terrestrial (DVB-T) and Hand-held (DVB-H) [3], optical wireless communications (OWC) [4], [5], and recently, it has been adopted for the fifthgeneration (5G) New Radio (NR) [6], [7]. OFDM has been also adopted for power-line communications (PLC), satellite Manuscript  communications, and the narrow-band Internet of Things (NB-IoT). The key for OFDM popularity is that each subcarrier experiences flat fading even though the overall signal spectrum suffers from frequency-selective fading. Moreover, appending the cyclic prefix (CP) prevents intersymbol interference (ISI), and hence, a low-complexity single-tap equalizer can be utilized to eliminate the impact of the multipath fading channel. Under such circumstances, the OFDM demodulation process can be performed once the fading parameters at each subcarrier, commonly denoted as channel state information (CSI), are estimated. Therefore, OFDM is suitable for frequency-selective channels that are experienced in 4G and 5G wireless systems, and flat fading channels which are commonly experienced in OWC under the effect of atmospheric turbulence [4], [5]. However, accurate CSI should be available at the receiver to enable recovering the data symbols reliably. Although OFDM has several advantages, it does not possess any special immunity against channel fading, and hence, its bit error rate (BER) is generally similar to single carrier transmission over flat fading channels [8]. Therefore, additional error mitigation technologies such forward error control coding (FECC), space diversity, or precoding are typically used. However, the selection of a particular supporting technology depends on the targeted applications. For example, certain applications such as OWC, PLC, NB-IoT and satellite communications are more suitable for using a single antenna at the transmitter, and thus, space diversity in the form of multiple-input multipleoutput (MIMO) can be replaced by time or frequency diversity, FECC, or use space diversity in the form of single-input multiple-output (SIMO).
In the literature, reducing the complexity of the received has received extensive attention due to the limited size, energy and computational capabilities of handheld and Internet of Things (IoT) devices. Among many receiver designs, amplitudecoherent detection (ACD) has been recognized as an efficient approach [9]- [12]. The key concept of ACD is to estimate only the amplitude of the channel frequency response and use it for equalization. Although this approach demonstrated to be robust in the presence of phase noise, phase estimation error and frequency offsets, it requires one-dimensional modulation, which may limit its spectral efficiency.

A. Preliminaries
Generally speaking, CSI estimation (CSIE) can be classified into blind [8], [14]- [18], and pilot-aided techniques [19]- [25]. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Blind CSIE techniques are spectrally efficient because they do not require any overhead to estimate the CSI. Nevertheless, such techniques have not yet been adopted in practical OFDM systems. Conversely, pilot-based CSI estimation is preferred for practical systems, because typically it is more robust and less complex. In pilot-based CSIE, the pilot symbols are embedded within the subcarriers of the transmitted OFDM signal in time and frequency domain; hence, the pilots form a two dimensional (2-D) grid [2], [6], [7]. The channel response at the pilot symbols can be obtained using the least-squares (LS) frequency domain estimation, and the channel parameters at other subcarriers can be obtained using various interpolation techniques [26]. Optimal interpolation requires a 2-D Wiener filter that exploits the time and frequency correlation of the channel, however, it is substantially complex to implement [27], [28]. The complexity can be reduced by decomposing the 2-D interpolation process into two cascaded 1-D processes, and then, using less computationally-involved interpolation schemes [29], [30]. Low complexity interpolation, however, is usually accompanied by error rate performance degradation [30]. It is also worth noting that most practical OFDM-based systems utilize a fixed grid pattern structures [2], [6], [7].
Once the CSI is obtained for all subcarriers, the received samples at the output of the fast Fourier transform (FFT) are equalized to compensate for the channel fading. Fortunately, equalization for OFDM is performed in the frequency domain using single-tap equalizers. The equalizer output samples, which are denoted as the decision variables, will be applied to a maximum likelihood detector a coherent (MLD) to regenerate the information symbols. It is worth noting that in addition the channel frequency/time selectivity and additive white Gaussian noise (AWGN), impulsive noise is another source of distortion that can affect the channel estimation and data detection process in particular applications [13].

B. Related Work
In addition to the direct approach, several techniques have been proposed in the literature for CSIE or detect the data symbols indirectly, by exploiting the correlation among the channel coefficients. For example, the per-survivor processing (PSP) approach has been used to approximate the maximum likelihood sequence estimator (MLSE) for coded and uncoded sequences [31]- [33]. The PSP utilizes the Viterbi algorithm (VA) to recursively estimate the CSI without interpolation using the least mean squares (LMS) algorithm. Although the PSP provides superior performance when the channel is flat over the entire sequence, its performance degrades severely if this condition is not satisfied, even when the LMS step size is adaptive [32]. Multiple symbol differential detection (MSDD) can be also used for sequence estimation without explicit CSIE. In such systems, the information is embedded in the phase difference between adjacent symbols, and hence, differential encoding is needed. Although differential detection is only 3 dB worse than coherent detection in flat fading channels, its performance may deteriorate significantly in frequency-selective channels [34], [35]. Consequently, Wu and Kam [36] proposed a generalized likelihood ratio test (GLRT) receiver whose performance without CSI is comparable to the coherent detector. Although the GLRT receiver is more robust than differential detectors in frequency-selective channels, its performance is significantly worse than coherent detectors.
Maximum likelihood sequence detection techniques have been widely studied in the literature. For example, a block differentially encoded OFDM for transmissions over frequencyselective fading channels is proposed in [37]. Depending on the channel and system parameters, the OFDM correlated subchannels are grouped into a set of independent subchannels, and hence, the transmission can be seen as a multiple-input singleoutput (MISO) transmission, with the advantage of the additional diversity at the transmitter. However, grouping the subcarriers is infeasible over fast fading channels, or when the OFDM symbol duration is short. A non-coherent maximum-likelihood sequence detection for differential OFDM with multiple receivers is presented in [38]. The system utilize the widely used suboptimal detector [39,Eq. 2] for multicarrier transmission scenarios. However, the system has a complex receiver structure due to the brut force search. The work in [40] presents a multiplesymbol differential detection (MSDD) that can be incorporated with MIMO OFDM for differential space-frequency modulation (DSFM). The system exploits the time and frequency correlation to perform the detection. Nevertheless, the system exhibits high error floors for moderate time and frequency selectivities, and suffers from an increased computational complexity. A more generalized class of non-coherent sequence detection (NCD) algorithms that supports general constellations such as quadrature amplitude modulation (QAM) is presented in [41].
The estimator-correlator (EC) cross-correlates the received signal with an estimate of the channel output signal corresponding to each possible transmitted signal [42], [43]. The signal at channel output is estimated with a minimum mean square error (MMSE) estimator from the knowledge of the received signal and the second order statistics of the channel and noise. The EC may provide a bit error rate (BER) that is about 1 dB from the ML coherent detector in flat fading channels, but at the expense of a large number of pilots. Moreover, the BER performance of EC detectors is generally poor in frequency-selective channels where the EC BER is significantly worse than the ML coherent detector [43]. Decision-directed techniques can also be used to avoid conventional CSIE. For example, the authors in [8] proposed a hybrid frame structure that enables blind decisiondirected CSIE. Although the proposed system manages to offer reliable CSI estimates and BER in various channel conditions, the system structure follows the typical coherent detector design where equalization and symbol detection are required.

C. Motivation and Key Contributions
To avoid separate channel estimation, equalization and detection processes, which are typically used in conventional OFDM detectors, this work presents a new detector to recover the information symbols directly from the received samples at the FFT output, which is denoted as the direct data detector (D 3 ). By using the D 3 , there is no need to perform CSIE, interpolation, equalization, or symbol decision operations. The D 3 exploits the fact that channel coefficients over adjacent subcarriers in time and frequency domain are highly correlated and approximately equal, and hence, it is derived by minimizing the difference between channel coefficients of adjacent subcarriers. The main limitation of the D 3 is that it suffers from a phase ambiguity problem, which can be solved using pilot symbols, which are part of a transmission frame in most practical standards including 4G and 5G systems [1], [2], [6], [7]. To the best of the authors' knowledge, there is no work reported in the published literature that uses the proposed principle. More specifically, the main contributions of this work are: 1) Propose a novel and efficient detector for OFDM systems denoted as D 3 . 2) Design a low complexity implementation using the Viterbi Algorithm. 3) Evaluate the BER using in flat and frequency-selective fading channels where accurate and closed-form expressions are derived for several cases of interest. 4) Evaluate the complexity and computational power. 5) Apply the proposed system to 4G and 5G resource blocks. 6) Apply the D 3 to coded systems. 7) The D 3 performance is compared to other widely used detectors such as the maximum likelihood (ML) coherent detector [44] with perfect and imperfect CSI, MSDD [34], the ML sequence detector (MLSD) with no CSI [36], and the per-survivor processing detector [31]. The obtained results show that the D 3 has a superior performance in various aspects as compared to the other considered detectors, particularly in frequency-selective channels at moderate and high SNRs.

D. Paper Organization and Notations
The rest of this paper is organized as follows. The OFDM system and channel models are described in Section II. The proposed D 3 is presented in Section III, and the efficient implementation of the D 3 is explored in Section IV. The system error probability performance analysis is presented in Section V. Numerical results are discussed in Section VII, and finally, the conclusion is drawn in Section VIII.
In what follows, unless otherwise specified, uppercase boldface and blackboard letters such as H and H, will denote N × N matrices, whereas lowercase boldface letters such as x will denote row or column vectors with N elements. Uppercase, lowercase, or bold letters with a tilde such asd will denote trial values, and symbols with a hat, such asx, will denote the estimate of x. Letters with apostrophe such asv are used to denote the next value, i.e.,v v + 1. Furthermore, E[·] denotes the expectation operation.

A. Transmitted Signal
Consider an OFDM system with N subcarriers modulated by a sequence of N complex data symbols  The data symbols are selected uniformly from a general constellation such as M -ary phase shift keying (MPSK) or quadrature amplitude modulation (QAM). In practical OFDM systems [6], [7], [45], N P of the subcarriers are allocated for pilot symbols, which can be used for CSIE and synchronization purposes. The modulation process in OFDM can be implemented efficiently using an N -point inverse FFT (IFFT) algorithm, where its output during the th OFDM block can be written as x( ) = F H d( ) where F is the normalized N × N FFT matrix, and hence, F H is the IFFT matrix. To simplify the notation, the block index is dropped for the remaining parts of the paper unless it is necessary to include it. Then, a CP of length N CP samples, no less than the channel maximum delay spread (D h ), is appended to compose the OFDM symbol with a total length N t = N + N CP samples and duration of T t seconds.
To increase the system spectral efficiency, the ratio N P /N should be generally very small. Therefore, the transmitted symbols are typically arranged into a two dimensional (2-D) grid denoted as a resource block (RB), where the two dimensions correspond to the time and frequency. For example, LTE-A [8, Fig. 1] RB has 168 symbols among which 8 symbols are pilots, and thus, the spectral loss is about 4.7%. In 5G NR, the pilots are not scatted as in the case of LTE-A, instead, two OFDM symbols are entirely dedicated for pilot transmission, which are symbols 3 and 12. Therefore, there is a total of 24 pilots within 168 symbols, which results in a spectral loss of~4.3%. Within an RB, it can be noted that some data symbols are enclosed by two pilots, double-sided (DS), while some other symbols are bounded by a pilot only from one sided, single sided (SS). Figs. 1 and 2 show the SS and DS segments, where K is the total number of symbols in the segment. It is worth noting that the segmentation can be in the time or frequency dimensions.

B. Channel and Received Signal Models
At the receiver front-end, the received signal is downconverted to baseband and sampled at a sampling period T s = T t /N t . In this work, the channel is assumed to be composed of D h + 1 independent multipath components each of which has a gain h m ∼ CN (0, 2σ 2 h m ) and delay m × T s , where m ∈ {0, 1,..., D h }. A quasi-static channel is assumed throughout this work, and thus, the channel taps are considered constant over one OFDM symbol, but they may change over two consecutive symbols. Therefore, the received sequence after dropping the CP samples and applying the FFT can be expressed as [12], is the AWGN vector and H denotes the channel frequency response (CFR) [12] By noting that r| H,d ∼ CN (Hd, 2σ 2 w I N ) where I N is an N × N identity matrix, then it is straightforward to show that the [R 2,3b ] coherent MLD can be expressed as [8], where · denotes the Euclidean norm, andd = [d 0 ,d 1 , . . . ,d N 1 ] T denotes the trial values of d. As can be noted from (3), the coherent MLD requires the knowledge of H. Moreover, because (3) describes the detection of more than one symbol, it is typically denoted as maximum likelihood sequence detector (MLSD). If the elements of d are independent, the MLSD can be replaced by a symbol-by-symbol coherent MLD [12] ( 4 ) Since perfect knowledge of H is infeasible, an estimated version of H, denoted asĤ, can be used in (3) and (4) instead of H. Another possible approach to implement the detector is to equalize r, and then use a symbol-by-symbol coherent MLD. Because the considered system is assumed to have no ISI or intercarrier interference (ICI), then a single-tap frequency-domain zero-forcing equalizer can be used. Therefore, the equalized received sequence can be expressed as [8], and [8], In LTE, the channel estimatesĤ v at the pilot locations can be obtained using the LS algorithm, then linear, spline, or other interpolation techniques can be used to obtain the channel estimates at the information symbols' locations.
It is interesting to note that solving (3) does not necessarily require the explicit knowledge of H under some special circumstances. For example, Wu and Kam [36] noticed that in flat fading channels, i.e., H v = H ∀v, it is possible to detect the data symbols using the following MLSD [36], Although the detector described in (7) is efficient in the sense that it does not require the knowledge of H, its BER is very sensitive to the channel variations. It is also worth noting that metric in [39,Eq. 23] with the optimized parameters can be utilized to combat the frequency selectivity of the channel.

III. PROPOSED D 3 SYSTEM MODEL
One of the distinctive features of OFDM is that its channel coefficients over adjacent subcarriers in the frequency domain are highly correlated and approximately equal. [R 2,1 ] Given that the channel multipath components vector is defined as where D h is the maximum delay spread of the channel, then, the correlation coefficient between two adjacent subcarriers can be defined as [8], where m ] = 0, and given that h n and h m are mutually independent, which is typically the case for several applications [14], [22], [27], [36], then E[h n h * m ] = 0. The multipath channel gains are normalized such that D h n=0 σ 2 h n = 1. The difference between two adjacent channel coefficients is For large values of N , it is straightforward to show that f → 1 and Δ f → 0. Similar to the frequency domain, the time domain correlation defined according to the Clarke' s model can be computed as [48], where is the OFDM symbol index, J 0 (·) is the Bessel function of the first kind and 0 order, and f d is the maximum Doppler frequency. For large values of N , 2πf d T t 1, and hence J 0 (2πf d T t ) ≈ 1, and thus t ≈ 1. Using the same argument, the difference in the time domain Δ t E[H v − H´ v ] ≈ 0. Although the proposed system can be applied in the time domain, frequency domain, or both, the focus of this work is the frequency domain.
Based on the aforementioned properties of OFDM, a simple approach to extract the information symbols from the received sequence r can be designed by minimizing the difference of the channel coefficients between adjacent subcarriers, which can be expressed asd As can be noted from (11), the estimated data sequenced can be obtained without the knowledge of H. Moreover, there is no requirement for the channel coefficients over the considered sequence to be equal, and hence, the D 3 should perform fairly well even in frequency-selective fading channels. Nevertheless, it can be noted that (11) does not have a unique solution because d and −d can minimize (11). To resolve the phase ambiguity problem, one or more pilot symbols can be used as a part of the sequence d. In such scenarios, the performance of the D 3 will be affected indirectly by the frequency selectivity of the channel because the capability of the pilot to resolve the phase ambiguity depends on its fading coefficient. Another advantage of using pilot symbols is that it will not be necessary to detect the N symbols simultaneously. Instead, it will be sufficient to detect K symbols at a time, which can be exploited to simplify the system design and analysis.
Using the same approach of the frequency domain, the D 3 can be designed to work in the time domain as well by minimizing the channel coefficients over two consecutive subcarriers, i.e., two subcarriers with the same index over two consecutive OFDM symbols, which is also applicable to single carrier systems. It can be also designed to work in both time and frequency domains, where the detector can be described aŝ where D L,K is an L × K data matrix, L and K are the time and frequency detection window size, and the objective function J(D) is given by For example, if the detection window size is chosen to be the LTE resource block, then, L = 14 and K =12. Moreover, the system presented in (13) can be extended to the multi-branch receiver scenarios, SIMO as, where N is the number of receiving antennas.

IV. LOW COMPLEXITY IMPLEMENTATION OF D 3
It can be noted from (12) and (13) that solving forD, given that N P pilot symbols are used, requires an M KL−N P trials if brute force search is adopted, which is prohibitively complex, and thus, reducing the computational complexity is crucial. Towards this goal, the RB can be divided into a number of one-dimensional (1-D) segments in time and frequency domains in order to reduce the complexity from order O( . In other words, the time complexity evolves  exponentially as the detection size increases in the 2-D block, while it grows linearly in the cascaded 1-D block, which is significant complexity reduction. The decomposition of the 2-D LTE-A RB into several 1-D segments over time and frequency is shown in [R 2,a ] Fig. 3.

A. The Viterbi Algorithm (VA)
By noting that the expression in (11) corresponds to the sum of correlated terms, which can be modeled as a first-order Markov process, then MLSD techniques such as the VA can be used to implement the D 3 efficiently. For example, the trellis diagram of the VA with binary phase shift keying (BPSK) is shown in Fig. 4, and can be implemented as follows: 1) Initialize the path metrics where U and L denote the upper and lower branches, respectively. Since BPSK is used, the number of states is 2.
2) Initialize the counter, c = 0. 3) Compute the branch metric J c m,n = | rc m − rć n | 2 , where m is current symbol index, m = 0 →d = −1, and m = 1 →d = 1, and n is the next symbol index using the same mapping as m. 4) Compute the path metrics using the following rules, 5) Track the surviving paths, 2 paths in the case of BPSK. 6) Increase the counter, c = c + 1. 7) if c = K, the algorithm ends. Otherwise, go to step 3. It is worth mentioning that placing a pilot symbol at the edge of a segment terminates the trellis. To simplify the discussion, assume that the pilot value is −1, and thus we compute only J 0,0 and J 1,0 . Consequently, long data sequences can be divided into smaller segments bounded by pilots, which can reduce the delay by performing the detection over the sub-segments in parallel without sacrificing the error rate performance.

B. LTE-A RB Detection
As can be noted from Fig. 3, the segmentation process can be applied directly to any row or column given that it has one or more pilots. Nevertheless, there are some rows and columns that do not have pilots. In such scenarios, the detection, for example, can be performed in two steps as follows: 1) Detect all rows (frequency domain subcarriers) with pilots, i.e., rows 1, 5, 8 and 12. 2) As a result of the first step, each column (time domain subcarrier) has either pilots, data symbols whose values are known as a result of the detection in the first step, or both, as in the case of columns 1, 4, 7 and 10. Therefore, all remaining subcarriers can be detected using the symbols detected in the first step.

C. 5G NR Bock Detection
For the NR case, the demodulation reference signals (DMRS) for each RB are formed from contiguous subcarriers that span the entire OFDM symbol, and there are at least two OFDM symbols that are reserved for DMRS. Therefore, each row, i.e., consecutive subcarriers in the time domain, in the NR block can be considered as a segment, and can be detected separately using the D 3 . Consequently, there is no need for segmentation in frequency domain.

D. System Design With an Error Control Coding
Forward error correction (FEC) coding can be integrated with the D 3 in two ways, based on the decoding process, i.e., hard or soft decision decoding. For the hard decision decoding (HDD), the integration of FEC coding is straightforward where the D 3 output is directly applied to the hard decision decoder. For soft decision decoding (SDD), we can exploit the coded data to enhance the performance of the D 3 , and then use the D 3 output to estimate the channel coefficients in a decision-directed manner. The D 3 with coded data can be expressed aŝ where U is the set of all codewords modulated using the same modulation used at the transmitter. Therefore, the trial sequences u are restricted to particular sequences. For the case of convolutional codes, the detection and decoding processes can be integrated smoothly since both of them are using the VA. Such an approach can be adopted with linear block codes as well because trellis-based decoding can be also applied to block codes [49]. The D 3 can be also smoothly integrated with turbo product codes (TPCs) with SDD. In the SDD process for TPC [50], the first step is to perform hard decisions to recover the data symbols, and then compute the reliability of each bit within the symbol. For BPSK, the reliability factors can be evaluated directly from the received signal without the need for CSI. For higher order modulations, the LS algorithm can be applied to obtain the CSI using the hard decisions obtained in the first step.

V. ERROR RATE ANALYSIS OF THE D 3
Although sequence detection has been considered widely in the literature, to the best of the authors' knowledge, exact BER analysis in frequency in frequency-selective channels remains an open problem [36], [41], [46], [47]. Therefore, the system BER analysis in this work is presented in terms of accurate approximations for several cases of interest. For simplicity, each case is discussed in a separate subsection. To make the analysis tractable, we consider BPSK modulation in the analysis while the BER of higher-order modulations is obtained via Monte Carlo simulation.

A. Single-Sided Pilot
To detect a data segment that contains K symbols, at least one pilot symbol should be part of the segment in order to resolve the phase ambiguity problem. Consequently, the analysis in this subsection considers the case where there is only one pilot within the K symbols, as shown in Fig. 1. Given that the FFT output vector r = [r 0 , r 1 , . . . , r N −1 ] is divided into L segments each of which consists of K symbols, including the pilot symbol, then the frequency domain D 3 detector can be written as, where l denotes the index of the first subcarrier in the segment, and without loss of generality, we consider that l = 0. Therefore, by expanding and simplifying (18), we obtain, For BPSK, |r v /d v | 2 = |r v | 2 , which is a constant term with respect to the maximization process in (19), and thus, they can be dropped. Therefore, the detector is reduced tô Given that the pilot symbol is placed in the first subcarrier and noting that d v ∈ {−1, 1}, thend 0 = 1 andd 0 can be written aŝ The sequence error probability (P S ), conditioned on the channel frequency response over the K symbols (H 0 ) and the transmitted data sequence d 0 can be defined as, which can be also written in terms of the conditional probability of correct detection P C as, Without loss of generality, we assume that d 0 =[1, 1,..., 1] 1 . Therefore,  (24) can be written as, which, as depicted in Appendix I, can be simplified to To evaluate P C | H 0 ,1 given in (26), it is necessary to compute Pr( {r v rv} > 0), which can be written as Given that d 0 =[1, 1,...,1], then  1,5l ] Although the product of two Gaussian variables is generally not Gaussian [51], the limit of the moment-generating function of the product of two random variables X ∼ N (μ x , σ 2 x ) and Y ∼ N (μ y , σ 2 y ) tends to be Gaussian as the ratios μ x /σ x and μ y /σ y increase [51], and thus, the PDF of the product of X and Y can be approximated by as Gaussian distribution XY ∼ N (μ x μ y , μ 2 x σ 2 y + μ 2 y σ 2 x ) . By noting that in (27)  1∀{x, y}. Moreover, because the PDF of the sum or difference of two Gaussian random variables is also Gaussian, then, and where Q(x) As it can be noted from (31), the PDF is a function the correlation coefficient f defined in (8).
Due to the difficulty of evaluating 2K integrals, we consider the special case of flat fading, which implies that H v = Hv H and (H I ) 2 + (H Q ) 2 α 2 , where α is the channel fading envelope, α = |H I/Q |. Therefore, the SEP expression in (29) becomes, Then the SEP formula in (32) using the Binomial Theorem in [44, can be written as, The conditioning on α can be removed by averaging over the PDF of α, f (α), which is Rayleigh distributed that is given in [12,Eq. 8], and hence, Because the expression in (32) contains high order powers of Q-function Q n (x), evaluating the integral analytically becomes intractable for K > 2. For the special case of K = 2, P S can be evaluated by substituting (33) and ([12, Eq. 8]) into (34) and evaluating the integral yields the following simple expression, whereγ s is the average signal-to-noise ratio (SNR). Moreover, because all data sequences have an equal probability of error, then P S | 1 = P S , which is also equivalent to the BER. It is interesting to note that (35) is similar to the BER of the differential binary phase shift keying (DBPSK) [44]. However, the two techniques are essentially different as D 3 does not require differential encoding, has no constraints on the shape of the signal constellation, and performs well even in frequency-selective fading channels. To evaluate P S for K > 2, we use an approximation for Q(x) in [52], which is given by Therefore, by substituting (36) into the conditional SEP (33) and averaging over the Rayleigh PDF ([12, Eq. 8]), the evaluation of the SEP becomes straightforward. For example, evaluating the integral for K = 3 gives, where Ei(x) is the exponential integral (EI), Similarly, P S for K = 7 can be evaluated to, Although the SEP is a very useful indicator of the system error probability performance, the BER is more informative. For a sequence that contains K D information bits, the BER can be expressed as P B = 1 Λ P S , where Λ denotes the average number of bit errors given a sequence error, which can be defined as Because the SEP is independent of the transmitted data sequence, then, without loss of generality, we assume that the transmitted data sequence is d (0) 0 . Therefore, where d 0 2 , in this case, corresponds to the Hamming weight of the detected sequenced 0 , which can be expressed as denotes the pairwise error probability (PEP). By noting that Pr(d 0 )∀i = j, then deriving the PEP for all cases of interest is intractable. As an alternative, a simple approximation is derived.
For a sequence that consists of K D information bits, the BER is bounded by In practical systems, the number of bits in the detected sequence is generally not large, which implies that the upper and lower bounds in (42) are relatively tight, and hence, the BER can be approximated as the middle point between the two bounds as, .
The analysis of the general 1 × N SIMO system is a straightforward extension of the single-input single-output (SISO) case. To simplify the analysis, we consider the flat channel case where the conditional SEP can be written as, (44) Given that all the receiving branches are independent, the fading envelopes will have Rayleigh distribution α i ∼ R(2σ 2 H )∀i, and thus, Therefore, the unconditional SEP can be evaluated as, For the special case of N =2, K = 2, P S can be evaluated as, Computing the closed-form formulas for other values of N and K can be evaluated following the same approach used in the SISO case. The analysis for the DS is given in Appendix II.

VI. COMPLEXITY ANALYSIS
The computational complexity is evaluated as the total number of primitive operations needed to perform the detection. The operations that will be used are the number of real additions (R A ), real multiplications (R M ), and real divisions (R D ) required to produce the set of detected symbolsd for each technique. It worth noting that one complex multiplication (C M ) is equivalent to four R M and three R A operations, while one complex addition (C A ) requires two R A . To simplify the analysis, we first assume that constant modulus (CM) constellations such as MPSK is used, then, we evaluate the complexity for higher-order modulation such as quadrature amplitude modulation (QAM) modulation.

A. Complexity of Conventional OFDM Detectors
The complexity of the conventional OFDM receiver that consists of the following main steps with the corresponding computational complexities: 1) Channel estimation of the pilot symbols, which computeŝ H k at all pilot subcarriers. Assuming that the pilot symbol d k is selected from a CM constellation, thenĤ k = r k d * k and hence, N P complex multiplications are required. Therefore, R (1) Interpolation, which is used to estimate the channel at the non-pilot subcarriers. The complexity of the interpolation process depends on the interpolation algorithm used. For comparison purposes, we assume that linear interpolation is used, which is the least complex interpolation algorithm. The linear interpolation requires one complex multiplication and two complex additions per interpolated sample. Therefore, the number of complex multiplications required is N − N P and the number of complex additions is 2(N − N P ). And hence, R 3) Equalization, a single-tap equalizer requires N − N P complex division to compute the decision variableš Therefore, one complex division requires two complex multiplications and one real division. Therefore, R Assuming that CM modulation is used, expanding the cost function and dropping the constant terms we can write J(d k ) = −ř kd * k −ř * kd k . We can also drop the minus sign from the cost function, and thus, the objective becomes maximizing the cost function d k = arg mind i J(d i ). Since the two terms are complex conjugate pair, then −ř kd * k −ř * kd k = 2 {ř kd * k }, and thus we can write the detected symbols as, Therefore, the number of real multiplications required for each information symbol is 2M , and the number of additions is M . Therefore, R Finally, the total computational complexity per OFDM symbol can be obtained by adding the complexities of the individual steps 1 → 4, as: For higher modulation orders, such as QAM, the complexity of the conventional OFDM receivers considering addition division operations is computed following the same steps 1 → 4 above, and found to be as:

B. Complexity of the D 3
The complexity of the D 3 based on the VA is mostly determined by the branch and path metrics calculation. The branch metrics can be computed as For CM constellation, the first and last terms are constants, and hence, can be dropped. Therefore, By noting that the two terms in (56) are the complex conjugate pair, then From the expression in (57), the constant "−2 " can be dropped from the cost function, however, the problem with be flipped to a maximization problem. Therefore, by expanding (57), we get (58) bottom of this page. By definingd md * n ũ m,n , and using complex numbers identities, we get (59), shown at bottom of this page. For CM, {ũ m,n } 2 + {ũ m,n } 2 is constant, and hence, it can be dropped from the cost function, which implies that no division operations are required.
To compute J c m,n , it is worth noting that the two terms in brackets are independent of {m, n}, and hence, they are computed only once for each value of c. Therefore, the complexity at each step in the trellis can be computed as R A = 3 × 2 M , R M = 4 + 2 × 2 M and R D = 0, where 2 M is the number of branches at each step in the trellis. However, if the trellis starts or ends by a pilot, then only M computations are required. By noting that the number of full steps is N − 2N P − 1, and the number of steps that require M computations is 2(N P − 1), then the total computations of the branch metrics (BM) are: The path metrics (PM) require R P M A = (N − 2N P − 1) + M (N P − 1) real addition. Therefore, the total complexity is: For QAM modulation, the most general case for the branch metrics of the D 3 will be used as, The branch metric in (63) requires one complex addition, C A = 1, one complex multiplication, C M = 1, and two complex divisions, C D = 2, per branch metrics. Therefore, the total path metric complexity is: To compare the complexity of the D 3 , we use the conventional detector using LS channel estimation, linear interpolation, zeroforcing (ZF) equalization, and [R_2, 3b]coherent MLD, denoted as coherent-L, as a benchmark due to its low complexity. The relative complexity is denoted by η, which corresponds to the ratio of the D 3 complexity to the conventional detector, i.e., η R A denotes the ratio of real additions and η R M corresponds to the ratio of real multiplications. As depicted in Table I, R A for D 3 less than coherent-L only using BPSK for N = 128, and then it becomes larger for all the other considered values of N . For R M , D 3 is always less than the coherent-L, particularly for high values of N , where it becomes 0.61 for N = 2048. It is worth noting that R D in the table corresponds to the number of divisions in the conventional OFDM since the D 3 does not require any division operations. For a more informative comparison between the two systems, we use the computational power analysis presented in [53], where the total power for each detector is estimated based on the total number of operations. Table I shows the relative computational power η P , which shows that the D 3 detector requires only 0.2 of the power required by the coherent-L detector for N = 128 and 0.31% for N = 2048.
It is also worth considering the complexity analysis for higher modulation orders that require division operations such as 16-QAM and 64-QAM since they widely used in modern wireless broadband systems [1], [2]. Table II shows the rations of real multiplications, multiplications, divisions, and lastly the ration of the overall computational power for 16-QAM and 64-QAM considering N = 512 and N = 2048. Unlike the CM modulus case, the D 3 requires division operations, where it is very comparable to conventional OFDM receivers in terms of the division computational resources. Although, the total number of computational addition resources needed is higher in D 3 by 25% − 65%, Nevertheless, the overall computational resources  in D 3 is less than the conventional OFDM reveries by %6 − 20% due to the significant saving in the multiplication operations of the D 3 . Besides, it is worth noting that linear interpolation has lower complexity as compared to more accurate interpolation schemes such as the spline interpolation [54], [55], which comes at the expense of the error rate performance. Therefore, the results presented in Table I can be generally considered as upper bounds on the relative complexity of the D 3 , when more accurate interpolation schemes are used, the relative complexity will drop even further as compared to the results in Table I.

VII. NUMERICAL RESULTS
This section presents the performance of the D 3 detector in terms of BER for several operating scenarios. The system model follows the LTE-A physical layer (PHY) specifications [2], where the adopted OFDM symbol has N = 512, N CP = 64, the sampling frequency f s = 7.68 MHz, the subcarrier spacing Δf = 15 kHz, and the pilot grid follows [R 2,a ] pilot/data distribution of Fig. 3, except for Fig. 12, which is generated using the NR pilot grid. The total OFDM symbol period is 75 μ sec, and the CP period is 4.69 μ sec. The channel models used are the flat Rayleigh fading channel, the typical urban (TUx) multipath fading model [56] that consists of 6 taps with normalized delays of [0, 2, 3,9,13,29] and average taps gains are  [R 2,3d ] Throughout this section, the ML coherent detector with perfect CSI will be denoted as coherent, while coherent detector the pilot-based systems with linear and spline interpolation will be denoted as coherent-L and coherent-S, respectively. [R 2,3c ] The interpolation was performed in frequency direction when a frequency-domain segment is detected, and in time-domain when a time segment is detected Moreover, the results are presented for the SISO system, N = 1, unless it is mentioned otherwise. The SNR in the obtained results is defined as the ratio of the average received signal power to the average noise power regardless of the number of pilots. Such an approach is followed because the proposed system in this work is evaluated in the context of the LTE RB, which has a fixed structure. For more general comparisons, the power and spectral efficiency of all considered systems should be identical. Fig. 5 shows the theoretical and simulated BERs for the SS and DS D 3 over flat fading channels for K = 2, 6 and 3,7, respectively, and using BPSK. The number of data symbols K D = K − 1 for the SS and K D = K − 2 for the DS because there are two pilot symbols at both ends of the data segment for the DS case. The results in the figure for the SS show that K has a noticeable impact on the BER where the difference between the K = 2 and 6 cases is about 1.6 dB at BER of 10 −3 . For the DS segment, the BER has the same trends of the SS, except that it becomes closer to the coherent because using more pilots reduces the probability of sequence inversion due to the phase ambiguity problem. The figure shows that the approximated and simulation results match very well for all cases, which confirms the accuracy of the derived approximations.
The effect of the frequency selectivity is illustrated in Fig. 6 for the SS and DS configurations using K D = 1. As can be noted from the figure, frequency-selective channels introduce error floors at high SNRs, which is due to the difference between adjacent channel values caused by the channel frequency selectivity. Furthermore, the figure shows a close match between the simulation and the derived approximations. The approximation results are presented only for K = 2 because evaluating the BER for K > 2 becomes computationally prohibitive. For example, evaluating the integral (30) for the K = 3 requires solving a 6-fold integral. The results for the frequency-selective channels are quite different from the flat fading cases. In particular, the BER performance drastically improves when the DS pilot segment is used. Moreover, the impact of the frequency selectivity is significant, particularly for the SS pilot case.   Fig. 7 shows the effect of the frequency selectivity on the simulated BER performance over a various degrees of frequency selectivity. The figure illustrates the performance of the D 3 , coherent-L, and coherent-S starting from | f | = 0.97, which represents a severely frequency selective channel as in [8,Tab. 1], to | f | = 1, which considers the flat case. As the figure indicates, the D 3 is more immune to error floors at both SNRs 30 and 40 dB as compared to other detectors. Fig. 8 shows the theoretical and simulated BERs for a 1 × 2 SIMO D 3 over a flat fading channel using SS and DS pilot segments. The figure shows that the maximum ratio combiner (MRC) BER with perfect CSI outperforms the DS and SS systems by about 2 and 3 dB, respectively. Moreover, the figure shows that the MLSD [36] and the D 3 have equivalent BERs  for the SISO and SIMO scenarios. The figure also compares the BER of the 1 × 2 SIMO with the SISO case.
For the remaining of this section, the results are presented for frequency-selective channels with large values of K. and hence, the BER is obtained using Monte Carlo simulation. Fig. 9 shows the BER for a SISO and 1 × 2 SIMO systems using the D 3 , MLSD, coherent, coherent-S and coherent-L systems over a frequency-selective channel. For both SISO and SIMO, the BERs of all considered techniques converge at low SNRs because the AWGN dominates the BER in the low SNR range. For moderate and high SNRs, the D 3 outperforms the other considered techniques except for the coherent, where the difference Fig. 10. BER of the D 3 for K = 7 DS using BPSK compared with PSP [31], MLSD [36], and differential coding modulation [34] over the 6-taps frequencyselective channel. is about 3.5 and 2.75 dB at BER of 10 −3 for the SISO and SIMO systems, respectively.
[R 2,3f ] Fig. 10 compares the BER of the D 3 , PSP [31], MLSD [36], MSDD [34], and the coherent detector over the 6-taps channel using BPSK. As can be noted from the figure, the D 3 noticeably outperforms other detectors for SNR 15 dB, which indicates that the D 3 is more robust to the frequency selectivity of the channel. Moreover, the figure shows the D 3 BER using VA which, as expected, is identical to the BER obtained using (11). It is worth noting that all the systems considered in the figure are implemented using the DS segment where K = 7, and thus, they are evaluated under similar throughput conditions. However, the BER sensitivity of each technique to the number of pilot symbols could be different from other techniques, which implies that some of these techniques might be able to provide roughly the same BER but using fewer pilot symbols. The same argument applies to the power efficiency as well, because the power allocated per information bit becomes different for various systems. However, because the LTE RB is used as the basis for testing all systems, then the current comparison can be considered generally fair. In the worst case scenario, i.e., considering that all other systems are fully blind, then the throughput power loss is only 4.7% as described in Subsection IV-B, which has a negligible effect on the BER.
[R 2,3e ] Fig. 11 shows the BER for the D 3 , MLSD [36], coherent, coherent-L and coherent-S using 16-QAM. As can be noted from the figure, the MLSD slightly outperforms the D 3 at low SNRs, and the coherent-S outperforms the D 3 at high SNRs. However, the coherent-S has generally much higher complexity. Fig. 12 shows the simulated BER of the D 3 with SISO deployment according to the 3GPP Release 15 Specifications [6] using type-B DMRS configuration [57]. In this configuration, the NR block has 13 time domain OFDM symbols, and each symbol  has 12 subcarriers. Symbols 1, 6 and 11 are entirely reserved for DMRS, thus, the spectral and power resources allocated for the pilot symbols are about 23%. Moreover, having pilots at each subcarrier index enables estimating the CSI for the entire block using a 1-D time-domain interpolation. Consequently, the BER will be generally close to the coherent detector with perfect CSI knowledge, even for speeds up to 50 km/h using linear interpolation, as shown in the figure. The same argument applies to the D 3 because each time-domain segment will have three pilots, and most of the symbols are DS. At a high speed, such as 100 km/h, the channel variation in time-domain becomes severe and the channel becomes time-selective. Consequently, the BER of all the considered systems increases noticeably. Nevertheless, the D 3 managed to exhibit high robustness as compared to the coherent detection with linear and spline interpolation. It is worth mentioning that the doppler effect of the channel mobility results is an intercarrier interference (ICI), and hence, reduce the effective SNR. It is also worth noting that the time-frequency distribution of the pilots may have a significant impact on the system performance. For example, the LTE-A in high mobility scenarios provides lower BER than the NR, even though it has much smaller number of pilots. Therefore, a high-efficiency design may require an adaptive pilot grid configuration.
As can be noted from the results in Fig. 5-Fig. 12, the main parameters that determine the BER performance of the D 3 with respect to the conventional coherent detector are the SNR, modulation order, channel selectivity, and the segment length. As can be noted from Fig. 9-Fig. 12, the BERs of all systems generally converge to the same value for SNR 20 dB for the SISO case and about 12 dB for the 1×2 SIMO. However, the D 3 outperforms the other considered systems including the pilotbased systems with linear/spline interpolation, MSDD, MLSD, and PSP at moderate and high SNRs. However, for the 16-QAM modulation case, the pilot-based with spline interpolation outperforms slightly the D 3 at high SNRs. Nevertheless, spline interpolation has generally higher computational complexity. Fig. 13 shows the simulated BER of the D 3 using turbo product codes (TPCs) with soft decision decoding, using extended Bose-Chaudhuri-Hocquenghem (eBCH) (32 × 26) 2 . In addition to the encoder interleaver, a 512 × 512 channel block interleaver is also used. The D 3 results are compared to the deferentially encoded PSK (DPSK) system, and both scheme are based on binary signaling. As can be noted from the figure, the BER of the D 3 and DPSK are comparable for the considered range of SNR, with a small advantage for the D 3 . As the strength of the code decreases by reducing the number of iterations, the advantage of the D 3 becomes more considerable. In fact, using less powerful codes with higher code rates, or using the less complex hard decision decoded, would increase the difference between the two schemes significantly.

VIII. CONCLUSION AND FUTURE WORK
This work proposed a new receiver design for OFDM-based broadband communication systems. The new receiver performs the detection process directly from the FFT output symbols without the need of experiencing the conventional steps of CSIE, interpolation, and equalization, which led to a considerable complexity reduction. Moreover, the D 3 system can be deployed efficiently using the VA. The proposed system was analyzed theoretically where simple closed-form expressions were derived for the BER in several cases of interest. The analytical and simulation results show that the D 3 BER outperforms the coherent pilot-based receiver in various channel conditions, particularly in frequency-selective channels where the D 3 demonstrated high robustness.
Although the D 3 has been considered in this work for SIMO systems, it can be also applied to MIMO systems as demonstrated in [58]. Nevertheless, the system design and performance analysis require a dedicated article, and hence, it will be considered in our future work. Moreover, it is crucial to evaluate the D 3 sensitivity to various practical imperfections such as phase noise, synchronization errors and IQ imbalance.
As can be noted from the obtained results, the D 3 performance generally depends on the frequency selectivity of the channel. Therefore, combining the time-domain interleaving (TDI) system [13] and D 3 can substantially improve the D 3 performance because the TDI converts a frequency-selective fading into a flat-fading. Moreover, the D 3 may gain some robustness to impulsive noise since TDI can efficiently mitigate such noise.
APPENDIX II Embedding more pilots in the detection segment can improve the detector's performance. Consequently, it worth investigating the effect of embedding more pilots in the SEP analysis. More specifically, we consider DS segment,d 0 = 1,d K−1 = 1, as illustrated in Fig. 2. In this case, the detector can be expressed as,d 0 = arg max From the definition in (75), the probability of receiving the correct sequence can be derived based on the reduced number of trials as compared to (21). Therefore, which, similar to the SS case, can be written as, Pr ( {r v rv}) > 0 . (77) Therefore, For flat fading channels, the SEP expression in (78) can be simplified by following the same procedure in Subsection V-A, for the special case of K = 3, the SEP becomes, For K > 3, the approximation of Q n (x), as illustrated in Subsection V-A, can be used in (78) to average over the Rayleigh PDF given in [12,Eq. 8]. For example, the case K = 4 can be evaluated as, where Ω 1 1 + For K = 6, where Ω 2 2 + √ 2 For the DS pilot, P B = P S for the case of K = 3, while it can be computed using (43) for K > 3.