Differential Data-Aided Channel Estimation for Up-Link Massive SIMO-OFDM

Pilot symbol assisted modulation (PSAM) is widely used to obtain the channel state information (CSI) needed for coherent demodulation. It allows the density of pilot symbols to be dynamically chosen depending on the channel conditions. However, the insertion of pilots reduces the spectral efficiency, more severely when the channel is highly time-variant and/or frequency-selective. In these cases a significant amount of pilots is required to properly track the channel variations in both time and frequency dimensions. Alternatively, non-coherent demodulation does not require any CSI for the demodulation independently of the channel conditions. For the particular case of up-link (UL) based on massive single input - multiple output (SIMO) combined with orthogonal frequency division multiplexing (OFDM), we propose to replace the traditional reference signals of PSAM by a new differentially-encoded data stream that can be non-coherently detected. The latter can be demodulated without the knowledge of the CSI and subsequently used for the channel estimation. We denote our proposal as hybrid demodulation scheme (HDS) because it exploits both the benefits of a coherent demodulation scheme (CDS) and a non-coherent demodulation scheme (NCDS) to increase the spectral efficiency. The mean squared error (MSE) of the channel estimation, bit error rate (BER), achieved throughput and complexity are analyzed to highlight the benefits of this differential data-aided channel estimation as compared to other approaches. We show that the channel estimation is almost as good as PSAM, while the BER performance and throughput are improved for different channel conditions with a very small complexity increase.


I. INTRODUCTION
N EW DEMANDING applications have emerged in the last years, which require a tremendous increase in the capabilities of the communications networks.In this context, the Third Generation Partnership Project (3GPP) proposed the Fifth Generation (5G) of mobile communications, commonly known as the New-Radio (NR) [1].Different scenarios are considered, from indoor to outdoor and from low to high mobility, where the latter has attracted much attention from industry [2] due to the commercial potential of new services related to the area of transportation, such as autonomous vehicles, wide-band communications in high speed trains, etc.
Orthogonal frequency division multiplexing (OFDM) [3] and massive multiple input -multiple output (MIMO) [4] are key radio technologies adopted by 5G for the physical layer.They can provide a significant improvement in terms of spectral and energy efficiency with a constrained complexity.Thanks to the use of OFDM, the multi-path channel is converted into a set of independent flat-fading channels, and the MIMO pre/post-coding techniques can be applied to each subcarrier to obtain the equalized symbols.Moreover, when the number of antennas is large enough, linear pre/postcoding techniques have a close to optimum performance [5], avoiding the use of other higher-complexity techniques.
Coherent demodulation is generally used at the receiver to compensate the propagation channel effects.Pilot symbol assisted modulation (PSAM) [6] is one of the mostfrequently employed techniques to obtain the accurate channel state information (CSI) needed for the coherent processing with a reduced complexity.In PSAM several time-frequency resources are exclusively reserved to transmit some known reference signals, typically called pilots, where the density and position of these pilots in the resource grid can be dynamically chosen according to the channel conditions.However, even with this flexibility, the insertion of the PSAM pilots causes a transmit overhead, especially when the channel is fast-varying and/or very frequency-selective.In these cases an important quantity of pilots must be transmitted for periodically tracking the channel variations in each dimension.
Alternatives to PSAM exist, such as blind channel estimation [7], [8].It requires no reference signals and consequently the spectral efficiency is increased, at the expense of sacrificing the complexity and/or the quality of the CSI.Blind methods mainly exploit the separability of the noise and signal sub-spaces by applying an eigenvalue decomposition (EVD) to the correlation matrix of the received signals.However, the CSI cannot be always obtained due to the ambiguity problem caused from the computation of EVD.Then, semi-blind techniques are proposed to solve the ambiguity issue and increase the quality of the channel estimates, now transmitting a reduced number of reference signals [9], [10].Then, the spectral efficiency is again decreased.Moreover, the computational complexity of blind and semi-blind techniques is generally higher than PSAM's due to their iterative structure.Another alternative is superimposed training (ST) [11], [12], which is able to effectively alleviate the complexity of blind techniques, where the data and pilots share the same time-frequency resources.Hence, the spectral efficiency is increased at the expense of degrading the quality of the channel estimates due to the presence of data self-interference.All these techniques, blind, semi-blind or ST, are appropriate for time-invariant and/or flat-fading environments, since they need to average a significant number of samples to obtain an acceptable quality of the CSI.
Another interesting alternative is a non-coherent detection scheme (NCDS), capable of obtaining the transmitted data information without CSI knowledge.Though traditionally dismissed due to the SNR loss with respect to coherent approaches, the combination of non-coherent detection with massive MIMO has recently gained attention, since this SNR loss may become negligible when considering the gains in spectral efficiency and complexity reduction.A significant number of works in the literature focus on the up-link (UL) [13]- [16], where single antenna users transmit to a base station (BS) equipped with an array of a very large number of elements, which corresponds to a multiuser single-input multiple-output (SIMO) scenario.The large number of elements at the receiver can leverage spatial diversity, improving the link performance.Reference [13] proposed a non-coherent scheme based on energy detection, that provides an insightful asymptotical performance analysis, while it practically requires an excessive number of antennas in order to provide an acceptable performance.References [14], [15] proposed the use of differential phase shift keying (DPSK), where the BS receiver non-coherently combines two contiguous symbols for each antenna and performs an averaging process over the antennas to mitigate the channel effects.Later, the combination of [14] with OFDM is studied in [16] assuming a doubly dispersive channel model.Moreover, the differential modulation showed great robustness against extremely high Doppler shifts, significantly outperforming the traditional coherent schemes in these scenarios.However, the system performance is compromised when the number of users simultaneously served in the same time and frequency resources is increased.
Taking into account the benefits and limitations of the existing techniques, we propose to take the best of both approaches.The idea is to maintain the SNR efficiency of coherent demodulation schemes (CDS) while not wasting resources for pilot transmission thanks to NCDS, even for those scenarios which suffer from high values of Doppler and/or delay spread.We focus on the particular case of UL since the recent literature has shown that the NCDS can make use of the large number of antennas at the BS to improve its performance [14]- [16].Besides, most massive MIMO systems are based on a time division duplex (TDD) where training is performed in the UL, making the UL CSI estimation indispensable.Our proposal is a differential data-aided channel estimation scheme, where the reference symbols of traditional PSAM are replaced by differentially-encoded data that are non-coherently detected in a very robust manner thanks to the massive number of antennas.NCDS and CDS data are multiplexed in the same way as pilots and data in traditional PSAM and in most wireless communication standards based on OFDM.Then, these non-coherently detected symbols are used for channel estimation to produce the CSI to coherently demodulate the main data stream.
We denote our proposal as hybrid demodulation scheme (HDS), and we provide a complete description of how to multiplex coherent and differential data in the OFDM frame in a flexible way that may be adapted to different channel conditions.By combining two data streams, denoted as CDS and NCDS, we are able to exploit the benefits of the traditional CDS for different channel conditions without sacrificing the data-rate usually spent in pilot overhead.However, this combination is not straightforward.If not carefully designed, the CSI obtained with the NCDS stream may not have the required quality and decrease the achievable rate of the CDS stream with also a poor achievable rate of NCDS, so nothing is gained.Then, an analysis of the influence of the system parameters, particularly the constellation size and the packet length of the NCDS stream, is crucial in order to understand their effects on the CSI accuracy, bit-errorrate (BER) and throughput, and provide a good design.The contributions of this paper are summarized as follows: • An analytical derivation of the BER for the NCDS stream is provided, which differs from that of the CDS and is not available in the literature for non-coherent massive MIMO.• An analysis of the MSE of the channel estimation is provided when the CSI is obtained using a NCDS data stream (unknown to the receiver).The analysis is based on the previous derivation of the BER and it allows us to determine the proper value of the parameters to ensure that the increment of the MSE with respect to the classical PSAM-based estimation is negligible.
• A comparative analysis of the complexity and achievable throughput of the CDS and HDS is provided.An alternative scheme such as ST is also evaluated for comparison.It is shown that HDS outperforms both ST and CDS with a very reduced increment in complexity.The analytical expressions of the throughput can be also used to choose the best constellation sizes for the HDS scheme in a certain scenario without the need of setting up a simulation environment.• This approach is validated by evaluating the BER and throughput that are obtained using a realistic geometric channel model with link-level simulations.The numerical results offer the same conclusions as the analytical approach.The remainder of the paper is organized as follows.
Section II provides the system model based on SIMO-OFDM.It also provides an overview of the traditional CDS, PSAM and NCDS, as basic building blocks or our proposal.Section III describes the proposed HDS, which is based on the combination of CDS and NCDS, explaining the details of the proposed channel estimation and demodulation.The MSE obtained with the channel estimation is also analyzed.Section IV offers the throughput and complexity evaluation.Section V presents some numerical results to verify the analysis and compare the performance of our proposal with respect to other alternatives.Finally, in Section VI, some conclusions are pointed out.
Notation: matrices, vectors and scalar quantities are denoted by boldface uppercase, boldface lowercase, and normal letters, respectively.[A] m,n denotes the element in the m-th row and n-th column of A. [a] n represents the n-th element of vector a. ∠(x) and x represent the angle and closest superior integer number of x, respectively.The superscripts (•) H , (•) * and (•) † denote Hermitian, complex conjugate and pseudoinverse of Moore-Penrose, respectively.E{• } represents the expected value.Var{• } denotes variance.CN (0, σ 2 ) represents the circularly-symmetric and zero-mean complex normal distribution with variance σ 2 .f (a|b) and P(a|b) is the conditional probability density function (PDF) of a conditioned to b for continuous and discrete random variables, respectively.(k, ξ) is the Gamma distribution with shape parameter k and scale parameter ξ , with mean μ = k/ξ and variance σ = k/ξ 2 .R and I refer, respectively, to the real and imaginary parts of a complex number.N indicates the set of integer numbers.

II. SYSTEM MODEL FOR CDS AND NCDS
In this section we first introduce the system model of an UL based on SIMO-OFDM.Focusing on this system model, we detail the channel estimation and equalization procedures of a classical CDS with PSAM.Then, a NCDS based on differential modulation is also presented.CDS and NCDS are building blocks of the HDS proposed in Section III.

A. MASSIVE SIMO-OFDM
We consider one BS equipped with an array of R antenna elements which is serving several single-antenna users.For the purpose of channel estimation, we focus on the UL of a particular user (see Fig. 1) that transmits N consecutive OFDM symbols to the BS.The OFDM signal has K subcarriers and the length of the cyclic prefix (L CP ) is long enough to absorb the effects of the multi-path channel (L CP ≥ L CH − 1).The number of channel taps is denoted by L CH .At the BS, after removing the cyclic prefix and performing a fast-Fourier transform (FFT), we can process each subcarrier as one of a set of K independent sub-channels.Moreover, the whole time-frequency resource grid (K × N resource elements) is split into unit blocks of K B consecutive subcarriers and N B contiguous OFDM symbols.Each unit block can be independently processed for channel estimation and detection.This model has practical interest since it is used, for example, in the Fourth Generation (4G) and 5G, where these generic unit blocks are denoted as physical resource blocks [1].Hence, for the sake of conciseness and without loss of generality, we focus on a particular unit block.
The signal received at the BS on the k-th subcarrier and n-th OFDM symbol is denoted by the vector y n k (R×1) given by where x n k denotes the symbol transmitted by the user on the k-th subcarrier and n-th OFDM symbol, ν n k (R × 1) represents the additive white Gaussian noise vector, whose elements are distributed according to CN (0, σ 2 ν ), and h n k is a (R × 1) vector that corresponds to the frequency channel response experienced at the R antennas, whose elements are distributed as CN (0, 1).Following [17], the channel coefficients are spatially uncorrelated but suffer from time variability according to where f is the subcarrier spacing in Hertz, J 0 (•) denotes the zero-th order Bessel function of the first kind and f d represents the maximum Doppler shift experienced by the transmitted signal, also in Hertz.Finally, assuming unit-power transmitted symbols, the SNR is defined as ρ = 1/σ 2 ν .

B. PILOT SYMBOL ASSISTED MODULATION (PSAM)
In PSAM, some of the time-frequency resources of the unit block are exclusively reserved for the transmission of pilot symbols while the others are dedicated to data symbols.Firstly, let A denote the set of pairs of frequency and time indexes that describe the resources of the unit block, defined as with cardinality |A| = K B × N B .Let A d denote the subset of the pairs of frequency and time indexes intended for the data transmission, and A p denote the subset that contains the pairs of frequency and time indexes reserved for the pilot symbols.These subsets must satisfy and their cardinality is given by In this equation K p and N p refer to the number of pilots in the frequency and time domain, respectively, in a unit block.According to [18], these pilot symbols must be equally spaced over the available resources.The distance between two contiguous pilot symbols in the frequency and time dimensions is denoted by L K and L N , measured in number of resources, and they must satisfy Hence, x n k can be built in the following way from the data symbol denoted by s n k that belongs to a quadrature amplitude modulation (QAM) constellation of size M C and the pilot symbol at the k-th subcarrier and n-th OFDM symbol denoted by p n k .
At the BS, the channel estimation is performed at the resources occupied by the reference signals and obtained as By applying the LS criterion [19], the channel estimate is and its corresponding MSE is σ 2 e = σ 2 ν .Finally, these channel estimates h n k are interpolated to obtain the channel estimates at the data symbol positions (A d ) by using some interpolation function.The channel estimation error after interpolation (σ 2 d ) depends on the channel variations in time and frequency and on the chosen interpolation function [20], whose detailed analysis is out of scope of this work.

C. COHERENT DEMODULATION SCHEME (CDS)
Once the CSI is available, the post-coding vectors can be computed to equalize the received data symbols.Using a zero-forcing (ZF) criterion, they are obtained as Then, the following decision variable is obtained in at the receiver to recover the transmitted data symbols According to [21], the BER for a M C -QAM constellation using a ZF post-equalization may be approximated as where ) and we assume the use of a rectangular pulse.

D. NON-COHERENT DEMODULATION SCHEME (NCDS)
We will use the frequency domain implementation of the NCDS presented in [16].This method has the advantage of involving a lower processing delay.In turn, only for strongly frequency-selective channels, an additional phase shift needs to be estimated and compensated.This phase shift remains quasi-static, even for extremely high Doppler scenarios and/or strongly frequency selective channels, when the number of antennas at the BS is large enough.Hence, if needed, this phase error can be estimated by using only two pilots for the whole time-frequency resource grid (K × N), which has a minimum impact in terms of data rate.
For a given unit block of size A, the transmitted symbols x n k are given by In this equation p n 1 is a reference symbol known by both ends of the communication link placed at the first subcarrier and each OFDM symbol (equivalent to one pilot in PSAM) and q n k is the complex data symbol to be transmitted in subcarrier k and symbol n, which belongs to a DPSK constellation of size M N .
At the receiver, two symbols contiguously transmitted in the frequency domain are non-coherently combined as After some manipulations, ( 15) can be expanded as Assuming that the number of antenna elements at the BS is large enough, and making use of the Law of Large Numbers [22], the following is satisfied and since (x n k−1 ) H x n k = q n k , ( 16) asymptotically tends to Otherwise, when the number of antennas is not large enough, there are some distortion and noise terms that are computed in [14].Additionally, Appendix A shows the development of the analytical expressions for the symbol error rate (SER) and BER of NCDS combined with massive SIMO.The latter is given by rdrdγ, (20) where D ∈ [−π/M N , π/M N ] denotes the decision region for the correspondent symbol of interest and (μ x , μ y ) = (1, 0) (set for the DPSK symbol placed in the positive part of the real axis of the complex plane without loss of generality).
The parameters σ x and σ y are as follows

III. PROPOSED DIFFERENTIAL DATA-AIDED CHANNEL ESTIMATION
In the previous section we showed that the CDS is limited by the additional overhead produced by the transmission of pilots.This inefficiency is more severe in scenarios with a high variability in time and/or frequency since the pilot density needs to be increased.We propose a new scheme, that we denote as the HDS, based on the combination of the CDS and the NCDS techniques to alleviate the overhead problem while maintaining a good performance.Our proposal consists of replacing the pilot symbols of the traditional PSAM-based CDS by an additional data stream following the NCDS approach, as shown in Fig. 2. At the BS, the data transmitted on the NCDS stream is firstly demodulated, without CSI knowledge, and then used to obtain the channel estimates.The diversity introduced by the large number of antennas at the BS and a careful selection of the parameters of the NCDS stream ensure a good performance, so it can provide an accurate channel estimation for the subsequent demodulation of the CDS data.Based on this channel estimates, the post-coding matrices are computed and the data transmitted on the CDS stream is equalized.Since the whole time-frequency grid is occupied by data symbols, to be demodulated either non-coherently (NCDS) or coherently (CDS), the system efficiency is increased.
In the following we provide the details of how the two data streams are multiplexed in the time-frequency grid of the unit block, the proposed procedure to estimate the CSI from the received differential data, and the analysis of the MSE that is obtained.

A. MULTIPLEXING CDS AND NCDS DATA STREAMS
The proposed HDS consists of multiplexing two data streams, namely a data stream transmitted following the traditional CDS (s n k ), and another data stream transmitted by the NCDS (q n k ) (see Fig. 2).Comparing the HDS with the traditional NCDS, the traditional reference symbols are replaced by a new data stream to be non-coherently detected.Hence, the distribution of resources expressed in ( 7) is modified as where k n min is the minimum value of the index k for the n-th OFDM symbol belonging to the subset A p (that is, the first sub-carrier of each OFDM symbol that carries the NCDS stream).For example, k n min = 1 in Fig. 2. Interestingly, our proposed differential data-aided system can be adapted to any existing pilot distribution based on PSAM, by simply replacing the pilots by a NCDS data stream.

B. DETECTION OF THE NCDS STREAM AND CHANNEL ESTIMATION
As described in (15), for those resources that carry the NDCS data stream, ((k, n) ∈ A p ), the non-coherently combined symbols (z n k ) can be obtained at the BS receiver.Then, the transmitted data symbols ( q n k ) can be recovered from z n k by using the maximum-likelihood (ML) criterion [23] as where Once the decision is taken on the NCDS data symbols, they can be used for the channel estimation as follows.Firstly, the differential symbols are reconstructed ( q n k → x n k ) by applying (21).Then, the LS criterion is used to estimate the channel at the (k, n) ∈ A p resources as Inspecting (23), we can see that there is no noise enhancement since | x n k | = 1.However, unlike in traditional PSAM, there is an additional error term in the channel estimation produced by the possible mismatch between the transmitted and the reconstructed differential symbol.This error in the channel estimation is studied in the next subsection.

C. ANALYSIS OF THE MSE OF THE CHANNEL ESTIMATION
The two terms given in (23) are independent to each other, which can be verified as due to the fact that the transmitted symbol, channel and noise are uncorrelated random variables.Hence, the channel estimation error incurred when using ( 23) is given by where the expectation is performed over both time and frequency dimensions.Recall that σ 2 ν is the noise variance, and the first term can be obtained as Here α accounts for the channel estimation error produced by the use of an unknown data sequence.It depends on the error probability of the decision of ( q n k ).Its value is derived in Appendix B and shown to be It depends on the NCDS stream length (K p ) and the error probability of the symbols q n k , denoted as P s,N .Moreover, α is bounded by 0 ≤ α ≤ 1.The case for α = 0 is equivalent to PSAM, where all symbol decisions are correct, while α = 1 means that all decisions are wrong, degrading the channel estimates.Therefore, the channel estimation error for our proposed scheme σ e is bounded by For low values of SNR, the dominant term is given by the noise since σ 2 ν >> 1.For high and moderate SNR values, both terms in (28) decrease.Moreover, according to (26) and ( 27), σ 2 q can be minimized by properly setting P s,N and K p , thus assuring a performance similar to the traditional PSAM with the advantage of being capable of transmitting a higher data rate.The performance of P s,N depends on the constellation size (M N ), number or antennas (R) and the SNR (ρ).In order to improve the error performance of the symbols transmitted by the NCDS and the quality of the channel estimation, M N should be specifically chosen for a particular scenario.On the other hand, the length of the sequence K p should be also constrained.However, this is not restrictive since K p is already constrained by using the unit block.
Finally, the estimated channel ( h n k , (k, n) ∈ A p ) is interpolated to provide the CSI for the whole resources of the unit block (A d ), the post-coding matrices based on ZF are computed, and the CDS data symbols are equalized, following the procedure explained in the previous section.

IV. ANALYSIS OF THE THROUGHPUT AND COMPLEXITY
In this section we analyze and compare in terms of throughput and complexity the traditional CDS and our proposed HDS.We show that HDS can increase the throughput with a negligible increment of the complexity.

A. THROUGHPUT ANALYSIS
For a typical packet-based transmission, let us define the total throughput of the HDS as where T C and T N refer to the throughput of the CDS and NCDS streams, respectively, for each unit block.The throughput of each scheme can be found as In these equations, L P denotes the number of bits in one packet, P b,N is given by (20) and P b,C by (12).The efficiencies of CDS and NCDS in terms of the occupied resources in the unit block are Inspecting ( 29)-( 31), we can see that the total throughput of our proposed HDS is the sum of the throughput of each scheme, CDS and NCDS given in (30) and (31), respectively.If some parameters are given by the system design, such as the number of antennas, SNR, bandwidth and length of the packet, the throughput of the NCDS only depends on the selected modulation scheme (M N ) and its corresponding BER (P b,C given in (20)).In turn, the BER if CDS (P b,C given in (12)) not only depends on the constellation size (M C ), but it also relies on the performance of the channel estimation provided by the NCDS stream.Hence, the proposed HDS outperforms the traditional CDS due to the additional throughput provided by the NCDS if the channel estimation error (σ 2 e ) is properly constrained in order to avoid the increment of P b,C .
In order to obtain the best pair of constellation sizes (M C and M N ) for a given scenario with some specific parameters, the throughput given in ( 29)-( 31) should be maximized.Due to the difficulty to find closed-form expressions from ( 12) and ( 20) and motivated by the fact that the search space is very small (log 2 (max(M N ))×log 2 (max(M C )) combinations), we propose to resort to an exhaustive search over these few options.The throughput expressions are evaluated for a particular scenario and the values that provide the maximum throughput given in (29) are chosen.As we show in the next section, the search is limited to a small number of options since the possible values of M C and M N are usually constrained to a few, e.g., 8 × 6.

B. COMPLEXITY
In general, non-coherent processing is much less complex than coherent processing.Anyway, as the NCDS stream is an addition, its demodulation will incur some extra complexity that would not exist in a CDS system alone.In order to show that the proposed HDS does not significantly increase the complexity of the system, the number of complex multiplications (NCM) required for each scheme is accounted as follows.
For the particular case of CDS, the expression of the NCM is given by where the LS channel estimation given in ( 23) is considered, as well as the post-coding matrix computation given in (10) and the equalization process given in (11).The complexity introduced by the interpolation process is not taken into account, so this is an optimistic evaluation of the complexity of CDS.If we consider it, the additional relative complexity of NCDS is even lower.
For the case of HDS, the NCM can be expressed as where D N accounts for the differential decoding given in (15), required for the NCDS.Note that the differential encoding required for computing the differential symbols q n k at the transmitter, and for the reconstruction of the differential symbols q n k at the receiver are considered negligible.The reason for this is the fact that the phase difference can be computed in polar coordinates, which corresponds to adding/subtracting phases, and the conversion from polar to binomial coordinates can be implemented by using look-up tables, since the differential symbols (q n k ) belong to a DPSK constellation, which is a finite set.On the contrary, these simplifications cannot be implemented for the differential decoding, since the received symbols, which are modified by the channel and noise effects, no longer belong to a finite set.
Comparing (33) and (34), we can see that the dominant term in (34) is the first one, which corresponds to the CDS, and therefore, both techniques have the same complexity order O(3RK B M B ), rendering the additional complexity of the NCDS stream negligible, as predicted.

V. NUMERICAL RESULTS
In this section, we provide some numerical results to verify our analysis and show the validity of the proposed HDS, which is capable of outperforming the traditional CDS for several scenarios of interest, as well as other alternative schemes such as ST.In Table 1, we provide a summary of some numerical values for the different system parameters, where they have been chosen taking into account the numerology given in 5G [1].Besides, the size of the unit block has been set K B = 12, N B = 14 and the interpolation method to 'spline' [24].The number of pilots placed in the unit block (given by K p and N p ) must be incremented as the variations of the channel in both dimensions increase.Note that the time variability of the channel is modelled by using (2).According to [18], the number of pilots placed in the frequency dimension should be at least twice the number of taps of the multi-path channel, and in the time dimension there must be at least two pilots within the coherence time [17].However, realistic communication links increase the number of pilots beyond the minimum, especially in the frequency dimension, to improve the quality of the estimators.Then, by changing the values of K p and N p we illustrate different scenarios of time and frequency variability of the channel.
The ST scheme of [12] is also taken into account for the comparison of throughput.This scheme has a parameter denoted as β that indicates the power allocated for the superimposed pilot symbols, while 1 − β corresponds to the power of the data symbols.According to [12], we set β = 0.2 which is the most frequently used value in the literature.ST requires an averaging process, which is implemented in both time and frequency dimensions and the number of resources to average is dynamically adapted for different values of Doppler and delay spreads to avoid the degradation of the channel estimates due to these effects.Hence, considering the definitions of ( 6), for the ST the number of averaged samples has been set equal to the amount of resources between two contiguous pilot symbols for each dimension (L K × L N ).

A. BER FOR NCDS COMBINED WITH MASSIVE SIMO
We verify the validity of ( 20) to obtain the BER of the DPSK symbols received using the NCDS with a massive number of antennas at the BS (P b,N ).In Fig. 3, we provide the simulation results and the analytical approximation of the BER for different values of M N and R. The chosen power delay profile for the channel impulse response corresponds to the type B, given in [25], with a spatially uncorrelated channel of delay spread 363 ns.It can be confirmed that the difference between the simulation and the analytical approximation is negligible, in particular when the number of antennas is large enough.

B. MSE OF THE DIFFERENTIAL DATA-AIDED CHANNEL ESTIMATION
We check the performance of the proposed channel estimation using a differential data-aided sequence as compared to the traditional PSAM.The number of antennas is constrained to R = 64 for feasibility reasons, due to the fact that this is a typical number of antennas for the deployment of 5G [25].In Fig. 4, we show the MSE of the channel estimation for different values of K p , given in (25), when M N = 16 and N p = 4.Note that generally we process the NCDS streams divided in unit blocks, so K p is lower than or equal to K B .The motivation for this choice is to show the feasibility of our proposed scheme for a realistic implementation, since the unit block (or resource block) is the minimum unit of processing in the frequency domain in 4G or 5G [1].Only for this subsection, both K p and K B are exceptionally allowed to be increased up to 24 subcarriers for illustrative purposes and discussion of the performance in other cases.The chosen power profile for the channel impulse response is the same as the one defined in the previous subsection.For medium and high SNR, our scheme and the PSAM have the same performance.For very low SNR, our proposed scheme is only slightly worse than PSAM, where the relative degradation increases with K p .However, we can properly adjust K p and M N to obtain a required performance even for very low SNR regimes, as the subsequent BER and throughput analysis show.Furthermore, we can see that the analytical approximation (denoted upper-bound (UB)) is very close to the results obtained by simulation, saturating to σ 2 ν + 2 (α = 1) for low SNRs, verifying the validity of the analysis shown in Appendix B.

C. THROUGHPUT EVALUATION FOR DIFFERENT CHANNEL SCENARIOS
In Fig. 5, we show the throughput comparison for the CDS, NCDS, HDS and ST for a unit block where R = 64, K p = 6 and N p = 7, which is an interesting case in terms of medium to high delay and Doppler spreads.There is a substantial improvement of performance for the HDS with respect to the CDS, since the latter is highly penalized by the great amount of necessary pilots.The throughput of the HDS is the highest, not only due to a low average BER, but also to the additional throughput provided by the NCDS stream (recall (30) and ( 31)).Moreover, we also show the performance of the pure NCDS, the DPSK symbols are occupying the entire unit block.We can see that NCDS is not able to outperform neither CDS nor HDS in this scenario.This is due to the fact that a QAM constellation has always a better performance than DPSK in terms of BER when M C = M N , provided that the channel can be adequately estimated.Also, it is known that the differential detection process increases the noise [14].On the other hand, ST has a very poor performance since in this scenario there are not enough resources to perform the required averaging process in order to avoid the self-interference produced by the data, degrading its performance in terms of throughput.
According to the throughput analysis, for each particular scenario of interest imposed by the system design (number of antennas, length of the packet, delay and Doppler spreads, etc.), an optimum value of M N and M C can be chosen for each case of SNR (ρ) in order to obtain the maximum throughput.Tables 2 and 3  SNR (ρ).Both constellation sizes can be increased when the SNR is high enough, maximizing the data-rate of the system.Note that the optimum value of M N is not only chosen to increase the throughput of the NCDS stream, but it also has to constrain the channel estimation error, guaranteeing the performance of the CDS stream (M C ).Hence, given the expressions of the throughput, we can easily choose the best configuration for a given case.For example M C = 16 and M N = 8 should be chosen for ρ = 5 dB, while M C = 64 and M N = 32 are best for ρ = 15 dB.
Table 4 shows a throughput comparison among the CDS, HDS and ST for different values of K p and N p where R = 64, M C = 16, M N = 8 and ρ = 5 dB.NCDS is not shown in this Table since its throughput remains constant for any of the considered values of delay and Doppler spread (K p and N p respectively).Its throughput is 49.5 × 10 3 packets/s for M N = 8 and ρ = 5.This value is the same given in Table 4 for HDS when K p = 12 and N p = 14, since in this extreme case all the symbols in the HDS are differentially encoded and HDS boils down to NCDS.Again, the different values of K p and N p correspond to different maximum supported values of delay and Doppler spreads.Once more, the use of ST does not bring in general an adequate performance.When there are enough resources for averaging (K p and M p low), the results are acceptable, but they degrade fast when K p and M p increase.In Table 5, for the same scenario, we provide the percentage of throughput increment of the proposed HDS with respect to the CDS, which is defined as We can see in both Tables 4 and 5 that for low mobility scenarios with a low or medium delay spread (low K p and N p ), the NCDS does not provide a significant increase of the throughput (approximately 0 − 5%) over the HDS.However, when either the Doppler or delay spreads are significantly increased, the throughput of the CDS is decreased while the throughput of NCDS is increased, improving T H (approximately 6 − 70%).For the extreme case of extremely high Doppler and delay spread (K p = 12 and N p = 14),  4.
only NCDS can provide an acceptable performance (HDS collapses to pure NCDS), while CDS cannot be used.In summary, for increasing K p and N p values the performance of CDS worsens faster than that of HDS.ST can only provide an acceptable performance for the particular case of K p = N p = 1, where the channel has very mild variations in both dimensions and the averaging processing can be performed satisfactorily.
According to this numerical evaluation of the throughput, we provide a graphical summary of the advisable technique for different scenarios in Fig. 6.CDS is not recommendable in scenarios with high delay and/or Doppler spreads, where an excessive number of pilots must be transmitted to provide a continuous tracking of the channel at both time and frequency dimensions (K p ↑ and N p ↑), so that an acceptable quality of the channel estimates (σ 2 e,p → 0 given in ( 25)) is provided.In these cases, T C = 0 since η C → 0. On the other hand, NCDS is suitable for these extreme cases where resources are not wasted to transmit reference signals.Indeed, [14]- [16] showed that NCDS has a great robustness against high Doppler scenarios, no matter the delay spread when OFDM is used.On the contrary, for low mobility or fixed communication scenarios with low or moderate values of delay spread, the number of reference signals can be greatly reduced (K p ↓ and N p ↓) since the channel remains quasi-static in both dimensions.Therefore, the throughput increment provided by the addition of the NCDS stream is not significant as compared to the throughput of the CDS (T C >> T N ).Finally, in those scenarios where the channel variations are not excessively high, at least in one dimension, the combination of CDS and NCDS, that is the proposed HDS, outperforms the existing solutions in terms of throughput.In these scenarios, the proportion of resources allocated to the reference signals in CDS is significant and the effective data-rate of the link is reduced if only the CDS stream is sent.In the example provided in Fig. 2, which is a typical configuration in 5G-NR [1], the pilot symbols correspond to 28.6% of the grid, which is an important overhead, and our proposal can take advantage of this overhead by transmitting an additional data stream using NCDS.

D. COMPLEXITY EVALUATION FOR DIFFERENT CHANNEL SCENARIOS
In order to compare the complexity of CDS and HDS, we define the complexity increment of HDS with respect to CDS as This complexity increment is presented in Table 6 where we can see that it is proportional to the number of symbols transmitted in the NCDS stream (given by K p and N p ).This complexity increment is below 10% for a low to medium amount of resources dedicated to NCDS, which corresponds to a throughput increment of up to 21% as shown in Table 5.For the extreme case of very fast timevarying and/or strongly frequency-selective channels, many more resources can be dedicated to the NCDS stream and the additional complexity can be increased up to 30%, while the throughput increase is almost 70%.In summary, the additional complexity produced by the proposed scheme is always much lower than the additional amount of throughput that it can provide, showing that HDS is more efficient than the traditional PSAM-based CDS.

E. BER AND THROUGHPUT EVALUATION WITH A GEOMETRIC CHANNEL MODEL
In this subsection we provide some simulation results with a more realistic channel model to compare CDS and HDS.Particularly, we adopt a geometric wide-band channel model  which enables the characterization of the effects of the propagation channel and the antenna arrays [26].It is characterized by the geometric superposition of several separate clusters, where each of them has a different value of delay and gain.Moreover, each cluster is made of a certain number of rays with different angle of arrival and departure.The chosen array configuration corresponds to a uniform linear array (ULA), where the distance of two contiguous elements is half the wavelength.The delay spread is set to 363 ns and the angular spread is set to 5 degrees, which are example values defined in [25].Additionally, with interest in high mobility scenarios, we set a Doppler frequency of 1.6 KHz, which corresponds to a carrier frequency of 3.5 GHz and an approximated speed of 500 km/h.Moreover, we set K p = 6 and N p = 4, which is the configuration given in Fig. 2.
In Fig. 7, we provide the throughput comparison for the different values of constellation size.We can see that the HDS provides a significant additional throughput as compared to CDS.Under this particular realistic channel model, the achieved throughput of both CDS and HDS is lower than in the previous case, due to the effect of the spatial correlation produced by the chosen array configuration.However, HDS still outperforms the traditional CDS by approximately a 11% of throughput increment, which is similar to what was obtained in the equivalent case with spatially uncorrelated channels.Hence, these results also show the advantages of our proposed scheme in a more realistic environment.

VI. CONCLUSION
For an UL massive SIMO-OFDM system, we have proposed a differential data-aided channel estimation scheme, where the traditional reference signals are replaced by a differential data stream.This data stream is demodulated using noncoherent detection at the BS and is used to perform the channel estimation.Therefore, by proposing a hybrid scheme denoted as HDS, the channel can be accurately estimated to perform coherent demodulation of the CDS stream, while the resources typically occupied by pilots are now leveraged for transmission of the NCDS data stream.The benefits of our proposal have been evaluated in terms of the channel estimation MSE, BER, throughput and complexity.
We have provided the analytical expressions of the channel estimation MSE and we have shown that it has very close performance to PSAM-based estimation.We have provided analytical expressions of the BER and evaluated the throughput of HDS for different configurations of the resources in the time-frequency grid, which correspond to different values of Doppler and delay spreads, showing that it outperforms the CDS with up to a 75% of throughput increment, which is obtained for high mobility scenarios.The different theoretical derivations given in this work are shown to match the numerical results, showing the accuracy of the analysis, and facilitating an optimization of the system parameters.Illustrative values of the number of antennas and the length of the NCDS data stream were chosen to shown the feasibility of this solution in the frame of the current mobile communications standards and deployments.With the analytical tools provided in this work, the same optimization can be performed for any other values of these parameters.
In summary, with the HDS scheme we are able to replace the reference signals with a differential data stream, which in most of the analyzed scenarios provides a higher throughput than what can be achieved with a coherent system based on PSAM.Other alternatives such as ST perform worse when there is a mild frequency selectivity or channel variability.Then, this work contributes to the improvement of the spectral efficiency of massive MIMO-OFDM systems, which is crucial for the evolution of wireless communications.
It is worth noting that the NCDS stream can be used, for example, to establish a low latency service in parallel with the regular high data rate application provided by the CDS, since the detected NCDS data stream can be quickly forwarded to upper protocol layers while the channel is estimated for its use to equalize the CDS data stream.Therefore, it is possible to use this system for a single service with enhanced throughput or to multiplex two parallel services at the physical level.

APPENDIX A DERIVATION OF SER AND BER OF NON-COHERENTLY DETECTED M N -DPSK COMBINED WITH MASSIVE MIMO
From Eq. ( 16), we can characterize the statistical distribution of the received symbol.Without loss of generality and to ease the notation, we focus on a particular subcarrier k and time instant n.According to [15], each term of ( 16) is independent to each other and ( 16) can be redefined as where The three terms (z n k ) g,l described in (37) are composed of the sum of the product of independent complex circularly symmetric Gaussian variables.Focusing on the first one, each term of (z n k ) g,1 can be decomposed as According to [27], the PDF of the product of two Gaussian random variables with the same parameters is proportional to the Modified Bessel function of the second kind and zero-th order [28].Hence, both real and imaginary parts of (42) are a sum of 2R independent product terms, and assuming the number of antennas at the BS (R) is large enough, the central limit theorem (CLT) [29] can be used.As shown in [30], and due to the independence among the different terms of (z n k ) g , its distribution can be approximated by According to [31], the second term of (37) is a onedimensional random variable that follows a Gamma distribution, and it can also be approximated as an one-dimensional Gaussian distribution when the number of terms in the sum is large enough given by Hence, the distribution of (z n k ) r is given by Note that the distortion produced by (z n k ) r only affects the amplitude of the received signal.Then, the interference and noise terms can be characterized as the sum of a complex circularly symmetric Gaussian variable affecting both real and imaginary parts, and a unidimensional Gaussian variable affecting only the amplitude.
To calculate the SER for the M N -DPSK, it is sufficient to calculate the error probability for one symbol, due to the symmetry in the distribution of the symbols over the complex plane.Therefore, and without loss of generality, we take the symbol placed over the positive part of the real axis (x, y) = (1, 0).We now use two variables x and y to define the real and imaginary axis of the complex plane C .Then, we can approximate the distribution of z n k with real and imaginary parts as Due to the independence of the real and imaginary parts of z n k , the bidimensional PDF of z n k (f (R{z n k }, I{z n k })) can be computed as the product of both parts, distributed according to (48) and ( 49 Finally, according to [21], the BER of a M N -DPSK symbol in NCDS can be approximated with a lower bound as P b,N ≈ 2 log 2 (M N ) P s,N . (53)

APPENDIX B ANALYSIS OF THE MSE OF THE DIFFERENTIAL DATA-AIDED CHANNEL ESTIMATION
Given (21), the obtained differential symbol at the k-th position is given by The first differential symbol of the block has the same performance as PSAM since the reference symbol is known.Moreover, the differential symbols x n k belong to the same constellation as the complex data symbols q n k with possibly an additional phase rotation produced by the phase of the first reference symbol.
The channel estimation error due to the wrong decision on the symbols transmitted by the NCDS was shown in (26) to depend on α and can be defined as cos ∠ x n k (i t ) − ∠ x n k (i r ) × × P x r |x t x r = x n k (i r )|x t = x n k (i t ) P x t x t = x n k (i t ) , (55) where x n k (i t ) and x n k (i r ) denote that the transmitted and decided differential symbols correspond to the i t -th and i r -th element of the constellation, respectively.According to (54) and inspecting (55), we realize that each differential symbol of the constellation ( x n k (i r )) can be produced by different combinations of the decided symbols ( q n k ).Hence, let us define the following transmission and reception vector indices as and they must satisfy that i−L K c t,i−1 , (k, n) ∈ A p , (58) x n k (i r ) = p n k n min k i=2 q n i−L K c r,i−1 , (k, n) ∈ A p .(59) Therefore, the second term of the product given in (55) can be expressed as Given the analytical expression of α, it can be solved by numerical computation.However, we provide the upper and lower bounds of α in order to provide some insights.Given (55), we can split it as × P x r |x t x r = x n k (i r )|x t = x n k (i t ) P x t x t = x n k (i t ) , (63) where α 1 accounts for the probability of taking correct decisions in a block of K p differential symbols, and α 2 corresponds to the average probability of taking wrong decisions weighted by the cosine of the angular distance of the symbols in one block of K p differential symbols.Hence, the lower-bound of α corresponds to taking correct decisions of all transmitted symbols by the NCDS, which is the same case as PSAM.Hence Inspecting ( 63) and taking into account that the constellation is DPSK, we have that α 2 > 0 since it is more likely to wrongly choose those symbols that belong to the contiguous decision regions from the correct one (which corresponds to low phase difference).Therefore, after performing the sum over all the possible values of the differential symbol, the positive terms (cos(∠(x n k (i t ))−∠( x n k (i r ))) > 0 ) have a higher probability than the negative ones.Hence, the upper-bound of α is given by and α 1 can be obtained as where P s,N is the probability of the symbol error of the NCDS, which is computed in the Appendix A. Hence, α can be bounded as α ≤ 0, P s,N = 0

FIGURE 1 .
FIGURE 1. SIMO-OFDM scenario, where a BS equipped with R antennas is serving a particular single-antenna user.

FIGURE 2 .
FIGURE 2. Example of a unit block for the proposed HDS, where K p = 6 and N p = 4.The green boxes are data modulated by CDS, the yellow boxes are data modulated by NCDS and "p" denotes a reference symbol.

FIGURE 3 .
FIGURE 3. BER of NCDS with different values of R and M N .

FIGURE 4 .
FIGURE 4. MSE of the channel estimation for R = 64, M N = 16, N p = 4 and different values of K p .

FIGURE 5 .
FIGURE 5. Throughput comparison of CDS, HDS, ST and NCDS for different constellation sizes, R = 64, K p = 6 and N p = 7.

TABLE 3 .
show the throughput evaluation of the HDS calculated for M C = 16 and M C = 64, respectively, where R = 64, K p = 6 and N p = 7.The maximum throughput values are highlighted in bold letters, and their corresponding optimal values of M N and M C depend on the TABLE 2. Throughput for HDS (10 3 × packets/second) for increasing ρ and increasing M N for R = 64, K p = 6, N p = 7 and M C = 16.Throughput for HDS (10 3 × packets/second) for increasing ρ and increasing M N for R = 64, K p = 6, N p = 7 and M C = 64.

TABLE 4 .TABLE 5 .
Throughput comparison among CDS, HDS and ST (10 3 × packets/second) for ρ = 5, R = 64, M N = 8 and M C = 16.Increasing K p and N p means increasing the maximum supported values of delay and Doppler spread.Percentage improvement of the throughput for the HDS with respect to the CDS; same parameters as in Table

FIGURE 6 .
FIGURE 6. of the chosen scheme for different scenarios.

FIGURE 7 .
FIGURE 7. Throughput comparison of CDS and HDS for different constellation sizes, K p = 6 and N p = 7.

− 2 σ√ 2 2 −r 2 2σ
), resulting in f R z n k , I z n k = f (x, y) = e x σ y 2π .(50)To simplify the integration to calculate the SER, we integrate the PDF in polar coordinates, for which we have to make the following change of variables (x = r cos(γ ) and y = r sin(γ )) asf (r, φ) = e − r cos(γ )−μx σx sin(γ )−μy σy √ x σ y 2π .(51)Hence, the SER probability of the q n k is given byP s,N = 1 − D y 2π ,(52)where D denotes the decision region for the particular symbol of interest, which in the selected case (x, y) = (1, 0) is defined as D ∈ [−π/M N , π/M N ].