Channel Estimation for Quantized Systems Based on Conditionally Gaussian Latent Models

This work introduces a novel class of channel estimators tailored for coarse quantization systems. The proposed estimators are founded on conditionally Gaussian latent generative models, specifically Gaussian mixture models (GMMs), mixture of factor analyzers (MFAs), and variational autoencoders (VAEs). These models effectively learn the unknown channel distribution inherent in radio propagation scenarios, providing valuable prior information. Conditioning on the latent variable of these generative models yields a locally Gaussian channel distribution, thus enabling the application of the well-known Bussgang decomposition. By exploiting the resulting conditional Bussgang decomposition, we derive parameterized linear minimum mean square error (MMSE) estimators for the considered generative latent variable models. In this context, we explore leveraging model-based structural features to reduce memory and complexity overhead associated with the proposed estimators. Furthermore, we devise necessary training adaptations, enabling direct learning of the generative models from quantized pilot observations without requiring ground-truth channel samples during the training phase. Through extensive simulations, we demonstrate the superiority of our introduced estimators over existing state-of-the-art methods for coarsely quantized systems, as evidenced by significant improvements in mean square error (MSE) and achievable rate metrics.


I. INTRODUCTION
M ASSIVE multiple-input multiple-output (MIMO) and millimeter wave (mmWave) systems enable the everincreasing requirements of bandwidth and throughput in wireless communications.However, deploying a large number of high-precision analog-to-digital converters (ADCs) for each antenna's radio frequency (RF) chain with bandwidths sufficient for mmWave systems is unaffordable in terms of cost and power consumption [1], [2].One of the most direct and promising ways in order to solve the power consumption bottleneck and achieve high energy efficiency is to use lowresolution ADCs at the base station (BS).In recent years, considerable research efforts have been devoted to analyzing the performance of low-resolution quantization systems [3]- [6].Remarkably, although the low-resolution quantization causes nonlinear distortions at the receiver, the capacity is not severely reduced, especially at low signal-to-noise ratios (SNRs) [7].
In order to realize the mentioned favorable characteristics in practical low-resolution systems of the next generation of cellular systems (6G), accurate channel estimation is a crucial task.However, the severe nonlinearity of the ADCs degrades the performance of conventional channel estimation algorithms [1], [2]; thus, it is necessary to design novel channel estimators that provide good performance together with reasonable complexity and robustness in quantized systems.
Various channel estimation algorithms for coarsely quantized systems have been proposed in recent years.In [8], least squares (LS) estimation is considered, which is computationally simple but results in rather poor estimation quality.An iterative channel estimation technique based on the expectationmaximization (EM) algorithm is proposed in [9], which exhibits limitations due to high complexity and convergence to local optima.Iterative maximum likelihood (ML) methods are investigated in [10], [11]; however, they typically require a large number of pilot signals, resulting in an unaffordable signaling overhead [2].In [12]- [14], joint channel estimation and decoding is investigated, where payload data is used to assist in channel estimation.Due to the iterative nature of the approaches, these methods are considered to have too high complexity for commercial massive MIMO systems [2].The works in [15]- [17] take into account the sparsity of wireless channels and utilize compressive sensing (CS) approaches such as iterative hard thresholding and generalized approximate message passing (GAMP); the main disadvantages thereby are the sensitivity concerning the (estimated) sparsity level and the high complexity of the iterative procedure.
Minimum mean square error (MMSE) channel estimation approaches in the case of a Gaussian channel with a diagonal covariance matrix are studied in [18], [19]; however, the MMSE estimator has no closed-form in the general case and is intractable to compute [20].In fact, even the linear MMSE estimator has generally no closed-form solution.Only in the special case when the quantizer input is jointly Gaussian, the linear MMSE channel estimator can be efficiently computed based on Bussgang's theorem [21] or the additive quantization noise model (AQNM) [22], which is a special case of the Bussgang decomposition tailored to quantization [23]. 1 .In [24], [25], the Bussgang estimator is derived for the onebit as well as multi-bit quantization case.On the one hand, it has the advantage that only a few pilot observations are necessary to achieve a good channel estimation quality.On the other hand, the estimator's applicability is seriously limited by the prerequisite of a Gaussian distributed channel with known second-order statistics.
Another recent branch of channel estimation techniques in quantized systems based on deep learning was investigated in [26]- [29].Although providing good performance for low numbers of pilots, the approaches generally lack generalization ability with respect to different numbers of quantization bits, pilot signals, antennas, and SNRs.Additionally, accumulating a large representative training dataset consisting of perfect channel state information (CSI) samples is necessary, which may be costly to acquire in practical systems, especially with coarse quantization.In this paper, we address this major issue by deriving adapted training procedures for our models that enable learning directly from quantized training data without requiring any ground-truth CSI in the whole training phase.
In recent works, conditionally Gaussian latent generative models were utilized in order to learn the underlying unknown channel distribution of a radio propagation environment and leverage this prior information to design wireless communication functionalities [30], especially channel estimators for high-resolution systems [31]- [36].As mentioned before, the highly nonlinear distortion of low-resolution ADCs makes it impractical to directly utilize the mentioned channel estimation techniques in coarsely quantized systems.We extend this class of channel estimators based on conditionally Gaussian latent generative models by utilizing a novel proposed connection between these models and the possibility of using the linear MMSE estimator based on a newly proposed conditional version of the Bussgang decomposition.
Our contributions are concisely summarized as follows.
1) We establish a theoretical foundation of extending the conventional Bussgang estimator to arbitrary channel probability density functions (PDFs) by introducing conditional events such that the channel becomes conditionally Gaussian.Through this, we derive an approximation of the mean square error (MSE)-optimal conditional mean estimator (CME) via a conditional Bussgang estimator that is the linear MMSE estimator.To find the conditional events of interest, we establish a novel connection to the topic of variational inference (VI), i.e., conditionally Gaussian latent models.
2 )dx and the error function is given as erf(x) = 2 √ π x 0 exp(−t 2 )dt.We denote the indicator function as χ(x ∈ A), which returns one if x ∈ A and zero otherwise.The nth entry of a vector is denoted by [x] n .Conditional cross-covariance matrices are denoted as including the unconditional case C xy , and auto-covariance matrices are abbreviated as C x|c = C xx|c .The diagonals and off-diagonals of a matrix are denoted by diag(A) and nondiag(A) = A − diag(A).We denote the Moore-Penrose inverse of a matrix A as A † .

II. PRELIMINARIES
1) System Model: We consider the uplink transmission of P pilot signals from a single-antenna mobile terminal (MT) to an N -antenna BS which operates ADCs with B quantization bits.The quantized receive signal is therefore written as R = Q B (Y ) = Q B (ha T + N ), where R = [r 1 , . . ., r P ] ∈ C N ×P contains the P quantized receive signals as columns, Y ∈ C N ×P describes the unquantized receive signal, h ∈ C N denotes the wireless channel, a ∈ C P is the pilot vector which fulfills the power constraint ∥a∥ 2 2 = P , N = [n 1 , . . ., n P ] ∈ C N ×P is additive white Gaussian noise (AWGN) with n i ∼ N C (0, σ 2 I), and Q B denotes the B-bit quantization function, which is discussed below.By columnwise vectorization, the system model can be written as with r = vec(R), y = vec(Y ), n = vec(N ), and A = a ⊗ I.
Note that an extension to a multi-user setup can, in principle, be straightforwardly achieved by stacking the channels of all users, cf., e.g., [24]; however, the analysis of multi-user systems is out of the scope of this work.By normalizing the channels as E[∥h∥ 2 2 ] = N , the SNR of the quantizer input is defined as SNR = 1/σ 2 .
Typically, several pilot observations are required to achieve reasonable channel estimation performance in coarsely quantized systems.For the case of one-bit quantization, it is shown in [20] that a pilot sequence with equidistant phase shifts in the range [0, π 2 ) is MSE-optimal with respect to the CME for jointly Gaussian inputs in the asymptotic high SNR regime.In the general case, the optimal pilot sequence depends on many system parameters and is thus intractable to derive.Therefore, in this work, we consider pilots that have an equidistant spacing in both the amplitude and the angle, i.e., where is the amplitude spacing.In order to fulfill the power constraint ∥a∥ 2 2 = P , the pilot vector is normalized as a = √ P ∥ã∥2 ã.The choice in (2) has shown to be robust with respect to all considered scenarios in our simulations.However, the presented class of estimators is independent of the specific pilot sequence and can be deployed with any other desired pilot sequence of choice.
2) Quantizer Design: In this work, we consider scalar quantizers, i.e., the quantization is performed elementwise on the input, where the real and imaginary parts are quantized independently.The quantizer can be described by means of the 2 B quantization labels ℓ i , i ∈ 1, . . ., 2 B , and the quantization thresholds τ i , i ∈ 0, . . ., 2 B , where τ 0 = −∞ and τ 2 B = ∞ by definition.The quantization function of the real/imaginary part of the signal can be denoted as For the case of one-bit quantization B = 1, the quantization function of the complex-valued signal y can be expressed as Practicable ADCs usually have uniformly spaced quantization thresholds with a constant step size ∆, which depends on the input distribution, and the quantization labels are placed in the middle of two quantization thresholds.For the case of zero-mean Gaussian input with variance one, the optimal values for ∆ are computed numerically in [37].
Under the assumption that the elementwise quantizer input is zero-mean Gaussian distributed with variance 1 + σ 2 following the considered SNR definition, we choose the SNRdependent step size as suggested in [38] as where ∆ * is the step size for the standard Gaussian input [37].
Although the quantizer input is generally not Gaussian distributed, this choice gives a reasonable performance with regard to practical feasibility.The necessary scaling in ( 5) is resolved by automatic gain control in practice.We note that there exist more sophisticated quantizer designs, e.g., nonuniform scalar quantization [37], [39]; however, the uniform quantizer is considered the most practicable choice in wireless communications [5].Furthermore, the channel estimation techniques proposed in this work can be straightforwardly extended to non-uniform quantization.
3) Channel Models: We work with the 3rd Generation Partnership Project (3GPP) spatial channel model [40], [41] where channels are modeled conditionally Gaussian: h|δ ∼ N C (0, C δ ).The random vector δ collects the angles of arrival/departure and path gains of the main propagation clusters between a MT and the BS.The main angles are drawn independently and uniformly from the interval [0, 2π]; the path gains are also drawn uniformly and are subsequently normalized such that they sum up to one.The BS employs a uniform linear array (ULA) such that the spatial channel covariance matrix is given by Here, t(γ) = [1, e jπ sin(γ) , . . ., e jπ(N −1) sin(γ) ] T is the array steering vector for an angle of arrival γ, and ω is a power density consisting of a sum of weighted Laplace densities whose standard deviations describe the angle spread of the propagation clusters [40].For every channel sample, we generate random angles and path gains, combined in δ, and then draw the sample as h ∼ N C (0, C h|δ ), which results in an overall non-Gaussian channel distribution [31].Note that the conditional Gaussianity of the channel model is not connected to the conditional Gaussianity of the proposed latent models since the inference of (6) from a single snapshot is intractable.
To ensure a broader evaluation of different channel models, version 2.4 of the QuaDRiGa channel simulator [42], [43] is used to generate channel samples.We simulate an urban macrocell scenario at a center frequency of 6 GHz.The BS's height is 25 meters, and it covers a 120 • sector.The distances between the MTs and the BS are in the range of 35-500 meters.We either consider a pure line-of-sight (LOS) scenario or a mixed LOS/non-line-of-sight (NLOS) scenario, where in 80% of the cases, the MTs are located indoors at different floor levels, whereas the MTs' height is 1.5 meters in the case of outdoor locations.The BS is equipped with a ULA with N "3GPP-3D" antennas, and the MTs employ an omnidirectional antenna.The generated channels are post-processed to remove the effective path gain [43,Sec. 2.7].

4) Training Datasets:
In this work, we consider the channel distribution to be unknown and arbitrarily complex by means of the sophisticated channel models, cf.Section II-3.However, we assume the availability of a representative dataset which is comprised of samples stemming from the respective channel model.In practice, this means that data samples from the respective BS cell are available.In this work, we discuss two different setups.First, we assume the availability of a training dataset consisting of T ground-truth channel samples H = {h t } T t=1 .This can be achieved in practice via measurement campaigns or digital twins, e.g., ray tracing.Afterward, we consider a training dataset consisting solely of noisy and quantized pilot observations R = {r t } T t=1 .Therefore, pilot observations from the regular BS operation can be utilized, and no ground-truth channel samples are needed.

III. BUSSGANG ESTIMATOR
In this section, we briefly revise the linear MMSE estimator based on the Bussgang decomposition, which is a direct consequence of Bussgang's theorem [21].As stated above, an equivalent derivation can be done via the AQNM, cf.[23].Although the Bussgang decomposition generally exists, the Bussgang linear MMSE estimator is analytically tractable only if the channel and noise follow a zero-mean Gaussian distribution [24].However, even if this is generally not true, an approximation to the linear MMSE estimator, assuming the channel is zero-mean Gaussian, is a reasonable baseline for channel estimation in quantized systems.
In particular, under the assumption of a jointly zero-mean Gaussian quantizer input, the Bussgang decomposition implies that the system in (1) can be written as a linear combination of the desired signal part and an uncorrelated distortion q as where B is the Bussgang gain that can be obtained from the linear MMSE estimation of r from y as B = C ry C −1 y , cf. [44, Sec.9.2], and where the distortion term q = Bn + η contains both the AWGN n and the quantization noise η.The Bussgang gain matrix for a uniform quantizer with jointly Gaussian input is derived in [45] and is computed as where D y = diag(C y ).In the case of one-bit quantization B = 1, by choosing ∆ = √ 2, we get the well-known solution As the statistically equivalent model ( 7) is linear, one can formulate the linear MMSE estimator The cross-correlation matrix between the channel and the received signal is calculated as C hr = E[h(BAh + q) H ] = C h A H B H which follows from the fact that the noise term q is uncorrelated with the channel h, see [24, Appendix A].Note that this property only holds in the case of a jointly Gaussian quantizer input.For the one-bit quantization case, the autocorrelation matrix is equal to the covariance matrix C r due to the elimination of the amplitude information and can be calculated in closed-form via the so-called arcsine law [46] as Unfortunately, for the multi-bit quantization case, no closedform expression for C r exists.Besides that, the computation of the variances after the quantization are no longer nontrivial.As shown in [5, eq. (2.14)] (adapted for the complexvalued case), for Gaussian input and the uniform quantizer, the variances are computed as where Although there exist various practicably feasible approximations for the evaluation of the involved Gaussian CDF, cf.[47], [48], the evaluation of (12) may still be problematic in time-critical systems; thus, a reasonable approximation is used in this work.By assuming that the signal's variance does not change for different antennas, the Bussgang gain becomes a scaled identity of the form B = ρ I, cf.(8).By further neglecting cross-correlations of the quantization distortion, the quantized covariance matrix C r is well approximated, especially in the low SNR regime, by, cf.[3], Importantly, the resulting covariance matrix C r remains positive semi-definite (PSD) if 0 ≤ ρ 2 ≤ 1.We note that the expression for C r with respect to C y generally depends on the quantizer choice and the input distribution, and useful approximations can be found differently, cf., e.g., [4].However, the design of the channel estimation algorithms in this work is not founded upon the choice in (13), and different approximations can be utilized.

IV. PARAMETERIZED BUSSGANG CHANNEL ESTIMATORS
The prerequisite of the Bussgang estimator in Section III that the quantizer input is jointly Gaussian imposes a severe limitation.This becomes especially evident when considering the channel distribution of a whole BS cell, which is strongly shaped by the propagation environment, generally being considerably underrepresented by a simple Gaussian distribution.Although the Bussgang decomposition in principle also exists for the non-Gaussian case, the corresponding Bussgang gain matrix is non-diagonal and not analytically tractable, which, in turn, yields no analytic solution for the linear MMSE estimator [23].This motivates us to utilize the powerful concept of conditional Gaussianity, which is already used for channel estimation in high-resolution systems [31]- [36].In the following Lemma 1, we establish the theoretical foundation for the estimation framework in quantized systems through a conditional Bussgang decomposition.Lemma 1.Consider the system model (1) with a uniform quantizer Q B (•) and let h ∼ p(h) with an arbitrary PDF p(h).Let c be a conditional event independent of n such that and C y|c = AC h|c A H +σ 2 I.Then, there exists a unique conditional Bussgang gain B c such that (1) can be decomposed as the statistically equivalent model where η and h are conditionally uncorrelated given c with the following properties: ) the diagonal entries of C r|c are computed via (12) and C r|c is well approximated by (13) for B > 1, where C y is substituted with C y|c in (8), (9), and (11)- (13).Further, the conditional linear MMSE estimator of h given r and c is computed as Proof: See Appendix A.
The main result of Lemma 1 shows that the Bussgang estimator from Section III can be extended to arbitrary channel PDFs when finding conditional events such that the channel becomes (conditionally) zero-mean Gaussian.Although it is a promising result, finding such conditional events is generally highly non-trivial, especially because the conditional event must not be a function of the observation, as this would introduce a dependency with respect to the noise realization.
However, when modeling the conditional event c as a latent random variable with a prior distribution p(c) of choice and enforcing ( 14) via a parameterized distribution, aiming to maximize the likelihood p(h) with a given dataset, we end up in the classical framework of VI [49,Ch. 10] in the special case of conditionally Gaussian latent variable models.The Bayesian modeling of the conditional event further allows us to approximate the MSE-optimal CME, which is generally intractable [20], as ≈ ĥc (r)p(c|r) dc (18) where in (17) we use the law of total expectation and in (18) we approximate the conditional mean E[h|r, c] by the linear MMSE estimate ( 16) derived in Lemma 1.Note that in (17), we have implicitly assumed that the latent variable is continuous, albeit discrete latent variables can equivalently be used, as shown later.Although (18) remains an approximation, the linear MMSE estimator is widely adopted for channel estimation, especially because it allows for a lowcost implementation due to the desirable linearity of the filter.The marginalization over the latent variable c in (18) remains to be solved in a tractable manner by the VI formalism of choice, for which we discuss several variants in the remainder of this section, including the GMM, the MFA, and the VAE.Throughout this section, it is assumed that a training dataset H of ground-truth channel samples is available, cf.Section II-4.

A. GMM-based Bussgang Estimator
We start by deriving the GMM-based estimator, which parameterizes a componentwise Bussgang estimator.Generally, a GMM is a PDF of the form where K is the number of mixture components and {π k , µ h|k , C h|k } K k=1 is the set of parameters of the GMM, namely the mixing coefficients, the means, and the covariances of the Gaussian components.The parameters of the GMM are fitted via the EM algorithm for a given training dataset H of channel samples, cf.[49,Ch. 9].An essential property of GMMs is that for a given data sample, the responsibility of each component can be computed as, cf.[49, Ch. 9], p(k|h) ∝ π k N C (h; µ h|k , C h|k ).The GMM can be described via a discrete latent variable with a categorical distribution, which conditions on one of the K components [49, Ch. 9] and, thus, yields a conditionally Gaussian latent variable model.
To ensure the validity of ( 14) in Lemma 1, we enforce the component means to be zero, i.e., µ h|k = 0 for all k ∈ {1, . . ., K}.To reflect this constraint in the fitting process, the component means are set to zero in every M-step of the EM algorithm.Naturally, the zero-mean constraint diminishes the capability of the GMM to a certain extent concerning its ability to approximate the true underlying distribution.Nevertheless, since a feasible wireless channel distribution is considered to be zero-mean with a decreasing probability density towards higher amplitudes, cf., e.g., [40], the loss of accuracy of the model can be considered to be small.Moreover, restricting the component means prevents overfitting and allows to model high-dimensional data [50].
After the GMM is trained, it is used for channel estimation similar to [31] but with multiple adaptions in order to take the quantization effect into account, as outlined in Lemma 1.We first note that if the channel distribution is modeled as a zero-mean GMM, also the distribution of the unquantized receive signal y follows a zero-mean GMM with covariances C y|k = AC k A H + σ 2 I for all k ∈ {1, . . ., K}, cf.(1).For each GMM component, we can apply the conditional Bussgang decomposition to find a statistically equivalent model, cf.(15) in Lemma 1: where B k is the conditional Bussgang gain of component k, and q k = B k n + η.According to Lemma 1, the computation of the conditional Bussgang gain B k is done via the closedform solutions in ( 8) or ( 9), respectively.We note that the integral in (18) to approximate the CME simplifies to a sum because of the discrete latent variable of the GMM.Opposite to the high-resolution case, the evaluation of the discrete distribution p(r|k) is intractable in general since the cardinality of its discrete support increases exponentially in the number of dimensions [20].Thus, in order to evaluate the responsibility for a given pilot signal, we assume that the quantized receive signal follows a zero-mean GMM distribution with the same second-order moments; this assumption effectively resembles approximate inference [49,Ch. 10].The covariance matrix of component k, named C r|k , is thereby computed via (11) or (13) for the oneor multi-bit quantization case, respectively, by plugging in the component's unquantized covariance C y|k , cf.Lemma 1. Thereby, in the case of a covariance matrix C h|k with a nonconstant diagonal, the scaling parameter ρ k in (13) for the kth component is approximated via , which ensures that the resulting matrix is PSD.This yields the following responsibility evaluation of the quantized receive signal: The final channel estimate is computed via the convex combination of the componentwise Bussgang estimators, which follows from (18) together with Lemma 1, parameterized by the GMM covariances, which yields We note that the GMM with fully parameterized covariance matrices is independent of any array geometries at the BS.However, as shown in [33], it is possible to enforce different structural constraints for the GMM's covariances, such as a circulant ("GMM circ") or Toeplitz ("GMM toep") structure.These structural constraints reflect the typical array geometries of a BS, e.g., a ULA or uniform planar array (UPA), and result in a reduced number of parameters and a lower online complexity of the estimator due to the usage of 1D or 2D fast Fourier transforms (FFTs) [33], respectively.The covariance matrix of the kth GMM component is thereby constrained to be of the form C h|k = Q H diag(c h|k )Q where Q is an (oversampled) 1D or 2D discrete Fourier transform (DFT) matrix in the case of a ULA or UPA, respectively, and [c h|k ] i ∈ R + .The structural constraints are, without limitation, also applicable in coarsely quantized systems since the quantization is performed elementwise on the input and thus does not alter the imposed array structure.In this work, we consider solely the case of a ULA as mentioned in Section II-3.The necessary memory overhead and computational complexity of the resulting estimators are discussed in more detail in Section VII.

B. MFA-based Bussgang Estimator
A related concept to the GMM is the MFA model, which, in addition to a discrete latent variable k which describes the mixture component, also contains a continuous latent variable z ∈ C L of lower dimension, i.e., L < N holds [51, Ch. 12], [34].This effectively models the data on a piecewise linear subspace.After integrating out the continuous latent variable z ∼ N (0, I), the PDF of the MFA model is a special form of a GMM with low-rank plus diagonal-constrained covariances of the form where W h|k ∈ C N ×L is the factor loading matrix and Ψ h|k ∈ C N ×N is a diagonal matrix.In order to fit the parameters {π k , µ h|k , W h|k , Ψ h|k } K k=1 of the MFA model for a given dataset H of channel realizations, an EM algorithm can be used [51,Ch. 12].After training, by defining C h|k = W h|k W H h|k + Ψ h|k , the model can be effectively treated as a GMM.A zero-mean MFA model with µ h|k = 0 for all k ∈ {1, . . ., K} can be similarly enforced as in the GMM case.Similar to [34], we set Ψ h|k = ψ h|k I for all k ∈ {1, . . ., K}.
The main advantage of the MFA model in the context of channel estimation, in contrast to the GMM, lies in the reduced number of parameters due to the low-dimensional latent space, which mitigates overfitting effects during training.Thus, it is a more robust model for lower numbers of training data, as demonstrated in [34].Since the MFA also parameterizes a conditionally Gaussian distribution, we can apply Lemma 1, resulting in a componentwise Bussgang estimator similar to the GMM case, see Section IV-A, which yields the MFAparameterized Bussgang channel estimator ĥBMFA by substituting the respective covariances in (22).We note that the MFA model enforces the covariance structure via training and is thus independent of the BS's array geometry.
Fig. 1: Proposed adapted VAE architecture with the encoder, latent space, and decoder together with the parameterized distributions and the reparameterization trick.

C. VAE-based Bussgang Estimator
The VAE was introduced in [52] and has attracted a lot of interest in the area of generative modeling due to its strong performance, which builds on the basis of neural networks (NNs) that are used for the encoder and decoder of the VAE, cf.Fig. 1.In [35], [36], the VAE was successfully utilized for channel estimation in high-resolution systems.In contrast to the GMM and MFA, the VAE comprises a continuous and nonlinear latent space, encoded by the low-dimensional latent vector z ∈ R L .The most common design choice is a Gaussian model for the latent vector, i.e., z ∼ N (0, I).Since the variational inference task is no longer tractable by the classical EM algorithm, NNs in combination with the reparameterization trick are used to train the VAE [52], [53].In this work, we choose the parameterized encoder and decoder distributions as . Note that ( 25) is chosen to fulfill ( 14) in Lemma 1.The matrix F is a DFT matrix such that the parameterized channel covariance matrix of the VAE is a circulant matrix, cf.[36].This choice results in a reduced number of parameters and is justified by the imposed structure of the ULA at the BS, similar to the GMM case, cf.Section IV-A.The extension to the UPA case is once again straightforward by replacing the 1D DFT matrix with its 2D counterpart.
For every training data point h t ∈ H, the VAE computes the ELBO on the log-likelihood [52] By plugging in (24) and ( 25) and ignoring the constant terms (they do not influence the optimization), the ELBO is utilized as the loss function L ϕ,θ of the VAE, which reads as, cf.[36], Following Lemma 1, the VAE can be utilized to parameterize a channel estimator by means of the conditional Gaussianity at the output of the decoder of the VAE in combination with linear MMSE filters.After the training, we utilize the VAE to parameterize a channel covariance matrix for each quantized receive pilot r by forwarding the pilot through the encoder and then using the latent mean µ ϕ (r) as input to the decoder which yields the channel covariance matrix This procedure approximates (18) with a low-complexity implementation, which has been shown to work well in the high-resolution case [35], [36].The VAEparameterized Bussgang estimator then reads as where B z is computed by plugging C y|z = AC θ A H + σ 2 I into (8) or (9).Similarly, the covariance C r|z is computed by plugging C y|z into (11) or (13), cf.Lemma 1.
For the encoder and decoder, we use a four-layer feedforward NN with rectified linear unit (ReLU) activation functions, respectively, for which we stack the real and imaginary parts of the pilot observation at the input to the encoder.In the case of multiple pilot observations, we add a convolutional NN (NN) with P/2 layers and ReLU activation functions before the encoder, which perform 1 × 1 convolutions in order to always have a 2N -dimensional input at the encoder.This modification drastically reduces the number of parameters and simplifies a forward pass through the VAE.The complete VAE architecture is detailed in Fig. 1.

V. ENABLING LEARNING FROM QUANTIZED PILOT OBSERVATIONS AS TRAINING DATA
As discussed above, the availability of a large training dataset H consisting of representative ground-truth channel samples for a whole BS cell (radio propagation scenario) is questionable in practical communication systems.Although different approaches exist, such as ray tracing, to generate a training dataset that mimics the underlying channel distribution, it is unclear whether they can sufficiently capture the characteristics of a real communication scenario.A different idea is to train the respective models directly on pilot observations and mitigate imperfections, e.g., additive noise or sparsely allocated pilots, through model-based training adaptations.This has already been shown to work well for highresolution scenarios [36], [54].However, learning a generative model from a training dataset R consisting of quantized data poses a significant challenge due to the pronounced nonlinear distortion resulting from low-resolution quantization.In the remainder of this section, we propose two novel training adaptations to the GMM and the VAE to learn the underlying channel distributions, although only quantized training data and no ground-truth CSI is available.

A. Covariance Recovery for GMM Approximation
Recovering the unquantized covariance matrix of the input to a one-bit quantizer solely from quantized data has gained a lot of interest very recently [55]- [57].Since the amplitude information is lost in the case of a zero-threshold one-bit quantization, only the normalized correlation matrix can be obtained via the inverse arcsine law [46].To resolve this issue, non-zero or time-varying quantizer thresholds were considered in order to be able to estimate the variances of the input signal and, thus, the whole covariance matrix [55]- [57].The work in [58] validates that the covariance recovery technique from [55] can be used in combination with the Bussgang estimator in order to perform channel estimation.However, the considered quantizer designs with non-zero thresholds are challenging to implement in communication systems and may require more sophisticated analog and digital signal processing, ultimately resulting in performance losses.In contrast, considering multibit quantization of the input signal, coarse amplitude information is preserved because of the multi-level quantization, even with a fixed zero-threshold.However, up to now, there is no covariance recovery algorithm proposed for this case.
We derive a novel low-complexity covariance recovery algorithm for the multi-bit case by splitting the task into estimating the correlation matrix and the variances independently.Let us define the following problem statement where we reuse the notation from above for simplicity.Consider a dataset R = {r t } T t=1 of T samples of the form r t = Q B (y t ) where y t ∼ N C (0, C y ) and B > 1.The task is to recover C y from the T quantized samples.Since no closed-form solution for the correlation matrix R y = diag(C y ) − 1 2 C y diag(C y ) − 1 2 in the case of multi-bit quantization exists, we simply discard the samples' amplitude information, effectively treating them as one-bit quantization data.Because of that, the closedform expression for estimating the unquantized correlation matrix R y by means of the one-bit sample covariance matrix H can be obtained via the inverse arcsine law: Unfortunately, although having a closed-form solution, the resulting correlation estimate is not necessarily PSD [55], [59].However, this can be resolved by a projection onto the set of PSD matrices, as discussed later.
Since the quantization acts elementwise on the real and imaginary part independently, it is sufficient to derive the variance estimation for a real-valued scalar y ∼ N (0, ξ 2 ) and r = Q B (y) ∈ R for ease of notation.We note that the amplitude of y follows the half-normal distribution.The corresponding CDF of the half-normal distribution is given by P(|y| ≤ τ ) = erf(τ / 2ξ 2 ).Because the CDF is fully parameterized by the input signal's variance ξ 2 , one can utilize the coarse amplitude information after the quantizer for its estimation.By defining the positive quantization thresholds as τi < ∞, i ∈ {1, . . ., 2 B−1 −1}, i.e., τi = τ i+2 B−1 , one can estimate the probability of observing a sample with an amplitude of at most τi by P(|y| For circularly symmetric Gaussian distributed complexvalued input and multiple quantization thresholds, i.e., B > 2, an overdetermined system of equations can be constructed using the different quantizer thresholds for both the real-and imaginary parts, yielding 2 B − 2 equations.The subtraction of two comes from the fact that the last quantization regions up to infinity are uninformative since, in this case, the (sample) probability is always one.In summary, the equation system accounting for the real part of the input is of the form with i ∈ {1, . . ., 2 B−1 − 1}.The remaining half of the equation system is built similarly by replacing the real with the imaginary part.Note that the equation system only depends on the unknown variance parameter ξ 2 .Geometrically, we aim to interpolate the sample probabilities belonging to the different thresholds by a Gaussian CDF curve in a LS sense with the adjustable variance parameter ξ 2 .Since the derivative of the CDF is trivially given by the Gaussian PDF, a simple Gauss-Newton approach can be utilized for solving the nonlinear LS problem.For a robust initial starting point ξ 2 0 , a solution to the equation with the quantizer's largest τi is used.
In the multi-dimensional case, if the input signal's variance is assumed to be different for each antenna, the nonlinear LS problem can be solved for each dimension independently, yielding an estimate of diag(C y ).Otherwise, the equation systems for each dimension can be combined to yield a more accurate estimate of the single variance parameter.Note that in the complex-valued case, the estimated variance has to be scaled by a factor of two to account for the sum of the real and imaginary parts.Finally, the full covariance matrix estimate is computed as The derived covariance recovery scheme is now used in order to fit the GMM's covariances by only using quantized data.For simplicity, we assume that the training data stems from single snapshot observations, i.e., r = Q B (h + n) with P = 1, which can always be enforced by pre-processing.In each iteration of the EM algorithm, the M-step is adapted by using the proposed covariance recovery algorithm for estimating the unquantized covariance matrix due to the Gaussianity of each GMM component.The necessary change to the purely Gaussian setting from before is that the responsibilities, computed in each E-step, are used in order to weight the sample probability for component k accordingly as ) for all i ∈ {1, . . ., 2 B−1 −1}, where N k = T t=1 p(k|r t ).Note that ℜ([r t ] n ) is replaced by ℑ([r t ] n ) for the second half of the equation system.Since the quantizer's input signal is also distorted with AWGN, the M-step adaptation from [54,Th. 1] for noisy data is used in addition by means of subtracting the noise covariance and afterward projecting to the set of PSD matrices by performing an eigenvalue decomposition (EVD) and truncating the negative eigenvalues.This also accounts for the possibly non-PSD correlation estimate from (28).After estimating the channel covariance Ĉh|k of component k in this way, one first determines Ĉy|k to eventually construct the covariance of the quantized observation Ĉr|k by using one of the approximations given in (12) or (13).Since the complexity is not crucial in the offline learning, we utilize the accurate formula for the variance (12).The necessary adaptations in the M-step are concisely summarized in Algorithm 1.The sofound covariance matrix Ĉr|k is afterward used to compute the responsibilities in the E-step, similar to (21).
By the law of large numbers, the variance estimate based on the sample probability is a consistent estimator and, together with the closed-form solution for the correlation matrix, yields Algorithm 1 Adapted M-step for quantized training data.

B. Loss Function Adaptation for the VAE
Similar to the GMM, the VAE model can be trained directly from noisy pilot observations by properly adapting the ELBO loss function [35], [36].The idea, thereby, is to modify the parameterized channel's covariance matrix, which stems from the output of the decoder such that it represents the covariance matrix of the pilot observations in the loss function (26).Consequently, only the channel covariance matrix is learned by the VAE through gradient updates.For the case of coarsely quantized training samples from R, the expression of the quantized covariance matrix with respect to the channel covariance matrix is not in closed-form and thus not differentiable, cf.(12).However, one can use the approximation from (13) as shown in the following.
Since the training is done in the Fourier-transformed domain in order to parameterize a circulant covariance matrix C θ = F H diag(c θ ) F, we first simplify the expression for the diagonal term diag( and where ρ θ is computed via (8) by plugging in diag(C y,θ ).
Thus, the VAE model can be learned from quantized data by replacing c θ with c r,θ from ( 32) and h with r in the loss function (26).

VI. BASELINE CHANNEL ESTIMATORS
We compare the proposed parameterized generative modeling-aided channel estimators with state-of-the-art baseline channel estimators for coarse quantization systems.First, for the 3GPP channel model from Section II-3, we have genie access to the true underlying channel covariance matrix C h|δ from (6) in the simulation.This allows us to evaluate the genieaided Bussgang estimator ĥBuss-genie = C h|δ A H B H δ C −1 r|δ r where B δ and C r|δ are found by plugging C h|δ into (8) or ( 9) and ( 13) or (11) for the multi-bit or one-bit case, respectively.Note that this estimator is not feasible in practice but only serves as a lower bound on the performance of the Bussgang estimator, which is the best linear estimator.The corresponding curves are labeled as "Buss-genie".
A practicably feasible approach that is primarily used in the literature is to use the sample covariance matrix Ĉh = 1 T T t=1 h t h H t in combination with the Bussgang estimator (10), labeled as "Buss-Scov".Note that for this case, a training dataset of ground-truth channels H is necessary.
A simple baseline is the LS estimate based on the Bussgang decomposition (7), i.e., ĥBLS = A † B † r, labeled as "BLS".For computing the Bussgang gain ( 8) or ( 9), we use the sample covariance matrix Ĉh from above.
In [17], a CS-based channel estimator is proposed, which is a combination of the EM algorithm for approximating the channel PDF in the sparse angular domain and the GAMP algorithm to solve the sparse recovery problem.Note that in this case, an EM algorithm is deployed online for each transmission link and pilot observation, which is fundamentally different from the GMM approach, which exploits the EM solely in the offline phase.We applied the EM-GM-GAMP algorithm to estimate the channel parameters x in the angular domain, such that the final channel estimate is computed to ĥEM-GM-GAMP = F x, labeled as "EM-GM-GAMP".
We also evaluate a deep learning-based estimator, similar to [28], where a three-layered feed-forward NN is trained to directly map the pilot observation to a channel estimate.The ReLU function is used as the activation function in all layers except the output layer.To achieve a fair comparison to the proposed approaches, a single network is trained for the whole SNR range.Similar to [28], the best performance was achieved by a drastic increase of the neurons in the hidden layers.We, therefore, set the number of neurons in both hidden layers to 2N 2 .The corresponding curves are labeled as "DNN".

VII. MEMORY AND COMPLEXITY ANALYSIS
The offline memory requirements of data-based techniques and the algorithmic online complexity are key features for channel estimation in real-time systems.The number of parameters of the (structured) zero-mean GMM and the MFA model is determined by the K covariances and the number of mixing coefficients.The corresponding linear MMSE filters for each component and SNR value are fixed after the offline training, which means that they can be pre-computed.Since this can be similarly done for the evaluation of the responsibilities in (21), the overall online complexity is determined by matrix-vector products for each component [31].Notably, the computation of the K filters/responsibilities is trivially parallelizable, which is of great importance in practical systems.For the case of circulant-structured GMM covariances, cf.Section IV-A, the complexity reduces due to the usage of FFTs [31].
For the VAE approach, the number of parameters and the complexity for a forward pass through the network depend on the network architecture, cf.Section IV-C.The resulting filter is computable by means of FFTs since a circulant covariance matrix is parameterized, similar to the circulant GMM case.We further note that the approaches that learn from quantized data, cf.Section V, are only adapted in the training procedure and thus have the same memory overhead and online complexity as the models learned with perfect CSI.
Table I summarizes the memory overhead and computational online complexity of all proposed approaches as well as the baseline methods.It can be seen that the proposed approaches vary in the number of parameters and online complexity to allow for a smooth trade-off with respect to the desirable performance and practical system requirements.Of particular importance is the comparison to the deep NN approach, which is adapted from [28] and directly provides a channel estimate at the output, i.e., it does not parameterize an analytical estimator.It becomes apparent that the proposed approaches exhibit a much lower number of parameters as well as a lower online complexity compared to the deep NN approach.The main reason for this is the drastic increase of neurons in the hidden layers in the NN approach, cf.[28]; in contrast, the proposed models are comprised of a latent space which enforces a compression, and thus, a reduced memory and complexity overhead.As shown in the following numerical results, the estimation performance of the proposed approaches is, in most cases, even better, although having fewer parameters and reduced online complexity.

VIII. ACHIEVABLE RATE LOWER BOUND
The achievable rate is of great interest in quantized systems [3], [24].We evaluate a lower bound on the corresponding achievable rate of a respective data transmission system that is taking the CSI mismatch into account.To this end, after estimating the channel with the pilot transmission in (1), the data symbol s is transmitted over the same channel, i.e., r = Q B (hs + n) = Bhs + q; in the second equation, the linearized model with Bussgang's decomposition is used where q = Bn+η.We make the worstcase assumption that the aggregated noise is Gaussian, i.e., q ∼ N C (0, C q = C r − BC h B H ), cf.[60].Furthermore, the BS is assumed to perform maximum-ratio combining (MRC) with the normalized filter g H MRC = ĥH /∥ ĥ∥ 2 2 .Note that the variance of the data symbol s is assumed to be one without loss of generality.We further assume that the SNR is the same during pilot and data transmission.Thus, we can evaluate the use-and-then-forget (UF) bound as a lower bound on the TABLE I: Computational complexity and number of parameters of the discussed estimators with example numbers for the case of K = N = 64, L = 16, and P = 4.

A. Covariance Recovery
Before investigating the channel estimation performance, we evaluate the proposed covariance recovery algorithm from Section V-A in a purely Gaussian setting without AWGN, comparing it to reasonable baselines.By assuming genie-knowledge of the unquantized samples, we can evaluate the unquantized sample covariance matrix, i.e., Ĉunquant = 1 T T t=1 h t h H t .Note that this baseline requires perfect CSI and thus only serves as a baseline.A feasible approach is to neglect the quantization effect and evaluate the quantized sample covariance matrix Ĉquant = 1 T T t=1 r t r H t , where r t = Q B (h t ).This baseline becomes more accurate with more quantization bits B but introduces a systematic error due to the coarse quantization.
We construct 100 random covariance matrices C h|δ from (6) and draw a fixed number of samples from each covariance matrix as ] is computed by using those 100 covariance realizations.
The left plot in Fig 2 shows the normalized MSE versus different numbers T of samples for different numbers B of quantization bits.It can be seen that the proposed covariance recovery scheme performs equally well for all quantization levels since it yields a consistent estimator, i.e., the estimation error steadily decreases for a larger number of samples.This is similar to the unquantized sample covariance matrix ("Scov ∞-bit") but with a more or less constant offset, which is mainly caused by the correlation estimate that is unchanged for varying bits B, cf.(28).In contrast, the quantized sample covariance matrix ("Scov B-bit") is a biased estimator and shows a relatively high error floor, which decreases for a higher number of quantization bits, as excepted.We note that the consistency and unbiasedness of the proposed estimator is a key characteristic since, for the training with quantized pilot observations, it can be expected that a large dataset R can be acquired cheaply during regular operation of the BS; this is in contrast to a dataset H consisting of ground-truth channels, which either requires costly measurement campaigns or intricate modeling of the underlying propagation environment.
In the right plot in Fig. 2, we evaluate the necessary number of iterations of the Gauss-Newton algorithm for solving the nonlinear LS problem until convergence, i.e., until the absolute change of the estimated variances is smaller than 10 −5 .It can be seen that in all cases, only a few iterations are necessary for the convergence; the number of iterations also decreases for a higher number of data samples T and for less quantization bits (due to an increasing number of equations for larger B), which makes the variance estimation very fast.In combination with the closed-form solution for the correlation matrix (28), the covariance estimator exhibits low complexity and is applicable for any given number B > 2 of quantization bits.

B. Channel Estimation
This section provides numerical results to evaluate the proposed channel estimators, cf.Section IV and Section V, against the discussed state-oft-the-art baselines from Section VI.In all simulations, we have fixed the number of training samples for both H and R, cf.Section II-4, to T = 100,000.The normalized MSE 1 2 and the achievable rate lower bound (33) are computed by means of T test = 10,000 channel samples, which are not part of the training dataset.If not otherwise stated, the number of components for the GMM/MFA is K = 64, the latent dimension for the MFA/VAE is L = N/4, and the data-aided approaches are trained with the channel dataset H.For both the VAE and the DNN approach, a single NN architecture is trained for the whole SNR range of [−10, 20]dB.Since a low pilot overhead is considered to be a key aspect in practical systems [2], we, therefore, especially focus on the single snapshot scenario in this paper.The simulation code of the proposed channel estimators is publicly available. 2n Fig. 3, we evaluate the MSE performance of the proposed channel estimators in comparison to the baseline methods over the SNR for the 3GPP channel model from Section II-3 with one (top row) and three (bottom row) propagation clusters for B ∈ {1, 2, 3} quantization bits, N = 64 BS antennas, and P = 1 pilot.In all cases, the approaches "BLS", "Buss-Scov", and "EM-GM-GAMP" are outperformed with a considerable performance gap over the whole SNR range by the proposed approaches.This is due to the fact that the Bussgang theorem does not hold for the non-Gaussian distributed channels, which is assumed by "Buss-Scov"; moreover, the channels are  generally not perfectly sparse in the angular domain (leakage effect), which substantially the CS approach "EM-GM-GAMP".Interestingly, for the case of one propagation cluster, the GMM-based approach is close to the "Bussgenie" approach, which is the Bussgang estimator with utopian knowledge of the true channel covariance matrix for a single snapshot; this underlines the powerful estimation abilities of the GMM.For the considered case of a ULA, the Toeplitzstructured GMM version is almost on par with the full GMM approach, whereas the circulant-structured approach exhibits a small performance gap.The observation of having increasing MSE for higher SNR values beyond a certain SNR level in some cases is due to stochastic resonance, which is a wellknown effect in quantized systems [61]; thereby, the effect can vary for different estimators, depending on the parameterization.The "DNN" approach also shows good estimation results for the different scenarios but is consistently outperformed by at least one of the proposed estimators, although having a much larger number of parameters and a higher online complexity, cf.Table I.Overall, the simulation results in Fig. 3 demonstrate the great potential of the proposed class of parameterized estimators based on Gaussian latent models in combination with the Bussgang estimator.Fig. 4 assesses the achievable rate lower bound from (33) for the 3GPP channel model with one (left) and three (right) propagation clusters for N = 64 antennas, P = 1 pilot observation, and different numbers of quantization bits.For the case of one propagation cluster, the achievable rate lower bound of the GMM and VAE approach is almost on par with the "Bussgenie" approach.Moreover, a substantial gap to the achievable rate lower bound of the "Buss-Scov" approach is apparent for all considered numbers of quantization bits.This behavior similarly translates to the case of three propagation clusters but with an overall reduced gap to the baseline approach.These results indicate that the better estimation performance of the proposed estimators can be effectively converted a higher data rate or to a lower resolution while preserving the same throughput as the baseline approach "Buss-Scov".
In Fig. 5 (left), the MSE performance is compared for different numbers B of quantization bits, now for the QuaDRiGa LOS channel model, cf.Section II-3.Once again, the approaches "BLS", "Buss-Scov", and "EM-GM-GAMP" are outperformed over the whole range of quantization bits.Interestingly, the "DNN" approach is comparably good for B = 2 and B = 3 but, in turn, suffers in performance in the extreme cases of B = 1 and infinite resolution, which indicates better robustness of the proposed approaches in comparison.Interestingly, the MFA estimator is ranked among the best estimators in this case, which highlights the fact that estimators' performances may vary slightly different channel models; however, it can also be seen that the overall performance of the proposed class of estimators is stable and robust with respect to a different channel model.In Fig. 5 (right), the corresponding achievable rate lower bound from ( 33) is evaluated.It can be seen that the better estimation qualities of the proposed approaches in terms of the MSE directly translate to a higher achievable rate guarantee, which is approximately only 1-2 bits/s/Hz below the achievable rate lower bound with perfect CSI knowledge at the receiver.
The left plot in Fig. 6 examines the estimation quality over different numbers N of antennas at the BS for B = 1 bit, P = 1 pilot, and an SNR of 10dB for the QuaDRiGa LOS channel model, cf.Section II-3.In contrast to the baseline approaches "BLS", "Buss-Scov", and "EM-GM-GAMP", the performance of the proposed estimators significantly increases for a higher number of antennas, which is particularly important in massive MIMO systems.The GMM approach performs best for all considered antenna numbers, whereas the VAE approach is especially strong in the high number of antennas case; this is reasoned by the circulant parameterization of the covariances by the VAE, which only holds asymptotically for high numbers of antennas.A similar behavior is observed for the "GMMcirc" estimator.The "DNN" approach, which has a quartic scaling of the number of parameters in the number of antennas, cf.Table I, is outperformed over the whole range.
The right plot in Fig. 6 shows the MSE performance for an increasing number of pilot observations by utilizing the pilot design from Section II-1 for N = 64 antennas and a fixed SNR of 5dB.The proposed estimators outperform all baseline approaches over the whole range of pilot observations, including the "DNN" estimator.Especially the MFA and VAE models, which are comprised of a nonlinear latent space, perform well in the high number of pilots regime.Although we focus our analysis primarily on the single-snapshot case, we see that also with an increasing number of pilots, the proposed estimators perform very well.
Next, the number K of GMM components that are necessary to achieve a certain performance is discussed.In the left plot of Fig. 7, the MSE over the number of GMM components is evaluated for B = 1 bit, N = 64 antennas, P = 1 pilot, and for varying SNRs for the QuaDRiGa LOS as well as mixed LOS/NLOS channel model, cf.Section II-3.It can be expected that the superposition of many sub-paths, as it is the case in the mixed LOS/NLOS scenario, results in a less structured wireless channel, and thus, less structural information can be inferred as prior knowledge by the data-aided models.Therefore, it can be observed that the overall performance is worse for the mixed scenario.However, in both scenarios, the increase of GMM components continuously enhances the estimation performance, with a greater improvement in the pure LOS case.This points towards applications in mmWave communications, where the high frequency in combination with smaller BS cells results in high LOS probabilities.The right plot of Fig. 7 analyzes the same setup but now for B = 3 quantization bits.In this case, we evaluate the approximation quality of computing C r for a given C y via ( 13) by comparing it with the estimator that computes the exact variances via (12) and otherwise uses the same approximation for the off-diagonals, labeled "GMM-ex.".As expected, the approximation is highly accurate in the low to medium SNR region, which is the considered operating range of low-resolution systems.In high SNR, the approximation is less accurate and shows saturation effects when increasing the number of GMM components.However, the overall approximation loss is small, and it still results in a high estimation accuracy as compared to the baseline approaches.Besides that, the estimation performance generally steadily increases for a higher number of GMM components with an overall saturation for high numbers K of components.
Fig. 8 evaluates the proposed training adaptations for the GMM and the VAE, as detailed in Section V, in order to learn from noisy and quantized pilot observations R, cf.Section II-4, without having ground-truth channel samples during training.We refer to the adapted GMM, cf.Algorithm 1, as "GMM R", and the GMM learned with ground-truth channels as "GMM H".The VAEs are denoted likewise.Note that a main difference between "GMM R" and "VAE R is that the training of the adapted GMM is performed for a fixed SNR, but afterward, the model can be utilized for the whole SNR range, whereas the VAE is trained directly for the whole SNR range in order to generalize properly.In order to have a meaningful comparison, the "GMM R" is trained for each SNR point, whereas the "VAE R" is trained for the whole SNR range (no performance gain was seen in the simulation results for an SNR-dependent training).
In Fig. 8 (a) and (b), the case of N = 64, P = 1, B = 2, and the 3GPP channel model with one and three propagation clusters are considered, respectively, whereas in Fig. 8 (c) and (d) the same setup with B = 3 bits is investigated.Astonishingly, through the model-based adaptations, the models trained with the dataset R consisting of coarsely quantized and noisy data samples are almost on par with their counterparts that are trained with ground-truth noise-free channel samples from H. Overall, the estimation quality seems to be most accurate in the low to medium SNR range, where the approximation in (13) and thus the training adaptations are highly accurate.This implies that the generally costly dataset H can be replaced by R with almost no performance loss in this regime.In the high SNR regime, the performance loss of "GMM R" tends to increase, which is a consequence of the generally indefinite closed-form correlation estimate (28).After the projection onto the set of PSD matrices by truncating the negative eigenvalues, cf.Algorithm 1, the resulting covariance estimate is missing the corresponding eigenvectors, which has a higher impact on the performance in the high SNR regime.This correlates with the observation that the adaptations are working best in cases with fewer multi-path components.

X. CONCLUSION
In this work, we presented a novel and promising framework for channel estimation in coarse quantization systems by utilizing Gaussian latent models such as GMMs, MFAs, and VAEs.These models successfully learn the unknown and complex channel distributions present in radio propagation scenarios and, afterward, utilize this valuable prior information to enable the development of tractable parameterized linear MMSE estimators based on a conditional Bussgang decomposition.We have shown that all of the presented estimators perform well for various channel and system parameters with only minor differences.This allows for selecting the preferred model using the discussed memory and complexity overhead.
In addition, we derived model-based training adaptations, i.e., a covariance recovery algorithm for the GMM and a loss function adaptation for the VAE, in order to learn these models directly from quantized training data, with only marginal performance losses.Extensive simulations verified a superior performance over classical and deep learning-based approaches in terms of MSE and achievable rate metrics.
The presented work outlines several directions for further investigating the proposed estimation framework, e.g., the analysis of multi-user systems, pilot contamination in multicell systems, and different receive strategies, e.g., zero-forcing.With the recent advances of the discussed VAE concept to model time-varying channels [62], an extension of the presented estimation framework to time-varying systems is part of future work.
which can be simply shown by writing diag(C θ ) = 1 N tr(C θ ) I and utilizing the properties of the trace together with the unitary DFT matrix F. Using this, we can now write diag(C y,θ ) = 1 N N n=1 ([c θ ] n + σ 2 ) I when assuming single snapshot pilot observations as r =

Fig. 2 :
Fig. 2: Performance evaluation for covariance estimation with N = 64 dimensions and 100 Monte Carlo iterations using covariances obtained from the 3GPP model (6).

Fig. 8 :
Fig.8: MSE performance evaluation of the models trained with quantized data for the 3GPP channel model (cf.Section II-3) for P = 1 pilot and N = 64 antennas.
This feature enables us to use quantized pilot observations collected during the regular BS operation for training.In the case of the GMM, we introduce a novel covariance recovery method that serves as an unbiased and consistent estimator of the unquantized input covariance matrix, using only quantized samples.Additionally, for the VAE, we adapt overhead.Notably, all three models share the advantage of being adaptable to different SNRs, pilot sequences, and quantization bits without requiring re-training.3) To facilitate practical feasibility, we propose training adaptations that allow us to learn the corresponding models solely from quantized pilot observations as training data, which eliminates the need for perfect CSI in the training stage.